Automate web scraping with Github Actions

Intro & Problem details

In this article, I will detail a way to solve a common problem: updating databases from web data. This process can be tedious and error-prone when done manually, so a good alternative is to do web scraping via code. However, the ideal situation would be for this code to run “on its own” without the need for someone to be pending to activate this task each time. The solution I present here is to automate this using Github Actions.

Github Actions allows us to execute code periodically, which makes it an ideal tool for this type of task. Next, I will show you how to set up a Github Actions workflow that automates the extraction of data from a website, the processing of that data, and the generation of two CSV files.

This tutorial is aimed at people who have basic knowledge of R and Github.

For a reference example of how this would be applied, you can review the posts that show the data of the matches of the Chilean soccer team, allowing you to monitor: the historical goal difference with an interactive dashboard and the performance ranking of their coaches.

Github Actions to the rescue

Github actions allows us, among many other things, to automate the execution of codes using a Github repository. This is of course particularly useful when said code performs a recurring task.

In some way or another, this can be considered as an example of the application of the Continuous Integration (CI) concept from DevOps.

In simple words, to make this work, what you need to do is set up a Workflow, whose content will specify different aspects that are mainly related to a) when the process(es)/job(s) will be executed (Jobs), and b) what will be done and how it will be done

Next, I will detail the necessary steps you should consider. For definitions and more details, check the official Github Actions documentation. If you want to compare your progress with mine, here is the Github repository of Dato Fútbol.

1) Create a repository in Github

Of course, we assume here that you already have a Github account. If you don’t have one, you should create one before anything else!

Once in your account:

Press “New”

In this repository we will configure the Workflow, add the code that will execute the web scraping and the data process periodically. In addition, that’s where we will store the data in CSV format.

Try to give it an appropriate name based on the objective you are looking for. I called mine gha-chile-games. If you want to try the steps in this tutorial you can use github-actions-test, for example
You can work on the repository either publicly or privately. If you later want to share your code and/or data, I suggest working publicly to facilitate access.

2) Create a new Workflow

Once in the repository go to the “Actions” tab and then press “New workflow”

As you can see, there you will be able to choose from a large number of predefined options, which were designed according to the objective to be achieved. In this case, for didactic purposes, we will configure our own Workflow manually, so you must press “set up a workflow yourself”.

When a Workflow is created for the first time, you will notice that a folder called .github/workflows/ is automatically created in your repository with a file inside called main.yml. This file is empty by default. You can rename it as you prefer as long as it is easy to identify based on what it does (in my case I called it chile-data-scraping.yml). In this folder all the Workflows you create will be saved
Each Workflow has a specific YML file associated with its configuration.

3) Configuring the Workflow

When we talk about configuring the Workflow, we refer to writing a series of instructions in a standard format that will mainly define what we will do, when and how we will do it.

As an example, below we will review each part of the YML file used in the example of the Chile national team matches: “chile-data-scraping”:

Show code

name: CI chile games and dts stats

# Controls when the workflow will run
on:
  # Triggers the workflow based on the schedule:
  schedule:
    - cron: '0 0 * * *'

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "update-data"
  update-data:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      - name: Set up R
        uses: r-lib/actions/setup-r@v2

      - name: Install packages
        uses: r-lib/actions/setup-r-dependencies@v2
        with:
          packages: |
            any::rvest 
            any::dplyr
            any::stringr
            any::lubridate
            any::readr
            any::janitor
    
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - name: Check out repository
        uses: actions/checkout@v3
      
      # Runs the R code for data update
      - name: Import data
        run: Rscript -e 'source("update-data.R")'

      # Runs a set of commands using the runners shell
      - name: Commit results
        run: |
          git config --local user.email "actions@github.com"
          git config --local user.name "GitHub Actions"
          git add chile_games.csv
          git add chile_dts_stats.csv
          git commit -m 'Data updated' || echo "No changes to commit"
          git push origin || echo "No changes to commit"

A) When will it run?

After assigning a name to the Workflow using name:, on: configures the options that define when it will run. In this example we have 2 options configured:

[ Schedule a Workflow run with a cron: syntax, whose value “0 0 * * *” in this example states “Run once a day at midnight”. You can read more about cron syntax here. ]
[ workflow_dispatch: just enables the option to manually run the Workflow in Github. ]

You can also set a running trigger for other cases such as when a code is committed to the repo or a pull request is created, etc.

B) What we will do and how

This is the core of the Workflow and contains everything we want to do. Keep in mind that for this to work properly, it will run on a server where we need to run an operating system, install programs and packages, etc., in other words we have to enable everything we are going to use.

[ jobs: starts the Jobs that make up the Workflow (in this case only one, called update-data). runs-on: specifies the operating system that will be used (in this case the latest available version of Ubuntu, Linux). To set Windows use windows-latest and for Mac macos-latest. ]

The following details the steps we want to take for the single Job configured:

[ Install R: Here we use pre-defined code that is openly available to the community (r-lib/actions/setup-r@v2) ]
[ Install the R packages that we will use in our code: In addition to explicitly specifying the packages, here we also use a pre-defined code to manage the package dependencies (r-lib/actions/setup-r-dependencies@v2). ]
[ Using actions/checkout@v3 checks, among other things, that we have access permissions to the Github repository and that everything is in order to execute the following steps. ]
[ Using run: we run our R code update-data.R in the console. ]
[ Finally, we send some git commands to save the data (the first time it will do it anyway; later, it will only do it when there is new data). In case there are no changes in the data, nothing else is done. ]

Final details

It is important to respect the whitespace and hierarchy of each part (code indentation)

If you are interested, here you can check the R Code for scraping & data processing:

Show code

library(rvest)
library(dplyr)
library(stringr)
library(lubridate)
library(readr)
library(janitor)


# chile games
path = 'https://www.partidosdelaroja.com/1970/01/partidos-clase-a.html'
web = read_html(path)

games = web %>% 
         html_nodes(xpath = "//table[@class='table3']") %>%
         html_table(fill = T) %>% 
         as.data.frame()

columns = c("num_game", "date", "city", "team_home", "goals_home", "team_away", "goals_away", "competition")

names(games) = columns

games_ok = games %>% 
          slice(3:nrow(games)) %>%
          mutate(across(c(goals_home, goals_away), ~gsub("[(].*", "", .x))) %>% 
          mutate(across(c(num_game, goals_home, goals_away), ~as.numeric(.x))) %>% 
          mutate(date = dmy(date)) %>% 
          filter(!is.na(goals_home))

write_csv(games_ok, "chile_games.csv")


# chile dts
path_dt = "https://www.partidosdelaroja.com/1970/01/entrenadores.html"
web_dt = read_html(path_dt)

data_dt = web_dt %>% 
          html_nodes(xpath = "//*[@id='post-body-7333737142337163810']/center[1]/table") %>%
          html_table(fill = T) %>% 
          as.data.frame()

names(data_dt) = data_dt[1, ]

numeric_columns = c("pj", "pg", "pe", "pp", "gf", "gc", "dif")

dt_stats = data_dt %>%
          clean_names() %>% 
          rename("dt" = "entrenador") %>% 
          filter(!dt %in% c("Entrenador", "Total", "Sin entrenador")) %>% 
          mutate(across(all_of(numeric_columns), ~as.numeric(.x))) %>% 
          mutate(desde = dmy(desde),
                 hasta = dmy(hasta)) 

write_csv(dt_stats, "chile_dts_stats.csv")

Finally, because our configuration, you have not to wait midnight to know if the whole process works. You can manually trigger the Workflow by clicking “Actions” -> “[the-name-of-the-workflow]” -> “Run workflow” -> “Run workflow” (again, yes), as the following image shows:

If the whole process works fine, a green check will be shown (yellow rectangle on image). You can also click there in the name of the Workflow and check details step by step either meantime the process is running or when it has already finished.

Conclusion

We reviewed in detail an efficient way to keep a database updated based on the configuration of a Github Actions Workflow. In this way, an R code is executed periodically to perform a specific web scraping that updates the CSV with the data when there is new information.

I invite you to explore this tool. Consider that its potential is enormous, not only focused on updating a database. Different tasks can be performed such as the deployment of a Shiny app or rendering and publishing a website, to name a few.