If you are looking for a way to get data that is displayed on a website and save it to a Google Sheet every <some time interval>—because a record number of
cmd+v keystrokes aren’t really the stats that resonate with you and your friends—please, read on.
An example problem:
In January 2020, Los Angeles County’s Department of Public Health started publishing Coronavirus data specific to Los Angeles County at http://publichealth.lacounty.gov/media/Coronavirus/locations.htm. Every day the numbers are updated providing a snapshot of the current status.
If someone is interested in previous weeks’ data in order to make inferences about future numbers and needs, then today’s snapshot in isolation isn’t very useful. What makes that data useful is aggregating it in a format that can be easily plotted and analyzed. Since publichealth.lacounty.gov doesn’t provide an api to get the historical data, a custom program needs to be written.
Writing a program that scrapes and saves
- Identify data of interest. In this example, we’ll keep it simple and focus on capturing the number of cases and the number of deaths in LA county.
- Create a new Google spreadsheet (I’m not going to go into details, but check this out if you are not familiar with creating a new spreadsheet)
- Give the spreadsheet a name and add column headers for the data that you will be capturing
- Open the script editor.
- Rename the function and add enough pseudo code to help you stay on task then
cmd+sor file then save will open a prompt to name file.
What is Apps Script?
Getting the site content
Quick refresh, web scraping is extracting data from a web site. To extract the data, we need to first get the content. To do that, we can use UrlFetchApp. To create our own time stamp we can use Utilities.formatDate.
Extracting the data
There are a few approaches that you can take to parsing the content:
- XML Service: this can be used to parse HTML, but it can be cumbersome to navigate through the DOM tree. Here is an example implementing custom functions like getElementById to simplify it. https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html
I opted for regex.
Beware of asynchronous code and oopsies
The first regex to match the number of cases worked perfectly in https://regex101.com/, but group 2 returned empty when I added it to the Apps Script function.
Digging a little deeper, I found that the two values of interest, the number of cases and the number of deaths, are being returned asynchronously after the initial page loads. The values are hard coded into a JS file—not from a database 🤔—but if a user scrolls, the same values are included in a table immediately available on page load—meaning the data is hardcoded in two different places. (🤔🤔🤔 This design decision baffles me!…but that’s not the current topic so moving past that.)
If the values weren’t immediately available, then we would either need to see if accessing the API returning the data is a possibility or delay capturing the web content until after the page finished loading. Since it is available in table, we can adjust the regex to pull the data from there.
Also watch out for oopsies like a number being divided by a span. I doubt this was intentional and was probably the result of a GUI edit, but it will affect your query.
Saving the data to a Google Sheet
Apps Script has several helpful methods for a number of tasks including getActiveSpreadsheet, getDataRange, appendRow… these and many others can be found on the documentation page, https://developers.google.com/apps-script/reference/spreadsheet/spreadsheet.
A couple things to note:
- A separate function is created to add a row to the sheet. I set it up this way so that the function could be reused, for example, to add a row to different sheets (tabs) within the spreadsheet at a future point.
- An object is passed into the function instead of an array. This decision was made to improve readability and make it easier to make insertRowInTracker its own function.
Automate running the script
Setting up a trigger to run your function so you don’t have to press the button is a beautiful thing. A time-driven trigger (sometimes called a clock trigger) is like a Unix cron job. You can manage triggers you setup through a dashboard, and/or with code. Below is a hybrid example. First we’ll set up a function to create a trigger and run that once.
To validate that it was created, you can click on the clock icon (between save and play) to open the Triggers dashboard. You should see something like this. 👇
If you click on the pencil tool, bottom right, a modal will open with configuration and notification options—in case you want to get a notification if the script fails for some reason.
Nicely done! Now take a moment to sip something and appreciate all the reminders and keystrokes that you avoided…unless you need to back fill data.
Tips to backfill your data
This is manual work, but can be helpful and is sometimes necessary.
Check the Wayback Machine. It is an archive site that captures websites how they were at a given point in time. It looks like the archive happens, at least for smaller sites, when a user searches for it on Wayback Machine’s site. Larger circles means more hits/snapshots on that specific day.
Check news articles, press releases, or social media feeds with timestamps from reputable sources for reported numbers.
May the way back be with you!