Scrape and save data to Google Sheets with Apps Script

Kamie Robinson
6 min readMar 31, 2020

If you are looking for a way to get data that is displayed on a website and save it to a Google Sheet every <some time interval>—because a record number of cmd+x and cmd+v keystrokes aren’t really the stats that resonate with you and your friends—please, read on.

An example problem:

In January 2020, Los Angeles County’s Department of Public Health started publishing Coronavirus data specific to Los Angeles County at http://publichealth.lacounty.gov/media/Coronavirus/locations.htm. Every day the numbers are updated providing a snapshot of the current status.

A screen capture from 03/18/2020 showing 190 cases, and 1 death were reported in Los Angeles County
A screen capture from 03/28/2020 showing 1804 cases and 32 deaths were reported in Los Angeles County

If someone is interested in previous weeks’ data in order to make inferences about future numbers and needs, then today’s snapshot in isolation isn’t very useful. What makes that data useful is aggregating it in a format that can be easily plotted and analyzed. Since publichealth.lacounty.gov doesn’t provide an api to get the historical data, a custom program needs to be written.

Writing a program that scrapes and saves

Getting setup

  • Identify data of interest. In this example, we’ll keep it simple and focus on capturing the number of cases and the number of deaths in LA county.
  • Create a new Google spreadsheet (I’m not going to go into details, but check this out if you are not familiar with creating a new spreadsheet)
  • Give the spreadsheet a name and add column headers for the data that you will be capturing
  • Open the script editor.
A screen capture showing that the script editor can be found unter tools in the main menu
You can find the script editor under Tools in the main menu.
  • Rename the function and add enough pseudo code to help you stay on task then cmd+s or file then save will open a prompt to name file.
A screen capture of the script editor with an outline for the getCovidDataAndUpdateSpreadsheet function

What is Apps Script?

If you are thinking, “Wait a minute, that looks a lot like JavaScript!” you are right. Apps Script is JavaScript enriched with libraries for G Suite applications like Gmail, Calendar, Drive, Sheets, etc. You can find an overview in the docs, here: https://developers.google.com/apps-script/overview.

Getting the site content

Quick refresh, web scraping is extracting data from a web site. To extract the data, we need to first get the content. To do that, we can use UrlFetchApp. To create our own time stamp we can use Utilities.formatDate.

A screen capture of a log showing that the content was fetched and stored in a variable.

Extracting the data

There are a few approaches that you can take to parsing the content:

  1. XML Service: this can be used to parse HTML, but it can be cumbersome to navigate through the DOM tree. Here is an example implementing custom functions like getElementById to simplify it. https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html
  2. jQuery: even though developers are encouraged to leverage jQuery on this page, https://developers.google.com/apps-script/guides/html/best-practices#take_advantage_of_jquery, it would require importing the library or creating an Javascript.html page with scripts—similar to the example.
  3. Regex (regular expressions or RegExp): search patterns used to find one or multiple character matches in a string. Support for regex is built into JavaScript.

I opted for regex.

A screen capture showing the logs including the regex results.

Beware of asynchronous code and oopsies

The first regex to match the number of cases worked perfectly in https://regex101.com/, but group 2 returned empty when I added it to the Apps Script function.

Screen capture of regex101 with the regex expression and highlighted match.

Digging a little deeper, I found that the two values of interest, the number of cases and the number of deaths, are being returned asynchronously after the initial page loads. The values are hard coded into a JS file—not from a database 🤔—but if a user scrolls, the same values are included in a table immediately available on page load—meaning the data is hardcoded in two different places. (🤔🤔🤔 This design decision baffles me!…but that’s not the current topic so moving past that.)

A screen capture showing the first data values for cases and deaths returned in a js file and the second included in a table.

If the values weren’t immediately available, then we would either need to see if accessing the API returning the data is a possibility or delay capturing the web content until after the page finished loading. Since it is available in table, we can adjust the regex to pull the data from there.

Also watch out for oopsies like a number being divided by a span. I doubt this was intentional and was probably the result of a GUI edit, but it will affect your query.

A screen shot with dev tools open showing a span splitting a number between the thousands and hundreds column.

Saving the data to a Google Sheet

Apps Script has several helpful methods for a number of tasks including getActiveSpreadsheet, getDataRange, appendRow… these and many others can be found on the documentation page, https://developers.google.com/apps-script/reference/spreadsheet/spreadsheet.

A couple things to note:

  • A separate function is created to add a row to the sheet. I set it up this way so that the function could be reused, for example, to add a row to different sheets (tabs) within the spreadsheet at a future point.
  • An object is passed into the function instead of an array. This decision was made to improve readability and make it easier to make insertRowInTracker its own function.

Automate running the script

Setting up a trigger to run your function so you don’t have to press the button is a beautiful thing. A time-driven trigger (sometimes called a clock trigger) is like a Unix cron job. You can manage triggers you setup through a dashboard, and/or with code. Below is a hybrid example. First we’ll set up a function to create a trigger and run that once.

Select the createTrigger function from the drop down and press the play button to run it.

To validate that it was created, you can click on the clock icon (between save and play) to open the Triggers dashboard. You should see something like this. 👇

A screen capture of the triggers dashboard showing a trigger that was created.

If you click on the pencil tool, bottom right, a modal will open with configuration and notification options—in case you want to get a notification if the script fails for some reason.

A screen capture of the open modal with options to edit the selected trigger.

Nicely done! Now take a moment to sip something and appreciate all the reminders and keystrokes that you avoided…unless you need to back fill data.

Tips to backfill your data

This is manual work, but can be helpful and is sometimes necessary.

Check the Wayback Machine. It is an archive site that captures websites how they were at a given point in time. It looks like the archive happens, at least for smaller sites, when a user searches for it on Wayback Machine’s site. Larger circles means more hits/snapshots on that specific day.

A screen capture from Wayback Machine’s site showing days in March 2020 when the site/data was archived.

Check news articles, press releases, or social media feeds with timestamps from reputable sources for reported numbers.

May the way back be with you!

--

--

Kamie Robinson

Software Engineer rooted as a Creative. Writing how-tos and about 'ahas' and 'gotchas' of making.