A Designer’s Guide to Scraping Data from HTML Pages with Artoo

Article summary

Gathering real data for your digital designs can be a real challenge. Without access to the client’s data, we’re often on our own—reliant on bad mock data or Lorem ipsum. But don’t worry, Artoo can help solve this problem by web scraping HTML pages for you.

Artoo is a neat little web scraper that will let you target data on a webpage and return only the data your interested in. It’s quite similar to using jQuery but has some additional features. Artoo does require some knowledge of CSS and jQuery selectors, but those are great tools to know, and they crossover very well to your implementation tool kit.

How it Works

Artoo is essentially a function you can run in the console of the web browser. Its goal is to return data from an HTML structure given a specific set or sets of targets defined by the user (e.g., return the data from the HTML table on the page).

You can also get more specific. For example, you can specify only the third table with class name my-table, or only the last character of the last word of the second-to-last row and column of each table. With the power of the selectors, pulling the content you’re looking for should be attainable quickly.

The first step is to identify how to find all instances of the data we’re looking for. Does all the data within the HTML structure have a classname? Then we can use that to tell Artoo to find all the data on the page with that classname. This can be done by inspecting the HTML element in the browser and looking at the raw HTML of the page.

If there are no uniquely identifying characteristics of the HTML, you’ll have to get a little creative. Look at the structure of the HTML. In the example below, all the product names appear in the first column of every row. Therefore, we should be able to select for each row and each first column of that row.

To accomplish this, we tell Artoo to output all the data in the table, then narrow the output—refining the selectors to only output the first column of each row. I rarely get the intended output the first time, but with a few trials, I can come up with a solution to meet my needs.

An Example

On a recent project, we were trying to gather some realistic data: a list of facilities that produced a specific product. Unfortunately, the HTML of the client’s website had a less-than-ideal structure, and the content lacked visual hierarchy. This made manual copy-and-paste out of the question, and getting access to the data would have taken some time. Fortunately, I remembered Artoo.

I’ll show you what I did, using the Cboe market data for final settlement prices. Our goal will be to list out the names of the products.

  1. Add the Artoo script to your bookmarks (see quick start guide)
  2. Navigate to the page with the data your looking to gather, and open the Console via the web browser’s inspector:
  3. Click the Artoo bookmark to start Artoo.
  4. Enter your first command: type artoo.scrape(‘tbody tr td’) into the browser console.
  5. We get back data of the table, but it’s not exactly what we were looking for, nor is it in a usable format. We can add an additional selector such as td:first-child to select the first td of each tr within tbody.

  6. Enter artoo.scrape('tbody tr td:first-child’) in the the console.

    This is exactly what we were looking for.
  7. Now with some extremely minor text editing, we can remove the keys and quotes from the data structure, and we can easily start using this data in our layouts.

This is a pretty simplistic example, but it showcases how you can leverage scripting to save you time getting real data for your designs. The end goal would be to pull this data directly into your design tool with something like react-sketchapp and go from sketching to visual design without copying and pasting.