Technical

Scraping for content marketing ideas & research

by on 7th March 2016

“Scraping” is a way of automating or scaling the process of gathering information from different websites on the Internet. It’s a bit of a staple in the SEO skill set because of the every day need for quality assurance, bug testing and SEO diagnostics.

There are some really good articles on the technical search applications for scraping, like this data scraping guide for SEO, to more social data / analytics focused guides like this and this.

With the exception of our very own content strategy helper, I don’t see many posts that talk about scraping applications for content marketing research.

In this post I’m going to talk about the basics of scraping and give some examples on how scraping can be used for research purposes in content marketing projects.

How To Scrape – The Very Basics

The trick with scraping is to have a basic understanding of how a web page’s mark-up is laid out. That, with an understanding of the XML path language known as XPath and a few tools to help extract data. You can get started in literally minutes, as Chrome plugins and tools are freely available and simple to get started with very quickly.

Install Scraper for Chrome

Start by Installing Scraper for Chrome. It’s an incredibly simple plugin that works a little like this:

scraper-how-to

Look at the scraper dialogue. Under “Selector” you can see some XPath:


Put simply, XPath is a language for selecting elements in a hierarchy. You follow that hierarchy from the top down, which is how you derive the syntax. Take a look at the example above; the XPath selects the contents of all “a” containers that are nested in a “h2” container. In the context of my author page, that’s all post titles and links featured on the page.

In a more complex web page, it might pay to look closely at the mark-up and style classes used in the containers. Let’s switch to Web Developer Console in Chrome and have a closer look:

markup-close-up

The h2 is styled with a class attribute called “post_title”.

modified-xpath

This is a useful thing to know, because you can select all containers that use this style class without traversing through a number of unstyled containers.


There’s a useful discussion about selecting elements by their CSS style attributes on this Stack thread. Xpath is much more powerful too, if you’re interested in learning more have a look at this quick start reference to some surprisingly powerful functions.

There are a few more tools that are pretty standard in the SEO world that are absolute must haves for data collection too.

Screaming Frog

The Frog is the best SEO crawler your money can buy.

I don’t need to write about it a great deal because surely by now you all know this. Perhaps not everybody knows a few useful features in Screaming Frog so that you could, should you wish, scrape quite a lot of data from a small to mid-size website, or a large one rather carefully.

Scraping with Filters: XPath, CSS Selectors and Regex

Screaming Frog has two massively powerful features if you’re going to use it for data extraction. Include/Exclude URLs and the XPath/Regex/CSS Selector based scraper.

Include URLs is really easy to configure, and gives you quite a lot of focus over which pages you’d like to crawl on a site. In the example below, category pages linked to via the Women’s clothing category on asos.com

page-category-crawl-frog

Several site architecture data collection ideas spring to mind with this feature, but today we’re more interested in scraping.

asos

Let’s try to catalogue how many products asos.com have in some of their product categories:

asos-extract

Where:


Obviously the data collected would need a tidy up but you see the principles at play. Very simple data extraction, all contained in your trusty desktop crawler.

The only limiting factor to your desktop crawl is allocated memory. Obviously making sure you include / exclude rules are working helps an awful lot, but after that you could just modify the amount of memory in use, if your machine has some spare!

Here’s the config file location to set a larger amount of memory: C:\Program Files (x86)\Screaming Frog SEO Spider

modify-screaming-frog-memory

If you do need to increase your memory allocation, you’ll run into issues unless you’re running 64 bit Windows and 64 Bit Java. It’s pretty easy to check and uninstall Java from add/remove programs.

You can also set a reduced crawl limit (play nice and you won’t upset services like cloudflare).

Scraping with Screaming Frog is awesome and a skill well worth learning. Nate Plaunt shared this image of his project which looks like an impressive combination of Regex and XPath:

CcF1_FZUMAAjxmq

Google Docs

As fun as using Screaming Frog is for crawling and scraping, sometimes it’s not necessary. Simple is always better, and lots of fun things we’ve made for the community have been based in Google Docs, using the =importXML() function.

Everyone’s written about this feature before, so all I’ll say is that if you want to understand the basic arrangement of an =importXML query in Google Docs, read this post or copy / paste the query from below:


For a really impressive example of functions like =importXML() in action, take a look at Danny’s Link Reclamation Tool.

Extract JPG Image URLs on Any Topic from Reddit Search

With the basics of scraping more or less covered, what can you actually do with scraping?

As you know we’ve covered a lot of content marketing and ideation concepts on the blog this year. Reddit continues to be such a goldmine for ideas and inspiration. Often we focus on extracting insights and raw ideas, but you could also extract raw materials, too. Like images:

Try this Reddit search: https://www.reddit.com/r/Polaroid/search?q=cat+url%3Ajpg&sort=relevance&t=all

Reddit search has a suite of powerful operators, including a URL: command, which delivers results that contain a string in the shared URL. That’s fabulous for specific file types, like images. Once you’ve found a subject of choice, extract with this XPath:


Pro tip: the scraper extension has presets functionality. You can save your XPath, navigate to a new set of search results and re-run the query without having to re-enter it.

pro-tip

Find Average Salaries for Graduate Jobs Added on Reed

congrats

Reed is a huge jobs database in the UK. They have a lot of job ads in lots of different verticals. As we know, the jobs market is of continual interest especially for recent graduates. Content ideas like a comparison of the average salary for a new graduate job in London compared to a different city might make for interesting reading:

Graduate Jobs in London

Vs.

Graduate Jobs in Birmingham

Now, the data’s not perfect here but it could be cleaned up easily. Look in the HTML:

  • £35,000 – £55,000 per annum


That data is easily extracted:


I’m not sure of the answer so, if anyone wants to work that out, go ahead!

Compare the Price of Cider at Asda vs Waitrose

cider

This is a topic close to my heart. In the middle of summer, when you absolutely must have a bottle of cider in your hand, where are you going to head? Waitrose or ASDA?

Waitrose Cider Page

The XPath:


vs

Asda Cider Page

The XPath:

Other Examples / Inspiration

This whole post was really inspired by this Reddit thread, a contributor had hired someone on Upwork to manually collect the data for the post. The format wasn’t great but the post did exceptionally well.

Similarly, this Buzzfeed post discussing how much your rent has increased since 2007 really shows the power of historic data, something we all have access to in our day-to-day jobs.

Responses

  1. I use Screaming Frog every day but have never heard of the Web Scraper Chrome extension until now. I can already think of 100’s of uses for it though! Screaming frog is great for internal SEO issues, but I’ve been looking for something to help directly scrape some weird Google results so think this is perfect. Thanks for the detailed post Richard.

  2. Hi,
    very useful article, but I think you’re linking to the wrong Chrome extension.
    The link is to ‘Web scraper’ but looking at the screenshots you are using ‘Scraper’.

    I think the Chrome Web store link should point to https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd

  3. Thanks for the catch – I’ll update later. My mistake!

Comments are closed.

Join the Inner Circle

Stay one step ahead of the competition by staying with us. Join the Inner Circle to receive a monthly update of Organic Digital marketing resources, blogs and industry news.

Get insights straight to your inbox

Stay one step ahead of the competition with our monthly Inner Circle email full of resources, industry developments and opinions from around the web.