Scraping Schema Markup for Competitive Intelligence

Structured mark up is crucial for e-commerce websites if they want to stand out in the SERPs. Because e-commerce sites are generally set up to scale, scraping all of their information is very easy. All it takes is a Screaming Frog crawl and Outwit Hub.

For dropshippers and affiliate sites, harvesting competitor data within schema mark up tags can be extremely useful. If you are selling the same products as your competitors, you can compare pricing, product descriptions, calls to action/special promotions – anything – and analyze how you stack up against your competitors.

Before we can start, we need to figure out where products live on the competitor site. If your competitor has clearly built out information architecture, it shouldn’t be too tough. On Target.com, they use the directory  /p/ for their products.

target_IA_example

Step 1) Crawl and Collect Product Pages

In order to get the pages that live under the /p/ directory, fire up Screaming Frog and under Configuration > Include,  add .*/p/.*

include p directory to snag products

Now your Screaming Frog export will only include product pages

So everyone can follow along and work with the same data, I’ve gone ahead and scraped all the laptops that are currently listed on the Target.com site, which you can get here:

 List of Target Laptops (09/10/2013)

Step 2) Analyze Structured Markup and On Page Elements

Take one of the product pages from your Screaming Frog Export, for this example, we’ll use the Acer Aspire 11.6″ Touch Screen Laptop PC page. If you enter the URL into the Rich Snippet Testing Tool you can see that Target is using a ton of structured markup on their product pages.

For this, exercise, we’re going to scrape:

  • Price
  • SKU
  • Product Name
  • Battery Charge Life (non-schema element)
  • Call to action/Promotion (non-schema element)

Step 3) Fire up OutWit Hub

Outwit Hub Logo

Outwit Hub is a desktop scraper/data harvester. It costs $60 a year and is well worth it. Outwit can utilize cookies, so scraping behind a pay-wall or password protected site is a non-issue. Instead of having to use Xpath to scrape data, Outwit Hub lets you highlight the source code and set markers to scrape everything that lies in between. If you are not a technical marketer, and you find yourself having to collect a lot of data/wasting your time – this is a good tool to have in your arsenal.

 Step 4) Build Your Scraper

This may be intimidating at first, but it’s so much more scalable then trying to use Excel or Google Docs to scrape 1000s of data points

In the right-hand menu, click on Scrapers. Enter the example Target URL. This will load the source code.

Click on the “New” Button on the lower portion of the screen and name your scraper. I’m calling mine, “Target Laptop Scraper.”

Outwit_Scraper_Build

In the search box, start entering in the markup for the schema tags you want to scrape for. Remember this isn’t Xpath, you don’t need to worry about the DOM, you only need to figure out what unique source code goes before the element (the schema tag) and what’s after it.

Extreme Close Up!

Scraper_Build_Close_Up

It will take some practice at first, but once you get the hang of it, it will only take a few minutes to set up a custom scraper.

Step 5) Test Your Scraper

Once you’re done entering in the markers for the data you want to collect, hit the execute button and test your results. You should see something like this:

scraper_test_for_outwit_hub

 

 Step 6) Put the list of URLs into a .txt file and save it.

disks for saving

Any of these storage devices or your local machine will do

 Step 7) Open the .txt file in Outwit using the file menu

If you go to the left navigation, just under the main directory, there is a subdirectory called “Links.” Click on Links in the left-hand nav. This is what you should see:

a list of links from outwit to scrape

Select all the data using Control+A and then right click on the row with all the URLs.

 Step 7) Fast Scrape!

scraping tons of schema with outwit

In the right click menu, select: Auto-Explore >Fast Scrape (Include Selected Data) > And select the scraper we just built together.

Here’s a video of the last step in Outwit

Step 8) Bask in the glory of your competitor’s data

 scraped pricing data from target using outwit

In the left-hand navigation, there is a category called “data”, with the subcategory “scraped” – just in case you navigated away from it, that’s where all your data will be stored, just be careful not to load a new URL in Outwit Hub or else it will be written over and you will have to scrape all over again.

You can export your data into HTML, TXT, CSV, SQL or Excel. I generally just go for an Excel export and do a VLOOKUP to combine the data with the original Screaming Frog crawl from step one in Excel.

Got any fun potential use cases?

Share them below in the comments!



Stay Updated: Sign Up for Webinar & New Blog Alerts

14 thoughts on “Scraping Schema Markup for Competitive Intelligence

  1. Chris Le says:

    Just a warning. There was one time, at this one company I used to work for *ahem* we ran into a client whose GA visits went bonkers for one day. It was discovered by the account manager and analytics manager.

    We dug into GA and noticed it was coming from one area in Boston, with a 1024×768 browser (which nobody actually uses), to one page. That page turned out to be their pricing page.

    a) we notified the client the time and day of the traffic spike
    b) their IT guys looked into their server logs
    c) they cross referenced the area in boston to their competitors. It was one of their main competitors.
    d) found the IP range in the logs and banned them forever

    Word of warning: tread carefully. You might get caught.

  2. Ian Howells says:

    Ohhh… I like it. Immediately, I’m thinking I can use this to scrape multiple sites to combine the reviews for a single product since stuff like item model numbers for laptops will be the same regardless of the retailer. Scrape a bunch of the top sites, pull down all the reviews, combine into one database and launch your own aggregated review page (with an affiliate link, naturally).

  3. Rick Backus says:

    This is an awesome post John-Henry. I might have to give OutWit a shot.

  4. John-Henry Scherck says:

    Hey Chris, good call – It’s important to note that Outwit can and should be hooked up to a proxy server.

    The FastScrape option in Outwit doesn’t actually load the page, so it doesn’t cause GA to execute and track the visit, but if they had server side analytics/watched their log files like a hawk – they could see it. That’s where proxies come in handy… but I think you probably know more about that than I do ;)

  5. John-Henry Scherck says:

    Hey Ian, the one issue I’ve had with reviews in OutWit is that if they generate dynamically, I can’t use the Fast Scrape option (because the page doesn’t actually load in the outwit browser) it gets a little more manual if you want to do reviews, but it’s easily done. Happy scraping, don’t do anything I wouldn’t do :)

  6. John-Henry Scherck says:

    Thanks Rick, glad you liked it, it can be kind of tricky at first and it took me sitting down and watching some video tutorials. Let me know if you get it, I can send you some great resources that will have your team up and running in no time.

  7. Adam Vanderbush says:

    I used to have a startup company called Mass Vector that did this very thing, but made suggestions for pricing optimization. Chris Le is correct about being found out. The problem with screaming frog is that it will spider out data way too fast. What you could do instead is the same technology but run it through the Tor network and adjust the ip address around on a regular basis, making it much more difficult to be noticed.

  8. Awesome post JH! I’ve been playing with the SEO Tools for Excel Beta which supports multiple proxy IPs and distributes requests asynchronously. Looks like those xPath parsing problems have been fixed, too! Thanks again – I didn’t know about outwit. Great work!

  9. John-Henry Scherck says:

    Thanks Richard! I had no idea the newest version of SEO Tools for Excel was going to allow for multiple proxies. That is going to be extremely fun to play with :)

  10. Alex Johnson says:

    Thanks for the post. Interesting insight into ways to utilize this tool. I like Ian’s ideas for the ‘reviews’ topics, as that could definitely provide some really strong content for users to peruse through on a site (aka Google likes you more = Page 1 above the fold please and thank you).

    Thanks for the heads up on being found out too … always nice to grab the info needed and go unnoticed in doing so.

  11. Simon says:

    Funny I was looking for something very similar recently. If only screaming frog dumped onpage content as well as everything else it does.

    Does anyone know of a free alternative to outwit?

  12. bestseowiki says:

    hi John-Henry Scherck
    Enjoyed every bit of your blog post.Much thanks again.
    Awesome.

  13. Joe Robison says:

    Thanks so much for the OutWit tip – would this be a suitable replacement for Python/Pyscape/Scrapy scraping? I’ve spent a good amount of time starting to learn Python scraping but couldn’t get past the technical hurdles…

  14. Hi Joe,

    This is suitable for replacing python based web scraping – but it wont scale to the same level. Python allows for true automation – this only automates part of the process.

    Thanks!

    JH

Comments are closed.