Creating Efficient Data Collection Systems for SEO and Social

As marketers we only have so many hours in the day. Nothing is worse than having your team spend time on menial tasks (like data entry) when there’s real work to do. That’s why it’s crucial for managers to create efficient systems – so teams can get more accomplished and make a bigger impact.

Most data entry work is delegated far down the totem pole. Managers are removed from data entry, so making the process as efficient as possible isn’t always a priority. If you aren’t “technical,” it’s easy to accept data collection as a necessary evil and justifiable expense.

Web scraping takes the “entry” part out of “data entry.” It saves a ton of time. However, web scraping is complicated. For a lot of new SEOs, it can be extremely intimidating. It was for me.

Baby on a laptop computer.

It’s our responsibility as managers to give our team members the keys they need to be successful – no matter how complicated or intimidating the task.

Lean Process Creation

Lean Manufacturing is a production process that considers the use of resources for anything other than the creation of value for the end customer to be wasteful. When it comes to data collection, I try to eliminate four types of waste that cause inefficiency:

  1. Motion (ex. opening up and logging into web apps to pull data)
  2. Waiting (ex. waiting on web apps to compile and  export data)
  3. Overproduction (ex. pulling more data than needed)
  4. Over Processing (ex. poor process design that results in wasted time)

In agency life, value for the end customer is actionable data and insights. Anything we do for a client that doesn’t result in actionable data and insights should be minimized as much as possible or cut from the process entirely.

We’ve all worked on teams where we’ve needed to collect data on a daily basis. These tasks may seem like a small time commitment at first, but they can really add up. By automating repetitive, low-intelligence tasks for your team, they can get entire weeks back on their calendar. Here’s a chart from XKCD visualizing how much time will be saved by automating a recurring task:

Is It Worth The Time

If you can automate a daily five-minute task for someone on your team, they get 1.2 whole days back every year. Your team can use that time to do the things that computers can’t: critical thinking, emotional reasoning, qualitative analysis — stuff you actually want your team spending their time on!

Creating a Dynamic Data Collection Template

Let’s say a few big conferences are coming up, and we want to track Twitter follower acquisition as a KPI for everyone that’s presenting. Before handing off data collection to someone on your team, it’s your role as a manager to compartmentalize and automate as much of the data collection process as possible.

Personally, I can figure out the XPath for a Twitter user’s follower count pretty quickly –  “//li[3]/a/strong” –  so that part of the task I’ll execute because it requires some background technical knowledge. I also scraped a hypothetical list of speakers who are going to be at the same conferences as us, along with their twitter handles. Using that information, I set up this scraper template:

Twitter_scraper_with_formuals_shown

Which translates to:

twitter_scraper_no_formulas

This approach to data entry/collection is exponentially more efficient.

Now someone on my team can open this document every day, refresh all the cells by hitting CTRL + ALT + SHIFT + F9 and then copy and paste the newly scraped data into another sheet. Over time, we will be able to chart some pretty insightful data, like this:

 

Hypothetical_Twitter_Follower_Acquisition

 

If you use these templates, there’s one important thing to remember: Excel caches formula values. Whenever the file is opened, Excel displays the values that were stored in the cell the last time the file was opened. In order to get fresh data, it’s necessary to rerun all formulas by pressing CTRL + ALT + SHIFT + F9.

It takes about 1 second per scraper formula, depending on the complexity level. While Excel is rerunning all these formulas, don’t touch anything.

When recording the data, it’s critical to Copy and Paste as Values in a separate sheet, that way you are saving the numerical data and not the scraper formulas – which could cause some headaches.

warning

It’s important to not overload excel. If you are planning on scraping hundreds of data points with Xpath every day, do it in sections. To refresh formulas in section, use find and replace (ctrl + H) and replace = with =, this hack will recalculate all cells that you select to replace. It’s not pretty, but if your file is so huge that it crashes Excel, it may be a necessary step.

You can download an example template here.

Anyone with a basic handle on excel can execute the remainder of this process. Systems like this are great, but XPath can be a little slow and buggy, that’s why I’ve taken to scraping with JSON whenever possible.

Scraping Made Simple: An Intro to Scraping JSON

JSON is a human-readable language, it’s extremely simple to pick up and requires no coding background. However, when you look at a JSON output from any API, it can be a little tough to decipher exactly what’s going on:

Godfather_RT_no_json_viewer

But, if you use a Chrome extension like JSONview, it’s really easy to pick up the parent-child structure of JSON.

Let’s try scraping JSON with the Rotten Tomatoes API!

Rotten_Tomatoes

 

Unlike XPath, JSON is absurdly easy to scrape, and JSONview (for Chrome and FireFox) even lets you copy the JSON path with a right click.

First let’s pick what we’re going to scrape, for this example, the infamous Rotten Tomatoes Total Rating is what we are going after:

rotten tomatoes json api

 

Using SEO Tools for Excel to scrape JSON, just enter this formula structure:

=JsonPathOnURL([cell with URL or URL],”[object]”)

To get the Total count from Rotten Tomatoes, the formula is:

JSONpathonURLexample

 Which Returns:

If the object you want to scrape has a parent object, just enter “$..” before the object you want to scrape (h/t Richard Baxter for that protip):

=JsonPathOnURL([cell with URL or URL],”[$..child]”)

json_title_rt

Which returns:

rt_godfather_json_ran

Collecting Facebook Data with the Open Graph API

Many APIs require authorization tokens/API keys, but a few don’t – Facebook’s Open Graph API is one of them. To access it, just replace the “www” in the url with “graph”:

Facebook_JSON

All of this is prime for scraping

Use Case: Record Facebook Stats for Holiday Content

Let’s say we are working with a big box retailer this holiday – and we want to track Facebook stats every day, starting two weeks before Black Friday and going to Jan 1.

Using the Facebook Open Graph API, it’s going to be easy to scrape and automate this task.

We want to pull:

Data We Need Excel Formula
Talking About Count: =JsonPathOnUrl(URL,”talking_about_count”)
Like Count: =JsonPathOnUrl(URL,”likes”)

Setting up this dynamic template is pretty straightforward, by concatenating usernames on to the end of the Facebook Open Graph API call, we call pull the stats we need:

Facebook_graph_API_formulas_shown

 Which pulls this data:

Facebook_JSON_scrapers

By collecting these KPIs over time, we can determine which big box retailer had the most successful holiday Facebook campaign and gain insights for next year.

 You can download the example template here.

Scrape Any Social Share Count

To get share metrics for all social media channels, you can use the SharedCount API. Here’s an example API call for a URL to a post I wrote recently. To get the JSON output, go to:

http://api.sharedcount.com/?url=[URL]

and you will see a screen like this:

shared count api

All of this information is extremely easy to scrape. If you’re tracking daily shares via social media for a piece of content, you can throw all of this information into a template to help scale data collection: Shared_Count_Scraper_Template_Image

Download the Shared Count Scraper Template

Creating these types of efficient systems instead of manually collecting data saves our teams time and allows us to make a bigger impact for our clients. As managers, it’s our job to help our team scale their efforts even if they don’t have the technical prowess to do it themselves. Creating similar templates is a pathway to freeing up entire weeks of your team members time. Excel can be tricky, and working with new APIs can be frustrating, but once you get the hang of it you will be improving efficiency for you team at an exponential rate.



Stay Updated: Sign Up for Webinar & New Blog Alerts

17 thoughts on “Creating Efficient Data Collection Systems for SEO and Social

  1. Matthew Barby says:

    Damn… this is awesome. John, every time I read one of your articles it sets me into a frenzy of creativity – Excel doesn’t know what’s just hit it!

  2. I’m going to heartily agree there! John-Henry’s posts are amazing!

  3. Nico Miceli says:

    JH this is awesome, I didnt know you could scrape json this in excel. Trying to build a python program to pull JSON feeds. This just saved me a ton time.

    Hope all is good in SF.

    See ya at SES!

  4. Nico Miceli says:

    JH this is awesome! I pulled JSON feeds before with JS & Python but never thought about scraping them. This is going to save me a ton of time.

    Also, love the json view plugin.

    Hope all is good in SF, see ya at SES!

  5. Thanks Matthew, glad you got some use out of it! :)

  6. Kirill says:

    Hello John. Your article helped to spot some new opportunities for my company. By the way could you please give me some direction where can find out how can view the XPath for Twitter tweets, following?

  7. Hi Kirill, glad it could help. Here’s the XPath for Twitter page elemenets

    //li[1]/a/strong – tweet count
    //li[2]/a/strong – following
    //li[3]/a/strong – followers

    Hope this helps!

  8. Kirill says:

    Thank you for your reply! Really useful! I am sorry to bombard you with questions, but do you know how to retrieve account registration date? I need to to calculate Average tweets per day =)

  9. Kirill says:

    Thanks Richard. Still couldn’t find the registration date with it…

  10. Hey Kirill

    You would need true API access to Twitter to get the information you are looking for.

    Thanks,

    John-Henry

  11. Kirill says:

    Thanks everyone, really appreciate the help! And great reply speed ! :D

  12. Nicky Helmkamp says:

    Hey John! We loved your article and wanted to let you know it was featured in our Monthly Resource Roundup http://www.northcutt.com/blog/2013/11/november-resource-round-up-the-best-of-seo-social-media-and-content-marketing/.

  13. Joe Robison says:

    Beyond excellent article and I love the free downloads!

    My only problem is I can’t get all the cells to refresh at the same time, it only works when I manually hit enter in each cell.

    I tried ctrl alt shit f9, and then also hitting the Data->Refresh All button, to no avail!

  14. Hmm, I’m using the latest version of Office, I would look up the version you are using and seeing if its another command. Also – are you on a Mac? Because none of this works on a Mac.

    PC > Mac

    :)

    JH

  15. Alexandra Petean-Nicola says:

    Thanks this is quite useful for getting all my information there. It’s great cause it eliminates a big volume of work. Thank you!

Comments are closed.