Playing Around with ImportXML in Google Spreadsheets

Put me in front of a Mac and it’s almost as if I never learned to use a computer. Put me in front of Google Spreadsheets and all of the time I’ve spent working with Excel feels a little like time wasted, and not in a good way. I’m just not very used to a spreadsheet that isn’t Excel.

Unafraid of a challenge, I recently decided to give Google’s (exceptional) importXML, importFEED and importHTML functions a try – the ability to fetch information from the web to retrieve the data you need. Mostly to make an interesting blog post, but partly out of envy that Excel doesn’t have this function.

It’s frustrating trying to get XML data into Microsoft Excel – unless you’ve got the time and patience to build some basic Macros or VBscript for your requirements. With Google Docs, it’s really easy.

A few resources

If you want to use Google Docs to extract data from the web, it would be a good idea for you to learn a little xPath. “XPath is used to navigate through elements and attributes in an XML document”, or, in simple terms, you can use xPath to fetch little bits of data contained in structured elements like <span>, <div> or links or pretty much anything, really.

Also, there are a few people who have been doing this a while, and probably have sample spreadsheets that blow some of the examples below away – but you have to start somewhere, right? If you’re already an importXML / Google Docs Ninja, maybe go and find something else to do instead of reading this post.

If you’re interested, I made a Google Docs Spreadsheet with all of the examples below:

http://bit.ly/9Fs7aF

Does anyone know?

“Does anyone know” is such an interesting search on Twitter – just combine that query with a keyword, like “restaurant” and a location for everyone on Twitter looking for a very specific, thing. Great if you happen to be trading in that thing.

anyone know

Try a query like this to pull through results from the Twitter search RSS feed:

=Importfeed(“http://search.twitter.com/search.atom?q=+restaurant+%22anyone+know%22+london+OR+manchester+OR+birmingham”)

anyone-know-data

Twitter followers

A nod to Steven Foskett for this one, and particular kudos for the mention of vCard, the query for LinkedIn connections, Klout score and Alexa Rank. Nice!

Try this query: =importXML(“http://twitter.com/[your-username]“,”//span[@id='follower_count']“)

Which will give you the number of followers you have on your Twitter profile. I added together the total followers that my SEO team have (that’s the three of us) for kicks. I wonder how long it will be before someone totals up all followers counts for all UK agencies? I wonder if there’s a correlation between that data and turnover :-)

follow SNC on Twitter

Pull price data from the web

I think that, after some mild haranguing, Will might have purchased himself a pair of Etymotic headphones. Perhaps my pitch would have gone slightly more efficiently with a little xPath and Google Product search:

For something like this, a way smarter approach to get pricing data from Amazon would be to use their API – but you get the point with this brief example.

Get all of your (competitors) URLs from their sitemap

Try something like this: =ImportXML(“http://www.yourcompetitordomain.com/sitemap.xml”,”//url/loc”)

I mentioned doing this with Excel to find orphaned pages, but you can have a lot more fun with importXML. For one, theoretically you could go off and fetch all keywords contained in the <title> tag of each of the URLs – an instant keyword strategy!

url list from a sitemap file

Pull link data from Blekko

With a query like this: =ImportXML(“http://blekko.com/ws/http://seogadget.com/+/links+/rss”,”//link”)

Blekko is everyone’s favourite new SEO tool, and fair enough, it is quite cool. As Blekko are happy to push their data out via RSS, we’re able to pull this data into our spreadsheets with ImportXML (to be fair this is really easy with Excel, unless you’d like to create multiple columns with different domain queries.

blekko pushes their data out via RSS!

More Blekko – link data tables

Blekko have a feature that allows for a pretty insightful breakdown of their SEO data on your domain. If you want to pull some of that through in to Google Docs, no problem:

link data

Try this query: =importhtml(“http://blekko.com/ws/www.smashingmagazine.com+/seo”,”table”,7)

Have fun

This wasn’t a particularly “advanced” post – I did quite enjoy the thought of what to do next with this data, though. Fetch IP addresses, WHOIS details, root domain links or keyword research data with Google Suggest, the Alchemy API, or plain scraping your competitor home pages. If you’re using importXML, I’d really like to hear how.

Anyway, as I mentioned earlier, please feel free to take the queries from here: http://bit.ly/9Fs7aF – improve them, and let me know what you did.

A little update

I got in touch with my friend Tom from Distilled to see if he wanted to contribute. He’s been out in Vegas, but came back with a tip to solve the problem of Google caching a result for around two hours at a time:

Google docs will cache a URL for ~2 hours and so if you want to crawl a URL more often than that then you need to add a modifier to the URL.

I use int(now()*1000) to generate a unique timestamp and then add that into the URL in a dummy query string. E.g.

http://www.google.com/search?q=seattle+seo+consulting&pws=0&gl=us&time=1354333

The search results won’t change when you change the time value but Google docs will treat it as a fresh URL and crawl it again.

Also – you can do lots of amazingly fancy things using Google Scripts (kind of like macros for google docs) but don’t have a huge amount of time to go into detail about that now!

Well, hopefully Tom will have time soon – thanks for contributing!

Image credits: Tech109



Stay Updated: Sign Up for Webinar & New Blog Alerts

27 thoughts on “Playing Around with ImportXML in Google Spreadsheets

  1. Sam Hamilton says:

    Not trying to stick up for MS but importing XML data into Excel especially Excel 2007+ is easy – http://office.microsoft.com/en-gb/excel-help/import-xml-data-HP010206405.aspx

  2. Hi Richard,

    nice article pretty straight forward but still good to get some ideas of what you can do. And you can always export to Excel.

    Have you taken a look at the Google refine product? i have been playing with it but a lack memory is causing me issues its quiet good at quickly filtering data or looking for trends and you can pull data into it as well.

    Something else to have a look at is DataSift (from the team at TweetMeMe) as that looks to open up a lot of twitter mashing possibilities.

  3. Hey Matt – definitely. I also think there’s a ton of milage in Yahoo Pipes (which, unless I’m mistaken will happily export xml which can be imported into Google docs). I’ve got a few macros and VBscripts to do these things in Excel but it’s quite amazing how much easier it is in Google Docs. Horrible software, mind you.

  4. Hey Sam,

    Not that easy – if you want to form multiple columns, concatenating different queries to form varying URLs for the appropriate XML response it is still a bit of a pain! You have to create a data file and it’s such a mess around compared to Google Docs. If you have an example though – upload the file and let’s take a look. I’d be delighted to learn!

  5. cart2mobile says:

    Thanks for this update on Google spreadsheets. I wasn’t aware of “Does anyone know?”. Therefore this was really of great help.

  6. Ben Joven says:

    Wow! Now if we can only get Google Docs to make calls to the Twitter servers that would be great!

  7. James Morell says:

    Again, not sticking up for MS but I found the XML data tool excel add in really useful over the past couple of weeks: http://office.microsoft.com/en-us/excel-help/create-an-xml-data-file-and-xml-schema-file-from-worksheet-data-HA010263509.aspx

  8. Jemima says:

    I’m a bit of a fan of that twitter fan count – do you know if it’s possible to do the same for facebook pages, perhaps based on the page id?

  9. Is there a way I can extract from the serp for a keyword phrase?

  10. Tristina says:

    xml to google spreadsheet nice article let me try do download competitors url.. thanks for sharing..

  11. Matt says:

    Hey Richard; thanks for an inspiring post but do you mind sharing the query you used to create the columns in the Google Product Search example?

    Thanks!

  12. Matt says:

    never mind; I overlooked the link to your GDoc at the bottom of the post.

    Thanks

  13. Red says:

    I coincidentally tried a few of these a few wks ago. I generated a sitemap for a site, stripped out everything until the urls were left in excel.

    I then scraped the urls for tag and meta description details which all worked well…

    The only thing is the site has 5000 pages +

    I was hoping that I could somehow mass edit some title tages in a marathon manner,,,

    Unfortunately GDocs doesn’t support much pasting into the spreadsheet and only supports 50 =importxml queries…

    Is there anyway to use GDOCS to ref the XPATH code to then create an follow like instance that will affect a sequence of say 500 cells in a column? Otherwise it’s pointless and I’ll have to learn php, RoR +regular expressions – and I don’t want to do that yet. Life is too short!

    Whilst I’m here – does anyone find the XPATH tools at liquidXML any good for these SEO scraping functions?

  14. Mihai C. says:

    I am tring to use the function importxml() but without succes.

    Maybe you can help me. I want to extract a currency exchange rate from xml file, the EUR figure only: http://www.bnr.ro/nbrfxrates.xml

    Nothing works… :( =importXML(“http://www.bnr.ro/nbrfxrates.xml”,”//DataSet/Body/Cube/Rate['EUR']“)

    Any ideas?

  15. WMG says:

    Hmmm does anyone know if GDocs is being flaky for scraping nowadays? Just tried doing a lil GDocs scraping project that I created months ago.

    I’m scraping the SERPS using importXML. Now I get the serps results in GDocs – but when I paste these into excel it does something weird and encodes everything.

    It used to work a treat a few months back. I could paste the cells into excel exactly as the GDocs spreadsheet displayerd them. Now it seems to concatenate url results and add weird encoded characters – I’ve tried paste special etc – is GDocs defunct for scraping now?

    eg. Eg.

    #VALUE! #VALUE! http://www.markosweb [dot] com/www/forex-handel-online.blogspot [dot] com/ http://www.freeadsboard [dot] com/index.php/topic/134024-everything-for-forex-handel/

    Anyone know how to bypass this?

  16. WMG says:

    I found that other people are having issues too:

    http://www.google.com/support/forum/p/Google%20Docs/thread?tid=19733fc7fb48ecd5&hl=en

    There’s a few tidbits there for anyone seeking help – not sure how useful these are as yet

  17. Ryan Boots says:

    I’ve found this to be enormously useful. However, when I couldn’t find any online string builder to help build the importXML strings, I decided to create my own.

    http://www.xpathbuilder.com/

    It’s still very much a work in progress, so I’d love some feedback for ideas for future improvements.

  18. Jeremy says:

    A simple (I use that loosely) would be to create multiple spreadsheets.

    1 Spreadsheet would use =importrange which pulls in data from other spreadsheets. The other spreadsheets would use the =importxml to get the actual data you want.

  19. Wikiopens says:

    You should learn xPath to get more infomation you need

  20. Red says:

    @Ryan Boots Just had a play with that xpathbuilder – really neat and intuitive – it’s a bit like something I created for Google Docs bulk scraping. Does the BING search return 100 results? Re. =importxml(“http://www.bing.com/search?q=kiss+my+ass&count=100″, “//div[@class='sb_tlst']//h3//a/@href”)

    Any way to bring back 1000 results in BING?

  21. Saul says:

    Hi Ryan, great post, but have you noticed the xpath query you use to grab the twitter followers is not working?

    I was using it myself, but ever since a few weeks ago, googledocs says “The xpath query did not return any data”…

    Do you know what is going on or how to update the query so that it works?

    Thank you!

  22. Maire says:

    Thanks so much for this post.

    I wanted to set a financial spreadsheet to help my daughter pick out some “safe” stocks with good dividend yields. Here are my queries.

    Append stock ticker (add ?hl=en if you are in a non-US locale): http://finance.yahoo.com/q/ks?s=

    Different screen scrapers: //tr[td/text()[contains(.,'Forward Annual Dividend Yield')]]/td[2] //tr[td/text()[contains(.,'Revenue Growth')]]/td[2] //tr[td/text()[contains(.,'Earnings Growth')]]/td[2] //tr[td/text()[contains(.,'Current Ratio')]]/td[2] //tr[td/text()[contains(.,'PEG Ratio')]]/td[2] //tr[td/text()[contains(.,'Return on Assets')]]/td[2] //tr[td/text()[contains(.,'Return on Equity')]]/td[2] //tr[td/text()[contains(.,'3 month')]]/td[2] 3 month volume //tr[td/text()[contains(.,'10 day')]]/td[2] 10 day volume

  23. Maire says:

    It looks like the pages are not localized so “&hl=en-US” is not needed.

  24. Ragu says:

    AlexaRank ImportXML function no longer worked for me as of October 2011

  25. Dave says:

    Thank you. Embarrassed to say I am a very late starter into Xpath and Google Docs & having found this post you give some fantastic examples of how to scrape data and use it in a clever way, I have been reading the posts from the guys at Distilled into using Google Docs to perform quick checks for rankings in the SERPS but it seem that I am too late and they have now stopped that function from working, still its great to gain exposure to this and get ideas on how to use these tools thank you.

Comments are closed.