Technical

What we learned watching GoogleBot crawl a JS web app

by on 7th December 2015

A little while ago, we built a prototype universal JS application called History of Humanity. More recently, I decided to start converting it to ES6, and go back and tidy the code a little (code available at the GitHub repo). So I thought we’d share a little of what we learned from that, and what we’ve discovered about Google’s crawling of universal JS apps.

Client-Side Pick-Up

Initially, we discovered two issues with how data was handled when the application handles picking up the rendered server data and instantiating on the client-side.

The data for the timeline can be thought of as a multidimensional array, where the first level is years, and the second level is items in a year. So for example:

  • 1750
    • Peak of the Little Ice Age
  • 1752
    • The Lightning rod invented by Benjamin Franklin
  • 1754
    • Treaty of Pondicherry ends Second Carnatic War and recognizes Muhammed Ali Khan Wallajah as Nawab of the Carnatic
    • King’s College founded by a royal charter of George II of Great Britain
    • The French and Indian War, Fought in the U.S. and Canada mostly between the French and their allies and the English and their allies (the North American chapter of the Seven Years’ War)

…and so on. If we wanted to reference the first item, we can think of this as 1750-0 (the zeroth item in the 1750 set). Similarly, if we wanted the founding of Kings College, that would be 1754-1, the French and Indian war would be 1754-2 and so on.

The way the routing had been built (i.e., quickly and without a lot of thought), meant that the variables for year and position in the stack were passed in on the client-side, but the application would tear down the server-side HTML and re-render the item in question. We discovered that Google would try and execute this, but was timing out the JS execution, so it only ever got as far as the tear-down, and never finished re-rendering the new item.

As a result, we amended the routing and boot-up logic on the client-side to ensure that it was able to use the HTML sent by the server. This successfully fixed the issues we were seeing.

Takeaway

Make sure your application logic for routing and instantiation is universal, and ensure the pick-up on the client side is flawless to avoid crawling issues.

Pagination Crawling

The second thing we discovered is that despite Google’s apparent improvements at crawling JS, they still require everything that manipulates a URL to be a link. Initially we’d supported paging using events on spans, which moved forward and backward through the list of events, and a pagination list at the bottom of the app, using similar logic. Interestingly though, whilst Google can render JS reasonably well, it’s not yet trying to understand the context of that JS.

When we checked the server logs, we were able to see that they weren’t attempting to poke the pagination list, despite that it looks like pagination (albeit at that point without links). What we’ve been able to infer from further testing of this is that while Google understands context from layout and common elements, it doesn’t yet try and fire JS events to see what will happen.

Takeaway

Anything that manipulates URLs needs to still be dealt with using links, with the link in the anchor of that link. Anything else risks Google not assigning weight, or not crawling to the right page at all. this is potentially a situation that an inexperienced SEO might not spot and it’s a lot more common a solution for a JS developer than an SEO might prefer!

Notes on React 0.14.x and ES6

Once the amends had been made, we updated the app to React 0.14.2 and starting to take advantage of ES6 (using Babel and Babelify for transpilation to ES5). This was a reasonably painless process, and the new modular nature of React is awesome. We’ll be continuing to improve the app over time, and any further discoveries will be written about in follow-up posts.

Our main takeaways however continue to be:

  1. Have a single component at the top of the application which controls state and holds all the functions that modify state. State should always flow down to the subcomponents which need it, and be as high in the application as possible. Using Redux can help with this as it abstracts the need for state in the conventional sense away entirely, although using state in parent components passing it down has its own architectural advantages too.
  2. Ensure consistent routing by planning route architecture at the start of the build, rather than retrofitting it later. Architectural constraints should always be addressed before writing code, especially in heavily nested systems.

As always, thoughts and feedback are welcome. Leave a comment, or find me on twitter at @pwatsonwailes

Join the Inner Circle

Stay one step ahead of the competition by staying with us. Join the Inner Circle to receive a monthly update of Organic Digital marketing resources, blogs and industry news.

Get insights straight to your inbox

Stay one step ahead of the competition with our monthly Inner Circle email full of resources, industry developments and opinions from around the web.