Around 6 months ago now, I came across a website that was blocking the MSNbot crawler (that’s Bingbot since October 2010). No one at the company had any idea of the situation, and this particular situation had been costing them, traffic wise for nearly a year. Ouch.
This weekend, SEOmoz accidentally blocked Googlebot’s access to Open Site Explorer while trying to deal with some very heavy server load from a distributed bot net posing as a search engine crawler. Accidents are unavoidable, and the occasional slip can happen to the best of us, but what checks can we put in place to mitigate such risks?
This post is based on an article I wrote for State of Search in July 2010. I’ve refreshed and updated some of the advice and added a few new checks, too.
Get into the habit
Time to get the honesty box out: When was the last time you switched user agents? Checked your 304 not modified response? Made sure your canonical “www” redirects and trailing slashes were being added or removed correctly? Some checks are easily missed in todays “out of the box” code world.
Traffic dropped recently? Are you happy blaming the latest algorithm update or could it be a problem closer to home? Remember, web server configurations can change, often without the SEO being made aware.
Periodically browse your site with a different user agent setting
In the example at the beginning of this post I mentioned the problem with a search engine user agent being restricted from crawling a site. Episodes of traffic outage caused by blocked crawler access are exceedingly rare, but as we’ve seen very recently, they can happen.
Browsing the internet with your user agent set to say, Bingbot can reveal some fascinating oversights, errors or dare I say, cloaking.
Check your canonical redirects and old domain inventory
If you’re a seasoned SEO old timer, there’s nothing new here – but, be honest! When was the last time you checked your canonical redirects? Does your “www” redirect in, or out (depending on which you prefer) with a 301 server header response? This same tip applies to title case redirects, trailing slashes and even your old redirected domain inventory.
Make sure your rel=”canonical” is correct sitewide
A few mistakes with rel=”canonical” can lead to an unpleasant outcome. On larger sites though, it’s quite difficult to keep tabs on how rel=”canonical” is configured. Fortunately, SEOmoz’s Web App has a nifty export feature that gives you all of the crawl data for your site (you’re limited to however many URLs your subscription point allows – most are 10,000 but the limit is a million).
Some of the more interesting values you can extract from this tool are:
- Blocked by X-robots
– Blocked by meta-robots
– Rel Canonical
If you’re a bit of an Excel geek, you might be interested in the data export for SEOgadget’s most recent crawl from SEOmoz’s crawler. I know that some of this (but *not* all of it) is available via IIS SEO Toolkit (here’s the installation guide) and it looks like SEO Spider has a great deal to offer, too. Dan’s on the case with a feature by feature comparison of all three (and possibly more), so stay tuned.
Beyond 404?s – server header checks that get missed
Beyond checking that your error pages produce a 404 (and that Google Webmaster Tools isn’t reporting too many), you might want to consider digging in to your server header responses a little deeper. For example, a “304 not modified” is a response to an if-modified-since header field in the client request header. In English: some webservers will respond with a “not modified” if the page requested hasn’t changed since the last time it was crawled.
I’ve seen 304 responses handled really badly. In one situation, a web site was responding normally to all requests except when the if-modified-since header field was present. The server, instead of returning the correct 304 response, collapsed spectacularly with a 403 error. Oops! Test your site with Feed the Bot’s awesome 304 header checker tool (one of my favourite SEO tools).
Watch out for x-robots
Ever look out for the X-Robots tag? X-Robots is part of robots exclusion protocol (REP) and can be found in the server header response of a web page. You can noarchive, noindex, nofollow with an X-Robots tag, so it’s probably worth checking to see if something unexpected is lurking. You could even try checking for X-Robots with (and without) your user agent configured as a search engine…
Keep an eye on the status of your top pages
The top pages report on Open Site Explorer is excellent for making the most of all linked to pages on your site. Look out for any 404 errors – a simple 301 redirect could rescue some valuable link juice:
Pro tip – export all of your Open Site Explorer data and run a Xenu crawl on the top pages list to get the latest, most up to date status codes. That way, you’ll know the data is fresh and the server response codes are bang up to date.
A few pages into this report, I found a URL that produces a 404 with 6 root domain links to the URL. I hadn’t looked at Open Site Explorer’s report for my own site for a long time. I’m pretty glad I added this section now…
Sudden performance changes
Has your site suddenly started performing slowly? It might be worth keeping an eye on site performance (page load times), just in case. While page load is still very much undefined in terms of its impactfulness in the serps, it’s certainly better for customers to have a slick, well optimised for page load experience.
Watching your site for errors and general housekeeping
Fortunately for all of us, an SEO’s work is never done. If you follow every item of advice in this post, you’ll only arrive at a point where you know your site is ok at a single point in time. General housekeeping activities, like monitoring for errors in Webmaster Tools and keeping an eye on the state of your redirects is an endless task. Something that surprises me (especially about webmaster tools) is that there doesn’t seem to be an alerts based feature to keep us informed about any sudden changes to our sites – a big spike in 404s, a significant change to pages with internal links or drops in external links are all signals I’d like to be warned about if they change.
Over to you. What are your oft-overlooked but seriously handy search engine accessibility checks?