Fixing Duplicate Content (and no, I’m not going to talk about pagination)

There’s been a lot of attention on one of my favourite SEO issues recently, particularly duplicate content. “dupe content” is just one of those subjects that never go away. Why? It seems that for every fix we apply to sort the problem out, an entirely new kind of duplicate content can occur. The other problem is that it takes a while for an inexperienced SEO to learn enough about the problem, how to diagnose it, and how to solve it on their site once and for all.

A lot of the posts we’ve seen recently tend to focus on the same subject. Pagination! And, the same solution: using a noindex, follow tag to sort it out. (this includes me, by the way :-) ) This is all fine, but what if you don’t have that kind of issue, or you have pagination, you’ve fixed it, but you still have other problems?

This post is going to focus on two more sources of duplicate content frustration in your internal site architecture. Our old friend the session_id, and tracking codes for analytics packages. We’re going to talk about user agent detection, conditional redirects and javascript onclick events and, if you’re really desperate, robots.txt wildcards.

1) Hiding Analytics tracking with Onclick events

The ultimate irony. Google blog and offer advice about reducing duplicate content. They also have an analytics package whose tracking codes are littered all over the internet which is causing heaps of the stuff. How can you prevent leaking a tracking code like ?utm_source= when you need to track the ROI of the link? Use a java onclick event.

For the moment, Googlebot can’t see or execute these types of links. Here’s an example:

>>>  Seogadget home page <<<

Here’s the code:

Javascript Onclick event - hidden tracking string in url

When you click the link, you will be taken back to my homepage but you’ll see the utm_source= query appended to the end of the url. Disable javascript with web developer toolbar on this page and mouseover the same link and you’ll only see a canonicalised homepage link – which, for the time being at least, Googlebot will respect. At least you won’t be leaking any more urls into Google’s index.

Hiding tracking code (or any query string) from search engines can be useful, but what if that code is already in the index? We’ll come on to that in a moment.

2) Session id’s leaking all over the place

Most SEO blogs say “don’t use session id’s”. That’s great advice – most recent content management systems no longer use session id’s. That’s not to say that legacy content management systems should be binned.

If you have a site index full of session_id’s, you might want to consider setting up a conditional 301 redirect to send any known search bot back to the canonical version of the URL. I call this “session stripping”, detecting the user agent on every url and 301ing out any query string or session id that will cause duplicate content in a search engine index. My good friend Gareth Jenkins is working with me on a technical post on how to do just that in ASP code. Subscribe to my RSS feed and you’ll get it in a few days time.

Here’s a good example of a site using user agent detection to strip session id’s from the index – follow the link below with your firefox user agent set to Googlebot:

http://www.goldgroup.co.uk/town-planning-recruitment/?session_id={7FA6ADAE-C397-4755-A4F1-0066FE68DC1E}

See? That session id is cleaned out when you’re Googlebot but not when you’re a user. Introducing this method has the added benefit of cleaning up the site index, every session id that gets recrawled is 301 redirected to the canonical form, so after a few weeks your entire site index is cleaned out. This is technically known as conditional redirecting and there’s a lot of debate at the moment of the white hatted-ness of the procedure. I personally think this kind of conditional redirection is ok. You’re making it easier for search engines to crawl your site, and you’re not cloaking your content at all. What’s the problem?

3) If you’re desparate, the robots.txt wildcard

Let me open this with the following statement. Using wildcards to prevent the indexing of session id’s is a bad idea. If you block every session id then your site won’t get crawled at all. That’s bad! If you’re really stuck, however, you could make sure that you’re internal linking uses the canonical version of the url and that you’re only using session id’s where it is considered an absolute nessecity. Here’s a widcard in a robots.txt file:

User-Agent: * Disallow: /qca/ Disallow: /form/ Disallow: /search/ Disallow: /candidate_community/ Disallow: /campaign/ Disallow: /*?session_id

This example robots.txt would disallow:

http://www.example.com/test-url-devon-40305/?session_id={C355FEB0-4043-4FE9-A07D-D788E441EFDE}

but allow:

http://www.example.com/test-url-devon-40305/

I hope this post provides a little more insight into one of the most important subjects in site architecture for SEO. I personally hope that duplicate content issues never go away, because fixing them can be extremely satisfying ;-)



Stay Updated: Sign Up for Webinar & New Blog Alerts

8 thoughts on “Fixing Duplicate Content (and no, I’m not going to talk about pagination)

  1. Tom says:

    Another good post. Another way to fix this (though slightly more technical) is to 301 redirect from your tracking URLs to the canonical version. This will still let you track using a lot of analytics packages.

    If that doesn’t work for you, you could conditionally 301 redirect for google bot (this is what amazon does)

  2. Gab says:

    With Google executing JS on trusted sites, especially where the url is obvious in the js, I wouldn’t trust onclick to avoid dupe content. my untested 2 pence.

  3. You can’t get much more obvious than the javascript link back to my homepage and Google hasn’t discovered the url yet.

    See?

    I happen to have seen a lot of action with onclick events and I’ve seen no evidence to support your theory

  4. Tbone Malone says:

    Nice post.

    Regarding javascript. Recently on a fairly trusted site we used javascript on a link as part of some A/B testing. Thinking Google wouldn’t index the “B” page. Low and behold they indexed it within two days. This was the only link pointing to this new page. It wasn’t an “onclick” event though. Don’t know if that would matter.

  5. Hey Tbone, I think it does matter. The other problem you have is Google’s extremely aggressive discovery methods. They’re definitely using toolbar data and http refers for discovery so it’s very hard to tell what’s going on without a really full investigation.

  6. EnhakyEffiff says:

    omg.. good work, dude

  7. Mike says:

    I am in the process of moving a site from one host to another. My new server has apache userdir turned on so my site is visible on

    exampleserver.com/~mysite and mysite.com

    I have had a problem before with this sort of setup where google discovered both ways into the site and penalised me for duplicate content.

    Unfortunately I cannot turn Userdir off.

    Is there a way of preventing this sort of leak with htaccess or robots.txt?

  8. @Mike – you need to speak to g1SMD (you should be able to contact him at SEOmoz)

Comments are closed.