SEO | Technical

Using Google for duplicate content detection

by on 2nd July 2008

A month or so ago I was looking at a camping equipment website called Outdoor Pros. I love this site and would recommend it to anyone. Being an SEO, however, I couldn’t help but notice that they were using some suspicious looking paginated links on their categories pages, so after getting all excited about my new camping stove I decided to take a quick look in their Google site index to see how search engines might be indexing the site.

This post covers some basic tips on “site diagnostics”, specifically; duplicate content detection by using Google search. Checks that every SEO should do as part of investigating potential issues that may negatively impact search engine positioning.

Here’s the approach I always follow, using as an example site:

1) Use your common sense

Let’s start by doing a in Google search.

As you can see from the screen grab, Google is reporting 72,100 indexed pages. Is that too many? If so you may have some kind of duplicate content issue.

2) Skip around the index and see if you spot something weird

Ok, not terribly technical advice, but it doesn’t have to be.

Click to around page 10 and take a quick look at the indexed URL’s. If you don’t see anything weird, skip ahead another 10 pages. Go as far to the back of the index as you possibly can, because that’s where the bad stuff usually hides. You’re looking out for malformed urls, query strings (like ?=sessionid or ?first_page etc) or many repeated results with the same title / description.

In the case of our friends at you can see straight away that something doesn’t look right

That set of results tells me a lot about this site, and I’ve only been looking at it for 30 seconds. We’ve identified some query strings in the index. They might be causing duplicate content. How do we confirm that though?

3) Assessing if there really is a problem on individual page types

Take one of the query strings we saw in the index. Let’s use:


Is that indexed string causing a problem? Let’s see. The url was:

It looks like a brand / category page for Kershaw Knives. Checking if that page is indexed with and without the query string is the first step. Here’s the cached page with a query and without.

Woops. There are at least two copies of this page in the index.

But those pages have different content? Well, yes in that products the page links to are different, but, the brand category page is the same every time. Each copy of the page has the same meta title, description – it’s duplicating! It may be why Outdoorpros don’t rank organically for “Kershaw” or “Kershaw knives”

4) Deciding how may URLs you have in the index are duplicated

That’s quite easy. To get a feel for the number of URLs that are duplicating, just do a query like inurl: attribute_value_string

This site looks to have at least 13,000 urls that contain the query string. Drill down a little by picking a few different titles from indexed pages such as: intitle:"Buck Knives -"

There are 65 pages with that exact <title>. Doh!

5) How do I fix this?!

Ok, first of all let me recap on what we’ve done so far. We’ve used a basic site: command and taken a common sense snapshot of how many pages there are in the index. When you’re an e-commerce site with 100,000 indexed pages and only 5,000 products, you might need to think about it.

Next, we drilled down by just checking Google’s index in random positions to see if there was anything that didn’t look right. Something was definitely wrong. By carrying out a query that told us how many instances of the query string were present, we had a total number of indexed pages using that string. Finally, we picked a specific page <title> and found 65 instances of the same page.

There is a solution, and sadly just nofollowing paginated links won’t work. The damage has been done – you have some indexed urls and some housekeeping to do.

I’m going to offer some advice in this post, but I’m going to cover fixing duplicate content issues in my next post soon. Add my RSS feed to get that post when it’s done. In the meantime, my best advice to is they need to create a list of all of the query strings that describe paginated pages and set up a rule to noindex,follow anything above the value of the first page.

Here’s my example:

Let’s look at their pants page. :-) It’s a perfectly good pants page and I’ll hear no sniggering at the back of the class please..

The main url to this page is:

Check out the paginated navigational links. Each one of them produces a different url that looks like this:

The fix? A simple noindex,follow should be added in the page head whenever that query string is generated.<html> <head> <ti…</title>

 <meta name="robots" content="noindex, follow">

This way, the many versions of the same page will be crawled but not indexed. All links on the page will be followed so the products will still be added to Google’s index. You’ve identified the canonical version of your pants page and Google will be grateful. Job done.


  1. Nice review of how a couple of basic tools can turn into advanced techniques. Just like good games, good SEO tools take moments to learn and a lifetime to master.

  2. Definitely Pete, – thanks for dropping by!

  3. Nice post! Solid tips :-)

    Only thing is I think your code at the end is wrong – should say noindex, follow not noindex, nofollow

    More of this kind of thing please!

  4. Nicely spotted! Fixed – Cheers Tom, hope you’re well!!

  5. very good post Richard.

    If I may ask : couldn’t you use the google webmaster tool in order to get directly informations like duplicate titles? it works faster for you and scan the whole site.

  6. Hi There BtoomTurn, you’re right you could use webmaster tools to get dupicate titles. Webmaster is definitely one source of information, though you won’t get the details that you need to perform a complete diagnosis. WMT is a very important step, and I’d put that part of the diagnostic under “Use your common sense” – good call.

  7. Very helpful and like to said might be difficult for a site with over 100,000 odd pages. SEO is more about thinking like a tester but with a good understanding of SE, – Search for what doesn’t work in a site and you will find it,then apply your knowledge to the tools available and you find a solution.

  8. Hi there very useful tips, its fantastic input for a webmaster, tanks for posting i will look forward to visit your post..

  9. Is there any “official” checking tools for duplicate content? Not in terms of URL, but like similar content in two different URLs – any tool to see if Google (or any other SE) sees that as duplicate content?

    This is an issue for ecommerce stores, or product catalogs – where the difference between two products may just be the color, or a slight redesign, and the model number. While we can attempt to craft a different title, or meta description, most of the content remains the same.

  10. You’ve identified the canonical version of your pants page and Google will be grateful.

  11. Thank You! I’ve been trying for over a month to figure out what Google has in the index. Search Console only tells you how many URLs are indexed. I knew I had a bunch of duplicate pages due to not having the URL params configured properly on my Store page when I launched the site. Was using site:domain and even site:domain/subdirectory/ but it was only giving me a few new duplicates every few days.Like Whack-a-Mole. Didn’t know about the inurl: whereby I could simply put a param name (or any phrase of interest) in there. Found all 300+ duplicates and got rid of them in an hour. Tried forums all over the internet and nobody could tell me how to simply just see what is in the index. FYI Bing Webmaster provides an Index Report which is very useful. Be nice if Search Console provided something similar. :)

Comments are closed.

We're hiring – check out our careers page Careers

Get insights straight to your inbox

Stay one step ahead of the competition with our monthly Inner Circle email full of resources, industry developments and opinions from around the web.