Identifying duplicate & cannibalised pages in a large site architecture

by on 11th February 2016

I think we’re all aware by now (or should be!) of the effects that keyword cannibalisation and content duplication can cause on a site’s organic visibility.

There are a number of tools that will allow you to check for duplicate content. In my experience, though, they’re generally limited to specific elements of a page. They don’t always translate into whether the content of the page in its entirety (or at its core) is a duplicate of another, or whether it’s just a case of keyword cannibalisation caused by e.g. the same title tags, H1 headings etc being used.

For large ecommerce sites, in particular, this lack of specificity in a classification of what’s duplicate vs cannibalised vs unique can be a real problem.

The Limitations of Using Ranking Data

One effective method for addressing internal cannibalisation is through ranking data. When tracked over time, the data will show instances where multiple URLs are competing for the same keyword. This results in fluctuation and devaluation in rankings when they continue to swap in & out of organic search.


This approach has been discussed a number of times and whilst highly effective, it’s still limited to just analysing URLs that are ranking. For a more complete picture, here’s a processes we used recently to help solve this problem for a large ecommerce site.

A more rigorous approach

Collect your data

The process all stems from a database export of all categories & subcategories on the domain. I could have gone down the route of getting a download from the XML sitemaps, running a crawl etc, but what we’re interested in here is the actual logging system used behind the scenes to describe each of the URLs and its relationship to other categories in its simplest form. This is something that cannot be achieved via a crawl for a site that features all URLs in root directories & no breadcrumb trails to scrape.

This gave us a CSV export containing ~10K URLs with the following columns:

  • Category ID – the unique ID for the page
  • Parent ID – the unique ID for the parent category for which this page is associated
  • Description – the database name given to the page

Using these details we could recreate the URL paths. This gave us a clear understanding of the position of each page within the site architecture.

Exploring your data

Next, we performed a CountIF on the description so we could begin grouping URLs together based on their naming convention. This alone can be enough to locate cannibalisation issues. After this point, it was a case of looking for a unique identifier for any category within the HTML page template and scraping the data via Screaming Frog. We went with ‘number of products/results’ e.g. as seen on Amazon:


If this isn’t present, scraping the first 3-5 products on the page would also be a good indicator. If memory is becoming a problem, Mike King’s excellent post on hooking up to AWS will solve all your issues.

CountIfs was then used to count occurrences of where both the ‘Description’ and ‘Number of Results’ were the same. If the result were just a single instance, then we’re looking at a cannibalisation problem. If greater than 1, we’ve been able to pinpoint duplicate categories.

Here’s a couple of examples along with the formulas used for further context:

To make this more bullet proof you could combine the lookup on number of results with the first 3-5 products. This could then be expanded to look outside of instances where the ‘description’ matched, to see if there is any cross contamination via different naming conventions.

Like this? Sign up to read more