Find orphaned pages from your sitemap.xml file

by on 27th August 2010

Find Orphaned Pages From Your Sitemap.xml File with Excel and IIS Toolkit

For a little while now I’ve been wrangling with an interesting problem in site architecture diagnostics.

Do I have any orphaned pages on my site, and how do I find them?

Frequently, web developers will write code to generate an XML sitemap that will include all URLs in the database, linked to or no. How can we use this fact to our advantage?

I thought I’d share a strategy using Excel to extract all the URLs on your site, comparing them to the raw export from the amazing IIS Toolkit. (PS – word is that Daniel’s writing a “how to” for IIS, including the installation…)

Import your XML sitemap URLs into Excel

Did you know you can import an XML sitemap into Microsoft Excel? It can be a little crash-tastic, but oh-so rewarding when it works. Let’s use this site’s XML sitemap as the example:

Import an XML feed into Excel

Then, pop your XML URL into the dialogue box:

new source

I’d recommend you only do this if you have reasonably powerful machines, we rock i7’s with 4gbs of ram as a system minimum. Big sitemaps take a while!

The end result is something like this:

excel data

By the way, you could import Twitter search results via the RSS feed into Excel using the same method, if you’re into that kind of thing.

Get your IIS site crawl data

Dan’s working on his next post, so I’m not going to steal any thunder, I promise. Assuming you already have IIS Toolkit, go ahead and export the URL listfrom your site crawl:

IIS Rocks

Import the CSV export into Excel, and turn the data into a table called IIS. If you haven’t played with exporting data from IIS, you should – it’s amazing.

IIS Data Rocks

There’s a ton of brilliant data in the IIS export, including the contents of the server header reponse, internal links counts and the directory the page is hosted in. We’ll talk more about that later on, use for now we’re interested in orphaned pages.

Use VLOOKUP to find pages in your sitemap that have no internal links

At the end of the day, a page in your website with no internal links is very unlikely to rank. Many pages with no internal links indicates some problems with your site architecture that you’re going to need to fix. Alas, how to find them? With a trusty VLOOKUP, that’s how.

We’re most interested in the “linked by” column, which contains the total number of internal links a page has. All you have to do is pull this data into your imported sitemap XML list. Assuming you name your IIS table “IIS”, the query will look like this:

=IFERROR(VLOOKUP(XML[[#This Row],[ns1:loc]],IIS,18,0),”Not crawled”)

Why is this good?

The end result will be a list of URLs in your XML sitemap with corresponding internal link numbers. That’s cool for doing the following:

– Finding URLs in your XML sitemap with no internal links, and are as a result, “Not crawled”
– Finding URLs in your sitemap with very low numbers of internal links

Have fun!

Get insights straight to your inbox

Stay one step ahead of the competition with our monthly Inner Circle email full of resources, industry developments and opinions from around the web.