Find Orphaned Pages From Your Sitemap.xml File with Excel and IIS Toolkit

For a little while now I’ve been wrangling with an interesting problem in site architecture diagnostics. Do I have any orphaned pages on my site, and how do I find them?

Frequently, web developers will write code to generate an XML sitemap that will include all URLs in the database, linked to or no. How can we use this fact to our advantage?

I thought I’d share a strategy using Excel to extract all the URLs on your site, comparing them to the raw export from the amazing IIS Toolkit. (PS – word is that Daniel’s writing a “how to” for IIS, including the installation…)

Import your XML sitemap URLs into Excel

Did you know you can import an XML sitemap into Microsoft Excel? It can be a little crash-tastic, but oh-so rewarding when it works. Let’s use SEOgadget’s XML sitemap as the example:

Import an XML feed into Excel

Then, pop your XML URL into the dialogue box:

new source

I’d recommend you only do this if you have reasonably powerful machines, we rock i7’s with 4gbs of ram as a system minimum. Big sitemaps take a while!

The end result is something like this:

excel data

By the way, you could import Twitter search results via the RSS feed into Excel using the same method, if you’re into that kind of thing.

Get your IIS site crawl data

Dan’s working on his next post, so I’m not going to steal any thunder, I promise. Assuming you already have IIS Toolkit, go ahead and export the URL listfrom your site crawl:

IIS Rocks

Import the CSV export into Excel, and turn the data into a table called IIS. If you haven’t played with exporting data from IIS, you should – it’s amazing.

IIS Data Rocks

There’s a ton of brilliant data in the IIS export, including the contents of the server header reponse, internal links counts and the directory the page is hosted in. We’ll talk more about that later on, use for now we’re interested in orphaned pages.

Use VLOOKUP to find pages in your sitemap that have no internal links

At the end of the day, a page in your website with no internal links is very unlikely to rank. Many pages with no internal links indicates some problems with your site architecture that you’re going to need to fix. Alas, how to find them? With a trusty VLOOKUP, that’s how.

We’re most interested in the “linked by” column, which contains the total number of internal links a page has. All you have to do is pull this data into your imported sitemap XML list. Assuming you name your IIS table “IIS”, the query will look like this:

=IFERROR(VLOOKUP(XML[[#This Row],[ns1:loc]],IIS,18,0),”Not crawled”)

Why is this good?

The end result will be a list of URLs in your XML sitemap with corresponding internal link numbers. That’s cool for doing the following:

– Finding URLs in your XML sitemap with no internal links, and are as a result, “Not crawled”
– Finding URLs in your sitemap with very low numbers of internal links

Have fun, and have a great Bank Holiday weekend :-)



Stay Updated: Sign Up for Webinar & New Blog Alerts

3 thoughts on “Find Orphaned Pages From Your Sitemap.xml File with Excel and IIS Toolkit

  1. Jerry Okorie says:

    Always great to read your post especially how step by step guide using excel VLOOKUP’s. Have a nice bank holidays too.

  2. David says:

    very clever Richard, as usual. I am taking your tip with me to run it on the British Council site, cheers

  3. Thanks David! Good luck – Dan’s working on a “how to” with IIS as we speak!

Comments are closed.