How to Use Wildcards in Robots.txt

There’s been quite a large reaction to Google’s announcement that you don’t have to rewrite your URLs to appear as simple “static looking” and keyword rich as once advised. I’m not going to change a thing about the SEO on my sites, but what if you already have a site that uses dynamic query parameters, and it’s index at Google is a bit of a mess? Let’s use this website as an example.

I still truly believe the best way to handle a dynamic site is to rewrite and avoid any form of query parameter in the visible URL. But what if you don’t have the time, resource or inclination to rework a core component of your site? Here’s another, viable solution: Wildcard out all the unwanted query strings, submit your “canonical” urls via a sitemap.xml, and make sure your internal linking structure mirrors the canonical urls you referenced in the sitemap file correctly.

Let’s do a quick duplicate content check on the site:

The core website has approximately 50 pages of content, however, Google’s got more than 300:

http://www.google.co.uk/search?hl=en&safe=off&rlz=1B3GGGL_enGB257GB258&q=site%3Ahttp%3A%2F%2Fwww.sortoutstress.co.uk%2F&btnG=Search&meta

The main cause of the problem is a little query parameter called itemid – for perfect accuracy, let’s use title case as that’s how the query parameter has been indexed: Itemid

How do we get rid of all of those duplicated pages? First, let’s update the robots.txt to include a wildcard that blocks all urls containing the offending string (please don’t forget to sort out your internal linking while you’re doing all of this!)

Add the following line to your robots.txt:

Disallow: /*&Itemid

Make sure you test that this will work in Google’s webmaster tools like this:

*Robots.txt is case sensitive! Remember this when you’re testing your file.

Finally, create a sitemap.xml file and submit that via your webmaster tools account and / or reference it in your robots.txt file like this:

Sitemap: http://www.sortoutstress.co.uk/sitemap.xml

Some SEOs don’t agree with the use of wildcards, or at least they see it as slightly risky or far from best practise. I totally agree, they’re not ideal. You have to be very careful and give a huge amount of attention to testing. Problem is, we can’t always have our cake and eat it. Sometimes getting a fully rewritten site is out of reach in this years budget, or it’s beyond the scope of the site build. Whatever the reason, if you can’t canonicalise the old fashioned way, why not give this technique some serious consideration?



Stay Updated: Sign Up for Webinar & New Blog Alerts

7 thoughts on “How to Use Wildcards in Robots.txt

  1. JohnMu says:

    One thing you need to keep in mind is that disallows in the robots.txt will just disallow crawling, it will have less of an impact on actual indexing. If we have reason to believe that a URL which is disallowed from crawling is relevant, we may include it in our search results with whatever information we may have (if we’ve never crawled it, we may just include the URL — if we’ve crawled it in the past, we may include that information).

    To prevent URLs like these from being indexed, I would recommend that you have the server 301 redirect to the appropriate canonical (and of course not link to the incorrect one).

    John

  2. pah says:

    agree there… 301 redirect, scoop any relevant paras into a cookie, new site map . lovely. if only only i could do actually get that implemented on the site i'm working on

  3. Awesome – thanks for the info. This would be really useful for me – to have crawler bot disabled for certain wordpress posts with a particular keyword in the URL… (just amend the slug accordingly) But this only applies to Google yeah?

  4. Thanks for the tip. I was looking for something like this to block item id.
    I am working on an e-commerce website and client needs to blog such dynamic urls. So glad to find your blog.

    Bookmarked.

  5. i tried to use google robot testing, but could not test it. showing some error, dont know how to fix it. further wildcards in robotx.txt are allowed by google robots, but not necessarily by other search engines.

  6. bhavana says:

    I tried to disallow feed pages with disallow /feed not working
    what is is the correct way can i use $ sign to ending like
    disallow /feed$ or disallow /feed/$ which is correct

  7. Linn says:

    We’re trying to prevent search results pages from being indexed by Google. All search result pages start with /?s=. We’ve added the following lines to our robots.txt file, but keep getting indexed in Google:

    User-agent: *
    Disallow: /*?*
    Disallow: /*?

    Any ideas?

Comments are closed.