How to Use Wildcards in Robots.txt

The best way to handle duplicate content issues on a site is to do your best to manage any causes of duplication at the source. Try not to use indexable query parameters; manage your page bloat and look after regular technical SEO healthchecks to manage any surprise behaviours.

But what if you don’t have the time, resource or inclination to rework a core component of your site?

Here’s another, viable solution: Use REP (Robots Exclusion Protocol).

A Quick Guide to Robots.txt Wildcards

This post explains how to use Wildcards in robots.txt. With the directive, you could clear out all your unwanted query strings, or at least, prevent them from being indexed in the first place.

User-agent: *
Disallow: /*?

Blocks access to every URLs that contains a question mark “?”

User-agent: Googlebot
Disallow: /*.php$

The $ character is used for “end of URL” matches. This example blocks GoogleBot crawling URLs that end with “.php”

User-agent: *
Disallow: /search?s=*

Stops any crawler from crawling search parameter pages

Use with Caution and Test with Search Console

Some SEOs don’t agree with the use of wildcards, or at least they see it as slightly risky or far from best practise. I totally agree, they’re not ideal. You have to be very careful and give a huge amount of attention to testing in Search Console.

Learn More

Builtvisible are a team of specialists who love search, SEO and creating content marketing that communicates ideas and builds brands.

To learn more about how we can help you, take a look at the services we offer.


Stay Updated

Follow: | | |

Tags: , | Categories: Technical

12 thoughts on “How to Use Wildcards in Robots.txt

  1. JohnMu says:

    One thing you need to keep in mind is that disallows in the robots.txt will just disallow crawling, it will have less of an impact on actual indexing. If we have reason to believe that a URL which is disallowed from crawling is relevant, we may include it in our search results with whatever information we may have (if we’ve never crawled it, we may just include the URL — if we’ve crawled it in the past, we may include that information).

    To prevent URLs like these from being indexed, I would recommend that you have the server 301 redirect to the appropriate canonical (and of course not link to the incorrect one).

    John

  2. pah says:

    agree there… 301 redirect, scoop any relevant paras into a cookie, new site map . lovely. if only only i could do actually get that implemented on the site i'm working on

  3. Awesome – thanks for the info. This would be really useful for me – to have crawler bot disabled for certain wordpress posts with a particular keyword in the URL… (just amend the slug accordingly) But this only applies to Google yeah?

  4. Thanks for the tip. I was looking for something like this to block item id.
    I am working on an e-commerce website and client needs to blog such dynamic urls. So glad to find your blog.

    Bookmarked.

  5. i tried to use google robot testing, but could not test it. showing some error, dont know how to fix it. further wildcards in robotx.txt are allowed by google robots, but not necessarily by other search engines.

  6. bhavana says:

    I tried to disallow feed pages with disallow /feed not working
    what is is the correct way can i use $ sign to ending like
    disallow /feed$ or disallow /feed/$ which is correct

  7. Linn says:

    We’re trying to prevent search results pages from being indexed by Google. All search result pages start with /?s=. We’ve added the following lines to our robots.txt file, but keep getting indexed in Google:

    User-agent: *
    Disallow: /*?*
    Disallow: /*?

    Any ideas?

  8. Joseph says:

    How to prevent indexing of subsequent pages? Like site.com/category/1/, site.com/category/2/ etc. I have many categories.

  9. There’s a setting in Yoast’s SEO plugin to do that.

  10. Charlie Whitworth says:

    Hi Guys,

    What would you guys do about 100,000 URLs such as http://www.example.com/catalog/product_compare/add/product/9363/uenc/aHR0cHM6Ly93d3cubGVpZ2htYW5zLmNvbS9wZXJzb25hbGlzZWQtZ2lmdHM_cD05Mg,,/

    My client has them all 302’d which is disastrous in my opinion – and the robots.txt is not being honoured. I added a rule for the /product_compare/ directory.

    Would it not be better to use a wildcard (/*product_compare) to stop Google from crawling hundreds of thousands of duplicate and thin pages, rather than force the crawlers to follow a redirect?

    I am looking to restrict these pages, as more will be generated every time a user makes a comparison on this Magento site. I’m aware that noindexing would be better here but getting the developers to do this will be easier said than done.

    Would be interested to hear your thoughts on this one. I was under the impression that streamlining crawl budget was hugely beneficial – and preferable to endless redirects?

  11. Josh says:

    Will

    User-agent: *
    Disallow: /*&

    Block access to every URLs that contains an and symbol “&”?

    Here is a standard product page on our site:

    https://alohaoutlet.com/Shops/108/en/ItemDetail.aspx?iid=7665

    which is ALSO accessible from several query strings, all of which are preceded by an & symbol, i.e.

    https://alohaoutlet.com/Shops/108/en/ItemDetail.aspx?iid=7665&CatId=1153

    Looking to disallow any url with an & but to ALLOW urls with only a ?

    Thanks in advance

  12. Hey Josh – that’ll work. Test with the robots.txt tester in search console:

    https://support.google.com/webmasters/answer/6062598?hl=en

Leave a Reply

Your email address will not be published. Required fields are marked *