How to Use Wildcards in Robots.txt

The best way to handle duplicate content issues on a site is to do your best to manage any causes of duplication at the source. Try not to use indexable query parameters; manage your page bloat and look after regular technical SEO healthchecks to manage any surprise behaviours.

But what if you don’t have the time, resource or inclination to rework a core component of your site?

Here’s another, viable solution: Use REP (Robots Exclusion Protocol).

A Quick Guide to Robots.txt Wildcards

This post explains how to use Wildcards in robots.txt. With the directive, you could clear out all your unwanted query strings, or at least, prevent them from being indexed in the first place.

User-agent: *
Disallow: /*?

Blocks access to every URLs that contains a question mark “?”

User-agent: Googlebot
Disallow: /*.php$

The $ character is used for “end of URL” matches. This example blocks GoogleBot crawling URLs that end with “.php”

User-agent: *
Disallow: /search?s=*

Stops any crawler from crawling search parameter pages

Use with Caution and Test with Search Console

Some SEOs don’t agree with the use of wildcards, or at least they see it as slightly risky or far from best practise. I totally agree, they’re not ideal. You have to be very careful and give a huge amount of attention to testing in Search Console.

Have We Been Helpful?

Thank you for reading a post from our Technical archives. Have we been helpful? If you'd like to engage our services or speak with one of our consultants about a project, contact us and we'll be in touch.

Stay Updated: Sign Up for New Blog Alerts

Follow: | | |

Tags: , | Categories: Technical

7 thoughts on “How to Use Wildcards in Robots.txt

  1. JohnMu says:

    One thing you need to keep in mind is that disallows in the robots.txt will just disallow crawling, it will have less of an impact on actual indexing. If we have reason to believe that a URL which is disallowed from crawling is relevant, we may include it in our search results with whatever information we may have (if we’ve never crawled it, we may just include the URL — if we’ve crawled it in the past, we may include that information).

    To prevent URLs like these from being indexed, I would recommend that you have the server 301 redirect to the appropriate canonical (and of course not link to the incorrect one).


  2. pah says:

    agree there… 301 redirect, scoop any relevant paras into a cookie, new site map . lovely. if only only i could do actually get that implemented on the site i'm working on

  3. Awesome – thanks for the info. This would be really useful for me – to have crawler bot disabled for certain wordpress posts with a particular keyword in the URL… (just amend the slug accordingly) But this only applies to Google yeah?

  4. Thanks for the tip. I was looking for something like this to block item id.
    I am working on an e-commerce website and client needs to blog such dynamic urls. So glad to find your blog.


  5. i tried to use google robot testing, but could not test it. showing some error, dont know how to fix it. further wildcards in robotx.txt are allowed by google robots, but not necessarily by other search engines.

  6. bhavana says:

    I tried to disallow feed pages with disallow /feed not working
    what is is the correct way can i use $ sign to ending like
    disallow /feed$ or disallow /feed/$ which is correct

  7. Linn says:

    We’re trying to prevent search results pages from being indexed by Google. All search result pages start with /?s=. We’ve added the following lines to our robots.txt file, but keep getting indexed in Google:

    User-agent: *
    Disallow: /*?*
    Disallow: /*?

    Any ideas?

Leave a Reply

Your email address will not be published. Required fields are marked *