Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts

    Robots.txt blocking everything except Google, MSN, Yahoo

    I'm trying to construct a robots.txt to only allow crawling by certain bots. I know how to disallow specific bots but name but not how to only include specific ones by name.

    One bot in particular is causing problems (5x the crawl traffic of googlebot!) - well awstats says its a bot 'identified by empty user agent string' - is it even possible to block a bot that doesn't have a user-agent? I'm also thinking if it doesn't identify itself it probably wont obey robots.txt. I could block it in php but can any think of anything legit that might not provide a user agent string?

  • #2
    Senior Coder NancyJ's Avatar
    Join Date
    Feb 2005
    Location
    Bradford, UK
    Posts
    3,174
    Thanks
    19
    Thanked 66 Times in 65 Posts
    I tried blocking the blank user agents with
    Code:
    Options +FollowSymlinks
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
    RewriteRule .* - [F,L]
    but that blocked me and based on the loads it blocked everyone else too.

  • #3
    Regular Coder
    Join Date
    Jan 2009
    Posts
    316
    Thanks
    7
    Thanked 92 Times in 91 Posts
    Code:
    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-Agent: MSNBot
    Allow: /
    User-Agent: Slurp
    Allow: /
    Those are the 3 main ones.

    It's debatable whether the Allow has to come before Disallow. Allow might also only work on Googlebot. Experiment with different variations and see.
    Last edited by Fisher; 09-09-2009 at 05:49 PM.

  • #4
    Senior Coder
    Join Date
    Jul 2005
    Location
    UK
    Posts
    1,051
    Thanks
    6
    Thanked 13 Times in 13 Posts
    Neither MSN or Slurp support the Allow directive so the above will block them.

    Robots.txt is not a very flexible tool Nancy, you will need to disallow on an individual basis. Plus, there are many more traffic sending search engines than those 3, so blanket blocking everthing else, even if you could do it, would be a poor idea. Ande, even if you could, as you say why would they listen to your robots.txt file? Most aggressive bots are malicious and won't take any notice at all!

    If they are genuine bots that will listen, then they would be listed with a user agent and a web address.

  • #5
    Regular Coder
    Join Date
    Jan 2009
    Posts
    316
    Thanks
    7
    Thanked 92 Times in 91 Posts
    Quote Originally Posted by Pennimus View Post
    Neither MSN or Slurp support the Allow directive so the above will block them.
    You may be correct, but according to their respective sites, both support the Allow:
    Yahoo
    Bing

    I agree that a malicious bot will ignore it anyway. I like this guy's approach, even if it is a little heavy handed: Bot Trap

  • #6
    Senior Coder
    Join Date
    Jul 2005
    Location
    UK
    Posts
    1,051
    Thanks
    6
    Thanked 13 Times in 13 Posts
    I stand corrected. Historically it was a Google only innovation but evidently has (at last) been adopted by the others. Thanks for correcting me.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •