Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 5 of 5
  1. #1
    Supreme Master coder! abduraooft's Avatar
    Join Date
    Mar 2007
    Location
    N/A
    Posts
    14,863
    Thanks
    160
    Thanked 2,224 Times in 2,211 Posts

    robots.txt - should crawl but not index

    Hi all,

    I've a search page with a url like http://mysite.com/search/ to search a set of profiles, created dynamically.

    I use the same page to display the search results and have pagination in it. So, when someone hit the submit button displayed in this search page ( without opting anything), he'll reach http://mysite.com/search/page/1/?from=0&to=0 . (and he can use the pagination links to get the other pages too)

    I've also given a "browse link" in the footer, which randomly show different links like
    http://mysite.com/search/page/10/?from=0&to=0
    http://mysite.com/search/page/20/?from=0&to=0 etc. (where from and to corresponds to drop-downs for selecting the age ranges)

    Each of this search page has links to various profile pages like
    http://mysite.com/profile/1234
    http://mysite.com/profile/2143 etc.

    My question is, how can I direct search engines to crawl all these search/browse pages, without indexing them. I only need indexes for the profile pages. (Blocking the search page by "Disallow" and giving a sitemap is not as effective as allowing all the search pages)

    I believe, I can use a meta like
    Code:
    <meta name="robots" content="noindex"/>
    , but would that be effective? Or is there anyway to make a rule in robots.txt file to achieve the goal?
    Last edited by abduraooft; 10-10-2009 at 10:43 AM.
    The Dream is not what you see in sleep; Dream is the thing which doesn't let you sleep. --(Dr. APJ. Abdul Kalam)

  • #2
    Regular Coder
    Join Date
    Jan 2009
    Posts
    316
    Thanks
    7
    Thanked 92 Times in 91 Posts
    I don't have an exact answer for you, but it sounds a lot like preventing session IDs from being spidered. The noindex tag should be sufficient.

    You could look into canonical and nofollow links. Canonical links should weed out the duplicate content. From the Google canonical link:
    If your site has identical or vastly similar content that's accessible through multiple URLs, this format provides you with more control over the URL returned in search results

  • Users who have thanked Fisher for this post:

    abduraooft (10-11-2009)

  • #3
    Supreme Master coder! abduraooft's Avatar
    Join Date
    Mar 2007
    Location
    N/A
    Posts
    14,863
    Thanks
    160
    Thanked 2,224 Times in 2,211 Posts
    Thanks, I've added
    PHP Code:
    <?php
        
    if($page=='search' && isset($query_count))
            echo 
    '<meta name="robots" content="noindex"/>'."\n";
            echo 
    '<link rel="canonical" href="http://mysite.com/search/" /> \n'
        
    ?>
    and waiting .....
    The Dream is not what you see in sleep; Dream is the thing which doesn't let you sleep. --(Dr. APJ. Abdul Kalam)

  • #4
    Regular Coder
    Join Date
    Nov 2007
    Posts
    110
    Thanks
    0
    Thanked 1 Time in 1 Post
    I thing this robot.txt or meta tag are doing the same, hiding from Google bot. but its not indexing means no chance of crawl.. So both r related ie crawled pages r indexing...correct me if m wrong

  • #5
    New to the CF scene
    Join Date
    Oct 2009
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi,

    As I am one of the seo beginner,

    read this discussion and note down difficulties and solution about robots.txt for

    future reference.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •