Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 8 of 8
  1. #1
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts

    Get number of Google hits

    Based on code found elsewhere, I'm trying to return the number of hits Google shows for any given word.

    Here's the complete script, in case you're interested, and it works perfectly. You won't need to wade through this in order to understand my point later on:


    PHP Code:
      function getWordCount($word) {
        
    // Connect and send request. If can't connect, return false
        
    if (($h fsockopen('www.google.com'80))!== false) {
        
    fwrite($h,"GET /search?hl=en&q=%22".urlencode($word)."%22 HTTP/1.1\r\n");
        
    fwrite($h,"Connection: close\r\n");
        
    fwrite($h,"Host: www.google.com\r\n\r\n");

        
    // Read response
        
    $response '';
        while (!
    feof($h)) $response .= fread($h8096);
        
    fclose($h);

        
    $needle 'resultStats';
        
    $pos strpos($response$needle);

        if (
    $pos !== false) {
          
    $response strip_tags(substr($response$pos));
          
    $response substr($response, (strlen($needle)+4), (strlen($needle)+30));
          
          
    $needle2 ' of';
          
    $pos strpos($response$needle2);
          
    $response substr($response$pos);
          
    $expl explode('for'$response);
          
          
    $response preg_replace ('/[^\d\s]/'''$expl[0]);
        }
        else {
          echo 
    $needle." was not found";
        }

        return 
    $response;
        }
      } 
    The reason for this post is that I'm very conscious that this code depends entirely on the precise details of the html that Google outputs.

    If $needle == (for example) 'hamster', then the code searches the html output for the part that contains the phrase:

    "Results 1 - 10 of about 16,900,000 for hamster "
    I finally managed to code something that succeeds, however messily, in grabbing the number of hits... Taken from the code above:

    PHP Code:
         $needle2 ' of';
          
    $pos strpos($response$needle2);
          
    $response substr($response$pos);
          
    $expl explode('for'$response); 

    This explodes the html at the words 'of' and 'for', resulting in the number we want (the number of Google hits for 'hamster'). The number ends up in the first index of the array, i.e. in $expl[0], and I then strip away all the unwanted stuff with:

    PHP Code:
    $response preg_replace ('/[^\d\s]/'''$expl[0]); 
    This all seems a very clumsy, and insecure way of going about this.

    Can anyone suggest an alternative method of using PHP to query Google for this data? Something that doesn't depend on the html layout?

    As a side-question, what's the status of the legality of performing such a request? Am I breaking any laws by grabbing statistics from Google in this manner?

    Thanks for any insights.

  • #2
    New Coder
    Join Date
    Nov 2009
    Location
    Phoenix
    Posts
    17
    Thanks
    1
    Thanked 1 Time in 1 Post
    This is interesting.

    My first thought is that you can parse out the count more reliably if you change your code to this:

    Code:
    $response = "Results 1 - 10 of about 16,900,000 for hamster "; 
    
    $needle2 = ' of about ';
    $pos = strpos($response, $needle2);
    $response = substr($response, $pos+10);
    $expl = explode(' for ', $response);  
    
    echo $expl[0]."<br>"; // returns count
    echo trim($expl[1]); // returns search phrase
    This should better isolate the count. Testing for the position of ' of about ' and ' for ' (with outer spaces) avoids problems when search phrases contain the 'of' or 'for' character combinations within the search words. Also, checking for the substring position should start at the end of the matching string ($pos+10), not at the beginning.

    This makes the regex unnecessary.
    Last edited by TopDogger; 03-26-2010 at 05:55 AM.

  • #3
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Thanks for these ideas.

    Try Googling:

    "purple rodent-like"
    (a random weird phrase I just improvised in the hope that it would have very few hits!)

    You'll notice that Google outputs:

    Results 1 - 6 of 6 for "purple rodent-like". (0.09 seconds)
    ...which doesn't contain the word 'about'!

    I had already realised this, hence not having handled my 'explodes' in quite the way you suggest. Your revised code is, however, clearly an improvement, and it's easily ammended to reflect the fact that sometimes the text reads 'of', and sometimes 'of about'.

    Without meaning to appear ungrateful though, the primary purpose of my posting was to see if perhaps there's some way of getting at this data without recourse to such methods. I can't think of anything that could possibly work, so I may be stuck with php code that will break in the event that Google ever changes the formatting of their search results output.

    If no-one has any alternative ideas, then I'll certainly go with your regex-less, neater version.

    This is just a shot in the dark, but might there be some Google-provided API that could help with getting these figures?

    What about the legal question I raised at the bottom of my initial posting?

    P.S. How bizarre that "purple rodent-like" has actually been used by people!

    P.P.S. Welcome to this forum!
    Last edited by cfructose; 03-26-2010 at 07:10 AM. Reason: Adding something

  • #4
    New Coder
    Join Date
    Nov 2009
    Location
    Phoenix
    Posts
    17
    Thanks
    1
    Thanked 1 Time in 1 Post
    If you do the search with the quotes, then you are doing an exact match search and it changes the results. An exact match returns only pages that use the exact phrase somewhere on the page, plus pages that are identified with that keyword theme due to text links.

    I though you were just doing a typical broad match (without quotes) which returns any page that uses any of the search words somewhere on the page.

    If you are trying to see how many competing pages there are for a particular search phrase, you should use an exact match.

    That still becomes a little tricky because it looks like Google adds the word "about" whenever there are more than 1,000 results and when you do a broad match search, but it isn't even consistent about that. Major search engines only display the first 1,000 results. You cannot get to the search results pages beyond that.

    Getting back to your original question, I don't know of an easier way to pull the search results count from Google. Be aware that you should be careful about using a script like this. Don't set it up to check 500 phrases in a few minutes. Google started blocking IPs for search results tools a few years ago. They do not like scrapers. If they detect too many search requests in a short period of time, they will likely block your IP. If you have any Google accounts, they know who you are through Google cookies. Big Brother is watching!

    Thanks for the welcome. I will probably be visiting regularly.
    Last edited by TopDogger; 03-26-2010 at 03:49 PM.

  • Users who have thanked TopDogger for this post:

    cfructose (03-31-2010)

  • #5
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    Yeah, I'll sometimes be doing a single word, and sometimes a phrase.

    Anyhow, your cautions about Big Brother are enough to convince me that I ought to abandon this idea entirely! If the project I'm working on is to be successful at all, then there would indeed be hundreds of 'scrapes' per minute.

    Ho hum. Back to the drawing board.

    I don't suppose you can think of any clever ways to work out whether any given two-word phrase is a common collocation in the English language (i.e. whether they occur together with notable frequency), short of querying Google?

    My initial plan was to set a minimum number for Google hits (say, 1000), and if the query string ended up returning more than that number, then I would know that those two words have been used together often enough that they're worth considering as an example for my purposes. (My purposes require knowing that these two words aren't utter nonsense /ludicrously rare when juxtaposed).

    An open source database of words alongside the frequency of occurrence of other words before and after them seems like a highly unlikely thing to exist! (I've searched, of course, and can find nothing usable!)

    Well, even though I would be delighted to hear any suggestions about how this apparently intractable problem could be solved, I shan't hold my breath!

    Thanks so much for your input. It was invaluable.

  • #6
    Regular Coder xconspirisist's Avatar
    Join Date
    Jun 2006
    Location
    Great Britain.
    Posts
    138
    Thanks
    1
    Thanked 6 Times in 6 Posts
    Forgive me because I did not read the entire thread, but it seems like you are trying to page-scrape google.... why are you not using the google api?
    If I have been helpful, use the "thank" button - It makes me happy!

    xconspirisist.co.uk - homepage of my online alias
    technowax.net - a community for people interested in all forms of modern technology.

  • Users who have thanked xconspirisist for this post:

    cfructose (03-31-2010)

  • #7
    Regular Coder
    Join Date
    Feb 2007
    Location
    London
    Posts
    225
    Thanks
    16
    Thanked 2 Times in 2 Posts
    There is such a thing?!


    To answer your rhetorical question "Why aren't you using the Google API", well, because I'm so ill-informed that I had no idea it existed!

    You've rescued me from my despair!

    Well, I went to the site, and I already have an AJAX Search API Key.
    Gosh, that was quick and simple!

    Thanks for the heads-up.

    Am I right in presuming that I can now legitimately perform thousands of requests for such data without Google blacklisting me?

  • #8
    Regular Coder xconspirisist's Avatar
    Join Date
    Jun 2006
    Location
    Great Britain.
    Posts
    138
    Thanks
    1
    Thanked 6 Times in 6 Posts
    heh, glad to be of help and avoid you re-coding the wheel

    I believe the free API key allows you up to 1,000 request a day or something. Google certainly won't blacklist you for using it because it is a published feature, as long as you stay within their fair usage policies. Check out the FAQ.
    If I have been helpful, use the "thank" button - It makes me happy!

    xconspirisist.co.uk - homepage of my online alias
    technowax.net - a community for people interested in all forms of modern technology.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •