Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 5 of 5
  1. #1
    New Coder
    Join Date
    Apr 2009
    Location
    US Florida
    Posts
    25
    Thanks
    4
    Thanked 0 Times in 0 Posts

    Why is this simple php spider script results a bunch of junk

    Trying to lear this spider scrapping thing. Why is this script output a bunch of junk and how to I clean it?

    <?php
    $original_file = file_get_contents("http://www.domain.com");

    $stripped_file = strip_tags($original_file, "<a>");

    preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

    //DEBUGGING
    //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
    //$matches[1] now contains only the HREFs in the A tags; ex: link

    header("Content-type: text/plain");

    //Set the content type to plain text so the print below is easy to read!

    print_r($matches); //View the array to see if it worked?>

  • #2
    Regular Coder
    Join Date
    Mar 2006
    Posts
    238
    Thanks
    3
    Thanked 37 Times in 37 Posts
    Could you explain what you would like to see as a result of script work ? I have run the script. For me it has output an array with $matches[0] containing the complete <a> tags and $matches[1] containing values of href attributes... Which result would you like to get ? Could you explain the problem a little bit more please ?

  • #3
    New Coder
    Join Date
    Apr 2009
    Location
    US Florida
    Posts
    25
    Thanks
    4
    Thanked 0 Times in 0 Posts
    Apparently it's suppose to go to domain . com and pull the links... correct me if I'm wrong.
    What I want it to print is the links. the below is what i'm getting. Is this what ur getting?

    Array ( [0] => Array ( [0] => English [1] => Español [2] => Domain.com - It All Starts with a great Domain [3] => [4] => [5] => [6] => Sign up for Domain.com Offers [7] => Support [8] => Domains [9] => Web Hosting [10] => VPS Hosting [11] => Hosting [12] => Web Design [13] => Hosting [14] => Marketing [15] => Start Your Website, Scorching Fast Rock Solid Hosting [16] => Learn More! [17] => RENEW [18] => TRANSFER TO DOMAIN.COM [19] => WHOIS Lookup [20] => See all domain extensions [21] => Included FREE: ■ Total DNS Management ■ URL Forwarding ■ Email Forwarding ■ Transfer Lock [22] => Domain names [23] => Host your site [24] => VPS hosting [25] => *Terms [26] => [27] => [28] => [29] => [30] => Home [31] => Domain Names [32] => Web Hosting [33] => Website Builder [34] => Professional Web Design [35] => Email [36] => VPS Hosting [37] => eCommerce Hosting [38] => eCommerce Web Design [39] => Online Marketing [40] => Email Marketing [41] => PPC Marketing [42] => SEO Services [43] => SSL Certificates [44] => Private Domain Registration [45] => About Us [46] => Customer Support [47] => Blog [48] => Login [49] => WHOIS [50] => About [51] => Support [52] => FAQ [53] => Affiliate Program [54] => Legal Notices [55] => Privacy Policy [56] => Registration Agreement ) [1] => Array ( [0] => javascript:void(0) [1] => ?lang=es [2] => / [3] => https://secure.domain.com/order/usc/...p?siteid=42566 [4] => /account [5] => https://secure.domain.com/webmail/?siteid=42566 [6] => /newsletter/subscribe.php [7] => /contact [8] => /domains/ [9] => /hosting/ [10] => /vps/ [11] => /email/ [12] => /designstudio/ [13] => /ssl/ [14] => /marketing/ [15] => /hosting/ [16] => /hosting/ [17] => /domains/renewal.php [18] => /domains/transfer.php [19] => https://secure.domain.com/services/w...p?siteid=42566 [20] => /domains/search.php [21] => /domains/tools.php [22] => /domains/ [23] => /hosting/ [24] => /vps/ [25] => javascript:void(0) [26] => /ssl/ [27] => /domains/tld_us.php [28] => http://twitter.com/domaindotcom [29] => /vps/ [30] => / [31] => /domains/ [32] => /hosting/ [33] => /sitebuilder/ [34] => /designstudio/ [35] => /email/ [36] => /vps/ [37] => /hosting/ [38] => /designstudio/ecommerce.php [39] => /marketing/ [40] => /marketing/email.php [41] => /marketing/promotion.php [42] => /marketing/seo.php [43] => /ssl/ [44] => /domains/whoisprivacy.php [45] => /about/ [46] => /contact/ [47] => /blog/ [48] => /account [49] => https://secure.domain.com/services/w...p?siteid=42566 [50] => /about [51] => /contact [52] => https://secure.domain.com/KM/script_...unt_name=42566 [53] => /affiliate/ [54] => https://secure.domain.com/common/agr...p?siteid=42566 [55] => https://secure.domain.com/order/regi...p?siteid=42566 [56] => https://secure.domain.com/order/regi...p?siteid=42566 ) )

    I see links inside all this junk. how can i clean this stuff out and show just the links?
    Last edited by agfre44_9873; 09-07-2009 at 11:54 AM.

  • #4
    Regular Coder
    Join Date
    Mar 2006
    Posts
    238
    Thanks
    3
    Thanked 37 Times in 37 Posts
    You mean you would like to get rid of anything except absolute or relative URL's ? I have modified your regular expression a little bit:
    PHP Code:
    <?php
    $original_file 
    file_get_contents("http://www.domain.com");

    $stripped_file strip_tags($original_file"<a>");

    preg_match_all("/<a(?:[^>]*)href=([\"']?)(?=http|\/|\.)([^\"' >]*?)\\1(?:[^>]*)>(?:[^<]*)<\/a>/is"$stripped_file$matches);

    echo 
    '<pre>' print_r($matches[2],true) . '</pre>';
    ?>
    Now it returns absolute URL's which start with "http" or relative URL's which start with "/" or "." (e.g. "/example.php" or "../example.php" ).

    Please run it and look at the result.

    Also I would suggest to change domain.com to some other URL for tests... I hope domain.com which you have initially used for the test does not mind free advertising... But still they could be possibly unhappy if we experiment too much... Also the reason to use another domain could be to check different HTML and different situations for testing. E.g. all URL's at domain.com are surrounded with double quotes. This could be different if you run your script on some other site. I would even suggest to create a special page with very bad HTML for testing...
    Last edited by SKDevelopment; 09-07-2009 at 01:43 PM.

  • Users who have thanked SKDevelopment for this post:

    agfre44_9873 (09-07-2009)

  • #5
    New Coder
    Join Date
    Apr 2009
    Location
    US Florida
    Posts
    25
    Thanks
    4
    Thanked 0 Times in 0 Posts
    hey!!! that's a lot better. Thank you very much! One of my misunderstanding is what inside the preg_match_all(). My intentions is to go into http://www.dol.gov and look for info on migrant workers. So If I use their search engine I get the following link...

    http://www.dol.gov/search/AdvSearch....h_term=migrant

    I put this on my small script and pulls out all the links on the above search. I want to get the small amount of info under each link with the spider. Example,

    /*this is 1st link under search result*/
    Compliance Assistance By Law - The Migrant and Seasonal Agricultural Worker Protection Act
    /*end link*/

    /*content under link*/
    This Page E-mail This Page The Migrant and Seasonal Agricultural Worker ProtectionReturn to By Law Menu OVERVIEW The Migrant and Seasonal Agricultural Worker Protectionprovides employment-related protections to migrant and seasonal agricultural workers and

    I want to ge the content also. Every link and content is inside a table. The following format is being use by DOL to display links and contents


    <tr>
    <td>
    <p><a href="http://www.dol.gov/dol/compliance/comp-msawpa.htm">Compliance Assistance By Law - The Migrant and Seasonal Agricultural Worker Protection Act</a></p>
    </td>
    <td>
    <p>34k</p>
    </td>
    </tr>
    <tr>
    <td>
    <p>This Page E-mail This Page The Migrant and Seasonal Agricultural Worker ProtectionReturn to By Law Menu OVERVIEW The Migrant and Seasonal Agricultural Worker Protectionprovides employment-related protections to migrant and seasonal agricultural workers and</p>
    </td>
    <td>
    </td>
    </tr>

    I know I have to look at the tags. My guestion is, where do I start? How to I set preg_match_all() to grab both the link and content. Also, how can I get rid of the array and just have the links and content.
    Last edited by agfre44_9873; 09-07-2009 at 07:42 PM.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •