Hello and welcome to our community! Is this your first visit?
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 2 of 2
  1. #1
    New Coder
    Join Date
    Oct 2008
    Thanked 0 Times in 0 Posts

    help with function to crawl for links in all website.

    I'm trying to crawl for links in a specific website and show them at the end. The problem i'm facing is that it only show the links from the specific page not the whole pages in the website. I tried several loops with no success please give some advise.
    Here is the code:
    if (isset($_POST['Submit'])) {
        function getLinks($link)
            /*** return array ***/
            $ret = array();
            /*** a new dom object ***/
            $dom = new domDocument;
            /*** get the HTML (suppress errors) ***/
            /*** remove silly white space ***/
            $dom->preserveWhiteSpace = false;
            /*** get the links from the HTML ***/
            $links = $dom->getElementsByTagName('a');
            /*** loop over the links ***/
            foreach ($links as $tag)
                $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
            return $ret;
        /*** a link to search ***/
        $link = $_POST['address'];
        /*** get the links ***/
        $urls = getLinks($link);
        /*** check for results ***/
        if(sizeof($urls) > 0)
            foreach($urls as $key=>$value)
    if (preg_match('/^(http|https):\/\/([a-z0-9-]\.+)*/i',$key)) {
    echo '<span style="color:RED;">' . $key .' - external</span><br >';
    } else {
    echo '<span style="color:BLUE;">' . $link . $key . ' - internal</span><br >';
            echo "No links found at $link";
    <br /><br />
    <form action="" method="post" enctype="multipart/form-data" name="link">
    <input name="address" type="text" value="" />
    <input name="Submit" type="Submit" />

  • #2
    Regular Coder adarshakb's Avatar
    Join Date
    Jun 2009
    Silicon valley of india
    Thanked 1 Time in 1 Post
    Call recursively the function getLinks()

    After getting all the links in the page do the following
    1. Store the current page link in a global array/any data strcture such as linked list
    2. Call getLinks() for all the links in the current page IF its not present in the global array(i.e not already crawled) AND You need to check if the link is in the same website or not. If you are crawling with all the links you may end up crawling other websites also.
    Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.

    Albert Einstein
    My Blog songs


    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts