Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 11 of 11
  1. #1
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts

    Manually Crawling for Data (w/o RSS Feed)

    Hey Guys,

    I'm trying to find some information on manually crawling/pulling news data from a particular site, when it doesn't offer an RSS feed.

    I currently have a tabbed-panel of RSS fed content, but there's one place I need to manually pull the news info from.

    Any good place to start reading up/looking?

    Thanks in advance.

  • #2
    Regular Coder
    Join Date
    Oct 2008
    Posts
    214
    Thanks
    5
    Thanked 22 Times in 22 Posts
    Not sure if I understand you...

    You want to get some data from an external website and that data isn't XML (RSS) formatted?

    If so, you will need to read the whole page then parse it and then displaying this parsed data on your side...

    Depending of which technology you can use (PHP or other) and how the target data look like solution may vary greatly...

    More details are needed

  • #3
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts
    Correct. I would like to retrieve it via PHP somehow.

    The site is very simple. News site with headlines. Clicking the headlines will take you the specific news page. All I would like to grab is the name of the headline, and the link that it will go to.

    Does that help?

  • #4
    Regular Coder
    Join Date
    Dec 2006
    Posts
    166
    Thanks
    9
    Thanked 4 Times in 4 Posts
    Maybe something like the cURL function in combination with regular expressions?

  • #5
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts
    I see something like this for pulling a direct RSS feed using cURL:

    http://phpsense.com/php/php-curl-functions.html

    But I'm not sure how I would get the info/links directly from a specific page ..

  • #6
    Regular Coder
    Join Date
    Dec 2006
    Posts
    166
    Thanks
    9
    Thanked 4 Times in 4 Posts
    Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.

  • #7
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts
    Well, the source code looks like this:

    Code:
    <LI TYPE=news><A HREF="titlename.htm">new title here</A></LI>
    <LI TYPE=news><A HREF="titlename2.htm">news title 2 here</A></LI>
    Do you think that's the best way to do it? I'm curious what kind of load time/delay there would be in loading the whole page ... Any thoughts?

    Thanks.

  • #8
    Master Coder
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,549
    Thanks
    8
    Thanked 1,094 Times in 1,085 Posts
    I think the term you're looking for is "web page scraping".
    http://www.google.com/search?hl=en&q...earch&aq=f&oq=

    several techniques ... some ethical issues to deal with too.

  • #9
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by 194673 View Post
    Well you can download the webpage source using cURL and then use a regular expression to find the HTML tag which encloses the heading. Then you can extract the info between the tags.
    Could I also use fsockopen instead of cURL? Seems that's more popular with hosts ...

  • #10
    Regular Coder
    Join Date
    Jul 2003
    Posts
    601
    Thanks
    17
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by mlseim View Post
    I think the term you're looking for is "web page scraping".
    http://www.google.com/search?hl=en&q...earch&aq=f&oq=

    several techniques ... some ethical issues to deal with too.
    Yes I have found similar info. It won't be an issue, as I've confirmed with the site of what this will be doing. It simply will be grabbing news titles and linking directly to their site for added traffic.

    The only potential issue is the bandwidth hit in grabbing the links - although I would like to make it only happen a limited number of times a day.

  • #11
    Master Coder
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,549
    Thanks
    8
    Thanked 1,094 Times in 1,085 Posts
    If you were able to contact that other site's owner, maybe you could have them
    put an RSS feed for the data you want. That would be a win-win situation for the
    both of you. It would help their bandwidth, and it would give you the XML you need
    to easily display the information (with links back to their site).

    That's really the best way it should be done.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •