Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 13 of 13
  1. #1
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts

    Need suggestions on parsing information

    Hi guys,

    I've set up a mail pipe to retrieve lottery results by email. This is the first time I've done this and I want to get some advice/suggestions as to how to proceed or the best course of action to perform. Below is a sample email that i need to parse to retrieve the "game name", the "drawing date" and the "results" of the game. My question is rather around, if all of these variables are changing all the time how do I effectively locate and parse the correct sections of this mess? Given that there are random spaces or chars in weird areas. I'm not looking for someone to write me code, just looking for some pointers to follow.

    Code:
    <head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DISO=
    -8859-1">
    
    <title>Florida Lottery Winning Numbers</title>
    <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
    1">
    </head>
    <!-- Top -->
    <body bgcolor=3D"#D6E69D">
    <table width=3D"500" border=3D"1" bordercolor=3D"#99CC00" align=3D"center" =
    cellpadding=3D"0" cellspacing=3D"0">
     <tr>
       <td><table width=3D"500" border=3D"0" bgcolor=3D"#FFFFFF" align=3D"cent=
    er" cellpadding=3D"0" cellspacing=3D"0">
         <tr>
           <td><a href=3D"http://www.flalottery.com"><img src=3D"http://www.fl=
    alottery.com/exptkt/header.gif" alt=3D"Florida Lottery Winning Numbers" wid=
    th=3D"500" height=3D"90" border=3D"0"></a></td>
         </tr>
    <!-- draw_date -->
         <tr>
           <td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
    =3D"0" cellpadding=3D"8">
               <tr>
                 <td><div align=3D"center"><font face=3D"Arial" size=3D"5" col=
    or=3D"#8CC43F"><strong>Thursday, January 19,  2012 Draws</strong></font></d=
    iv></td>
               </tr>
           </table></td>
         </tr>
               <tr>
                 <td colspan=3D"2"><hr color=3D"#99CC00" size=3D"1" width=3D"9=
    0%"></td>
               </tr>
    
    <!-- Play4 midday -->
         <tr>
           <td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
    =3D"0" cellpadding=3D"0">
               <tr>
                 <td width=3D"40%"><div align=3D"center"><a href=3D"http://www=
    .flalottery.com/inet/games-play4Main.do"><img src=3D"http://www.flalottery.=
    com/exptkt/play4.gif" alt=3D"Play 4" width=3D"117" height=3D"49" vspace=3D"=
    3" border=3D"0"></a></div></td>
                 <td width=3D"60%"><div align=3D"left"><font face=3D"Arial" si=
    ze=3D"5" color=3D"#000000"><strong><font color=3D"#666666">Midday:</font>  =
    2 - 0 - 5 - 6<br>
                         </strong></font></div></td>
               </tr>
           </table></td>
         </tr>
         <tr>
           <td align=3D"center"><hr color=3D"#99CC00" size=3D"1" width=3D"90%"=
    ></td>
         </tr>
    
    <!-- Cash3 midday-->
         <tr>
           <td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
    =3D"0" cellpadding=3D"0">
               <tr>
                 <td width=3D"40%"><div align=3D"center"><a href=3D"http://www=
    .flalottery.com/inet/games-cash3Main.do"><img src=3D"http://www.flalottery.=
    com/exptkt/cash3.gif" alt=3D"Cash 3" width=3D"117" height=3D"46" vspace=3D"=
    3" border=3D"0"></a></div></td>
                 <td width=3D"60%"><div align=3D"left"><font face=3D"Arial" si=
    ze=3D"5" color=3D"#000000"><strong><font color=3D"#666666">Midday:</font> 6=
     - 2 - 0<br>
                        </strong></font></div></td>
               </tr>
           </table></td>
         </tr>
         <tr>
           <td align=3D"center"><hr color=3D"#99CC00" size=3D"1" width=3D"90%"=
    ></td>
         </tr>
    
    <!-- Bottom -->
         <tr>
           <td><table width=3D"100%" border=3D"0" align=3D"center" cellpadding=
    =3D"5" cellspacing=3D"0">
               <tr>
                 <td><font face=3D"Arial" size=3D"1">Please note every effort =
    has been made to ensure that the enclosed information is accurate; however,=
     in the event of an error, the winning numbers and prize amounts in the off=
    icial record of the Florida Lottery shall be controlling.<p>
    To unsubscribe from receiving Florida Lottery e-mail, please <a href=3D"htt=
    p://secondchance.flalottery.com/secondchance/vip_login.do"> click here</a>,=
     log in to your account and update your e-mail preferences.=20
    </font></td>
               </tr>
           </table></td>
         </tr>
         <tr>
           <td><a href=3D"http://www.flalottery.com"><img border=3D"0" src=3D"=
    http://www.flalottery.com/exptkt/footer.gif" width=3D"500" height=3D"40" al=
    t=3D"www.flalottery.com"></a></td>
         </tr>
       </table></td>
     </tr>
    </table>
    </body>
    </html>
    Last edited by macleodjb; 01-20-2012 at 03:10 AM. Reason: added stuff

  • #2
    Supreme Overlord Spookster's Avatar
    Join Date
    May 2002
    Location
    Marion, IA USA
    Posts
    6,280
    Thanks
    4
    Thanked 83 Times in 82 Posts
    For this you as long as they are consistent with how they write this out you can pick up on patterns using regular expressions.

    Patterns:
    Give you the date
    <!-- draw_date --> followed by a bunch of junk and then <strong> date </strong>

    Gives you the game names
    <!-- Play4 midday -->
    <!-- Cash3 midday-->

    Gives you results
    Midday:</font> precedes the results for each game type and each has unique pattern of results
    = 2 - 0 - 5 - 6
    6= - 2 - 0
    not really sure what that means.

    Link below will get you started on how to parse it with regular expressions. It also demonstrates use of DOM but I don't think you are going to be able to use that here. This HTML is pretty horrible.
    http://www.codingforums.com/showthread.php?t=244867
    Spookster
    CodingForums Supreme Overlord
    All Hail Spookster

  • #3
    Master Coder
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,538
    Thanks
    8
    Thanked 1,093 Times in 1,084 Posts
    If they offer results by email, do they also offer RSS feeds with the latest results?

    I would use their RSS feed instead of the email. You could have a PHP script
    automatically grab the RSS (XML file), save data in a database, and also SMS message
    your phone, or send a nice, clean email to you with any statistical data you wish.
    You would be creating the email you get using your PHP script.

    Find out if they offer an RSS feed, or an API (that would also work good).

    EDIT:
    I just found their feed here:
    http://www.flalottery.com/video/en/theWinningNumber.xml

    That can easily be accessed and parsed by a PHP script automatically, using a CRON job.
    You can then use the data however you want.


    .
    Last edited by mlseim; 01-20-2012 at 04:26 AM.

  • #4
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    I guess posting the florida email content was a bad example. I need to get this working for states that do not offer the RSS option. I am using the RSS in florida, however there are quite a few states that don't off it.

    As for parsing with regular expressions, I'm going to have to read up on them in great detail because I've never really understood them too well. For example i can spot the patterns but its the junk in the middle that i dont need is what concerns me. I'm not sure how to get rid of it. I want to be able to feel confident that my script will pull out the data with no left over junk or simple the wrong data.

    If you have any more suggestions please send them over. Thanks

  • #5
    Master Coder
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,538
    Thanks
    8
    Thanked 1,093 Times in 1,084 Posts
    How about using a service like some of these?
    https://www.google.com/search?q=stat...es&btnG=Search

    I realize they might have subscription costs, but the data is all
    in one place, and easy to access. I believe the time and energy
    you save would be worth the cost.

    Parsing even 10 states with HTML parsing would be a nightmare, and if they
    changed anything on their webpage (like a new layout design), you'd be
    starting all over again.

    How about this thought ... maybe you can tell us which states DON'T
    offer the RSS feed results. It's possible that 40 states offer it, and 10 don't.
    That might make it easier to swallow.


    .
    Last edited by mlseim; 01-20-2012 at 02:30 PM.

  • #6
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    here's my first question with regular expressions. How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->

    I tried this. but returns an empty result.
    PHP Code:
    $date_pattern "/<!-- draw_date -->(.*)<!--/"

  • #7
    Supreme Overlord Spookster's Avatar
    Join Date
    May 2002
    Location
    Marion, IA USA
    Posts
    6,280
    Thanks
    4
    Thanked 83 Times in 82 Posts
    That's easy

    PHP Code:
    <?php
    $subject 
    'fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->'
    $pattern '/<!--(.*?)-->/';
    echo 
    htmlspecialchars($subject) . "</br>"
    preg_match_all($pattern$subject$matches); 
    var_dump($matches);
    ?>
    produces:
    Code:
    fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->
    array
      0 => 
        array
          0 => string '<!-- draw_date -->' (length=18)
      1 => 
        array
          0 => string ' draw_date ' (length=11)
    Last edited by Spookster; 01-20-2012 at 07:38 PM.
    Spookster
    CodingForums Supreme Overlord
    All Hail Spookster

  • #8
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by Spookster View Post
    That's easy

    PHP Code:
    <?php
    $subject 
    'fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->'
    $pattern '/<!--(.*?)-->/';
    echo 
    htmlspecialchars($subject) . "</br>"
    preg_match_all($pattern$subject$matches); 
    var_dump($matches);
    ?>
    produces:
    Code:
    fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->
    array
      0 => 
        array
          0 => string '<!-- draw_date -->' (length=18)
      1 => 
        array
          0 => string ' draw_date ' (length=11)

    I'm not sure i follow that, but it doesn't look like what I'm after. I wanted to get the contents between those two <!--tag--> content <!-- tag -->

    so in your example it would return the following.
    Code:
    sdofijos.dsflsad

  • #9
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    I just tried to use this as my regular expression to get between those points.

    Code:
    "/->[A-Za-z0-9-^_]+<!-/"
    no luck

  • #10
    Supreme Overlord Spookster's Avatar
    Join Date
    May 2002
    Location
    Marion, IA USA
    Posts
    6,280
    Thanks
    4
    Thanked 83 Times in 82 Posts
    Quote Originally Posted by macleodjb View Post
    I'm not sure i follow that, but it doesn't look like what I'm after. I wanted to get the contents between those two <!--tag--> content <!-- tag -->

    so in your example it would return the following.
    Code:
    sdofijos.dsflsad
    No it doesn't. Did you look at the post? Did you even try it? I posted the code and the output it produces. Show me how it doesn't work.
    Spookster
    CodingForums Supreme Overlord
    All Hail Spookster

  • #11
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    Here is the output from your post above.

    PHP Code:
    Array
    (
        [
    0] => Array
            (
                [
    0] => 
                [
    1] => 
                [
    2] => 
                [
    3] => 
                [
    4] => 
            )

        [
    1] => Array
            (
                [
    0] =>  Top 
                
    [1] =>  draw_date 
                
    [2] =>  Play4 midday 
                
    [3] =>  Cash3 midday
                
    [4] =>  Bottom 
            
    )


    This returns what is between the tag start and end. What i am looking to do is return what is between the draw_date tag and the Play4 Midday tag. That will allow me to pull the date out of it easier as well as the results.

    For example:
    PHP Code:
    <!-- First Tag --> ie (<!-- draw_date -->)

    <
    b>Here is the content i want to return</b>

    <!-- 
    Second Tag --> ie (<!-- play4 midday -->) 

  • #12
    Supreme Overlord Spookster's Avatar
    Join Date
    May 2002
    Location
    Marion, IA USA
    Posts
    6,280
    Thanks
    4
    Thanked 83 Times in 82 Posts
    Quote Originally Posted by macleodjb View Post
    Here is the output from your post above.

    PHP Code:
    Array
    (
        [
    0] => Array
            (
                [
    0] => 
                [
    1] => 
                [
    2] => 
                [
    3] => 
                [
    4] => 
            )

        [
    1] => Array
            (
                [
    0] =>  Top 
                
    [1] =>  draw_date 
                
    [2] =>  Play4 midday 
                
    [3] =>  Cash3 midday
                
    [4] =>  Bottom 
            
    )


    This returns what is between the tag start and end. What i am looking to do is return what is between the draw_date tag and the Play4 Midday tag. That will allow me to pull the date out of it easier as well as the results.

    For example:
    PHP Code:
    <!-- First Tag --> ie (<!-- draw_date -->)

    <
    b>Here is the content i want to return</b>

    <!-- 
    Second Tag --> ie (<!-- play4 midday -->) 

    What I posted will search the string and return all matches it finds between those tags which is what you asked for

    How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->
    Spookster
    CodingForums Supreme Overlord
    All Hail Spookster

  • #13
    Regular Coder
    Join Date
    Apr 2007
    Posts
    317
    Thanks
    24
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by macleodjb View Post
    here's my first question with regular expressions. How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->

    I tried this. but returns an empty result.
    PHP Code:
    $date_pattern "/<!-- draw_date -->(.*)<!--/"

    No, what i was looking for was the content between. See the above. The first tag is <!-- drawing date --> and the following block <!-- {whatever}--> in this example would be <!-- pick4 midday -->.

    In my original posting I was attempting to use the full first tag, and the opening operand on the following tag, with anything (.*) in between. My lack of knowledge for regular expressions i guess made this hard to understand. And then in my next attempt I tried using your example just using the end operand and the start operand "-->(.*)<!--"

    Sorry for the confusion.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •