Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 2 12 LastLast
Results 1 to 15 of 27
  1. #1
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts

    Question PHP DOM function to parse HTML data source as if csv.

    I am trying to understand a PHP DOM function I found to parse a HTML data source. It is close to what I need but I need to understand and adjust it.

    I have the following as a data source format I can't change.
    Code:
    <html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>
    Awful I know. Yes it is given to me as a single line. It is basically a html table wrapped in html and body tags.

    I need a two dimensional array as if it had been read as a csv file. So I need this:
    Code:
    Array
    (
        [0] => Array
            (
    	    [0] => header1
    	    [1] => header2
    	    [2] => header3
    	    [3] => header4
    	    [4] => header5
    	    [5] => header6
    	    [6] => header7
    	    [7] => header8
    	    [8] => header9
            )
    
        [1] => Array
            (
    	    [0] => Value1
    	    [1] => value2
    	    [2] => value3
    	    [3] => value4
    	    [4] => value5
    	    [5] => value6
    	    [6] => value7
    	    [7] => value8
    	    [8] => value9
            )
    
        [2] => Array
            (
    	    [0] => value10
    	    [1] => value11
    	    [2] => value12
    	    [3] => value13
    	    [4] => value14
    	    [5] => value15
    	    [6] => value16
    	    [7] => value17
    	    [8] => value18
    	)
    
        [3] => Array
            (
    	    [0] => value19
    	    [1] => value20
    	    [2] => value21
    	    [3] => value22
    	    [4] => value23
    	    [5] => value24
    	    [6] => value25
    	    [7] => value26
    	    [8] => value27
            )
    )
    Now I hail from PHP4 days so I thought regex might be a way forward. A quick bounce around google and I had the message DON'T USE REGEX TO PARSE HTML thrown at me more than a few times.

    In the course of looking I found this function on:
    http://www.phpro.org/examples/Get-Te...ween-Tags.html

    PHP Code:
    /*
     * @get text between tags
     * @param string $tag The tag name
     * @param string $html The XML or XHTML string
     * @param int $strict Whether to use strict mode
     * @return array
     */
    function getTextBetweenTags($tag$html$strict=0)
    {
        
    /*** a new dom object ***/
        
    $dom = new domDocument;

        
    /*** load the html into the object ***/
        
    if($strict==1)
        {
            
    $dom->loadXML($html);
        }
        else
        {
            
    $dom->loadHTML($html);
        }

        
    /*** discard white space ***/
        
    $dom->preserveWhiteSpace false;

        
    /*** the tag by its tag name ***/
        
    $content $dom->getElementsByTagname($tag);

        
    /*** the array to return ***/
        
    $out = array();
        foreach (
    $content as $item)
        {
            
    /*** add node value to the out array ***/
            
    $out[] = $item->nodeValue;
        }
        
    /*** return the results ***/
        
    return $out;

    Now I understand some of it but am not familiar with the PHP DOM.Yes I've read his tutorial and much of the manual. I haven't reached that magic function or example explained that helps it fall into place in my visualizations.

    So using this to call the function above:
    PHP Code:
    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';
    print 
    '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>'
    and I get:
    Code:
    Array
    (
        [0] => header1
        [1] => header2
        [2] => header3
        [3] => header4
        [4] => header5
        [5] => header6
        [6] => header7
        [7] => header8
        [8] => header9
        [9] => value1
        [10] => value2
        [11] => value3
        [12] => value4
        [13] => value5
        [14] => value7
        [15] => value8
        [16] => value9
        [17] => value10
        [18] => value11
        [19] => value12
        [20] => value13
        [21] => value14
        [22] => value15
        [23] => value16
        [24] => value17
        [25] => value18
        [26] => value19
        [27] => value20
        [28] => value21
        [29] => value22
        [30] => value23
        [31] => value24
        [32] => value25
        [33] => value26
        [34] => value27
    )
    I could chunk the array down into what I need (only discovered that yesterday, thank you forum), but I suspect it will be much better to alter the function above to give the result I need.

    If well explained I also think it'd be a good introduction and practical example of the PHP DOM scripting that is new to me.

    Would anyone mind showing me how to adjust the function above to be what I need and perhaps explain it a little as they go?

    Thanks

    Matt

  • #2
    Senior Coder ahallicks's Avatar
    Join Date
    May 2006
    Location
    Lancaster, UK
    Posts
    1,134
    Thanks
    1
    Thanked 57 Times in 55 Posts
    You could change $out[] = $item->nodeValue; to $out[$tag] = $item->nodeValue;

    Which would basically give you a multi-dimensional array like:

    Code:
    Array
    (
        [th] => Array
            (
    	    [0] => header1
    	    [1] => header2
    	    [2] => header3
    	    [3] => header4
            )
        [td] => Array
            (
    	    [0] => value1
    	    [1] => value2
    	    [2] => value3
    	    [3] => value4
            )
    )
    etc

    PHP Dom, in my humble opinion, is brilliant! But it does take a little getting used to. The easiest way to start learn is to get your head around actual DOM, that is the Document Object Model. Think of it as a tree of nodes, somehave children, some have parents and some have siblings. Using PHPDom you can reference various elements by using these 'ancestor' references.

    The basic steps are; you create a DomDocument() with $dom = new DOMDocument();
    Then you load a string (XML/HTML/etc)/XML File/HTML File/etc into this DOMDocument and use the DOM fuctions (such as getElementsByTagName) to fetch items from the DOMDocument.

    You can then use these elements as you see fit. Look into simplexml too as that is also a powerful parser for PHP
    "write it for FireFox then hack it for IE."
    Quote Originally Posted by Mhtml View Post
    Domains are like women - all the good ones are taken unless you want one from some foreign country.
    Reputation is your friend

    Development & SEO Tools

  • Users who have thanked ahallicks for this post:

    MattyUK (02-11-2010)

  • #3
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts

    Question

    Hi ahallicks

    Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.

    Wouldn't your adjustment give a single dimension associative array?

    The $tag variable would be the "td" supplied as the function parameter. Creating a single dimension associative array overwritten in the loop so it only contains the last value.

    PHP Code:
    /*
     * @get text between tags
     * @param string $tag The tag name
     * @param string $html The XML or XHTML string
     * @param int $strict Whether to use strict mode
     * @return array
     */
    function getTextBetweenTags($tag$html$strict=0)
    {
        
    /*** a new dom object ***/
        
    $dom = new domDocument;

        
    /*** load the html into the object ***/
        
    if($strict==1)
        {
            
    $dom->loadXML($html);
        }
        else
        {
            
    $dom->loadHTML($html);
        }

        
    /*** discard white space ***/
        
    $dom->preserveWhiteSpace false;

        
    /*** the tag by its tag name ***/
        
    $content $dom->getElementsByTagname($tag);

        
    /*** the array to return ***/
        
    $out = array();
        foreach (
    $content as $item)
        {
            
    /*** add node value to the out array ***/
            
    $out[$tag] = $item->nodeValue;
        }
        
    /*** return the results ***/
        
    return $out;
    }  

    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

    print 
    '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>'
    Gives output:
    Code:
    Array
    (
        [td] => value27
    )

    Thank you for the idea however. I guess one approach might be to get the code to step through any child tags. Then we could pass it the tr tag knowing it'll get the values from all td tags it contained. I'm not sure of the array insert code that would be needed. Simply haven't thought about it yet. Might try a few mock ups now.

    Thanks.

  • #4
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    Quote Originally Posted by MattyUK View Post
    Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.
    PHP DOM is just as easy as JavaScript DOM, because DOM is a language independent API. you only have to fit it in the actual language’s syntax. (and there could hardly be a greater difference)
    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • #5
    Senior Coder ahallicks's Avatar
    Join Date
    May 2006
    Location
    Lancaster, UK
    Posts
    1,134
    Thanks
    1
    Thanked 57 Times in 55 Posts
    You could use:

    You could change $out[] = $item->nodeValue; to $out[$tag][] = $item->nodeValue;
    "write it for FireFox then hack it for IE."
    Quote Originally Posted by Mhtml View Post
    Domains are like women - all the good ones are taken unless you want one from some foreign country.
    Reputation is your friend

    Development & SEO Tools

  • #6
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts

    Question

    Thank you both.

    ahallicks, that modification would just make an associative array again. 2 dimensions yes, but not grouped via the row.

    the problem I think is the $tag variable since it is simply the string passed to tell the function what tag in the DOM to target. because each of the values returned, comes from the same tag type it'll overwrite the previous or in the second case be lumped together under the be lumped together under the associative array it caused.

    Check out what I mean below:
    PHP Code:
    /*
     * @get text between tags
     * @param string $tag The tag name
     * @param string $html The XML or XHTML string
     * @param int $strict Whether to use strict mode
     * @return array
     */
    function getTextBetweenTags($tag$html$strict=0)
    {
        
    /*** a new dom object ***/
        
    $dom = new domDocument;

        
    /*** load the html into the object ***/
        
    if($strict==1)
        {
            
    $dom->loadXML($html);
        }
        else
        {
            
    $dom->loadHTML($html);
        }

        
    /*** discard white space ***/
        
    $dom->preserveWhiteSpace false;

        
    /*** the tag by its tag name ***/
        
    $content $dom->getElementsByTagname($tag);

        
    /*** the array to return ***/
        
    $out = array();
        foreach (
    $content as $item)
        {
            
    /*** add node value to the out array ***/
            
    $out[$tag][] = $item->nodeValue;
        }
        
    /*** return the results ***/
        
    return $out;
    }  

    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

    print 
    '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>'
    Gives:
    Code:
    Array
    (
        [td] => Array
            (
                [0] => value1
                [1] => value2
                [2] => value3
                [3] => value4
                [4] => value5
                [5] => value6
                [6] => value7
                [7] => value8
                [8] => value9
                [9] => value10
                [10] => value11
                [11] => value12
                [12] => value13
                [13] => value14
                [14] => value15
                [15] => value16
                [16] => value17
                [17] => value18
                [18] => value19
                [19] => value20
                [20] => value21
                [21] => value22
                [22] => value23
                [23] => value24
                [24] => value25
                [25] => value26
                [26] => value27
            )
    
    )
    I guess you may have meant the actual tag name being interrogated at that moment. In which case you may have meant:
    Code:
    $out[][$item->nodeName] = $item->nodeValue;
    But that produces:
    Code:
    Array
    (
        [0] => Array
            (
                [td] => value1
            )
    
        [1] => Array
            (
                [td] => value2
            )
    
        [2] => Array
            (
                [td] => value3
            )
    
        [3] => Array
            (
                [td] => value4
            )
    ...
    and so on
    I think this is because the call to getElementsByTagname($tag); essentially says get all the TD's, irrespective of their relationship as children of any specific TR tag.

    Even if we somehow changed it to refference the parent node name perhaps...
    Code:
    $out[][$item->parentNode->nodeName] = $item->nodeValue;
    The next 'row' would still overwrite the previous ones values since they are using the same associative array key.

    I don't know the dom commands/functions to start at a TR collect all child TD elements then move onto the next TR.

    Dormilich, Then I may not have been using the dom after all. I just recall accessing elements under the old document.all. structure and changing attributes, reading values etc.

    I can picture the dom structure but don't know how to move around it well enough in PHP. Hence asking here. How would we foreach each TD child element of every TR element?

    Thank you again for the replies.

  • #7
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    Quote Originally Posted by MattyUK View Post
    I don't know the dom commands/functions to start at a TR collect all child TD elements then move onto the next TR.
    1. get all TRs
    2. loop over them
    3. in the loop, get all child TDs
    4. loop

    that is, a nested loop.

    ex. (simplified)
    PHP Code:
    $tr $dom->getElementsByTagName("tr");
    $l $tr->length;
    for (
    $i 0$i $l$i++)
    {
        
    $td $tr[$i]->getElementsByTagName("td");
        
    $m $td->length;
        for (
    $j 0$j $m$j++)
        {
            
    // further code
        
    }

    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • Users who have thanked Dormilich for this post:

    MattyUK (02-12-2010)

  • #8
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Thank you. I saw the dom functions firstChild, next sibling but couldn't quite see how to look at only siblings of a particular type. I guess I was over-thinking it. This approach should work. Thank you. I'll rework the code based on a nested loop approach.

    Thanks again Dormilich. I guess a nights sleep helped me see your point more clearly too.

  • #9
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    Quote Originally Posted by MattyUK View Post
    Thank you. I saw the dom functions firstChild, next sibling but couldn't quite see how to look at only siblings of a particular type.
    Node->nodeType, Node->nodeName, Node->localName, Element->tagName. would work too, but you need more check statements.
    Last edited by Dormilich; 02-12-2010 at 12:24 PM.
    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • #10
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Hi Dormilich

    Thanks for your help. I slapped myself upside the head and looked at it again from a nested viewpoint. I have this so far.

    PHP Code:
    function getHtmlTableText($html){
        
    //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
        
        
    $dom = new domDocument;
        
    //How should I sanitize the $html input? Do I need too since using dom?
        //Load html into dom object
        
    $dom->loadHTML($html);
        
    //discard white space
        
    $dom->preserveWhiteSpace false;
        
    //get the rows
        
    $rows $dom->getElementsByTagname('tr');
        
    //initialize the output array
        
    $rArr = array();
        
    //row count int var
        
    $rCount 0;
        
    //loop the rows
        
    foreach ($rows as $row)
        {
            
    //How to cleanly accommodate header cells? Don't want to replicate the entire loop 
            
            //get the cells in the row
            
    $cells $row->getElementsByTagname('td');
            
    //try OR
            //$cells = $row->getElementsByTagname('th'||'td');//Bad.
            //Concat?
            //$cells = $row->getElementsByTagname('th');
            //$cells .= $row->getElementsByTagname('td');//Bad
            //Addition?
            //$cells = $cells + $row->getElementsByTagname('td');//Bad
            //How do you join two donNode objects together? or use getElementsByTagname with OR multiple tags
            //loop the cells
            
    foreach ($cells as $cell)
            {
                
    //add to output array
                
    $rArr[$rCount][] = $cell->nodeValue;
            }
    //from: foreach ($cells as $cell){
            //increment row count
            
    $rCount++;
        }
    //from: foreach ($rows as $row){

        //Return output
        
    return $rArr;

    }
    //from: function getHtmlTableText($tag,$html,$strict=0){


    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

    print 
    '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>'
    I put the remaining questions inside the code, but this is working so far. Thanks.

    I put the step through the nodes using increments aside and went with a foreach after getting all nodes of a particular type. My thinking was that they may change the format down the line and introduce comment tags or malform the source.

    Does that make sense or am I off on a wrong track again?

    Node->nodeType, Node->nodeName, Node->localName, Element->tagName. would work too, but you need more check statements.
    I'll look those up. Thank you.

    My goal now is to answer the questions in the code above. Particularly how to handle the header row as cleanly as possible.


    Thanks for everyone's help so far.
    Last edited by MattyUK; 02-13-2010 at 04:08 PM. Reason: typos. code header question attempts added

  • #11
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    How should I sanitize the $html input? Do I need too since using dom?
    something like validating against a HTML DTD?

    Anyway of saying tag A or B? th or td. Or if there is a way of appending two domNode objects together. $headers.$cells?
    you can of course merge arrays (thanks to PHP not sticking to the DOM output data types (that wouldn’t work in JavaScript)). you could also say: get all TDs and if there are none get all THs.

    $headers.$cells wouldn’t work anyway (string concatenation on arrays!)
    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • #12
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Thanks.

    something like validating against a HTML DTD?
    Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.

    $headers.$cells wouldn’t work anyway (string concatenation on arrays!)
    I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.

    I tested they were objects with this code.
    PHP Code:
    ...
            
    //get the cells in the row
            
    $cells $row->getElementsByTagname('td');
            if(
    is_object($cells)){return '$cells is an object';}
            if(
    is_array($cells)){return '$cells is an array';}
    ... 
    Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:

    PHP Code:
    function getHtmlTableText($html){
        
    //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
        //Thanks to Dormilich for help.
        
    $dom = new domDocument;
        
        
    //How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded.

        //Load html into dom object
        
    $dom->loadHTML($html);
        
    //discard white space
        
    $dom->preserveWhiteSpace false;
        
    //get the rows
        
    $rows $dom->getElementsByTagname('tr');
        
    //initialize the output array
        
    $rArr = array();
        
    //row count int var
        
    $rCount 0;
        
    //loop the rows
        
    foreach ($rows as $row)
        {
            
    //get the cells in the row if they are td or th
            
    if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td')
            {
                
    $cells $row->getElementsByTagname($row->firstChild->nodeName);
            }
            else
            {
                
    //If both td and th fail then what on earth are we reading??
                //Better run away.
                
    return false;
            }
            
    //$cells = $row->getElementsByTagname('td');
            //if(is_object($cells)){return '$cells is an object';}
            //if(is_array($cells)){return '$cells is an array';}
            
            //loop the cells
            
    foreach ($cells as $cell)
            {
                
    //add to output array
                
    $rArr[$rCount][] = $cell->nodeValue;
            }
            
    //increment row count
            
    $rCount++;
        }
    //from: foreach ($rows as $row){
        //Return output
        
    return $rArr;

    }
    //from: function getHtmlTableText($tag,$html){



    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

    print 
    '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>'

  • #13
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.
    prepending a DTD can be made before loading into the DOMDocument.

    that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

    validating is done through DOMDocument->validate(); although you should be prepared that most user probably don’t know that there is a HTML standard at all and therefore the validation fails.

    I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.
    hm, makes sense after all. well, currently I see no way to merge those 2 objects.

    the objects returned conform to the DOM, that is, every method or property is listed in the DOM (resp. in the PHP manual)


    Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:
    although it doesn’t seem to matter, always write the method and property names correctly cased (i.e. getElementsByTagName, not getElementsbyTagname), other languages (e.g. JavaScript) will throw an error there.

    (comments inside)
    PHP Code:
    function getHtmlTableText($html){
        
    //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
        //Thanks to Dormilich for help.
        
    $dom = new domDocument;
        
        
    //How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded.
    # there is W3C’s "Tidy" … I haven’t used it with PHP (actually I don’t use it, because I know how valid HTML looks like)

        //Load html into dom object
        
    $dom->loadHTML($html);
        
    //discard white space
        
    $dom->preserveWhiteSpace false;
        
    //get the rows
        
    $rows $dom->getElementsByTagname('tr');
        
    //initialize the output array
        
    $rArr = array();
        
    //row count int var
        
    $rCount 0;
        
    //loop the rows
        
    foreach ($rows as $row)
        {
            
    //get the cells in the row if they are td or th
    # see below
            
    if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td')
    # if you’re unlucky, the first child is neither TD or TH, but the second one is
            
    {
                
    $cells $row->getElementsByTagname($row->firstChild->nodeName);
            }
            else
            {
                
    //If both td and th fail then what on earth are we reading??
    # invalid code ;)
                //Better run away.
    # not necessary, $cells will be empty and thus the loop not executed
                
    return false;
            }
            
    //$cells = $row->getElementsByTagname('td');
            
            //loop the cells
            
    foreach ($cells as $cell)
            {
                
    //add to output array
                
    $rArr[$rCount][] = $cell->nodeValue;
            }
            
    //increment row count
            
    $rCount++;
        }
    //from: foreach ($rows as $row){
        //Return output
        
    return $rArr;


    PHP Code:
    $cells $tr->getElementsByTagName("td");
    if (
    == $cells->length)
    {
        
    $cells $tr->getElementsByTagName("th");

    note: text should be retrieved using the CharacterData->data or Text->wholeText (that one was added from PHP, I think) properties.

    PS. just to have it mentioned, the sample HTML is not valid (see http://validator.w3.org)
    Last edited by Dormilich; 02-13-2010 at 07:12 PM.
    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • Users who have thanked Dormilich for this post:

    MattyUK (02-13-2010)

  • #14
    Regular Coder
    Join Date
    Jul 2002
    Posts
    301
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Good call/catch thank you again.

    This is the end result of the efforts:
    PHP Code:
    function getHtmlTableText($html){
        
    //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
        //Thanks to Dormilich
        
    $dom = new domDocument;
        
        
    /*Later on. Sanitize HTML. Dormilich: that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    validating is done through DOMDocument->validate();
    Source isn't valid html but is what is provided.
    Not sure if this will protect against malicious but valid code.
    Scripts etc so could use htmlspecialchars and or strip_tags.
        */
        
        //Load html into dom object
        
    $dom->loadHTML($html);
        
    //discard white space
        
    $dom->preserveWhiteSpace false;
        
    //get the rows
        
    $rows $dom->getElementsByTagname('tr');
        
    //initialize the output array
        
    $rArr = array();
        
    //row count int var
        
    $rCount 0;
        
    //loop the rows
        
    foreach($rows as $row)
        {
            
    //get the cells in the row if they are th or td this approach doesn't rely on firstChild requirement
            
    $cells $row->getElementsByTagName('th');
            if(
    == $cells->length)
            {
                
    $cells $row->getElementsByTagName('td');
            }
    //from: if(0 == $cells->length)

            //loop the cells
            
    foreach ($cells as $cell)
            {
                
    //add to output array
                //Note: Look up CharacterData->data or Text->wholeText??? rather than nodeValue
                //$rArr[$rCount][] = htmlspecialchars($cell->nodeValue);
                
    $rArr[$rCount][] = strip_tags($cell->nodeValue);
            }
    //from: foreach ($cells as $cell)
            //increment row count
            
    $rCount++;
        }
    //from: foreach ($rows as $row)
        //Return output
        
    return $rArr;
    }
    //from: function getHtmlTableText($tag,$html,$strict=0){



    $sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

    print 
    '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>'
    I feel as if I learned a lot from your help. Thank you.
    Last edited by MattyUK; 02-13-2010 at 07:56 PM.

  • #15
    Senior Coder Dormilich's Avatar
    Join Date
    Jan 2010
    Location
    Behind the Wall
    Posts
    3,335
    Thanks
    13
    Thanked 348 Times in 344 Posts
    PHP Code:
    $cell->nodeValue
    // should be
    $cells->firstChild->data
    // or
    $cells->firstChild->wholeText
    // because that makes sure, you actually get text.

    // unfortunately, this is not implemented (yet) (DOM-3)
    $cells->textContent 
    ah, and the text of an element does not contain tags (strip_tags() is not required), because they are child elements. on the other hand you could loop through all child elements and return the text data …

    if I’d be mean, I explain how to do that with SimpleXML or XSLT-deserialisation … but I didn’t want to ruin the DOM learning experience.

    Quote Originally Posted by MattyUK View Post
    I feel as if I learned a lot from your help. Thank you.
    I don’t mind getting a reputation for that. *gg*
    Last edited by Dormilich; 02-13-2010 at 08:11 PM.
    The computer is always right. The computer is always right. The computer is always right. Take it from someone who has programmed for over ten years: not once has the computational mechanism of the machine malfunctioned.
    André Behrens, NY Times Software Developer

  • Users who have thanked Dormilich for this post:

    MattyUK (02-13-2010)


  •  
    Page 1 of 2 12 LastLast

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •