Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 9 of 9
  1. #1
    New to the CF scene
    Join Date
    May 2009
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts

    RegExp assistance

    Hi

    I am writing a small javascript application which reads an rss feed. The description tag of each item contains a cdata section.

    I want to parse the cdata section and extract bits of information for display on google maps.

    Below is an example of this cdata section contents.

    Code:
    <div><b>Projekt:</b> Clausholmvej</div>
    <div><b>Længdegrad:</b> 55.642802</div>
    <div><b>Breddegrad:</b> 12.338333</div>
    <div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
    <div>Kryds 1</div>
    <div> </div>
    <div><em>Mere kursiv tekst</em></div>
    <div> </div>
    <div><strong>Fed tekst</strong></div>
    <div> </div>
    <div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a></div></div></div>
    I am able to extract "Projekt", "Længdegrad" and "Breddegrad" using a regular expression.

    I am, however, not able to extract "Indhold" (marked in red). Indhold is a html snippet which is supposed to go into the Google Maps window, which pops up on clicking on a marker.

    Could you help me and suggest a regexp that extracts Indhold for me? Please note the cdata example is copied verbatim as it arrives from the rss generator (Sharepoint...). I have no control of the generated rss. Also note that the ExternalClass... is expected to vary.

    NB: I have tried putting the contents of cdata into a jQuery object for processing via jQuery. While this approach works in Google Chrome it fails in Internet Explorer, which is a mandatory supported browser for this application. I am forced to extracting these data bits by plain old regexp.

    Thank you so much for your help.

    Best regards
    Andreas

  • #2
    New Coder
    Join Date
    Dec 2008
    Posts
    58
    Thanks
    2
    Thanked 1 Time in 1 Post
    Can you post what RegExp you are using and also what the results are. The more code you provide the easier it is to help.

  • #3
    New to the CF scene
    Join Date
    May 2009
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Below is the code I use to extract the three primitive data items.

    Code:
    var projectDetails = {};
    
    jQuery.each(["Projekt", "Længdegrad", "Breddegrad"], function(){
       var pattern = new RegExp("<div><b>" + this + ":<\/b>(.+)<\/div>");
       var match = pattern.exec(cdataText);
       projectDetails[this] = jQuery.trim(match[1]);
    });
    				
    var project = projectDetails["Projekt"];
    var lat = projectDetails["Længdegrad"];
    var lng = projectDetails["Breddegrad"];
    Where cdatatext is the sample text I gave you in my first post.

    For that sample I would get:
    Code:
    project == "Clausholmvej"
    lat == "55.642802"
    lng == "12.338333"
    Now I just need a regular expression to extract the html snippet "Indhold".

    Best regards
    Andreas

  • #4
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    26,603
    Thanks
    80
    Thanked 4,500 Times in 4,464 Posts
    Not sure you can do this with a single regular expression, at all.

    Not even easy to do with multiple reg exps.

    I'd just opt to do it in ordinary string manipulation code.

    Possible implementation:
    Code:
    <script>
    test = 
         "<div><b>Projekt:</b> Clausholmvej</div>"
       + "<div><b>Længdegrad:</b> 55.642802</div>"
       + "<div><b>Breddegrad:</b> 12.338333</div>"
       + "<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>"
       + "<div>Kryds 1</div>"
       + "<div> </div>"
       + "<div><em>Mere kursiv tekst</em></div>"
       + "<div> </div>"
       + "<div><strong>Fed tekst</strong></div>"
       + "<div> </div>"
       + "<div><a href=\"http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364\">Et link</a></div></div></div>"
       + "<div>MORE STUFF HERE</div>";
    
    function findIndHold( findIn )
    {
       var str = findIn.toLowerCase();
       // find start:
       var indholdAt = str.indexOf("indhold"); // we are past the <div>
       var extAt = str.indexOf("<div", indholdAt ); // the div with class=Extern...
       var startAt = str.indexOf("<div",extAt+4); // start of part we care about!
       // find matching end
       var cur = startAt + 4;
       var count = 2;
       while ( count > 0 )
       {
           var find1 = str.indexOf("<div",cur);
           var find2 = str.indexOf("</div",cur);
           if ( find1 < find2 )
           {
               ++count;
               cur = find1 + 4;
           } else {
               --count;
               cur = find2 + 5;
           }
       }
       var endAt = cur - 5;
       return findIn.substring( startAt, endAt );
    }
    alert(test);
    alert( findIndHold(test) );
    </script>
    Seems to work. Test it with other examples.

    I did the toLowerCase() so it works if you might have <DIV> in place of <div> in some places. If you are sure you don't need that, you can omit it.

  • Users who have thanked Old Pedant for this post:

    awarberg (05-20-2009)

  • #5
    Gütkodierer
    Join Date
    Apr 2009
    Posts
    2,127
    Thanks
    1
    Thanked 426 Times in 424 Posts
    It's actually quite similar to what you did before:

    Code:
    var cdataText=(<r><![CDATA[
    	<div><b>Projekt:</b> Clausholmvej</div>
    	<div><b>Længdegrad:</b> 55.642802</div>
    	<div><b>Breddegrad:</b> 12.338333</div>
    	<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
    	<div>Kryds 1</div>
    	<div> </div>
    	<div><em>Mere kursiv tekst</em></div>
    	<div> </div>
    	<div><strong>Fed tekst</strong></div>
    	<div> </div>
    	<div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a></div></div></div>
    ]]></r>).toString();
    
    cdataText = cdataText.replace(/\n/g, "\\n");
    
    var pattern = /<div><b>Indhold:<\/b> <div class=ExternalClass.*?>(.*?<\/div>)<\/div><\/div>/;
    var match = pattern.exec(cdataText);
    
    var indhold = (match[1]).replace(/\\n/g, "\n");
    
    alert(indhold);
    Only strange thing I did there is replace line feeds and put them back in afterwards because of problems with regexps and multiline strings.

  • Users who have thanked venegal for this post:

    awarberg (05-20-2009)

  • #6
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    26,603
    Thanks
    80
    Thanked 4,500 Times in 4,464 Posts
    No, I don't think that works, Venegal.

    Suppose that the input data looks like this, with what I have added in red:
    Code:
    	<div><b>Projekt:</b> Clausholmvej</div>
    	<div><b>Længdegrad:</b> 55.642802</div>
    	<div><b>Breddegrad:</b> 12.338333</div>
    	<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
    	<div>Kryds 1</div>
    	<div> </div>
    	<div><em>Mere kursiv tekst</em><div>embed1<div>embed2</div></div></div>
    	<div> </div>
    	<div><strong>Fed tekst</strong></div>
    	<div> </div>
    	<div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a>
            </div></div></div>
    Your regexp will terminate at the end of the </div> after my added text.

    You have assumed that his HTML will always have exactly that same format.

    If you happen to be right, then yes, your solution works. But if the contents of the html colored like this is indeed just arbitrary HTML coming from some database, then you can't possibly predict that a string of three </div>'s in a row will be the real end of the search string. And even as simple a thing as the HTML ending in
    Code:
    </div>        </div>        </div>
    (that is, with lots of spaces) would cause your regex to fail.

  • #7
    Gütkodierer
    Join Date
    Apr 2009
    Posts
    2,127
    Thanks
    1
    Thanked 426 Times in 424 Posts
    You're right, that would fail indeed. Regex is not the right tool for parsing complex markup, and I didn't read your answer before posting myself, sorry.

    But when helping out with a regex problem I like to assume the simplest case until told otherwise, and then adapt, if necessary, in order to keep things from getting needlessly convoluted.

    Maybe awarberg knows that the html is always of the same form and there will never be nested divs?

    Or, if not, maybe the end of that snippet is always the end of the cdata?

    Or, if not, maybe after the "Indhold" part there will always be another part starting with "<div><b>Something:" like the ones before?

    In my opinion there are too many possible scenarios, in which a simple regexp will work perfectly, to just reject a regexp solution completely in favor of a more flexible and complex one.

  • #8
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    26,603
    Thanks
    80
    Thanked 4,500 Times in 4,464 Posts
    Yes, I admit to trying to always find the most general answers. That's why I even used toLowerCase( ) to make sure that looking for <div and </div would work. And that's why I didn't search for <div>, since I assumed it would be possible to encounter something like <div style="color: red;">

    But yeah, my solution is quite possibly overkill. Maybe the original poster will eventually reply and we'll find out.

  • #9
    New to the CF scene
    Join Date
    May 2009
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Hi guys

    Thank you very much for your help!

    To answer the questions:

    - After Indhold, there will not be another "<div><b>Something:" part.
    - The cdata section only contains the html snippet so the end of the snippet is immediately preceding the end of the cdata section.

    I have opted for the regexp solution by venegal since it is easier for me to understand. I can see I was somewhat close but the newline part must have tripped me.

    As I mentioned the list comes from Sharepoint and the html snippets are created using Sharepoint Content Editor. I have made some trial runs and it seems unlikely that html snippets will be generated containing nested divs. I think the regexp solution will be robust. But, obviously, I would like to have a generic regexp which didn't make this assumption. Is it possible to use back references (eg. \0) in the regexp to solve this?

    Anyway, the solution works now - in IE7 as well - so thank you very much for your help.

    Best regards
    Andreas


  •  

    Tags for this Thread

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •