Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 4 123 ... LastLast
Results 1 to 15 of 47
  1. #1
    Senior Coder xelawho's Avatar
    Join Date
    Nov 2010
    Posts
    2,989
    Thanks
    56
    Thanked 557 Times in 554 Posts

    another regex question

    I am slowly getting my head around regex, but really it is mostly a mystery to me.

    Here's the thing: I have a string (although I have no idea how that string will look). All I know is that the string will contain a word (I don't know what that word is either). I don't know if the string will be a paragraph, a sentence or a sentence fragment (the sentence may be cut off, either at the start or the end).

    But I need to get as much of the sentence containing the word as possible, without getting too much.

    So I figure that these are the "rules":

    - Start capturing from the closest word before the variable word that starts with a capital/uppercase.
    - If there is no word that starts with a capital before the variable word, start capturing from the start of the string.
    - Equally, if the part of the string after the variable word contains a full stop/period, finish capturing at the full stop.
    - If not, capture until the end of the string.

    I know it's not perfect logic, but it doesn't have to be - all I want to do is to be able to show the word in some sort of context, like Word does when you do spellcheck.

    Any suggestions?
    Last edited by xelawho; 01-31-2013 at 05:50 PM. Reason: clarifying

  • #2
    Senior Coder
    Join Date
    Apr 2011
    Location
    London, England
    Posts
    2,120
    Thanks
    15
    Thanked 354 Times in 353 Posts
    Something like this:

    Code:
    (?:^|\.)\s?([^.]*wibble[^.]*)(?:$|\.)
    You can test it here.

    But I haven't tried to match a capital letter..
    "I'm here to save your life. But if I'm going to do that, I'll need total uninanonynymity." Me Myself & Irene.
    Validate your HTML and CSS

  • #3
    Senior Coder
    Join Date
    Apr 2011
    Location
    London, England
    Posts
    2,120
    Thanks
    15
    Thanked 354 Times in 353 Posts
    This version

    Code:
    (?:^|\.|\;)\s?([A-Z][^.]*wibble[^.]*)(?:$|\.)
    looks either for a full-stop or semi colon, and the sentence should start will a capital letter.
    "I'm here to save your life. But if I'm going to do that, I'll need total uninanonynymity." Me Myself & Irene.
    Validate your HTML and CSS

  • #4
    Supreme Master coder! Philip M's Avatar
    Join Date
    Jun 2002
    Location
    London, England
    Posts
    18,243
    Thanks
    203
    Thanked 2,555 Times in 2,533 Posts
    Here's my suggestion:-

    Code:
    <html>
    <head>
    </head>
    <body>
    
    Enter word to find <input type = "text" id = "theword" onblur = "findit()">
    
    <script type = "text/javascript">
    
    var text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ipsum leo, scelerisque at dapibus ac, consectetur vel ipsum. Morbi et metus ut diam molestie ullamcorper. Suspendisse rutrum semper semper. Donec volutpat neque in lorem tempus scelerisque. Curabitur dignissim rhoncus quam ac suscipit. Donec viverra quam lobortis neque porta a sagittis urna tristique. Suspendisse nec lacus nisi. Pellentesque fermentum massa sit amet magna hendrerit vestibulum. Sed elit libero, scelerisque eu eleifend ut, interdum gravida nunc. Etiam ut nisi sapien, et tempus sem. Nam vel mi est. Mauris congue felis ut ante bibendum vehicula. Nullam nec sapien arcu, eget cursus lorem. Donec blandit, dolor tristique ornare dictum, arcu sapien vulputate dolor, et placerat risus odio ut magna. Ut magna mauris, pellentesque at ultricies vitae, fermentum vitae dolor."
    
    //var ts = text.split(/\.|;/);   // split at period or semi-colon
    var ts = text.split(".");  // split at period only
    
    function findit() {
    var intext = false;
    for (var i=0; i < ts.length; i++) {
    var found = false;
    var tofind = document.getElementById("theword").value;
    var regexp = new RegExp(tofind, 'gi');	 // setting regex case insensitive and global
    if (regexp.test(ts[i])) {
    found = true;
    intext = true;
    }
    if (found) {alert ("The word " + tofind + " was found in the sentence:- " + "\n" + ts[i])}
    }
    if (!intext) {alert ("The word " + tofind  + " was not found.")}
    
    }
    
    </script>
    
    </body>
    </html>
    Christians only have one spouse. This is called monotony.
    - Pupil's answer to Catholic Elementary School test.
    Last edited by Philip M; 01-31-2013 at 08:21 PM.

    All the code given in this post has been tested and is intended to address the question asked.
    Unless stated otherwise it is not just a demonstration.

  • #5
    Senior Coder xelawho's Avatar
    Join Date
    Nov 2010
    Posts
    2,989
    Thanks
    56
    Thanked 557 Times in 554 Posts
    thanks Andrew - the first one was very close. I changed it to
    Code:
    (?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)
    to start the capture at the beginning of the sentence ort the beginning of the string, instead of the end of the previous one, and to end on a full stop, exclamation, question mark or just the end of the string

    seems right to me. Thank you both for your suggestions.

  • #6
    Senior Coder xelawho's Avatar
    Join Date
    Nov 2010
    Posts
    2,989
    Thanks
    56
    Thanked 557 Times in 554 Posts
    no, wait - that doesn't work. it ends if the sentence ends with a full stop, but keeps going if it is a ! or ?

  • #7
    Supreme Master coder! Philip M's Avatar
    Join Date
    Jun 2002
    Location
    London, England
    Posts
    18,243
    Thanks
    203
    Thanked 2,555 Times in 2,533 Posts
    Use mine!

    Code:
    var ts = text.split(/\.|;|\?|!/);   // split at period or semi-colon or ? or !
    Does your regex allow you to find a variable word? Or a phrase? Not just wibble!
    Last edited by Philip M; 01-31-2013 at 09:07 PM.

    All the code given in this post has been tested and is intended to address the question asked.
    Unless stated otherwise it is not just a demonstration.

  • #8
    Senior Coder xelawho's Avatar
    Join Date
    Nov 2010
    Posts
    2,989
    Thanks
    56
    Thanked 557 Times in 554 Posts
    Here's the thing: Lets say the string is this:
    "The dog jumped over the moon. He was happy to see me. I left in a hurry"

    and the word is "happy"

    in that case, all I want is
    "He was happy to see me."

    If it's
    "was happy to see me. I left in a hurry"

    all I want is
    "was happy to see me."

    If it's
    "The dog jumped over the moon. He was happy to see"

    all I want is:
    "He was happy to see"

    splitting it on the punctuation is probably the safest way, but then I have to loop through the array to find out which split is the one that I want. Which is why regex seems to be the answer...

  • #9
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,118
    Thanks
    80
    Thanked 4,555 Times in 4,519 Posts
    And what about
    "aardvarks whistle. happy dogs bark"
    ???

    What do you want to get out of that?

    Logically, it would be "happy dogs bark", as the period before "happy" belongs in another sentence. But it's your call.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #10
    Senior Coder xelawho's Avatar
    Join Date
    Nov 2010
    Posts
    2,989
    Thanks
    56
    Thanked 557 Times in 554 Posts
    in that case I would want happy dogs bark

    but sentences will always begin with a capital, and end with . or ! or ?

    the problem is that the string that contains the word may not be a complete sentence.

  • #11
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,118
    Thanks
    80
    Thanked 4,555 Times in 4,519 Posts
    Here's my answer.

    I'll let you figure out if you can combine the 4 regexp's into one.

    Note that I stop on the first match, because some text patterns will match more than one of the regexps, but the regexps are purposely ordered by most desirable match.

    The hack to get rid of a leading period is just that: a hack. But it works.

    Code:
    <script type="text/javascript">
    function findSentenceByWord( text, word )
    {
        var re1 = new RegExp( "[A-Z\\.][^A-Z\\.]+?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
        var re2 = new RegExp( "^[\\s\\S]*?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
        var re3 = new RegExp( "[A-Z\\.][^A-Z\\.]+?" + word + "[\\s\\S]*$", "" );
        var re4 = new RegExp( "^[\\s\\S]*?" + word + "[\\s\\S]*$", "" );
        var res = [ re1, re2, re3, re4 ];
        for ( var r = 0; r < res.length; ++r )
        {
            var re = res[r];
            if ( re.test( text ) )
            {
                document.write("Match on regexp " + (r+1) + "<br/>");
                var m = text.match(re)[0];
                if ( m.charAt(0) == "." ) { m = m.substring(1); }
                document.write( m + "<br/>");
                return;
            }
        }
    }
    
    function demo( text, word )
    {
        document.write( "<hr/>Testing <i><b>" + text + "</b></i> for word " + word + "<br/>" );
        findSentenceByWord( text, word );
    }    
    
    demo( "The dog jumped over the moon. He was happy to see me. I left in a hurry", "happy" );
    demo( "was happy to see me. I left in a hurry", "happy" );
    demo( "The dog jumped over the moon. He was happy to see", "happy" );
    demo( "aardvarks whistle. happy dogs bark", "happy" );
    demo( "happy happy happy! and even more happy?", "happy" );
    demo( "all the happy dogs", "happy" );</script>
    I dump out which regexp matched so that you can see that indeed all 4 are needed, depending on the input.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #12
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,118
    Thanks
    80
    Thanked 4,555 Times in 4,519 Posts
    Quote Originally Posted by xelawho View Post
    in that case I would want happy dogs bark

    but sentences will always begin with a capital, and end with . or ! or ?
    If that is true, why did you include this example:
    If it's
    "was happy to see me. I left in a hurry"
    "was happy to see me." does not start with a capital letter.

    My answer includes code to handle that case. It could be less code if you were *SURE* that a sentence always starts with a capital letter.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #13
    Senior Coder
    Join Date
    Apr 2011
    Location
    London, England
    Posts
    2,120
    Thanks
    15
    Thanked 354 Times in 353 Posts
    This revision
    Code:
    (?:|^)?[\w]([^.]*wibble[^.]*)($:|\.|\?|\!|$)
    is incorrect. Should be
    Code:
    (?:^|\.|\?\!)?[\w]([^.]*wibble[^.]*)(?:\.|\?|\!|$)
    (?: denotes a non-capturing group, and the | at the beginning was incorrect. So the previous sentence might also end with a ? or !
    "I'm here to save your life. But if I'm going to do that, I'll need total uninanonynymity." Me Myself & Irene.
    Validate your HTML and CSS

  • #14
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,118
    Thanks
    80
    Thanked 4,555 Times in 4,519 Posts
    Here's a slightly better version. Handles the sentence *before* "happy" ending with ? or ! (not just period).

    Has the interesting effect of changing *which* "happy" is found in demo #5. If you really wanted the first one found, I could fix it to do that. But I'm assuming that's a case you aren't too worried about.
    Code:
    <script>
    function findSentenceByWord( text, word )
    {
        var re1 = new RegExp( "[A-Z\\.\\?\\!][^A-Z\\.\\?\\!]+?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
        var re2 = new RegExp( "^[\\s\\S]*?" + word + "[^\\.\\?\\!]*[\\.\\?\\!]", "" );
        var re3 = new RegExp( "[A-Z\\.\\?\\!][^A-Z\\.\\?\\!]+?" + word + "[\\s\\S]*$", "" );
        var re4 = new RegExp( "^[\\s\\S]*?" + word + "[\\s\\S]*$", "" );
        var res = [ re1, re2, re3, re4 ];
        for ( var r = 0; r < res.length; ++r )
        {
            var re = res[r];
            if ( re.test( text ) )
            {
                document.write("Match on regexp " + (r+1) + "<br/>");
                var m = text.match(re)[0];
                m = m.replace( /^[\.\?\!]?\s*/, "" );
                document.write( m + "<br/>");
                return;
            }
        }
    }
    
    function demo( text, word )
    {
        document.write( "<hr/>Testing <i><b>" + text + "</b></i> for word " + word + "<br/>" );
        findSentenceByWord( text, word );
    }    
    
    demo( "The dog jumped over the moon. He was happy to see me. I left in a hurry", "happy" );
    demo( "was happy to see me. I left in a hurry", "happy" );
    demo( "The dog jumped over the moon. He was happy to see", "happy" );
    demo( "aardvarks whistle. happy dogs bark", "happy" );
    demo( "aardvarks whistle dixie! happy dogs bark", "happy" );
    demo( "happy happy happy! and even more happy?", "happy" );
    demo( "all the happy dogs", "happy" );
    </script>
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #15
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,118
    Thanks
    80
    Thanked 4,555 Times in 4,519 Posts
    Andrew: I'm pretty sure this is wrong:
    (?:^|\.|\?\!)

    The ^ character only means negation when used inside of [ ].

    In any case, you forgot the | between \? and \! if you were looking for "or" conditions. And also, in any case, you are missing parens.

    But I'm pretty sure that should be
    (?:[^\.\?\!])
    But I think that
    (?!(\.|\?|\!))
    would also work. ?! is a *negative* non-capture. The ! is the negation character for captures, not the ^

    Did you test it? Against many samples, as I did?

    *********

    EDIT: I did test it.

    I tested both your version:
    /(?:^|\.|\?|\!)?[\w]([^.]*happy[^.]*)(?:\.|\?|\!|$)/
    (I added the missing | before the first \!)

    And my modification:
    /(?:[^\.\?\!])?[\w]([^.]*happy[^.]*)(?:([\.|\?|\!]|$))/;

    Neither passed all tests.
    Neither could find "happy" in aardvarks whistle. happy dogs bark

    Neither isolated the sentence in either
    aardvarks whistle dixie! happy dogs bark
    or
    happy happy happy! and even more happy?
    (that is, in both cases they returned the entire test string)

    I will say that your (?:^|\.|\?|\!) seemed to have mostly worked. Surprised me.
    Last edited by Old Pedant; 01-31-2013 at 10:56 PM.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.


  •  
    Page 1 of 4 123 ... LastLast

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •