Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 11 of 11
  1. #1
    Regular Coder
    Join Date
    Apr 2006
    Location
    Northbrook, IL
    Posts
    394
    Thanks
    8
    Thanked 6 Times in 6 Posts

    Question regex for complex PHP string parsing - odd or even repetition?

    I'm trying to build a php parser for syntax highlighting using regex and i'm first parsing out complex stuff like strings and overloaded operators. I may need to do simple and complex strings in different regexes, instead of 2 passes, but if possible i'd like to to it in as few as possible. I need to get both delimiters as well as the contents.

    so far i have come up with the following:
    Code:
    (['"])(.*?)(?<=(\\{2})|(?<!\\{1}))(\1)
    this is works pretty well but trips up on things like
    Code:
    '\\\''
    and incorrectly grabs the first trailing single quote as the second delimiter, ignoring that it's escaped. It sees that it has 2 and ends up matching it.

    i thought this would give me a bit more headroom:
    Code:
    (['"])(.*?)(?<=(\\{2}|\\{4})|(?<!\\{1}|\\{3}))(\1)
    but it doesnt work as expected.

    I seem to need something that would determine if a quote is preceded by an even or odd number of escaping backslashes, but i dont know if there is such a thing in regex.

    any advice? thanks,
    Leon

  • #2
    Regular Coder
    Join Date
    Apr 2006
    Location
    Northbrook, IL
    Posts
    394
    Thanks
    8
    Thanked 6 Times in 6 Posts
    ha! i figured out a better way, and this works well:
    Code:
    (['"])((?:\\|.|)*?)(?<!\\)(\1)
    ...now to add heredoc to the mix, then do a second pass with a simple regex for each match based on initial delimiters to parse out special stuff like complex variables and escapes.
    Last edited by Leeoniya; 01-18-2008 at 02:53 PM.

  • #3
    Senior Coder
    Join Date
    Mar 2003
    Location
    Atlanta
    Posts
    1,037
    Thanks
    14
    Thanked 30 Times in 28 Posts
    You couldn't just use highlight_string()? Or are you just interested in building a parser?
    Most of my questions/posts are fairly straightforward and simple. I post long verbose messages in an attempt to be thorough.

  • #4
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Leeoniya View Post
    ha! i figured out a better way, and this works well:
    Code:
    (['"])((?:\\|.|)*?)(?<!\\)(\1)
    ...now to add heredoc to the mix, then do a second pass with a simple regex for each match based on initial delimiters to parse out special stuff like complex variables and escapes.
    Maybe is not the answer you expect, but it's a very bad idea to build a parser in this way.
    I'm a big fun of regex, and I'm convinced that you can build a parser this way after a hard work, but when you must make a minor change in source, every thing will blow up. The second problem with this aproach is that regex are not efficient when become too long and complicate.


    best regards

  • #5
    Regular Coder
    Join Date
    Apr 2006
    Location
    Northbrook, IL
    Posts
    394
    Thanks
    8
    Thanked 6 Times in 6 Posts
    thats too bad cause i just came up with a good heredoc matcher:
    Code:
    (<<<[ \t]*?)([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)$(.*?)^(\2){1};?$
    hehe

    also, highlight_string() is not CSS based. so styles are not easily modified if at all

    the final purpose of this is to actually build a text editor using php and something like php-gtk or wxphp when it comes out. the parser will be extended with all the regexes being in a wordfile. similar to ultraedit, but with standard regex syntax rather than custom.

    this comes from evaluating like 20 different text editors and finding that none of them have all the features i would like to see. most deficiencies for me are in "live" stuff such as dynamic brace matching, dynamic html tag matching, proper hinting for html/css attributes, lack of file browse dialogs when typing url() in css or src= in html, lack of live color pickers when typing background: in css or color: in html, ability to have custom keyword files and ability to style them differently that say reserved words or keywords1. things like indent guides or code folding isnt that necessary, but hidden character display should be customizable in color and style...etc.

    the best editors i have found for this that come close are Gridinsoft Notepad, phpDesigner pro 2008, nusphere phped and some scintilla based stuff.

    few of them can be extended to account for poor parsing of one or another language, and if they can, it doesnt always have the flexibility i would like.
    Last edited by Leeoniya; 01-18-2008 at 03:50 PM.

  • #6
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Leeoniya View Post
    thats too bad cause i just came up with a good heredoc matcher:
    Code:
    (<<<[ \t]*?)([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)$(.*?)^(\2){1};?$
    hehe

    also, highlight_string() is not CSS based. so styles are not easily modified if at all

    the final purpose of this is to actually build a text editor using php and something like php-gtk or wxphp when it comes out. the parser will be extended with all the regexes being in a wordfile. similar to ultraedit, but with standard regex syntax rather than custom.

    this comes from evaluating like 20 different text editors and finding that none of them have all the features i would like to see. most deficiencies for me are in "live" stuff such as dynamic brace matching, dynamic html tag matching, proper hinting for html/css attributes, lack of file browse dialogs when typing url() in css or src= in html, lack of live color pickers when typing background: in css or color: in html, ability to have custom keyword files and ability to style them differently that say reserved words or keywords1. things like indent guides or code folding isnt that necessary, but hidden character display should be customizable in color and style...etc.

    the best editors i have found for this that come close are Gridinsoft Notepad, phpDesigner pro 2008, nusphere phped and some scintilla based stuff.

    few of them can be extended to account for poor parsing of one or another language, and if they can, it doesnt always have the flexibility i would like.
    did you look in the wrong place, take a look to emacs, vi, scite or quanta.

    anyway, is your wish,

    Code:
    (<<<\s*?)([\w_\x7f-\xff][\w\d_\x7f-\xff]*)$(.*?)^(\2){1};?$
    - replace [ \t]* with \s*, is the same thing
    - what is *?, * means 0 or more, ? means 0 or one, you must decide, because it's redundant even is not wrong
    - a-zA-Z is \w
    - 0-9 is \d
    - $ means end of the line, escape one of them if you want to match the $ front of a php variable

    look to section syntax and modifier to this link:

    http://www.php.net/manual/en/ref.pcre.php

    best regards

  • #7
    Regular Coder
    Join Date
    May 2006
    Location
    Wales
    Posts
    820
    Thanks
    1
    Thanked 82 Times in 79 Posts
    what is *?, * means 0 or more, ? means 0 or one, you must decide, because it's redundant even is not wrong
    The ? makes + and * non greedy, eg:

    PHP Code:
    //Without the ? this matches 'aaarqsarq'
    preg match('#[a-z]+rq#''aaarqsarqaa');

    //With the ? this matches 'aaarq'
    preg match('#[a-z]+?rq#''aaarqsarqaa'); 

  • #8
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Mwnciau View Post
    The ? makes + and * non greedy, eg:

    PHP Code:
    //Without the ? this matches 'aaarqsarq'
    preg match('#[a-z]+rq#''aaarqsarqaa');

    //With the ? this matches 'aaarq'
    preg match('#[a-z]+?rq#''aaarqsarqaa'); 
    this depend of how PCRE_UNGREEDY is set,

    agreed with argumentation, not agreed with the way of writing regex.

    best regards

  • #9
    Regular Coder
    Join Date
    Apr 2006
    Location
    Northbrook, IL
    Posts
    394
    Thanks
    8
    Thanked 6 Times in 6 Posts
    - a-zA-Z is \w
    - 0-9 is \d
    i just copied and pasted from php manual for variable definition...yes i agree with refining it.
    - $ means end of the line, escape one of them if you want to match the $ front of a php variable
    end of line assertion is intentional and is there for proper heredoc spec: http://us.php.net/types.string
    - replace [ \t]* with \s*, is the same thing
    it is not the same thing.
    http://www.oreilly.com/catalog/regex...ter/part1B.pdf
    \s will match a space, a tab or a line break, i dont want to match linebreaks there because that invalidates the heredoc - http://us.php.net/types.string

    i'm not just slapping these together randomly because i dont know what i'm doing, most of the function you see is there for a reason, otherwise i would have asked for help.

    did you look in the wrong place, take a look to emacs, vi, scite or quanta.
    if you did not notice, all the software i had mentioned is Windows only, but you are suggesting that the RIGHT place to look was at Linux based editors like Quanta and Emacs and VI?
    or should I install a linux distro just so i can edit my scripts?

    i have tried scite/scintilla based editors, and couldnt find a way to make it dynamically match and style xhtml tags except possibly through macros like in Ultraedit. i also found a php syntax highlighting bug in the first php file that i opened and was not about to go fixing it. i really like their tab and hidden char display, but dislike that you cant control the color of the hidden chars.

    I tried vi/cream but it was too foreign to learn and customize quickly...i might as well abandon my task at hand for learning VI, and i definitely want to in the future, but now is not the time. google recently turned up an emacs style editor called epsilon which i would love to try sometime, but the trial is not available currently.

    E-control editor was also very close to what i needed..i think it uses synedit. and jEdit was also very close. Programmers notepad 2 is scintilla based, so suffers from the same issues, but i liked a lot otherwise. Notepad2 and Notepad++ both use scintilla, while PSpad uses synedit i think, and it comes close as well.

    anyhow...
    Last edited by Leeoniya; 01-18-2008 at 10:44 PM.

  • #10
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    this is quick info, but not accurate, there are many variants of regex, perl is based on POSIX standard, php use two variants. PCRE is perl, but modified. Use php doc instead as you need.

    \s will match a space, a tab or a line break, i dont want to match linebreaks there because that invalidates the heredoc - http://us.php.net/types.string
    agreed with heredoc argument and newline, but I'm not sure that \s include \r and \n. I don't know how it work if you need to use multiline modifier, /m, that could be usefull in your regex.
    Anyway all this make regex more easy to read and debug.

    i'm not just slapping these together randomly because i dont know what i'm doing, most of the function you see is there for a reason, otherwise i would have asked for help.
    I don't think I say that, I'm sorry if I say something wrong. My intention was to say that regex is not the proper way to do that because the problem is too complex, that's all.

    if you did not notice, all the software i had mentioned is Windows only, but you are suggesting that the RIGHT place to look was at Linux based editors like Quanta and Emacs and VI?
    I used emacs on window many years, AFAIK is a port of vi also, I don't know about quanta and scite.

    or should I install a linux distro just so i can edit my scripts?
    I'm not a linux activist, I'm happy with it but I'm not so sure that is the proper os for everybody.

    i have tried scite/scintilla based editors, and couldnt find a way to make it dynamically match and style xhtml tags except possibly through macros like in Ultraedit. i also found a php syntax highlighting bug in the first php file that i opened and was not about to go fixing it. i really like their tab and hidden char display, but dislike that you cant control the color of the hidden chars.

    I tried vi/cream but it was too foreign to learn and customize quickly...i might as well abandon my task at hand for learning VI, and i definitely want to in the future, but now is not the time. google recently turned up an emacs style editor called epsilon which i would love to try sometime, but the trial is not available currently.

    E-control editor was also very close to what i needed..i think it uses synedit. and jEdit was also very close. Programmers notepad 2 is scintilla based, so suffers from the same issues, but i liked a lot otherwise. Notepad2 and Notepad++ both use scintilla, while PSpad uses synedit i think, and it comes close as well.

    anyhow...
    that's the reason I like emacs, if I don't like something I extend it.

    best regards

  • #11
    Regular Coder
    Join Date
    Apr 2006
    Location
    Northbrook, IL
    Posts
    394
    Thanks
    8
    Thanked 6 Times in 6 Posts
    after a bit more searching i actually discovered a very good editor that i glanced at before, but overlooked perhaps because of lacking code hints and tag auto-completion maybe. behold dev-PHP:
    http://devphp.sourceforge.net/

    i'm currently using phpDesigner Pro 2008 and find it to be the best for php of any IDE that i've used. I always loved the syntax highlighting accuracy and live brace and tag matching.

    well i discovered the secret component that he uses is also embeded into dev-PHP...it's called SynWeb...which i think is a mod or implementation of synEdit and it is brilliant. Lightning fast, lightweight, portable, and accurate. the dev-PHP still seems lacking in the code hint/tag completion department as well as a function/class browser, but for a portable editor of css/html/php/js it is great and has FTP support. it has better highlighting accuracy than scintilla based notepad++, and also has live tag matching. (np++/scintilla was the only one that passed the php parsing test here out of a lot of highly rated editors: http://planetozh.com/blog/2007/06/te...hmatch-review/)

    SynWeb: http://flatdev.ovh.org/
    Synweb implementations: http://flatdev.ovh.org/downloads.php?project=2

    scite does have a win32 build, and i have tried it, good but not great, i like notepad++ implementation of the same scilexer.dll better, but then again, scite is meant to be a demo of the scintilla component rather than a full featured editor

    i will definitely try emacs/vi again when i have time. i'm almost ashamed to lack proficiency in these true text editors. thanks for the advice.

    cheers,
    Leon


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •