Mastering Regular Expressions

Author: Jeffrey Friedl
4.4
All Stack Overflow 64
This Year Stack Overflow 7
This Month Stack Overflow 1

Comments

by anonymous   2017-10-30
check this book it will make you an expert https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124
by anonymous   2017-08-20

If the quoted sub-strings to be matched do NOT contain escaped characters, then both Karl Barker's and Pierce's answers will both match correctly. However, of the two, Pierce's expression is more efficient:

reobj = re.compile(r"""
    # Match double quoted substring (no escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"]*             # Zero or more non-".
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.VERBOSE)

But if the quoted sub-string to be matched DOES contain escaped characters, (e.g. "She said: \"Hi\" to me.\n"), then you'll need a different expression:

reobj = re.compile(r"""
    # Match double quoted substring (allow escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"\\]*           # {normal} Zero or more non-", non-\.
      (?:               # Begin {(special normal*)*} construct.
        \\.             # {special} Escaped anything.
        [^"\\]*         # more {normal} Zero or more non-", non-\.
      )*                # End {(special normal*)*} construct.
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.DOTALL | re.VERBOSE)

There are several expressions I'm aware of that will do the trick, but the one above (taken from MRE3) is the most efficient of the bunch. See my answer to a similar question where these various, functionally identical expressions are compared.

by dreftymac   2017-08-20

Regular-expressions.info has a tutorial section. This site is pretty comprehensive for regular expressions in general, although the emphasis is on support in scripting and programming languages.

The J. Friedl book on mastering regular expressions (mentioned elsewhere), is definitely an outstanding resource, and essential reading if you are serious about doing what the title says.

Anyone who uses regular expressions routinely, or as part of their job, would do well to pick up the book. Browsing around for tutorials is not really conducive to proficiency, in regex-land.

by anonymous   2017-08-20

Matching (possibly nested) BBCode tags

A multi-pass approach is required if the elements are nested. This can be accomplished in one of two ways; matching from the inside out (no recursive expression required), or from the outside in (which requires a recursive expression). (See also my answer to a similar question: PHP, nested templates in preg_replace) However, since the Javascript regex engine does not support recursive expressions, the only way to (correctly) do this using regex is from the inside out. Below is a tested function which replaces BBCode SIZE tags with SPAN html tags from the inside out. Note that the (fast) regex below is complex (for one thing it implements Jeffrey Friedl's "unrolling-the-loop" efficiency technique - See: Mastering Regular Expressions (3rd Edition) for details), and IMO all complex regexes should be thoroughly commented and formatted for readability. Since Javascript has no free-spacing mode, the regex below is first presented fully commented in PHP free-spacing mode. The uncommented js regex actually used is identical to the verbose commented one.

Regex to match innermost of (possibly nested) SIZE tags:

// Regular expression in commented (PHP string) format.
$re = '% # Match innermost [size=ddd]...[/size] structure.
    \[size=            # Literal start tag name, =.
    (\d+)\]            # $1: Size number, ending-"]".
    (                  # $2: Element contents.
      # Use Friedls "Unrolling-the-Loop" technique:
      #   Begin: {normal* (special normal*)*} construct.
      [^[]*            # {normal*} Zero or more non-"[".
      (?:              # Begin {(special normal*)*}.
        \[             # {special} Tag open literal char,
        (?!            # but only if NOT start of
          size=\d+\]   # [size=ddd] open tag
        | \/size\]     # or [/size] close tag.
        )              # End negative lookahead.
        [^[]*          # More {normal*}.
      )*               # Finish {(special normal*)*}.
    )                  # $2: Element contents.
    \[\/size\]         # Literal end tag.
    %ix';

Javascript function: parseSizeBBCode(text)

function parseSizeBBCode(text) {
    // Here is the same regular expression in javascript syntax:
    var re = /\[size=(\d+)\]([^[]*(?:\[(?!size=\d+\]|\/size\])[^[]*)*)\[\/size\]/ig;
    while(text.search(re) !== -1) {
        text = text.replace(re, '<span style="font-size: $1pt">$2</span>');
    }
    return text;
}

Example input:

r'''
[size=10] size 10 stuff
    [size=20] size 20 stuff
        [size=30] size 30 stuff [/size]
    [/size]
[/size]
'''

Example output:

r'''
<span style="font-size: 10pt"> size 10 stuff
    <span style="font-size: 20pt"> size 20 stuff
        <span style="font-size: 30pt"> size 30 stuff </span>
    </span>
</span>
'''

Disclaimer - Don't use this solution!

Note that using regex to parse BBCode is fraught with peril! (There are a lot of "gotchas" not mentioned here.) Many would say that it is impossible. However, I would strongly disagree and have in fact written a complete BBCode parser (in PHP) which uses recursive regular expressions and works quite nicely (and is fast). You can see it in action here: New 2011 FluxBB Parser (Note that it uses some very complex regexes not for the faint of heart).

But in general, I would strongly warn against parsing BBCode using regex unless you have a very deep and thorough understanding of regular expressions (which can be gained from careful study and practice of Friedl's masterpiece). In other words, if you are not a master of regex (i.e. a regex guru), steer clear from using them for any but the most trivial of applications.

by anonymous   2017-08-20

This is similar to this problem. And since you are using PCRE, using the recursion syntax, there is actually a solution.

/
(?(DEFINE)                # define a named capture for later convenience
  (?P<parenthesized>      # define the group "parenthesized" which matches a
                          # substring which contains correctly nested
                          # parentheses (it does not have to be enclosed in
                          # parentheses though)
    [^()]*                # match arbitrarily many non-parenthesis characters
    (?:                   # start non capturing group
      [(]                 # match a literal opening (
      (?P>parenthesized)  # recursively call this "parenthesized" subpattern
                          # i.e. make sure that the contents of these literal ()
                          # are also correctly parenthesized
      [)]                 # match a literal closing )
      [^()]*              # match more non-parenthesis characters
    )*                    # repeat
  )                       # end of "parenthesized" pattern
)                         # end of DEFINE sequence

# Now the actual pattern begins

(?<=[(])                  # ensure that there is a literal ( left of the start
                          # of the match
(?P>parenthesized)?       # match correctly parenthesized substring
$                         # ensure that we've reached the end of the input
/x                        # activate free-spacing mode

The gist of this pattern is obviously the parenthesized subpattern. I should maybe elaborate a bit more on that. It's structure is this:

(normal* (?:special normal*)*)

Where normal is [^()] and special is [(](?P>parenthesized)[)]. This technique is called "unrolling-the-loop". It's used to match anything that has the structure

nnnsnnsnnnnsnnsnn

Where n is matched by normal and s is matched by special.

In this particular case, things are a bit more complicated though, because we are also using recursion. (?P>parenthesized) recursively uses the parenthesized pattern (which it is part of). You can view the (?P>...) syntax a bit like a backreference - except the engine does not try to match what the group ... matched, but instead applies it's subpattern again.

Also note that my pattern will not give you an empty string for correctly parenthesized patterns, but will fail. You could fix this, by leaving out the lookbehind. The lookbehind is actually not necessary, because the engine will always return the left-most match.

EDIT: Judging by two of your examples, you don't actually want everything after the last unmatched parenthesis, but only everything until the first comma. You can use my result and split on , or try Bohemian's answer.

Further reading:

EDIT: I noticed that you mentioned in your question that you are using Object Pascal. In that case you are probably not actually using PCRE, which means there is no support for recursion. In that case there can be no full regex solution to the problem. If we impose a limitation like "there can only be one more nesting level after the last unmatched parenthesis" (as in all your examples), then we can come up with a solution. Again, I'll use "unrolling-the-loop" to match substrings of the form xxx(xxx)xxx(xxx)xxx.

(?<=[(])         # make sure we start after an opening (
(?=              # lookahead checks that the parenthesis is not matched
  [^()]*([(][^()]*[)][^()]*)*
                 # this matches an arbitrarily long chain of parenthesized
                 # substring, but allows only one nesting level
  $              # make sure we can reach the end of the string like this
)                # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
                 # now actually match the desired part. this is the same
                 # as the lookahead, except we do not allow for commas
                 # outside of parentheses now, so that you only get the
                 # first comma-separated part

If you ever add an input example like aaa(xxx(yyy()) where you want to match xxx(yyy()) then this approach will not match it. In fact, no regex that does not use recursion can handle arbitrary nesting levels.

Since your regex flavor doesn't support recursion, you are probably better off without using regex at all. Even if my last regex matches all your current input examples, it's really convoluted and maybe not worth the trouble. How about this instead: walk the string character by character and maintain a stack of parenthesis positions. Then the following pseudocode gives you everything after the last unmatched (:

while you can read another character from the string
    if that character is "(", push the current position onto the stack
    if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
     take a substring from there to the end of the string

To then obtain everything up to the first unnested comma, you can walk that result again:

nestingLevel = 0
while you can read another character from the string
    if that character is "," and nestingLevel == 0, stop
    if that character is "(" increment nestingLevel
    if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
  you left the loop

These two short loops will be much easier for anyone else to understand in the future and are a lot more flexible than a regex solution (at least one without recursion).

by anonymous   2017-08-20

First of all, please, don't monkey patch. This is bad design, just make a helper function that takes an argument you need (string in your case).

def title_case(string)
    no_caps = %w(a an the at by and as but or for in nor on at up to on of from by)
    no_caps_regex = /\b(#{no_caps.join('|')})\b/i # match separate words from above, case-insensitive

    # you will need ActiveSupport (or Rails) for +String#titleize+ support
    titleized = string.titleize
    handle_special = titleized.gsub(/\b(mc)(.+?)\b/i) do |match| 
        [$1, $2].map(&:capitalize).join 
    end

    no_capsed = handle_special.gsub(no_caps_regex) { |match| match.downcase }
end

title_case('mcdonalds is fast food, but mrmcduff is not')
# => "McDonalds Is Fast Food, but Mrmcduff Is Not"

UPDATE: I am sorry about that, it was really bad reading, but I still want to elaborate on the confused terms you noted:

  1. Monkey patching is a technique, available for some dynamic languages (Ruby or Javascript, for example) where you can change (add or remove methods/properties) to already existing classes, such as String, Fixnum, DateTime and others. Often this technique is used for "enhancing" core types (exactly like you did in your code, adding method title_case to String). The problem here is that if any other library developer chooses the same name and adds it to String class, and you eventually want to try his library in your project, your implementations will clash together and which one is added later wins (depending on the code loading time, usually yours). This will either brake your code or brake the library which is no good also.

    Another similar problem, is when you try to "fix" some bug in third party library this way. You monkey patch it, everything works and you forget about it. Then 6 months later you decide to upgrade the library to a new version and suddenly everything blows up, because library code clashed with your changes and you may even not to remember about your monkey patch (or it may even be another developer, that doesn't even know about your monkey patch existence).

  2. Helper function - is just some function that you can add a) to a separate file, called helper b) or just to the current controller/model (the place you need it).

  3. \b is a mark in regex that tells regex engine to treat the following text as a separate word, i.e. /as/ regex can match for word as and also for word fast since it contains as. If you instead use /\bas\b/, only as will be matched.

    Regexes are very powerful, please, find some time to learn them, you'll boost your text processing skills to a next level. Combined with some console tools knowledge (I mean commands in UNIX terminals, such as ls, ps, find, grep and etc.), they can be very powerful in day-to-day routines such as "whether yesterday logs contain some ip?", "what is the process name that eats all memory on my machine right now?" or "what are all files in my project that contain this function call?".

    The classic book on this subject is J. Friedl's "Mastering regular expressions", highly recommended.

Have a nice day.

by Kibbee   2017-08-20

If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.

by anonymous   2017-08-20

I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:

http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0

It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.

-Ryan

by anonymous   2017-08-20

Alternate answer to specific question:

anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:

function hasSpecial_1(text) {
    // If regex matches, then string does NOT contain special chars.
    return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
    // If regex matches, then string contains (at least) one special char.
    return /[<>]|&#/.test(text) ? true : false;
}

These two functions are functionally equivalent, but the second one is probably quite a bit faster.

Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.

Test if a string contains HTML special Chars:

It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as: &#160;, &#xA0;, etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &amp;, &lt;, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.

Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)

FALSE test using positive regex:

function hasHtmlSpecial_1(text) {
    /* Commented regex:
        # Match string having no special HTML chars.
        ^                  # Anchor to start of string.
        [^<>&]*            # Zero or more non-[<>&] (normal*).
        (?:                # Unroll the loop. ((special normal*)*)
          &                # Allow a & but only if
          (?!              # not an HTML entity (3 valid types).
            (?:            # One from 3 types of HTML entities.
              [a-z\d]+     # either a named entity,
            | \#\d+        # or a decimal entity,
            | \#x[a-f\d]+  # or a hex entity.
            )              # End group of HTML entity types.
            ;              # All entities end with ";".
          )                # End negative lookahead.
          [^<>&]*          # More (normal*).
        )*                 # End unroll the loop.
        $                  # Anchor to end of string.
    */
    var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
    // If regex matches, then string does NOT contain HTML special chars.
    return re.test(text) ? false : true;
}

Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))

TRUE test using negative regex:

function hasHtmlSpecial_2(text) {
    /* Commented regex:
        # Match string having one special HTML char.
          [<>]           # Either a tag delimiter
        | &              # or a & if start of
          (?:            # one of 3 types of HTML entities.
            [a-z\d]+     # either a named entity,
          | \#\d+        # or a decimal entity,
          | \#x[a-f\d]+  # or a hex entity.
          )              # End group of HTML entity types.
          ;              # All entities end with ";".
    */
    var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
    // If regex matches, then string contains (at least) one special HTML char.
    return re.test(text) ? true : false;
}

Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.