Regular Expressions Cookbook: Detailed Solutions in Eight Programming Languages

Author: Jan Goyvaerts, Steven Levithan
4.5
This Year Stack Overflow 1
This Month Stack Overflow 2

Comments

by anonymous   2019-07-21

I suggest to buy yourself this book:

http://www.amazon.ca/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/1449319432/ref=sr_1_1?ie=UTF8&qid=1444846344&sr=8-1&keywords=regular+expression+cookbook

if you are struggling for such a basic regex (which can be found pretty much everywhere on the net). It's a cookbook, meaning that the solutions can be use directly as they are in the book.

From page 249:

^\(?([0-9]{3})\)?[- ]?([0-9]{3})[- ]?([0-9]{4})

You have three capture groups:

1st and 2nd ([0-9]{3}) 3rd ([0-9]{4})

You can use those capture group to return the different section of the phone number. Use non-capturing group (?:) for increased performance (in that context that should not be necessary unless you use a loop).

That if for optional parenthesis \(? and \)?.

You need to escape the parenthesis since it is use for grouping in regex. The question mark (?) makes the preceding element or group optional.

Use [- ]? to make space and hyphen optional between the digits sequence.

Additional note: That is a pretty generic regex, so it should be compatible with any programming language.

Therefor, you should know that event though the general structure of regular expression is pretty much the same, there are significant difference among the different language implementation for fancier/advanced functionalities.

Next time, you should also specify for which programming language you want the regex (by example: PHP, Javascript, etc.)

by anonymous   2019-07-21

There are two problems with your query:

  1. Tagsoup adds namespaces

    Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):

    basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
    

    or use * as namespace indicator for each element:

    basex -ipage.xhtml "//*:div[@id='ps-content']"
    
  2. XML/XQuery is case sensitive

    I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.


Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.

by anonymous   2017-08-20

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.

If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.

Start simple

Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.

Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.

If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)

Order from the menu

Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.

The pattern . is special: rather than matching a literal dot only, it matches any character. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].

Think of character classes as menus: pick just one.

Helpful shortcuts

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match non-negative integers: one way to write that is [0-9]+. Digits are a frequent match target, so you could instead use \d+ match non-negative integers. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

  • * (zero or more times)
  • + (one or more times)
  • {n} (exactly n times)
  • {n,} (at least n times)
  • {n,m} (at least n times but no more than m times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

  • ick
  • Nick
  • nick
  • Nnick
  • nNick
  • nnick
  • (and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

For example, say the input is

"Hello," she said, "How are you?"

You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.

To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.

If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.

(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)

Anchors

Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.

Say you want to match comments of the form

-- This is a comment --

you'd write ^--\s+(.+)\s+--$.

Build your own

Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.

Tools for writing and debugging regexes:

  • RegExr (for JavaScript)
  • Perl: YAPE: Regex Explain
  • Regex Coach (engine backed by CL-PPCRE)
  • RegexPal (for JavaScript)
  • Regular Expressions Online Tester
  • Regex Buddy
  • Regex 101 (for PCRE, JavaScript, Python)
  • Visual RegExp
  • Expresso (for .NET)
  • Rubular (for Ruby)
  • Regular Expression Library (Predefined Regexes for common scenarios)
  • Txt2RE
  • Regex Tester (for JavaScript)
  • Regex Storm (for .NET)

Books

Free resources

  • Regular Expressions - Everything you should know (PDF Series)
  • Regex Syntax Summary
  • How Regexes Work

Footnote

†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.