An Introduction to Formal Languages and Automata

Category: Computer Science
Author: Peter Linz
3.0
This Year Stack Overflow 2
This Month Stack Overflow 2

Comments

by anonymous   2017-08-20

Update: In a comment on your question, you mentioned you want to clean wiki markup and remove balanced sequences of {{ ... }}. Section 6 of the Perl FAQ covers this: Can I use Perl regular expressions to match balanced text?

Consider the following program:

#! /usr/bin/perl

use warnings;
use strict;

use Text::Balanced qw/ extract_tagged /;

# for demo only
*ARGV = *DATA;

while (<>) {
  if (s/^(.+?)(?=\{\{)//) {
    print $1;
    my(undef,$after) = extract_tagged $_, "{{" => "}}";

    if (defined $after) {
      $_ = $after;
      redo;
    }
  }

  print;
}

__DATA__
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. {{delete me}} Sed quis
nulla ut dolor {{me too}} fringilla
mollis {{ quis {{ ac }} erat.

Its output:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit.  Sed quis
nulla ut dolor  fringilla
mollis {{ quis  erat.

For your particular example, you could use

$text =~ s/[^ac]|a(?!c)|(?<!a)c//g;

That is, only delete an a or c when they aren't part of an ac sequence.

In general, this is tricky to do with a regular expression.

Say you don't want foo followed by optional whitespace and then bar in $str. Often, it's clearer and easier to check separately. For example:

die "invalid string ($str)"
  if $str =~ /^.*foo\s*bar/;

You might also be interested in an answer to a similar question, where I wrote

my $nofoo = qr/
  (      [^f] |
    f  (?! o) |
    fo (?! o  \s* bar)
  )*
/x;

my $pattern = qr/^ $nofoo bar /x;

To understand the complication, read How Regexes Work by Mark Dominus. The engine compiles regular expressions into state machines. When it's time to match, it feeds the input string to the state machine and checks whether the state machine finishes in an accept state. So to exclude a string, you have to specify a machine that accepts all inputs except a particular sequence.

What might help is a /v regular expression switch that creates the state machine as usual but then complements the accept-state bit for all states. It's hard to say whether this would really be useful as compared with separate checks because a /v regular expression may still surprise people, just in different ways.

If you're interested in the theoretical details, see An Introduction to Formal Languages and Automata by Peter Linz.

by anonymous   2017-08-20

The patterns in your question match the same text. In terms of implementation, they correspond to different automata and side effects (i.e., whether they capture substrings).

In a comment below, Garrett Albright points out a subtle distinction. Whereas (.|\n) matches any character, [.\n] matches either a literal dot or a newline. Although dot is no longer special inside a character class, other characters such as -, ^, and ] along with sequences such as [:lower:] take special meanings inside a character class. Care is necessary to preserve special semantics from one context to the other, but sometimes it isn’t possible such as in the case of \1 as an archaic way of writing $1 outside a character class. Inside a character class, \1 always matches the character SOH.

Character classes ([...]) are optimized for matching one out of some set of characters, and alternatives (x|y) allow for more general choices of varying lengths. You will tend to see better performance if you keep these design principles in mind. Regex implementations transform source code such as /[abc]/ into finite-state automata, usually NFAs. What we think of as regex engines are more-or-less bookkeepers that assist execution of those target state machines. The sufficiently smart regex compiler will generate the same machine code for equivalent regexes, but this is difficult and expensive in the general case because of the lurking exponential complexity.

For an accessible introduction to the theory behind regexes, read “How Regexes Work” by Mark Dominus. For deeper study, consider An Introduction to Formal Languages and Automata by Peter Linz.