Extended Regular Expressions

General Information

Extended regular expressions (ERE-s) are used in several places in the LDM system -- primarily as a pattern for matching

Consequently, it's important to know how to use ERE-s.

See the official description of ERE-s.

There is an excellent tool for determining a matching ERE from example text.

Here is an incomplete summary of ERE syntax:

        .           Any single character
        [chars]     Character class: One  of chars
        [^chars]    Character class: None of chars
        text1|text2 Alternative: text1 or text2

        ?           0 or 1 of the preceding text
        *           0 or N of the preceding text (N > 0)
        +           1 or N of the preceding text (N > 1)
        {M,N}       M through N of the preceding text (either may be missing)

        (text)      Grouping of text
                 (either to set the borders of an alternative or
                 for making backreferences where the Nth group can
                 be used in a pqact PIPE action with (for example) \N)

        ^           Start of string anchor
        $           End   of string anchor

        \char       escape that particular character
                 (for instance to specify the characters ".[]()"

Pathological ERE-s

Don't use ERE-s with a ".*" prefix because:

  1. They can take up to three orders of magnitude longer to match against a string than their unprefixed counterparts; and
  2. The ".*" prefix adds nothing to the ERE: the same set of strings is matched if it's omitted.

This only applies to ERE-s with a ".*" prefix: the ERE ".*", by itself, is perfectly OK.

The inefficiency of pathological ERE-s can be seen by using the UNIX time utility with the LDM's regex utility. First the non-pathological case:

      $ time regex -n 10000 \
        -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
      no match

      real	0m0.044s
      user	0m0.040s
      sys	0m0.000s

The above indicates that ten-thousand comparisons of the given string against the ERE took 0.04 seconds of user-time.

The timing of the corresponding pathological ERE is much different:

      $ time regex -n 10000 \
        -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
      no match

      real	0m18.424s
      user	0m17.720s
      sys	0m0.020s

The above took 17.72 seconds of user-time, meaning that the non-pathological ERE is 443 times more efficient than the pathological one. More complex pathological ERE-s have even worse results.