Alternation, Grouping, capturing and back-references

Alternation

 

An alternation is a way to provide several subexpression as alternatives that could be matched at the current point. The meta-character used is the vertical bar (|). It has precedence over any other operator and you usually need grouping to reduce its scope. When you match against /this is blue|red/, you will match either "this is blue" or just "red". To reach what was probably intended, you have to use grouping (see later) like this /this is (blue|red)/.

When you use a Tradition NFA engine, the order of the alternative in an alternation will be respected. This means that if the first alternative lead to a complete successful match, the other alternative will be ignored. For example, /this is (green|green-blue)/ will never match its second alternation with a traditional NFA. Only DFA or POSIX NFA will ensure the longest-leftmost match as expected.

Grouping

 

Like mathematics, to modify the scope of a operator, you needs grouping. As expected, the grouping operator is (…). It may be used to limit the scope of an alternation or to increase the scope of a quantifier. For example: /this is (blue|red)/ or /([a-z][0-9])+/. But the grouping operator (…) as also the side effect of capturing. Since capturing could be annoying or inefficient, two solution may be provided to avoid capturing:

  • A special RegexMode that disable capturing of (…) expression
  • An alternative grouping-only operator  (?:…) that do no capture

Capturing

 

Capturing has two objectives, obtain the data matched by a subexpression of a regular expression and back-references. Since regular expressions are available through several more classical language, the ability for these languages to have an access to the text matched by a regular expressions or one of its subexpressions is a really powerful option.

The capturing of parenthesed items and access to the text preceding a match, the text of a match and the text following a match is available in most implementation. Some implementation also provide named capture. The operator for a named capture is (?<name>…). Here is an example of both normal capture and named capture:
/(.+)\/(.+)\/(?<filename>.*)/

If you match this example against a pathname containing at least 2 directory names, the filename will be captured in a named capture "filename". The last two directories will be stored in the capture groups number 1 and 2.

Since the capture operator is also a group operator that could be used for scoping a quantifier, this means that a certain capture could occur more than once during a match. Compare these two expression: /([a-z])*/ and /([a-z]*)/. The first one will repeat the capture and in most language, the capture group 1 will contains the last letter matched by ththee expression. Under .NET, it is also possible to access all previous match of the group. Since this expression may also never match but success, the capture group may also be undefined. In the second expression, the capture is only done once and capture the whole list of character that as matched. Once again, the quantifier * may never match, but this time the capture will be done whatever and will always be defined but may contains an empty string.

Since access to these captures should be done after a match in the flow of the host language, the way this is done heavily depends on the host language itself. For example, in perl, access to capture groups is possible using variables named $1 to $9. $& permits access to the whole matched text. $' and $` returns the preceding and following text respectively. And, really specific to perl, $+ return the capture groups having the highest number.

Back-references

 

Back-references permits access to a capture inside the regular expression itself. In most implementations, access to first 9 capture groups could be done using the meta-sequences \1 to \9. Some implementation also permit access to groups over 9, but since this conflict with the octal sequence, they may be tricky to distinguish.

Implementations that provide named capture also provide access to these captures. The syntax may vary, for example, .NET language use \k<name> and Python use (?P=name).

Atomic grouping

 

Atomic grouping is similar to possessive quantifier and is only related to regular expression backtracking and is explain in that section. The syntax of atomic grouping is (?>…).