Skip to content

Latest commit

 

History

History
769 lines (501 loc) · 22.9 KB

regular_expressions.pod

File metadata and controls

769 lines (501 loc) · 22.9 KB

Literals

The simplest regexes are simple substring patterns:

The match operator (// or, more formally, m//) contains a regular expression--in this example, hat. Even though that reads like a word, it means "the h character, followed by the a character, followed by the t character, appearing anywhere in the string." Each character is an atom in the regex: an indivisible unit of the pattern. The regex binding operator (=~) is an infix operator (fixity) which applies the regular expression on its right to the string produced by the expression on its left. When evaluated in scalar context, a match evaluates to a true value if it succeeds.

The negated form of the binding operator (!~) evaluates to false if the match succeeds.

The qr// Operator and Regex Combinations

Regexes are first-class entities in modern Perl when created with the qr// operator:

You may interpolate and combine them into larger and more complex patterns:

Quantifiers

Regular expressions are far more powerful than previous examples have demonstrated; you can search for a literal substring within a string with the index operator. Using the regex engine for that is like flying your autonomous combat helicopter to the corner store to buy spare cheese.

Regular expressions get more powerful through the use of regex quantifiers, which allow you to specify how often a regex component may appear in a matching string. The simplest quantifier is the zero or one quantifier, or ?:

Any atom in a regular expression followed by the ? character means "match zero or one of this atom." This regular expression matches if there are zero a characters immediately following a c character and immediately preceding a t character. It also matches if there is one and only one a character between the c and t characters.

The one or more quantifier, or +, matches only if there is at least one of the preceding atom in the appropriate place in the string to match:

There is no theoretical limit to the number of quantified atoms which can match.

The zero or more quantifier is *; it matches if there are zero or more instances of the quantified atom in the string to match:

This may seem useless, but it combines nicely with other regex features to indicate that you don't care about what may or may not be in that particular position in the string to match. Even so, most regular expressions benefit from using the ? and + quantifiers far more than the * quantifierFriedl's book explains why..

Finally, you can specify the number of times an atom may match with numeric quantifiers. {n} means that a match must occur exactly n times.

{n,} means that a match must occur at least n times, but may occur more times:

{n,m} means that a match must occur at least n times and cannot occur more than m times:

Metacharacters

Regular expressions get more powerful as atoms get more general. For example, the . character in a regular expression means "match any character except a newline". If you wanted to search a list of dictionary words for every word which might match 7 Down ("Rich soil") in a crossword puzzle, you might write:

Of course, if your list of potential matches were anything other than a list of words, this metacharacter could cause false positives, as it also matches punctuation characters, whitespace, numbers, and many other characters besides word characters. The \w metacharacter represents all alphanumeric characters and the underscore:

Use the \d metacharacter to match digits:

Use the \s metacharacter to match whitespace, whether a literal space, a tab character, a carriage return, a form-feed, or a newline:

These three metacharacters have negated forms. To match any character except a word character, use \W. To match a non-digit character, use \D. To match anything but a space, use \S.

If the range of allowed characters in these four groups isn't specific enough, you can specify your own character classes by enclosing them in square brackets:

If the characters in your character set form a contiguous range, you can use the hyphen character (-) as a shortcut to express that range.

Move the hyphen character to the start of the class to include it in the class:

Just as the word and digit class metacharacters (\w and \d) have negations, so too you can negate a character class. Use the caret (^) as the first element of the character class to mean "anything except these characters":

Greediness

The + and * quantifiers by themselves are greedy quantifiers; they match as many times as possible. This is particularly pernicious when using the tempting-but-troublesome "match any amount of anything" pattern .*:

The problem is more obvious when you expect to match a short portion of a string. Greediness always tries to match as much of the input string as possible first, backing off only when it's obvious that the match will not succeed. Thus you may not be able to fit all of the results into the four boxes in 7 Down if you go looking for "loam" with:

You'll get Alabama, Belgium, and Bethlehem for starters. The soil might be nice there, but they're all too long--and the matches start in the middle of the words.

Regex anchors force a match at a specific position in a string. The start of string anchor (\A) ensures that any match will start at the beginning of the string:

Similarly, the end of line string anchor (\Z) ensures that any match will end at the end of the string.

If you're not fortunate enough to have a Unix word dictionary file available, the word boundary metacharacter (\b) matches only at the boundary between a word character (\w) and a non-word character (\W):

Sometimes you can't anchor a regular expression. In those cases, you can turn a greedy quantifier into a parsimonious quantifier by appending the ? quantifier:

In this case, the regular expression engine will prefer the shortest possible potential match, increasing the number of characters identified by the .*? token combination only if the current number fails to match. Because * matches zero or more times, the minimal potential match for this token combination is zero characters:

If this isn't what you want, use the + quantifier to match one or more items:

The ? quantifier modifier also applies to the ? (zero or one matches) quantifier as well as the range quantifiers. In every case, it causes the regex to match as few times as possible.

In general, the greedy modifiers .+ and .* are tempting but dangerous tools. For simple programs which need little maintenance, they may be quick and easy to write, but non-greedy matching seems to match human expectations better. If you find yourself writing a lot of regular expression with greedy matches, test them thoroughly with a comprehensive and automated test suite with representative data to lessen the possibility of unpleasant surprises.

Capturing

It's often useful to match part of a string and use it later; perhaps you want to extract an address or an American telephone number from a string:

Named Captures

Given a string, $contact_info, which contains contact information, you can apply the $phone_number regular expression and capture any matches into a variable with named captures:

The capturing construct can look like a big wad of punctuation, but it's fairly simple when you can recognize as a single chunk:

The parentheses enclose the entire capture. The ?< name > construct must follow the left parenthesis. It provides a name for the capture buffer. The rest of the construct within the parentheses is a regular expression. If and when the regex matches this fragment, Perl stores the captured portion of the string in the special variable %+: a hash where the key is the name of the capture buffer and the value is the portion of the string which matched the buffer's regex.

Parentheses are special to Perl 5 regular expressions; by default they perform the same grouping behavior as parentheses do in regular Perl code. They also enclose one or more atoms to capture whatever portion of the matched string they match. To use literal parentheses in a regular expression, you must preface them with a backslash, just as in the $area_code variable.

Anonymous Captures

Named captures are new in Perl 5.10, but captures have existed in Perl for many years. You may encounter anonymous captures as well:

The parentheses enclose the fragment to capture, but there is no regex directive giving the name of the capture. Instead, Perl stores the captured substring in a series of magic variables starting with $1 and continuing for as many capture groups are present in the regex. The first matching capture that Perl finds goes into $1, the second into $2, and so on.

While the syntax for named captures is longer than for anonymous captures, it provides additional clarity. You do not have to count the number of opening parentheses to figure out whether a particular capture is $4 or $5, and composing regexes from smaller regexes is much easier, as they're less sensitive to changes in position or the presence or absence of capturing in individual atoms.

Grouping and Alternation

Previous examples have all applied quantifiers to simple atoms. They can also apply to more complex subpatterns as a whole:

If you expand the regex manually, the results may surprise you:

This still matches, but consider a more specific pattern:

Some regexes need to match one thing or another. Use the alternation metacharacter (|) to do so:

The alternation metacharacter indicates that either preceding fragment may match. Be careful about what you interpret as a regex fragment, however:

It's possible to interpret the pattern rice|beans as meaning ric, followed by either e or b, followed by eans--but that's incorrect. Alternations always include the entire fragment to the nearest regex delimiter, whether the start or end of the pattern, an enclosing parenthesis, another alternation character, or a square bracket.

To reduce confusion, use named fragments in variables ($rice|$beans) or grouping alternation candidates in non-capturing groups:

The (?:) sequence groups a series of atoms but suppresses capturing behavior. In this case, it groups three alternatives.

Other Escape Sequences

Perl interprets several characters in regular expressions as metacharacters, which represent something different than their literal characters. Square brackets always denote a character class and parentheses group and optionally capture pattern fragments.

To a literal instance of a metacharacter, escape it with a backslash (\). Thus \( refers to a single left parenthesis and \] refers to a single right square bracket. \. refers to a literal period character instead of the "match anything but an explicit newline character" atom.

Other useful metacharacters that often need escaping are the pipe character (|) and the dollar sign ($). Don't forget about the quantifiers either: *, +, and ? also qualify.

To avoid escaping everything (and worrying about forgetting to escape interpolated values), use the metacharacter disabling characters. The \Q metacharacter disables metacharacter processing until it reaches the \E sequence. This is especially useful when taking match text from a source you don't control when writing the program:

The $literal_text argument can contain anything--the string ** ALERT **, for example. With \Q and \E, Perl will not interpret the zero-or-more quantifier as a quantifier. Instead, it will parse the regex as \*\* ALERT \*\* and attempt to match literal asterisk characters.

Assertions

The regex anchors (\A and \Z) are a form of regex assertion, which requires that a condition is present but doesn't actually match a character in the string. That is, the regex qr/\A/ will always match, no matter what the string contains. The metacharacters \b and \B are also assertions.

Zero-width assertions match a pattern, not just a condition in the string. Most importantly, they do not consume the portion of the pattern that they match. For example, to find a cat on its own, you might use a word boundary assertion:

... but if you want to find a non-disastrous feline, you might use a zero-width negative look-ahead assertion:

The construct (?!...) matches the phrase cat only if the phrase astrophe does not immediately follow.

The zero-width positive look-ahead assertion:

... matches the phrase cat only if the phrase astrophe immediately follows. This may seem useless, as a normal regular expression can accomplish the same thing, but consider if you want to find all non-catastrophic words in the dictionary which start with cat. One possibility is:

Because the assertion is zero-width, it consumes none of the source string. Thus the anchored .*\Z pattern fragment must be present; otherwise the capture would only capture the cat portion of the source string.

Add section on Regex flags:

/i /g /s /m /x

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 144:

Deleting unknown formatting code N<>

Around line 466:

A non-empty Z<>

Around line 769:

=end for without matching =begin. (Stack: [empty])