-
Notifications
You must be signed in to change notification settings - Fork 0
regex
Alexander Chernyshev edited this page Jun 16, 2022
·
3 revisions
Regular Expressions, also know as RegEx or RegExp, is a standard language for text matching, used by many many many tools. If you work with text matching regularly - you should definitely learn it.
It is very easy to start using RegEx with basic things, but it's very powerful (and complicated) if you want to master it.
Some of notes on RegEx to keep in mind:
- RegEx is greedy by default - it will try to match as much as possible, if you permit that, often getting unexpected results. Try to limit your patterns.
- RegEx uses anchors - special symbols which tell RegEx to match only specific positions, like word breaks, whitespaces, line starts/ends, etc. Anchors are a must for any complicated pattern.
- RegEx main instrument are masks, which allow usage of more than 1 specific symbol at a place. Most likely you already know masks from other notations, like
*
as a mask for any amount of any symbols used in Blob patterns and Windows masks. - RegEx allows to use quantifiers - how many of something you want to match.
- RegEx allows to do lookups (look-aheads) - forward or back ones, positive or negative, without actually matching them - a condition.
- RegEx allows to define groups - either positional or named, and to reference their values - either in the same pattern definition, or in substition.
- RegEx uses many special characters. If you need to match one of those specifically in text - escape it using
\
. For example|
has a meaning of OR, but if you need to match it literally - you need to use\|
. - RegEx allows several ways to write specific pattern. If you can't write the pattern in one of ways - just use one of replacements. Like, OTLoV filtering uses
\s
, for example, or just.
.
There are numerous cheat sheets on Regex, just google a few and use whichever you like.
There are great services, which help to understand RegEx better and to learn to use it. I prefer regex101.com (need to use .NET variant there).
A few of most commonly used basic patterns:
Pattern | Meaning |
---|---|
. |
Any one symbol |
\s |
whiteSpace symbol (space, tab, newline if allowed by options, etc.). Lowercase s . |
\S |
not a whiteSpace symbol. Uppercase S . |
\w |
Word character (alpha, numeric, underscore) |
[a-z] |
Specific list of characters. Lowercase a to z in this case (regex is case sensitive by default, but OTLoV uses a default option to make it case insensitive; can be changed) |
[^:] |
Any character, but :
|
.* |
* denotes a quantifier - any number (0 or more) of the character before it. In this case . before denotes any character. So .* means - any number of any characters. |
\w+ |
+ denotes a quantifier - one or more of character before it. In this case \w before - means word characters. So \w+ means - one or more of word characters. |
[a-zA-Z_]+ |
this is the same as \w+ - another form. |
fail(ure)? |
? denotes a quantifier - one or zero of characters before it. In this case (ure) group comes before ? , which must match ure characters. So fail(ure)? will match fail and failure (as we have no anchors here - the pattern will match fail in failed too for example). |
^ |
anchor to match the start of the line. For multiline text behavior depends on options. |
\n |
matches new line character (OTLoV specially converts any line breaks into only \n ). |
$ |
anchor to match the end of the line. For multiline text behavior depends on options. |
\d{1,5} |
\d means Decimal (number), {1,5} - a quantifier, 1 to 5 characters. So \d{1,5} means - from 1 to 5 numbers. |
ab|ba |
OR pattern - either ab or ba . |
More difficult examples:
Pattern | Meaning |
---|---|
\b(OK|KO)\b |
\b - a word break (\w from one side). (...) - a group. OK|KO - OR pattern - either OK or KO. \b(OK|KO)\b will match standalone words OK or KO , but not characters inside a word (say, OKKIE won't be matched, as there's no word break after OK ). |
(?<!no )fail |
(?<!...) Negative lookbehind - there should be NO pattern denoted by ... before. fail - match fail specifically. (?<!no )fail means - match fail if there's no no before, i.e. no fail won't match, while fail will be matched. |
\d*[1-9]\d* |
a number, which consists not only from zeroes. |
To use found values for replacements (Replacement
):
Pattern | Meaning |
---|---|
$0 | Full text matching RegEx pattern |
$1 | The first RegEx group matching text (groups are defined in brackets; order - from left to right; passive groups defined as (?:...) are ignored here) |
$2 | The second RegEx group value |
${name} | Value of the named group with name name , defined as (?<name>...)
|