Retina is a regex-based programming language. It's main feature is taking some text via standard input and repeatedly applying regex operations to it (e.g. matching, splitting, and most of all replacing). Under the hood, it uses .NET's regex engine, which means that both the .NET flavour and the ECMAScript flavour are available.
Retina was mainly developed for Code golf which may explain its very terse configuration syntax and some weird design decisions.
There is an up-to-date Windows binary of Retina in the root directory of the repository. Alternatively, you can build it yourself from the C# sources. The code requires .NET 4.5. I do not regularly test Retina with Mono, but previous versions have worked without problems.
Each program is grouped into stages. A stage consists of one or two parts depending on its type (or mode). Each stage transforms its input (string) to an output string which is passed to the next stage. The first stage reads from the standard input stream. Each stage may optionally print its result to the standard output stream in addition to passing it on. By default, only the very last stage prints its result. Stages can also be grouped into loops.
By default, Retina reads a single source file, where each line is a separate "part" (so each stage is one or two lines):
Retina ./program.ret However, escape sequences exist (`\n` or `$n` depending on whether you're in a regex or a substitution). If you actually want to use multiple lines for each parts (e.g. to split a complicated regex over multiple lines in free-spacing mode), you can use the `-m` flag. In this case, each part is either given as a separate file, or directly on the command line, by using the `-e` flag. So all of the following are valid invocations:
Retina -m ./pattern.rgx
Retina -m -e "foo.*"
Retina -m ./pattern.rgx ./replacement.rpl
Retina -m ./pattern.rgx -e "bar"
Retina -m -e "foo.*" ./replacement.rpl
Retina -m -e "foo.*" -e "bar"
Retina -m ./pattern1.rgx ./replacement1.rpl ./pattern2.rgx
Retina -m ./pattern1.rgx -e "bar" -e "foo*"
Whenever Retina reads a file, it will try to open the file as either UTF-8 or UTF-32. When neither of those is valid, it will open the file as ISO 8859-1 instead. This gives a way to use an encoding where each byte is a single character without having to specify the encoding manually.
- Since linefeeds separate parts, no part can contain a literal linefeed character. However, Retina will replace all pilcrows (
¶
) with linefeeds after the file has been split into parts. When using ISO 8859-1 encoding, this gives a single-byte way to include linefeeds in the source code. In the unlikely event that you actually need pilcrows in your code and want to use single-line mode, this substitution can be deactivated with the-P
flag. - Retina expects the file to use Unix-style line endings (a single linefeed character, 0x0A). If the file uses Windows-style line endings (
\r\n
, carriage return 0x0D followed by linefeed), the carriage returns will be contained at the ends of the parts.
Regardless of whether a stage consists of one or two parts, the first one will be the pattern. First and foremost, this will contain the regex to be used. However, if the file contains at least one backtick (`
), the code before the first backtick will be used to configure the stage - let's call this the configuration string. As an example, the pattern file
_Ss`a.
configures Retina with _Ss
(more on that later) and defines the regex as a.
. Any further backticks are simply part of the regex. If you want to use backticks in your regex, but do not want to configure Retina, just start your pattern file with a single backtick (which translates to an empty configuration).
Most types of stages only use a single file.
If the stage operates in Replace mode, it consists of two parts, where the first is a pattern as described above, and the second is the replacement string. You can use all the usual references to capturing groups like $n
. If no replacement part is supplied (but Replace mode is enforced), the replacement string is assumed to be the empty string.
Currently, the configuration string simply is a (mostly) unordered bunch of characters, which switches between different options, modes and flags. This may get more complicated in the future. If multiple conflicting options are used, the latter option will override the former.
Some characters are available in all modes, some only in specific ones. Mode-specific options are denoted by non-alphanumeric characters and are mentioned below when the individual modes are discussed.
All regex modifiers available in .NET (through RegexOptions
), except Compiled
are available through the configuration string. This means you don't have use inline modifiers like (?m)
in the regex (although you can). All regex modifiers are available through lower case letters:
c
: Is forCultureInvariant
. Quoting MSDN: "Specifies that cultural differences in language is ignored. For more information, see the "Comparison Using the Invariant Culture" section in the Regular Expression Options topic."e
: ActivatesECMAScript
mode and essentially changes the regex flavour. Some of the other modifiers don't work in combination with this.i
: Makes the pattern case-insensitive.m
: ActivatesMultiline
mode, which makes^
and$
match the beginning and end of lines, respectively, in addition to the beginning and end of the entire input.n
: ActivatesExplicitCapture
mode. Quoting MSDN: "Specifies that the only valid captures are explicitly named or numbered groups of the form (?…). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…). For more information, see the "Explicit Captures Only" section in the Regular Expression Options topic."r
: ActivatesRightToLeft
mode. For more information see MSDN.s
: ActivatesSingleline
mode, which makes.
match newline characters.x
: Activates free-spacing mode, in which all whitespace in the pattern is ignored (unless escaped), and single-line comments can be written with#
.
The default mode for a stage depends on whether there are any further parts in the program after the current pattern: if this pattern is the last part of the program, the stage defaults to Match mode. Otherwise it defaults to Replace mode. The following upper-case letters in the configuration string can override these defaults:
M
: Match mode.R
: Replace modeS
: Split mode.G
: Grep mode.A
: AntiGrep mode.T
: Transliterate mode.
The following options apply to all modes:
(
and)
:(
opens a loop and)
closes a loop. All stages between(
and)
(inclusive) will be repeated in a loop until an iteration doesn't change the result. Note that(
and)
may appear in the same stage, looping only that stage. Also, the order of(
and)
within a single stage is irrelevant - they are always treated as if all(
appear before all)
. Furthermore, an unmatched)
assumes a(
in the first stage, and an unmatched(
assumes a)
in the last stage. Loops can be nested. These options makes Retina Turing-complete (see below for details).+
: Short-hand for()
.;
and:
: Turn Silent mode off and on, respectively (this determines whether the result of a stage or loop is printed to the standard output stream). Every stage and every loop has a separate silent flag. All but the last outermost loop are silent by default. If a loop has been closed with)
or+
in the current configuration string before the;
or:
, the silent mode of that (last closed) loop is affected. Otherwise, the stage's own silent mode is affected. This means that(:)
defines a stage which is looped and prints its result on each iteration, but()+
is a stage which is looped but only prints the result once the loop terminates. By default non-silent stages always print a trailing linefeed. This is a rare case where the order of options in the configuration string is not arbitrary.\
: If the current stage is not silent, suppress its trailing linefeed. Like;
and:
it's relative position with respect to)
or+
matters.
As outlined above, Retina currently supports 6 different operation modes: Match, Replace, Split, Grep, AntiGrep, Transliterate.
This mode takes the regex with its modifiers and applies it to the input. By default, the result is the number of matches.
Match mode currently supports the following options:
!
: Instead of the number of matches, the result is a newline-separated list of all matches.&
: Consider overlapping matches. Normally, the regex engine will start looking for the next match after the end of the previous match. With this mode, Retina will instead look for the next match after the beginning of the previous match. Note that this will still not take into account overlapping matches which start at the same position (but end at different positions.)
Ultimately, this mode will probably receive the most elaborate configuration scheme, in order to print capturing groups or other information about the match.
Replace mode does what it says on the tin: it replaces all matches of the regex in the input with the replacement string, and returns the result. Replace mode currently doesn't have any dedicated options, but will probably get at least one more option in the future: limiting the number of replacements done per iteration (e.g. replace only the first match).
Replace mode does not use .NET's built-in Regex.Replace
, but a custom implementation instead. All of the substitution elements $...
that are valid in .NET also work in Replace mode. However, Retina understands the following additional substitution elements:
$n
: Is an escape sequence for a linefeed character (0x0A), much like\n
inside a regex.
This passes the regex on to Regex.Split
. The result of Regex.Split
separated by newlines is the result of this stage. This means that you can use capturing groups to include parts of the matches in the output.
Split mode comes with one additional option:
_
: Filter all empty strings out of the result before printing.
I might add more configuration options in order to control the output format or to limit the number of splits in the future.
Grep mode makes Retina assume grep's basic mode of operation: the input is split into individual lines, the regex is matched against each line, and the result consists of all lines that yielded a match.
AntiGrep mode is almost the same, except that it prints all lines which didn't yield a match.
This mode is rather versatile mode, which is mainly intended for substituting individual characters for others. Its format is slightly different from the other modes, in that it consists of up to four segments on a single line. All of the following are valid Transliterate stages:
config`from
config`from`to
config`from`to`regex
Here, config
is just the regular configuration string, which needs to contain T
to select this mode. If to
is not given it is assumed to be an empty string. If regex
is not given or empty it is assumed to be [\s\S]+
, which always matches the entire input.
First, both from
and to
are expanded to lists of characters, using the following rules (reading left to right):
- The following escape sequences are known (only for convenience - you can also embed the characters directly in the source code):
\a
(bell, 0x07),\b
(backspace, 0x08),\f
(form feed,0x0C
),\n
(line feed,0x0A
),\r
(carriage return,0x0D
),\t
(tab,0x09
),\v
(vertical tab,0x0B
). If these are encountered, they are added to the character list. - If a
\
is encountered which is not followed by one of these characters, the next character itself is added to the list (this let's you add, e.g.,`
to the list). If there is a single\
left at the end of the stage, it is also added to the list literally. - If a
-
is encountered before the end of the stage and a single character was added to the list just before this, it denotes a range of characters. Ranges can be ascending or descending (they can also be degenerate, e.g.a-a
). Note that the-
also acts as an escape for the next character, which means thatA-`
is a range and does not end the current part. However, the above special escape sequences are not known, so they cannot be used at the right end of a range:a-\n
would add a range froma
to\
and the single charactern
. However\n-a
would be a range from the line feed (0x0A) toa
. d
is equivalent to0-9
.w
is equivalent to_0-9A-Za-z
.- In any case not covered above, the encountered character is just added to the list literally.
If the expanded to
list is shorter than the from
list, it will be padded to the same length by repeating its last character. That is, the following specifications are equivalent:
T`d`abc
T`0123456789`abcccccccc
After this preprocessing has been done, the regex is applied to the input string. Parts which aren't matched are left unchanged, but in the matches, every character found in from
is replaced by the character at the same position in to
. Characters not found in from
are left unchanged as well. If to
is empty, the characters in from
are removed from the matches instead.
This mode allows simple character transformations which would otherwise require listing every character separately. Some examples:
Tx`A-Za-z`a-zA-Z # Swap the case of all ASCII letters in the input.
Tx`w`_da-zA-Z # Same, but three characters shorter.
Tx`a-z`A-Z`\b. # Capitalise the first letter of each word.
Tx`A-Z`N-ZA-M # ROT-13.
Tx`A-Z`Z-A # Atbash cypher.
Transliterate mode currently doesn't have any dedicated options.
While the .NET-flavour itself is just short of being Turing-complete (as far as I know), Retina is, thanks to repeated Replace mode via the +
option. As an example, I've implemented a 2-Tag System "interpreter" in Retina.
Pattern file:
+`^(.).(\w*)(?=\|.*\1>(\w*))|^(?<2>\w+).*
Replacement file:
$2$3
With this Retina code in place, an arbitrary tag system can be supplied via STDIN, as long as its alphabet consists only of alphanumerical characters and underscores (this limitation could easily be removed). The input must consist of the initial word, as well as the set of production rules available for the tag system. For instance, the first example system given on Wikipedia, would be written as
baa|a>ccbaH,b>cca,c>cc
Note that the pattern is hardcoded to 2-tag systems, although it would also be possible to generalise this to n
-tag systems, where n
could be encoded in unary in the input string.