-
Notifications
You must be signed in to change notification settings - Fork 14
Raw text annotators
Many input texts have sections that need to be parsed and other sections that should be skipped or replaced. For example, in XML, it may only be necessary to parse text contained between certain tags, and to skip certain tags inside that text.
The simplest way to do this is by applying regular expression filters to the raw text, using the --textAnnotators option (semi-colon delimited list of paths), or adding the file path to the talismane.core.annotators.text-annotators key in the configuration file. For example, the configuration file might look like this:
talismane { core { ...annotators { text-annotators = [ "path/to/my_annotators.txt" ${languagePack}"text_marker_filters.txt" ] ...
The textAnnotators file has the following tab-delimited format per line:
FilterType Markers Regex GroupNumber* Replacement
The meaning of these fields is given below:
- FilterType: currently the only allowable value is RegexMarkerFilter
- Markers: a comma-delimited list of markers to be applied by this filter. See marker types below.
- Regex: the regular expression to be found in the text. This follows the standard Java format.
- GroupNumber*: the group number within the regular expression, that indicates the actual text to be marked. This is useful when some context is required to determine which text needs to be marked, but the context itself should not be marked. More information about groups can be found in the Groups and Capturing section of the Java Pattern class
- Replacement: Only required if the FilterType is REPLACE. Like in Java Patterns, can include placeholders $1, $2, etc., which get filled in from the groups in the regex.
Default group By default, if a regex contains groups (marked by parentheses), and the GroupNumber parameter is omitted, Talismane will assume that the entire expression will be marked. To select a specific group for marking, explicitly enter the group number. Groups are numbered starting at 1 in the order in which their opening parenthesis appears, with group 0 always referring to the entire expression.
Below is a table of allowable markers. Markers are either stack based or unary. Stack-based markers mark both the beginning and end of section of text, and can be nested. Unary markers apply a single action at a given point in the text: if unary markers (e.g. start and end markers) are placed inside an area marked by a stack-based marker, their action will only affect this area. For maximum robustness, the best strategy is to reserve stack-based markers for very short segments, and use unary markers instead of excessive nesting.
Marker type | Description |
---|---|
SKIP | Skip any text matching this filter (stack-based). |
INCLUDE | Include any text matching this filter (stack-based). |
OUTPUT | Skip any text matching this filter, and output its raw content in any output file produced by Talismane (stack-based). |
SENTENCE_BREAK | Insert a sentence break. |
SPACE | Replace the text with a space (unless the previous segment ends with a space already). Only applies if the current text is marked for processing. |
REPLACE | Replace the text with another text. Should only be used for encoding replacements which don't change meaning - e.g. replace "é" by "é". Only applies if the current text is marked for processing. |
STOP | Mark the beginning of a section to be skipped (without an explicit end). Note that the processing will stop at the beginning of the match. If this marker is placed inside an area marked by SKIP, INCLUDE or OUTPUT, it will only take effect within this area. It can be reversed by a START marker. |
START | Mark the beginning of a section to be processed (without an explicit end). Note that the processing will begin AFTER the end of the match. If this marker is placed inside an area marked by SKIP, INCLUDE or OUTPUT, it will only take effect within this area. It can be reversed by a START marker. |
OUTPUT_START | Mark the beginning of a section to be outputted (without an explicit end). Will only actually output if processing is stopped. Stopping needs to be marked separately (via a STOP marker). Note that the output will begin at the beginning of the match. If this marker is placed inside an area marked by OUTPUT, it will only take effect within this area. It can be reversed by a OUTPUT_STOP marker. |
OUTPUT_STOP | Mark the end of a section to be outputted (without an explicit beginning). Starting the processing needs to be marked separately. Note that the output will stop at the end of the match. If this marker is placed inside an area marked by OUTPUT, it will only take effect within this area. It can be reversed by a OUTPUT_START marker. |
The text marked for raw output will only be included if the output template explicitly includes it using the precedingRawOutput field (as is the case for the default templates).
Default behaviour for processing: By default, Talismane will assume that the input file/stream should be processed from the very beginning. If this is not the case (e.g. for an XML file), the user should set the parameter processByDefault=false.
RegexMarkerFilter SKIP <skip>.*</skip>
RegexMarkerFilter SKIP <b> RegexMarkerFilter SKIP </b>
RegexMarkerFilter SKIP \n(Figure \d+:) 1
RegexMarkerFilter INCLUDE <text>(.*)</text> 1
RegexMarkerFilter OUTPUT <marker>.*</marker>
RegexMarkerFilter OUTPUT <marker>(.*)</marker> 1
RegexMarkerFilter SKIP,SENTENCE_BREAK (\r\n|[\r\n]){2} 0 RegexMarkerFilter SPACE [^-\r\n](\r\n|[\r\n]) 1
RegexMarkerFilter REPLACE é é
RegexMarkerFilter START <text>
RegexMarkerFilter STOP </text>
RegexMarkerFilter STOP,OUTPUT_START <marker>
RegexMarkerFilter START,OUTPUT_STOP </marker>