Skip to content

Raw text annotators

Assaf Urieli edited this page Mar 31, 2017 · 3 revisions

Many input texts have sections that need to be parsed and other sections that should be skipped or replaced. For example, in XML, it may only be necessary to parse text contained between certain tags, and to skip certain tags inside that text.

The simplest way to do this is by applying regular expression filters to the raw text, using the --textAnnotators option (semi-colon delimited list of paths), or adding the file path to the talismane.core.annotators.text-annotators key in the configuration file. For example, the configuration file might look like this:

talismane {
  core {
    ...
annotators {
  text-annotators = [
    "path/to/my_annotators.txt"
    ${languagePack}"text_marker_filters.txt"
  ]

  ...

The textAnnotators file has the following tab-delimited format per line:

FilterType	Markers	Regex	GroupNumber*	Replacement

The meaning of these fields is given below:

  • FilterType: currently the only allowable value is RegexMarkerFilter
  • Markers: a comma-delimited list of markers to be applied by this filter. See marker types below.
  • Regex: the regular expression to be found in the text. This follows the standard Java format.
  • GroupNumber*: the group number within the regular expression, that indicates the actual text to be marked. This is useful when some context is required to determine which text needs to be marked, but the context itself should not be marked. More information about groups can be found in the Groups and Capturing section of the Java Pattern class
  • Replacement: Only required if the FilterType is REPLACE. Like in Java Patterns, can include placeholders $1, $2, etc., which get filled in from the groups in the regex.

Default group By default, if a regex contains groups (marked by parentheses), and the GroupNumber parameter is omitted, Talismane will assume that the entire expression will be marked. To select a specific group for marking, explicitly enter the group number. Groups are numbered starting at 1 in the order in which their opening parenthesis appears, with group 0 always referring to the entire expression.

Below is a table of allowable markers. Markers are either stack based or unary. Stack-based markers mark both the beginning and end of section of text, and can be nested. Unary markers apply a single action at a given point in the text: if unary markers (e.g. start and end markers) are placed inside an area marked by a stack-based marker, their action will only affect this area. For maximum robustness, the best strategy is to reserve stack-based markers for very short segments, and use unary markers instead of excessive nesting.

Marker type Description
SKIP Skip any text matching this filter (stack-based).
INCLUDE Include any text matching this filter (stack-based).
OUTPUT Skip any text matching this filter, and output its raw content in any output file produced by Talismane (stack-based).
SENTENCE_BREAK Insert a sentence break.
SPACE Replace the text with a space (unless the previous segment ends with a space already). Only applies if the current text is marked for processing.
REPLACE Replace the text with another text. Should only be used for encoding replacements which don't change meaning - e.g. replace "é" by "é". Only applies if the current text is marked for processing.
STOP Mark the beginning of a section to be skipped (without an explicit end).
Note that the processing will stop at the beginning of the match.
If this marker is placed inside an area marked by SKIP, INCLUDE or OUTPUT, it will only take effect within this area. It can be reversed by a START marker.
START Mark the beginning of a section to be processed (without an explicit end).
Note that the processing will begin AFTER the end of the match.
If this marker is placed inside an area marked by SKIP, INCLUDE or OUTPUT, it will only take effect within this area. It can be reversed by a START marker.
OUTPUT_START Mark the beginning of a section to be outputted (without an explicit end).
Will only actually output if processing is stopped.
Stopping needs to be marked separately (via a STOP marker).
Note that the output will begin at the beginning of the match.
If this marker is placed inside an area marked by OUTPUT, it will only take effect within this area. It can be reversed by a OUTPUT_STOP marker.
OUTPUT_STOP Mark the end of a section to be outputted (without an explicit beginning).
Starting the processing needs to be marked separately.
Note that the output will stop at the end of the match.
If this marker is placed inside an area marked by OUTPUT, it will only take effect within this area. It can be reversed by a OUTPUT_START marker.

The text marked for raw output will only be included if the output template explicitly includes it using the precedingRawOutput field (as is the case for the default templates).

Default behaviour for processing: By default, Talismane will assume that the input file/stream should be processed from the very beginning. If this is not the case (e.g. for an XML file), the user should set the parameter processByDefault=false.

Text marker filter examples

  • To skip the XML tag <skip> and its contents:
  • RegexMarkerFilter	SKIP	<skip>.*</skip>
  • To skip the XML tag <b>, but not its contents:
  • RegexMarkerFilter	SKIP	<b>
    RegexMarkerFilter	SKIP	</b>
  • To skip the text "Figure 2:" at the beginning of a paragraph. Note that the "\n" character is used to enforce the start-of-paragraph constraint, and we explicitly indicate that the group to be marked is group 1:
  • RegexMarkerFilter	SKIP	\n(Figure \d+:)	1
  • To include the text INSIDE the XML tag <text>:
    RegexMarkerFilter	INCLUDE	<text>(.*)</text>	1
  • To mark the text inside the XML tag <marker> for output along with Talismane's analysis. Note that the output will be placed just prior to the token closest to the marker. Note that the token itself is included in the output:
  • RegexMarkerFilter	OUTPUT	<marker>.*</marker>
  • Same as above, but excluding the marker tag itself from the output:
  • RegexMarkerFilter	OUTPUT	<marker>(.*)</marker>	1
  • To replace a double-newline by a sentence break, and a single newline by a space, except when the line ends with a hyphen. Note that the regex below handles correctly both Unix, Windows and Mac OS9- newline characters:
  • RegexMarkerFilter SKIP,SENTENCE_BREAK	(\r\n|[\r\n]){2}	0
    RegexMarkerFilter SPACE	[^-\r\n](\r\n|[\r\n])	1
  • To replace any occurrence of &eacute; by é:
  • RegexMarkerFilter	REPLACE	&eacute;	é
  • To start processing when we reach the XML tag <text>:
  • RegexMarkerFilter	START	<text>
  • To stop processing when we reach the XML tag </text>:
  • RegexMarkerFilter	STOP	</text>
  • To start outputting raw text when we reach the XML tag <marker>:
  • RegexMarkerFilter	STOP,OUTPUT_START	<marker>
  • To stop outputting raw text when we reach the XML tag </marker>:
  • RegexMarkerFilter	START,OUTPUT_STOP	</marker>