Name		Name	Last commit message	Last commit date
parent directory ..
data		data
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
component.meta		component.meta
pom.xml		pom.xml

README.md

JCoRe File Collection Reader

JCoRe File Reader for reading in text files.

Descriptor Path:

de.julielab.jcore.reader.file.desc.jcore-file-reader

Objective

This is a reader for reading in text files, providing them to UIMA for further processing.

Requirement and Dependencies

The input and output of an AE is done via annotation objects. The classes corresponding to these objects are part of the JCoRe Type System.

Using the CR - Descriptor Configuration

In UIMA, each component is configured by a descriptor in XML. Such a preconfigured descriptor is available under src/main/resources/de/julielab/jcore/ but it can be further edited if so desired; see UIMA SDK User's Guide for further information.

1. Parameters

Parameter Name	Parameter Type	Mandatory	Multivalued	Description
InputDirectory	String	yes	no	Directory where the text files reside.
UseFilenameAsDocId	Boolean	no	no	If this is set to true, the document name (without extension) is used as document id.
PublicationDatesAsFile	String	no	no	A file that maps document ids to publication dates
ReadSubDirs	Boolean	no	no	If this is set to true, all subdirs of the InputDirectory are read.
FileNameSplitUnderscore	Boolean	no	no	Only used in conjunction with "`UseFilenameAsDocId`": If this is set to true, the split to determine the filename will also be done on underscores ("`_`").
AllowedFileExtensions	String	no	yes	A list of file extensions that should be read. If empty, all files are read.
OriginalFolder	String	no	no	Path to the folder where the "original" files reside. [1]
OriginalFileExt	String	no	no	File extension of the "original" files [1]
SentencePerLine	Boolean	no	no	If true, `Sentence` annotations are stored in the `CAS` according a "one line one sentence" format. [1]
TokenByToken	Boolean	no	no	If true, `Token` annotations are stored in the `CAS`, where every whitespace separated "entity" in the document is one token. [1]

[1] The last four parameters (OriginalFolder, OriginalFileExt, SentencePerLine, TokenByToken) are best used in conjunction with each other. For instance, you have documents that are free text and others that are basically the same but structure the text in such a way that sentences have each their own line and tokens are separated by whitespace. You don't want the document text in the CAS to be structured like the latter two but rather like in the "original" text file. That's where you should specify the aforementioned parameters accordingly.

2. Predefined Settings

Parameter Name	Parameter Syntax	Example
InputDirectory	valid Path to the files to read in	`data/files/`
UseFilenameAsDocId	boolean Variable	`false`
PublicationDatesAsFile	valid Path to the ACE files	`data/publicationDates`
ReadSubDirs	boolean Variable	`false`
FileNameSplitUnderscore	boolean Variable	`false`
AllowedFileExtensions	String (Multi)	`empty`
OriginalFolder	valid path to "original" files	`none`
OriginalFileExt	string of file extension	`txt`
SentencePerLine	boolean Variable	`false`
TokenByToken	boolean Variable	`false`

3. Capabilities

Type	Input	Output
de.julielab.jcore.types.Date		`+`
de.julielab.jcore.types.pubmed.Header		`+`
de.julielab.jcore.types.Sentence		`+`
de.julielab.jcore.types.Token		`+`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jcore-file-reader

jcore-file-reader

README.md

JCoRe File Collection Reader

Objective

Requirement and Dependencies

Using the CR - Descriptor Configuration

Files

jcore-file-reader

Directory actions

More options

Directory actions

More options

Latest commit

History

jcore-file-reader

Folders and files

parent directory

README.md

JCoRe File Collection Reader

Objective

Requirement and Dependencies

Using the CR - Descriptor Configuration