This is basically add a new document parser so that the document can be recognized and indexed. Basically the collection class does two things:
- Walk through the input directory to find the files based on some rules, e.g. files that end with .gz
- Read the file in order to find and return documents
The detailed steps are:
- Add a new class under package
io.anserini.index.collections
. This class should extends Collection class. The name should be something like MyOwnCollection where MyOwn is the name of your collection class. The class will be instanced asc = (Collection)Class.forName("io.anserini.index.collections."+collectionClass+"Collection").newInstance();
- In the constructor define your own
skippedFilePrefix
,allowedFilePrefix
,skippedFileSuffix
,allowedFileSuffix
,skippedDirs
. The discoverFiles relies on these sets to decide how to include/exclude files and folders. - Override function
prepareInput
andfinishInput
.prepareInput
takes a file path as the argument and you can initialize theBufferReader
(or something like that) there.finishInput
is called after the file is processed and you can close theBufferReader
(or something like that) here. - Add a new record reader under package
io.anserini.document
. This class should extends Indexable class. Typically the functionnext
inCollection
(sinceCollection
implementsIterator
) can call the function in the record reader to read one document at a time.
Please take a look at TrecCollection and TrecRecord for full example.