This is basically add a new document parser so that the document can be recognized and indexed. Basically the collection class does two things:

Walk through the input directory to find the files based on some rules, e.g. files that end with .gz
Read the file in order to find and return documents

The detailed steps are:

Add a new class under package io.anserini.index.collections. This class should extends Collection class. The name should be something like MyOwnCollection where MyOwn is the name of your collection class. The class will be instanced as c = (Collection)Class.forName("io.anserini.index.collections."+collectionClass+"Collection").newInstance();
In the constructor define your own skippedFilePrefix, allowedFilePrefix, skippedFileSuffix, allowedFileSuffix, skippedDirs. The discoverFiles relies on these sets to decide how to include/exclude files and folders.
Override function prepareInput and finishInput. prepareInput takes a file path as the argument and you can initialize the BufferReader (or something like that) there. finishInput is called after the file is processed and you can close the BufferReader (or something like that) here.
Add a new record reader under package io.anserini.document. This class should extends Indexable class. Typically the function next in Collection (since Collection implements Iterator) can call the function in the record reader to read one document at a time.

Please take a look at TrecCollection and TrecRecord for full example.

Provide feedback

Saved searches