Skip to content

Latest commit

 

History

History
150 lines (105 loc) · 6.93 KB

experiments-msmarco-unicoil.md

File metadata and controls

150 lines (105 loc) · 6.93 KB

Anserini: uniCOIL w/ doc2query-T5 for MS MARCO V1

This page describes how to reproduce the uniCOIL experiments in the following paper:

Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.

In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see this guide.

Note that Pyserini provides a comparable reproduction guide, so if you don't like Java, you can get exactly the same results from Python.

Passage Ranking

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

# Alternate mirrors of the same data, pick one:
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-b8.tar

tar xvf collections/msmarco-passage-unicoil-b8.tar -C collections/

To confirm, msmarco-passage-unicoil-b8.tar is ~3.3 GB and has MD5 checksum eb28c059fad906da2840ce77949bffd7.

Indexing

We can now index these docs as a JsonVectorCollection using Anserini:

sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
 -input collections/msmarco-passage-unicoil-b8/ \
 -index indexes/lucene-index.msmarco-passage.unicoil-b8 \
 -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
 -threads 12

The important indexing options to note here are -impact -pretokenized: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.

Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries. The queries are already stored in the repo, so we can run retrieval directly:

target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.unicoil-b8 \
 -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \
 -output runs/run.msmarco-passage.unicoil-b8.tsv -format msmarco \
 -impact -pretokenized

Note that, mirroring the indexing options, we also specify -impact -pretokenized here. Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread).

With -format msmarco, runs are already in the MS MARCO output format, so we can evaluate directly:

python tools/scripts/msmarco/msmarco_passage_eval.py \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.unicoil-b8.tsv

The results should be as follows:

#####################
MRR @10: 0.35155222404147896
QueriesRanked: 6980
#####################

This corresponds to the effectiveness reported in the paper.

Document Ranking

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/ZmF6SKpgMZJYXd6/download -O collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar

tar xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/

To confirm, msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar should have MD5 checksum of 88f365b148c7702cf30c0fb95af35149.

Indexing

We can now index these docs as a JsonVectorCollection using Anserini:

sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
 -input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \
 -index indexes/lucene-index.msmarco-doc-per-passage-expansion.unicoil-d2q-b8 \
 -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
 -threads 12

The important indexing options to note here are -impact -pretokenized: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 20,545,677 documents. The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around an hour.

Retrieval

We can now run retrieval:

target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-doc-per-passage-expansion.unicoil-d2q-b8 \
 -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.unicoil.tsv.gz \
 -output runs/run.msmarco-doc.unicoil-d2q-b8.tsv -format msmarco \
 -hits 1000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100 \
 -impact -pretokenized

Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 50 minutes (on a single thread). Note that, mirroring the indexing options, we specify -impact -pretokenized here also.

With -format msmarco, runs are already in the MS MARCO output format, so we can evaluate directly:

python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
 --run runs/run.msmarco-doc.unicoil-d2q-b8.tsv

The results should be as follows:

#####################
MRR @100: 0.352997702662614
QueriesRanked: 5193
#####################

Reproduction Log*