Skip to content

Commit

Permalink
Add documentation for MS Marco Passage Ranking using ElasticSearch
Browse files Browse the repository at this point in the history
* update_elastic_msmarco_passage_file

* integrate msmarco-passage instruction into elastirini.md

* Delete elastirini-msmarco-passage.md

* Minor updates
  • Loading branch information
w329li authored and Ryan Clancy committed Nov 22, 2019
1 parent 663f94c commit e502d76
Showing 1 changed file with 37 additions and 1 deletion.
38 changes: 37 additions & 1 deletion docs/elastirini.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ If at some point one of the ELK components is failing for some reason, or if you

Once we have a local instance of Elasticsearch up and running, we can index using Elasticsearch through Elastirini.

First, let us create the index in Elasticsearch.
First, let us create the index in Elasticsearch. We need to update <index_name> and BM25 parameters for our own purpose.

```
curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/<index_name>' \
Expand Down Expand Up @@ -139,3 +139,39 @@ Evaluation can be performed using `trec_eval`:
```
eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust2004.txt run.es.robust04.bm25.topics.robust04.301-450.601-700.txt
```
# Elasticsearch on MSMARCO(Passage)
For Msmarco-passage data preparation, check Anserini: [BM25 Baselines on MS MARCO (Passage)](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md). Similarly, there are three steps:

1.Create the index in Elasticsearch by calling `curl` command as above. Remember to update BM25 parameters with k1 = 0.82, b = 0.68.

2.Index documents as `JsonCollection` through Elastirini:
```
sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator JsoupGenerator
-es -es.index msmarco-passage -threads 9 -input msmarco-passage/collection_jsonl -storePositions -storeDocvectors -storeRawDocs
```
3.Retrieving and Evaluating the dev set
Since there are many queries (> 100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:
```
python ./src/main/python/msmarco/filter_queries.py --qrels msmarco-passage/qrels.dev.small.tsv \
--queries msmarco-passage/queries.dev.tsv --output_queries msmarco-passage/queries.dev.small.tsv
```
The output queries file should contain 6980 lines.

We can now retrieve this smaller set of queries with Elastirini, it takes about half hour on a modern desktop with an SSD:
```
sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \
-topics msmarco-passage/queries.dev.small.tsv \
-output msmarco-passage/run.dev.small.tsv
```
There are also other -es parameters that you can specify as you see fit.
To perform the evaulation with trec_eval, run:
```
./eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
msmarco-passage/qrels.dev.small.tsv msmarco-passage/run.dev.small.tsv
```
The output should be:
```
map all 0.1956
recall_1000 all 0.8573
```
Average precision and recall@1000 are the two metrics we care about the most. You can check the table in [BM25 Baselines on MS MARCO (Passage)](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) `BM25 Tuning` section for more information.

0 comments on commit e502d76

Please sign in to comment.