Skip to content

Commit

Permalink
Add bindings for MS MARCO V2.1 doc: prebuilt indexes, topics, qrels (c…
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Jun 17, 2024
1 parent 49d8c43 commit 20eb50b
Show file tree
Hide file tree
Showing 8 changed files with 306 additions and 19 deletions.
52 changes: 45 additions & 7 deletions docs/prebuilt-indexes.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@

# Pyserini: Prebuilt Indexes

Pyserini provides a number of pre-built Lucene indexes.
To list what's available in code:
Pyserini provides a number of prebuilt Lucene indexes.
To list what's available:

```python
from pyserini.search.lucene import LuceneSearcher
Expand All @@ -12,21 +12,21 @@ from pyserini.index.lucene import IndexReader
IndexReader.list_prebuilt_indexes()
```

It's easy initialize a searcher from a pre-built index:
It's easy initialize a searcher from a prebuilt index:

```python
searcher = LuceneSearcher.from_prebuilt_index('robust04')
```

You can use this simple Python one-liner to download the pre-built index:
You can use this simple Python one-liner to download the prebuilt index:

```
python -c "from pyserini.search.lucene import LuceneSearcher; LuceneSearcher.from_prebuilt_index('robust04')"
```

The downloaded index will be in `~/.cache/pyserini/indexes/`.

It's similarly easy initialize an index reader from a pre-built index:
It's similarly easy initialize an index reader from a prebuilt index:

```python
index_reader = IndexReader.from_prebuilt_index('robust04')
Expand All @@ -42,8 +42,22 @@ The output will be:
Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.

Below is a summary of the pre-built indexes that are currently available.
Detailed configuration information for the pre-built indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
Pyserini also provides a number of prebuilt Faiss indexes.
To list what's available:

```python
from pyserini.search.faiss import FaissSearcher
FaissSearcher.list_prebuilt_indexes()
```

And to initialize a specific Faiss index:

```python
searcher = FaissSearcher.from_prebuilt_index('msmarco-v1-passage.bge-base-en-v1.5', None)
```

Below is a summary of the prebuilt indexes that are currently available.
Detailed configuration information for the prebuilt indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).



Expand Down Expand Up @@ -200,6 +214,30 @@ Detailed configuration information for the pre-built indexes are stored in [`pys
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2-passage-augmented.d2q-t5.20220808.4d6d2a.README.md">readme</a>]
<dd>Lucene index (+docvectors) of the MS MARCO V2 augmented passage corpus with doc2query-T5 expansions.
</dd>
<dt></dt><b><code>msmarco-v2.1-doc</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 document corpus.
</dd>
<dt></dt><b><code>msmarco-v2.1-doc-slim</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 document corpus ('slim' version).
</dd>
<dt></dt><b><code>msmarco-v2.1-doc-full</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 document corpus ('full' version).
</dd>
<dt></dt><b><code>msmarco-v2.1-doc-segmented</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 segmented document corpus.
</dd>
<dt></dt><b><code>msmarco-v2.1-doc-segmented-slim</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 segmented document corpus ('slim' version).
</dd>
<dt></dt><b><code>msmarco-v2.1-doc-segmented-full</code></b>
[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
<dd>Lucene index of the MS MARCO V2.1 segmented document corpus ('full' version).
</dd>
</dl>
</details>
<details>
Expand Down
88 changes: 88 additions & 0 deletions pyserini/prebuilt_index_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,94 @@
"documents": 138364198,
"unique_terms": 41177061,
"downloaded": False
},

# MS MARCO V2.1 document corpus, three indexes with different amounts of information (and sizes).
"msmarco-v2.1-doc": {
"description": "Lucene index of the MS MARCO V2.1 document corpus.",
"filename": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.tar.gz",
],
"md5": "cecd55856c34afa82f1a499705c9df02",
"size compressed (bytes)": 54190811494,
"total_terms": 12710796540,
"documents": 10960555,
"unique_terms": 44599151,
"downloaded": False
},
"msmarco-v2.1-doc-slim": {
"description": "Lucene index of the MS MARCO V2.1 document corpus ('slim' version).",
"filename": "lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675.tar.gz",
],
"md5": "2c31a8c0a7133eb6ea04c91ceffa7e08",
"size compressed (bytes)": 6191736133,
"total_terms": 12710796540,
"documents": 10960555,
"unique_terms": 44599151,
"downloaded": False
},
"msmarco-v2.1-doc-full": {
"description": "Lucene index of the MS MARCO V2.1 document corpus ('full' version).",
"filename": "lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675.tar.gz",
],
"md5": "794708c9cf8fc6eafff07c5485e934b9",
"size compressed (bytes)": 102997532522,
"total_terms": 12710796540,
"documents": 10960555,
"unique_terms": 44599151,
"downloaded": False
},

# MS MARCO V2.1 segmented document corpus, three indexes with different amounts of information (and sizes).
"msmarco-v2.1-doc-segmented": {
"description": "Lucene index of the MS MARCO V2.1 segmented document corpus.",
"filename": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.tar.gz"
],
"md5": "6ec4cd595c9fe1ad91b43eabb39a637c",
"size compressed (bytes)": 60071133069,
"total_terms": 22707699649,
"documents": 113520750,
"unique_terms": 29040364,
"downloaded": False
},
"msmarco-v2.1-doc-segmented-slim": {
"description": "Lucene index of the MS MARCO V2.1 segmented document corpus ('slim' version).",
"filename": "lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.tar.gz"
],
"md5": "3c6c946c722a201b65903a92f082ea4f",
"size compressed (bytes)": 15374492909,
"total_terms": 22707699649,
"documents": 113520750,
"unique_terms": 29040364,
"downloaded": False
},
"msmarco-v2.1-doc-segmented-full": {
"description": "Lucene index of the MS MARCO V2.1 segmented document corpus ('full' version).",
"filename": "lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675.tar.gz",
"readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675.tar.gz"
],
"md5": "a43d09d31ae4e5dac81f5cfde1a810a7",
"size compressed (bytes)": 146130406504,
"total_terms": 22707699649,
"documents": 113520750,
"unique_terms": 29040364,
"downloaded": False
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# msmarco-v2.1-doc-segmented

Lucene inverted index of the MS MARCO V2.1 segmented document corpus.

Note that there are three variants of this index:

+ `msmarco-v2.1-doc-segmented` (56G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback (unless the documents are parsed on the fly).
+ `msmarco-v2.1-doc-segmented-slim` (15G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
+ `msmarco-v2.1-doc-segmented-full` (137G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.

These indexes were generated on 2024/04/19 at Anserini commit [`4f9675`](https://github.com/castorini/anserini/commit/4f967519baa1bc634f7dd2998d7a408c27120b1c) on `tuna` with the following commands:

```bash
nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675/ \
-threads 8 -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-segmented.20240418.4f9675.txt &

nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675/ \
-threads 8 -optimize >& logs/log.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.txt &

nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675/ \
-threads 8 -storePositions -storeDocvectors -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-segmented-full.20240418.4f9675.txt &
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# msmarco-v2.1-doc

Lucene inverted index of the MS MARCO V2.1 document corpus.

Note that there are three variants of this index:

+ `msmarco-v2.1-doc` (51G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback (unless the documents are parsed on the fly).
+ `msmarco-v2.1-doc-slim` (5.8G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
+ `msmarco-v2.1-doc-full` (96G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.

These indexes were generated on 2024/04/19 at Anserini commit [`4f9675`](https://github.com/castorini/anserini/commit/4f967519baa1bc634f7dd2998d7a408c27120b1c) on `tuna` with the following commands:

```bash
nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675/ \
-threads 8 -storeRaw -optimize >& logs/log.msmarco-v2.1-doc.20240418.4f9675.txt &

nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675/ \
-threads 8 -optimize >& logs/log.msmarco-v2.1-doc-slim.20240418.4f9675.txt &

nohup bin/run.sh io.anserini.index.IndexCollection \
-collection MsMarcoV2DocCollection \
-input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
-generator DefaultLuceneDocumentGenerator \
-index indexes/lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675/ \
-threads 8 -storePositions -storeDocvectors -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-full.20240418.4f9675.txt &
```
13 changes: 13 additions & 0 deletions pyserini/search/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,18 @@ def safe_getattr(cls, attr):
'dl20-unicoil': 'TREC2020_DL_UNICOIL',
'dl20-unicoil-noexp': 'TREC2020_DL_UNICOIL_NOEXP',
'dl21': 'TREC2021_DL',
'dl21-doc': 'TREC2021_DL',
'dl21-unicoil': 'TREC2021_DL_UNICOIL',
'dl21-unicoil-noexp': 'TREC2021_DL_UNICOIL_NOEXP',
'dl22': 'TREC2022_DL',
'dl22-doc': 'TREC2022_DL',
'dl22-unicoil': 'TREC2022_DL_UNICOIL',
'dl22-unicoil-noexp': 'TREC2022_DL_UNICOIL_NOEXP',
'dl23': 'TREC2023_DL',
'dl23-doc': 'TREC2023_DL',
'dl23-unicoil': 'TREC2023_DL_UNICOIL',
'dl23-unicoil-noexp': 'TREC2023_DL_UNICOIL_NOEXP',
'rag24.raggy-dev': 'TREC2024_RAG_RAGGY_DEV',
'msmarco-doc-dev': 'MSMARCO_DOC_DEV',
'msmarco-doc-dev-unicoil': 'MSMARCO_DOC_DEV_UNICOIL',
'msmarco-doc-dev-unicoil-noexp': 'MSMARCO_DOC_DEV_UNICOIL_NOEXP',
Expand All @@ -102,9 +106,11 @@ def safe_getattr(cls, attr):
'msmarco-passage-dev-subset-distill-splade-max': 'MSMARCO_PASSAGE_DEV_SUBSET_DISTILL_SPLADE_MAX',
'msmarco-passage-test-subset': 'MSMARCO_PASSAGE_TEST_SUBSET',
'msmarco-v2-doc-dev': 'MSMARCO_V2_DOC_DEV',
'msmarco-v2-doc.dev': 'MSMARCO_V2_DOC_DEV',
'msmarco-v2-doc-dev-unicoil': 'MSMARCO_V2_DOC_DEV_UNICOIL',
'msmarco-v2-doc-dev-unicoil-noexp': 'MSMARCO_V2_DOC_DEV_UNICOIL_NOEXP',
'msmarco-v2-doc-dev2': 'MSMARCO_V2_DOC_DEV2',
'msmarco-v2-doc.dev2': 'MSMARCO_V2_DOC_DEV2',
'msmarco-v2-doc-dev2-unicoil': 'MSMARCO_V2_DOC_DEV2_UNICOIL',
'msmarco-v2-doc-dev2-unicoil-noexp': 'MSMARCO_V2_DOC_DEV2_UNICOIL_NOEXP',
'msmarco-v2-passage-dev': 'MSMARCO_V2_PASSAGE_DEV',
Expand Down Expand Up @@ -419,12 +425,18 @@ def safe_getattr(cls, attr):
'dl22-passage': 'TREC2022_DL_PASSAGE',
'dl23-doc': 'TREC2023_DL_DOC',
'dl23-passage': 'TREC2023_DL_PASSAGE',
'dl21-doc-msmarco-v2.1': 'TREC2021_DL_DOC_MSMARCO_V21',
'dl22-doc-msmarco-v2.1': 'TREC2022_DL_DOC_MSMARCO_V21',
'dl23-doc-msmarco-v2.1': 'TREC2023_DL_DOC_MSMARCO_V21',
'rag24.raggy-dev': 'TREC2024_RAG_RAGGY_DEV',
'msmarco-doc-dev': 'MSMARCO_DOC_DEV',
'msmarco-passage-dev-subset': 'MSMARCO_PASSAGE_DEV_SUBSET',
'msmarco-v2-doc-dev': 'MSMARCO_V2_DOC_DEV',
'msmarco-v2-doc-dev2': 'MSMARCO_V2_DOC_DEV2',
'msmarco-v2-passage-dev': 'MSMARCO_V2_PASSAGE_DEV',
'msmarco-v2-passage-dev2': 'MSMARCO_V2_PASSAGE_DEV2',
'msmarco-v2.1-doc.dev': 'MSMARCO_V21_DOC_DEV',
'msmarco-v2.1-doc.dev2': 'MSMARCO_V21_DOC_DEV2',
'ntcir8-zh': 'NTCIR8_ZH',
'clef2006-fr': 'CLEF2006_FR',
'trec2002-ar': 'TREC2002_AR',
Expand Down Expand Up @@ -565,6 +577,7 @@ def safe_getattr(cls, attr):
topics_mapping = {k: v for k, v in topics_mapping.items() if v is not None}
qrels_mapping = {k: v for k, v in qrels_mapping.items() if v is not None}


def get_topics(collection_name):
"""
Parameters
Expand Down
33 changes: 21 additions & 12 deletions scripts/generate_docs_from_prebuilt_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,15 @@
# limitations under the License.
#

import os
import sys

# Use Pyserini in this repo (as opposed to pip install)
sys.path.insert(0, './')

from pyserini.prebuilt_index_info import *


__boilerplate__ = '''
# Pyserini: Prebuilt Indexes
Pyserini provides a number of pre-built Lucene indexes.
To list what's available in code:
Pyserini provides a number of prebuilt Lucene indexes.
To list what's available:
```python
from pyserini.search.lucene import LuceneSearcher
Expand All @@ -37,21 +32,21 @@
IndexReader.list_prebuilt_indexes()
```
It's easy initialize a searcher from a pre-built index:
It's easy initialize a searcher from a prebuilt index:
```python
searcher = LuceneSearcher.from_prebuilt_index('robust04')
```
You can use this simple Python one-liner to download the pre-built index:
You can use this simple Python one-liner to download the prebuilt index:
```
python -c "from pyserini.search.lucene import LuceneSearcher; LuceneSearcher.from_prebuilt_index('robust04')"
```
The downloaded index will be in `~/.cache/pyserini/indexes/`.
It's similarly easy initialize an index reader from a pre-built index:
It's similarly easy initialize an index reader from a prebuilt index:
```python
index_reader = IndexReader.from_prebuilt_index('robust04')
Expand All @@ -67,8 +62,22 @@
Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.
Below is a summary of the pre-built indexes that are currently available.
Detailed configuration information for the pre-built indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
Pyserini also provides a number of prebuilt Faiss indexes.
To list what's available:
```python
from pyserini.search.faiss import FaissSearcher
FaissSearcher.list_prebuilt_indexes()
```
And to initialize a specific Faiss index:
```python
searcher = FaissSearcher.from_prebuilt_index('msmarco-v1-passage.bge-base-en-v1.5', None)
```
Below is a summary of the prebuilt indexes that are currently available.
Detailed configuration information for the prebuilt indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
'''

Expand Down
Loading

0 comments on commit 20eb50b

Please sign in to comment.