Add bindings for MS MARCO V2.1 doc: prebuilt indexes, topics, qrels (c…

…astorini#1917)
mrkarezina · Jun 17, 2024 · 20eb50b · 20eb50b
1 parent 49d8c43
commit 20eb50b
Show file tree

Hide file tree

Showing 8 changed files with 306 additions and 19 deletions.
diff --git a/docs/prebuilt-indexes.md b/docs/prebuilt-indexes.md
@@ -1,8 +1,8 @@
 
 # Pyserini: Prebuilt Indexes
 
-Pyserini provides a number of pre-built Lucene indexes.
-To list what's available in code:
+Pyserini provides a number of prebuilt Lucene indexes.
+To list what's available:
 
 ```python
 from pyserini.search.lucene import LuceneSearcher
@@ -12,21 +12,21 @@ from pyserini.index.lucene import IndexReader
 IndexReader.list_prebuilt_indexes()
 ```
 
-It's easy initialize a searcher from a pre-built index:
+It's easy initialize a searcher from a prebuilt index:
 
 ```python
 searcher = LuceneSearcher.from_prebuilt_index('robust04')
 ```
 
-You can use this simple Python one-liner to download the pre-built index:
+You can use this simple Python one-liner to download the prebuilt index:
 
 ```
 python -c "from pyserini.search.lucene import LuceneSearcher; LuceneSearcher.from_prebuilt_index('robust04')"
 ```
 
 The downloaded index will be in `~/.cache/pyserini/indexes/`.
 
-It's similarly easy initialize an index reader from a pre-built index:
+It's similarly easy initialize an index reader from a prebuilt index:
 
 ```python
 index_reader = IndexReader.from_prebuilt_index('robust04')
@@ -42,8 +42,22 @@ The output will be:
 Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
 Nope, that's not a bug.
 
-Below is a summary of the pre-built indexes that are currently available.
-Detailed configuration information for the pre-built indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
+Pyserini also provides a number of prebuilt Faiss indexes.
+To list what's available:
+
+```python
+from pyserini.search.faiss import FaissSearcher
+FaissSearcher.list_prebuilt_indexes()
+```
+
+And to initialize a specific Faiss index:
+
+```python
+searcher = FaissSearcher.from_prebuilt_index('msmarco-v1-passage.bge-base-en-v1.5', None)
+```
+
+Below is a summary of the prebuilt indexes that are currently available.
+Detailed configuration information for the prebuilt indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
 
 
 
@@ -200,6 +214,30 @@ Detailed configuration information for the pre-built indexes are stored in [`pys
 [<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2-passage-augmented.d2q-t5.20220808.4d6d2a.README.md">readme</a>]
 <dd>Lucene index (+docvectors) of the MS MARCO V2 augmented passage corpus with doc2query-T5 expansions.
 </dd>
+<dt></dt><b><code>msmarco-v2.1-doc</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 document corpus.
+</dd>
+<dt></dt><b><code>msmarco-v2.1-doc-slim</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 document corpus ('slim' version).
+</dd>
+<dt></dt><b><code>msmarco-v2.1-doc-full</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 document corpus ('full' version).
+</dd>
+<dt></dt><b><code>msmarco-v2.1-doc-segmented</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 segmented document corpus.
+</dd>
+<dt></dt><b><code>msmarco-v2.1-doc-segmented-slim</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 segmented document corpus ('slim' version).
+</dd>
+<dt></dt><b><code>msmarco-v2.1-doc-segmented-full</code></b>
+[<a href="../pyserini/resources/index-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md">readme</a>]
+<dd>Lucene index of the MS MARCO V2.1 segmented document corpus ('full' version).
+</dd>
 </dl>
 </details>
 <details>

diff --git a/pyserini/prebuilt_index_info.py b/pyserini/prebuilt_index_info.py
@@ -561,6 +561,94 @@
         "documents": 138364198,
         "unique_terms": 41177061,
         "downloaded": False
+    },
+
+    # MS MARCO V2.1 document corpus, three indexes with different amounts of information (and sizes).
+    "msmarco-v2.1-doc": {
+        "description": "Lucene index of the MS MARCO V2.1 document corpus.",
+        "filename": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.tar.gz",
+        ],
+        "md5": "cecd55856c34afa82f1a499705c9df02",
+        "size compressed (bytes)": 54190811494,
+        "total_terms": 12710796540,
+        "documents": 10960555,
+        "unique_terms": 44599151,
+        "downloaded": False
+    },
+    "msmarco-v2.1-doc-slim": {
+        "description": "Lucene index of the MS MARCO V2.1 document corpus ('slim' version).",
+        "filename": "lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675.tar.gz",
+        ],
+        "md5": "2c31a8c0a7133eb6ea04c91ceffa7e08",
+        "size compressed (bytes)": 6191736133,
+        "total_terms": 12710796540,
+        "documents": 10960555,
+        "unique_terms": 44599151,
+        "downloaded": False
+    },
+    "msmarco-v2.1-doc-full": {
+        "description": "Lucene index of the MS MARCO V2.1 document corpus ('full' version).",
+        "filename": "lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675.tar.gz",
+        ],
+        "md5": "794708c9cf8fc6eafff07c5485e934b9",
+        "size compressed (bytes)": 102997532522,
+        "total_terms": 12710796540,
+        "documents": 10960555,
+        "unique_terms": 44599151,
+        "downloaded": False
+    },
+
+    # MS MARCO V2.1 segmented document corpus, three indexes with different amounts of information (and sizes).
+    "msmarco-v2.1-doc-segmented": {
+        "description": "Lucene index of the MS MARCO V2.1 segmented document corpus.",
+        "filename": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.tar.gz"
+        ],
+        "md5": "6ec4cd595c9fe1ad91b43eabb39a637c",
+        "size compressed (bytes)": 60071133069,
+        "total_terms": 22707699649,
+        "documents": 113520750,
+        "unique_terms": 29040364,
+        "downloaded": False
+    },
+    "msmarco-v2.1-doc-segmented-slim": {
+        "description": "Lucene index of the MS MARCO V2.1 segmented document corpus ('slim' version).",
+        "filename": "lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.tar.gz"
+        ],
+        "md5": "3c6c946c722a201b65903a92f082ea4f",
+        "size compressed (bytes)": 15374492909,
+        "total_terms": 22707699649,
+        "documents": 113520750,
+        "unique_terms": 29040364,
+        "downloaded": False
+    },
+    "msmarco-v2.1-doc-segmented-full": {
+        "description": "Lucene index of the MS MARCO V2.1 segmented document corpus ('full' version).",
+        "filename": "lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675.tar.gz",
+        "readme": "lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md",
+        "urls": [
+            "https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675.tar.gz"
+        ],
+        "md5": "a43d09d31ae4e5dac81f5cfde1a810a7",
+        "size compressed (bytes)": 146130406504,
+        "total_terms": 22707699649,
+        "documents": 113520750,
+        "unique_terms": 29040364,
+        "downloaded": False
     }
 }
 

diff --git a/...x-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md b/...x-metadata/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675.README.md
@@ -0,0 +1,34 @@
+# msmarco-v2.1-doc-segmented
+
+Lucene inverted index of the MS MARCO V2.1 segmented document corpus.
+
+Note that there are three variants of this index:
+
++ `msmarco-v2.1-doc-segmented` (56G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback (unless the documents are parsed on the fly).
++ `msmarco-v2.1-doc-segmented-slim` (15G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
++ `msmarco-v2.1-doc-segmented-full` (137G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.
+
+These indexes were generated on 2024/04/19 at Anserini commit [`4f9675`](https://github.com/castorini/anserini/commit/4f967519baa1bc634f7dd2998d7a408c27120b1c) on `tuna` with the following commands:
+
+```bash
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc-segmented.20240418.4f9675/ \
+  -threads 8 -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-segmented.20240418.4f9675.txt &
+
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc-segmented-slim.20240418.4f9675/ \
+  -threads 8 -optimize >& logs/log.msmarco-v2.1-doc-segmented-slim.20240418.4f9675.txt &
+
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc_segmented/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc-segmented-full.20240418.4f9675/ \
+  -threads 8 -storePositions -storeDocvectors -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-segmented-full.20240418.4f9675.txt &
+```
diff --git a/...urces/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md b/...urces/index-metadata/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675.README.md
@@ -0,0 +1,34 @@
+# msmarco-v2.1-doc
+
+Lucene inverted index of the MS MARCO V2.1 document corpus.
+
+Note that there are three variants of this index:
+
++ `msmarco-v2.1-doc` (51G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback (unless the documents are parsed on the fly).
++ `msmarco-v2.1-doc-slim` (5.8G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
++ `msmarco-v2.1-doc-full` (96G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.
+
+These indexes were generated on 2024/04/19 at Anserini commit [`4f9675`](https://github.com/castorini/anserini/commit/4f967519baa1bc634f7dd2998d7a408c27120b1c) on `tuna` with the following commands:
+
+```bash
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc.20240418.4f9675/ \
+  -threads 8 -storeRaw -optimize >& logs/log.msmarco-v2.1-doc.20240418.4f9675.txt &
+
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc-slim.20240418.4f9675/ \
+  -threads 8 -optimize >& logs/log.msmarco-v2.1-doc-slim.20240418.4f9675.txt &
+
+nohup bin/run.sh io.anserini.index.IndexCollection \
+  -collection MsMarcoV2DocCollection \
+  -input /mnt/collections/msmarco/msmarco_v2.1_doc/ \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.msmarco-v2.1-doc-full.20240418.4f9675/ \
+  -threads 8 -storePositions -storeDocvectors -storeRaw -optimize >& logs/log.msmarco-v2.1-doc-full.20240418.4f9675.txt &
+```
diff --git a/pyserini/search/_base.py b/pyserini/search/_base.py
@@ -82,14 +82,18 @@ def safe_getattr(cls, attr):
     'dl20-unicoil': 'TREC2020_DL_UNICOIL',
     'dl20-unicoil-noexp': 'TREC2020_DL_UNICOIL_NOEXP',
     'dl21': 'TREC2021_DL',
+    'dl21-doc': 'TREC2021_DL',
     'dl21-unicoil': 'TREC2021_DL_UNICOIL',
     'dl21-unicoil-noexp': 'TREC2021_DL_UNICOIL_NOEXP',
     'dl22': 'TREC2022_DL',
+    'dl22-doc': 'TREC2022_DL',
     'dl22-unicoil': 'TREC2022_DL_UNICOIL',
     'dl22-unicoil-noexp': 'TREC2022_DL_UNICOIL_NOEXP',
     'dl23': 'TREC2023_DL',
+    'dl23-doc': 'TREC2023_DL',
     'dl23-unicoil': 'TREC2023_DL_UNICOIL',
     'dl23-unicoil-noexp': 'TREC2023_DL_UNICOIL_NOEXP',
+    'rag24.raggy-dev': 'TREC2024_RAG_RAGGY_DEV',
     'msmarco-doc-dev': 'MSMARCO_DOC_DEV',
     'msmarco-doc-dev-unicoil': 'MSMARCO_DOC_DEV_UNICOIL',
     'msmarco-doc-dev-unicoil-noexp': 'MSMARCO_DOC_DEV_UNICOIL_NOEXP',
@@ -102,9 +106,11 @@ def safe_getattr(cls, attr):
     'msmarco-passage-dev-subset-distill-splade-max': 'MSMARCO_PASSAGE_DEV_SUBSET_DISTILL_SPLADE_MAX',
     'msmarco-passage-test-subset': 'MSMARCO_PASSAGE_TEST_SUBSET',
     'msmarco-v2-doc-dev': 'MSMARCO_V2_DOC_DEV',
+    'msmarco-v2-doc.dev': 'MSMARCO_V2_DOC_DEV',
     'msmarco-v2-doc-dev-unicoil': 'MSMARCO_V2_DOC_DEV_UNICOIL',
     'msmarco-v2-doc-dev-unicoil-noexp': 'MSMARCO_V2_DOC_DEV_UNICOIL_NOEXP',
     'msmarco-v2-doc-dev2': 'MSMARCO_V2_DOC_DEV2',
+    'msmarco-v2-doc.dev2': 'MSMARCO_V2_DOC_DEV2',
     'msmarco-v2-doc-dev2-unicoil': 'MSMARCO_V2_DOC_DEV2_UNICOIL',
     'msmarco-v2-doc-dev2-unicoil-noexp': 'MSMARCO_V2_DOC_DEV2_UNICOIL_NOEXP',
     'msmarco-v2-passage-dev': 'MSMARCO_V2_PASSAGE_DEV',
@@ -419,12 +425,18 @@ def safe_getattr(cls, attr):
     'dl22-passage': 'TREC2022_DL_PASSAGE',
     'dl23-doc': 'TREC2023_DL_DOC',
     'dl23-passage': 'TREC2023_DL_PASSAGE',
+    'dl21-doc-msmarco-v2.1': 'TREC2021_DL_DOC_MSMARCO_V21',
+    'dl22-doc-msmarco-v2.1': 'TREC2022_DL_DOC_MSMARCO_V21',
+    'dl23-doc-msmarco-v2.1': 'TREC2023_DL_DOC_MSMARCO_V21',
+    'rag24.raggy-dev': 'TREC2024_RAG_RAGGY_DEV',
     'msmarco-doc-dev': 'MSMARCO_DOC_DEV',
     'msmarco-passage-dev-subset': 'MSMARCO_PASSAGE_DEV_SUBSET',
     'msmarco-v2-doc-dev': 'MSMARCO_V2_DOC_DEV',
     'msmarco-v2-doc-dev2': 'MSMARCO_V2_DOC_DEV2',
     'msmarco-v2-passage-dev': 'MSMARCO_V2_PASSAGE_DEV',
     'msmarco-v2-passage-dev2': 'MSMARCO_V2_PASSAGE_DEV2',
+    'msmarco-v2.1-doc.dev': 'MSMARCO_V21_DOC_DEV',
+    'msmarco-v2.1-doc.dev2': 'MSMARCO_V21_DOC_DEV2',
     'ntcir8-zh': 'NTCIR8_ZH',
     'clef2006-fr': 'CLEF2006_FR',
     'trec2002-ar': 'TREC2002_AR',
@@ -565,6 +577,7 @@ def safe_getattr(cls, attr):
 topics_mapping = {k: v for k, v in topics_mapping.items() if v is not None}
 qrels_mapping = {k: v for k, v in qrels_mapping.items() if v is not None}
 
+
 def get_topics(collection_name):
     """
     Parameters

diff --git a/scripts/generate_docs_from_prebuilt_indexes.py b/scripts/generate_docs_from_prebuilt_indexes.py
@@ -14,20 +14,15 @@
 # limitations under the License.
 #
 
-import os
-import sys
-
-# Use Pyserini in this repo (as opposed to pip install)
-sys.path.insert(0, './')
 
 from pyserini.prebuilt_index_info import *
 
 
 __boilerplate__ = '''
 # Pyserini: Prebuilt Indexes
 
-Pyserini provides a number of pre-built Lucene indexes.
-To list what's available in code:
+Pyserini provides a number of prebuilt Lucene indexes.
+To list what's available:
 
 ```python
 from pyserini.search.lucene import LuceneSearcher
@@ -37,21 +32,21 @@
 IndexReader.list_prebuilt_indexes()
 ```
 
-It's easy initialize a searcher from a pre-built index:
+It's easy initialize a searcher from a prebuilt index:
 
 ```python
 searcher = LuceneSearcher.from_prebuilt_index('robust04')
 ```
 
-You can use this simple Python one-liner to download the pre-built index:
+You can use this simple Python one-liner to download the prebuilt index:
 
 ```
 python -c "from pyserini.search.lucene import LuceneSearcher; LuceneSearcher.from_prebuilt_index('robust04')"
 ```
 
 The downloaded index will be in `~/.cache/pyserini/indexes/`.
 
-It's similarly easy initialize an index reader from a pre-built index:
+It's similarly easy initialize an index reader from a prebuilt index:
 
 ```python
 index_reader = IndexReader.from_prebuilt_index('robust04')
@@ -67,8 +62,22 @@
 Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
 Nope, that's not a bug.
 
-Below is a summary of the pre-built indexes that are currently available.
-Detailed configuration information for the pre-built indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
+Pyserini also provides a number of prebuilt Faiss indexes.
+To list what's available:
+
+```python
+from pyserini.search.faiss import FaissSearcher
+FaissSearcher.list_prebuilt_indexes()
+```
+
+And to initialize a specific Faiss index:
+
+```python
+searcher = FaissSearcher.from_prebuilt_index('msmarco-v1-passage.bge-base-en-v1.5', None)
+```
+
+Below is a summary of the prebuilt indexes that are currently available.
+Detailed configuration information for the prebuilt indexes are stored in [`pyserini/prebuilt_index_info.py`](../pyserini/prebuilt_index_info.py).
 
 '''