Replication and tweaks of 20 Newsgroups (castorini#1204)

tailaiwang · May 17, 2020 · 793d92c · 793d92c
1 parent 5506a3f
commit 793d92c
Show file tree

Hide file tree

Showing 2 changed files with 43 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -92,6 +92,7 @@ For the most part, manual copying and pasting of commands into a shell is requir
 
 + [Working with AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
 + [Baselines for the TREC-COVID Challenge](docs/experiments-covid.md)
++ [Working with the 20 Newsgroups Dataset](docs/experiments-20newsgroups.md)
 + [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
 + [Guide to running BM25 baselines on the MS MARCO Passage Retrieval Task](docs/experiments-msmarco-passage.md)
 + [Guide to running BM25 baselines on the MS MARCO Document Retrieval Task](docs/experiments-msmarco-doc.md)

diff --git a/docs/experiments-20newsgroups.md b/docs/experiments-20newsgroups.md
@@ -1,71 +1,82 @@
-# Anserini: 20 Newsgroups
+# Anserini: Working the 20 Newsgroups Dataset
 
-This page contains instructions for how to index the 20 Newsgroups dataset.
+This page contains instructions for how to index the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/).
 
 ## Data Prep
 
-We're going to use `20newsgroups/` as the working directory.
+There are many versions of the 20 Newsgroups dataset available on the web, we're specifically going to use [this one](http://qwone.com/~jason/20Newsgroups/) (the "bydate" version).
+We're going to use `collections/20newsgroups/` as the working directory.
 First, we need to download and extract the dataset:
 
-```sh
+```bash
 mkdir -p collections/20newsgroups/
 wget -nc http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -P collections/20newsgroups
 tar -xvzf collections/20newsgroups/20news-bydate.tar.gz -C collections/20newsgroups
 ```
 
 To confirm, `20news-bydate.tar.gz` should have MD5 checksum of `d6e9e45cb8cb77ec5276dfa6dfc14318`.
 
-After untaring, you should see the following two folders:
-```
+After unpacking, you should see the following two folders:
+
+```bash
 ls collections/20newsgroups/20news-bydate-test
 ls collections/20newsgroups/20news-bydate-train
 ```
 
 There are docs with the same id in different categories.
-For example, doc `123` can exists in `misc.forsale` & `sci.crypt` even if the two docs have different text. Hence we need prune the dataset by ensuring each doc has a unique id.
-To prune and merge them into one folder:
-```
+For example, doc `123` exists in `misc.forsale` and `sci.crypt`, with different texts.
+Since we assume unique docids when building an index, we need to clean the the dataset first.
+To prune and merge both train and test splits into one folder:
+
+```bash
 python src/main/python/20newsgroups/prune_and_merge.py \
- --paths collections/20newsgroups/20news-bydate-test   \
-         collections/20newsgroups/20news-bydate-train  \
+ --paths collections/20newsgroups/20news-bydate-test collections/20newsgroups/20news-bydate-train \
  --out collections/20newsgroups/20news-bydate
 ```
 
-Now you should see train & test merged into one folder in `20newsgroups/20news-bydate/`.
+Now you should see the train and test splits merged into one folder in `collections/20newsgroups/20news-bydate/`.
 
-# Indexing
+## Indexing
 
-To index train & test together:
-```
+To index train and test together:
+
+```bash
 sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
  -input collections/20newsgroups/20news-bydate \
- -index indexes/lucene-index.20newsgroups.pos+docvectors+raw \
+ -index indexes/lucene-index.20newsgroups.all \
  -generator DefaultLuceneDocumentGenerator -threads 2 \
- -storePositions -storeDocvectors -storeRaw
+ -storePositions -storeDocvectors -storeRaw -optimize
 ```
 
-To index the train set:
-```
+To index just the train set:
+
+```bash
 sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
  -input collections/20newsgroups/20news-bydate-train \
- -index indexes/lucene-index.20newsgroups.train.pos+docvectors+raw \
+ -index indexes/lucene-index.20newsgroups.train \
  -generator DefaultLuceneDocumentGenerator -threads 2 \
- -storePositions -storeDocvectors -storeRaw
+ -storePositions -storeDocvectors -storeRaw -optimize
 ```
 
-To index the test set:
-```
+To index just the test set:
+
+```bash
 sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
  -input collections/20newsgroups/20news-bydate-test \
- -index indexes/lucene-index.20newsgroups.test.pos+docvectors+raw \
+ -index indexes/lucene-index.20newsgroups.test \
  -generator DefaultLuceneDocumentGenerator -threads 2 \
- -storePositions -storeDocvectors -storeRaw
+ -storePositions -storeDocvectors -storeRaw -optimize
 ```
 
-You should see similar states as the table below.
+Indexing should take just a few seconds.
+For reference, the number of docs indexed should be exactly as follows:
+
+|               | # of docs | pre-built index |
+|---------------|----------:|-----------------|
+| Train         |    11,314 | [[download](https://www.dropbox.com/s/npg5eovr92h5k7w/lucene-index.20newsgroups.train.tar.gz)]
+| Test          |     7,532 | [[download](https://www.dropbox.com/s/aptj8hz9wti3qaf/lucene-index.20newsgroups.test.tar.gz)]
+| Train + Test  |    18,846 | [[download](https://www.dropbox.com/s/qo2wt6fzu01yt4c/lucene-index.20newsgroups.all.tar.gz)]
+
+For convenience, we also provide pre-built indexes above.
 
-|               | Index Duration  | # of docs |
-|---------------|-----------------|-----------|
-| Train         |     ~12 seconds |    11,314 |
-| Test          |      ~6 seconds |     7,532 |
-| Train + Test  |     ~15 seconds |    18,846 |
+## Replication Log