Skip to content

Commit

Permalink
Replication and tweaks of 20 Newsgroups (castorini#1204)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored May 17, 2020
1 parent 5506a3f commit 793d92c
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 31 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ For the most part, manual copying and pasting of commands into a shell is requir

+ [Working with AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
+ [Baselines for the TREC-COVID Challenge](docs/experiments-covid.md)
+ [Working with the 20 Newsgroups Dataset](docs/experiments-20newsgroups.md)
+ [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
+ [Guide to running BM25 baselines on the MS MARCO Passage Retrieval Task](docs/experiments-msmarco-passage.md)
+ [Guide to running BM25 baselines on the MS MARCO Document Retrieval Task](docs/experiments-msmarco-doc.md)
Expand Down
73 changes: 42 additions & 31 deletions docs/experiments-20newsgroups.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,82 @@
# Anserini: 20 Newsgroups
# Anserini: Working the 20 Newsgroups Dataset

This page contains instructions for how to index the 20 Newsgroups dataset.
This page contains instructions for how to index the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/).

## Data Prep

We're going to use `20newsgroups/` as the working directory.
There are many versions of the 20 Newsgroups dataset available on the web, we're specifically going to use [this one](http://qwone.com/~jason/20Newsgroups/) (the "bydate" version).
We're going to use `collections/20newsgroups/` as the working directory.
First, we need to download and extract the dataset:

```sh
```bash
mkdir -p collections/20newsgroups/
wget -nc http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -P collections/20newsgroups
tar -xvzf collections/20newsgroups/20news-bydate.tar.gz -C collections/20newsgroups
```

To confirm, `20news-bydate.tar.gz` should have MD5 checksum of `d6e9e45cb8cb77ec5276dfa6dfc14318`.

After untaring, you should see the following two folders:
```
After unpacking, you should see the following two folders:

```bash
ls collections/20newsgroups/20news-bydate-test
ls collections/20newsgroups/20news-bydate-train
```

There are docs with the same id in different categories.
For example, doc `123` can exists in `misc.forsale` & `sci.crypt` even if the two docs have different text. Hence we need prune the dataset by ensuring each doc has a unique id.
To prune and merge them into one folder:
```
For example, doc `123` exists in `misc.forsale` and `sci.crypt`, with different texts.
Since we assume unique docids when building an index, we need to clean the the dataset first.
To prune and merge both train and test splits into one folder:

```bash
python src/main/python/20newsgroups/prune_and_merge.py \
--paths collections/20newsgroups/20news-bydate-test \
collections/20newsgroups/20news-bydate-train \
--paths collections/20newsgroups/20news-bydate-test collections/20newsgroups/20news-bydate-train \
--out collections/20newsgroups/20news-bydate
```

Now you should see train & test merged into one folder in `20newsgroups/20news-bydate/`.
Now you should see the train and test splits merged into one folder in `collections/20newsgroups/20news-bydate/`.

# Indexing
## Indexing

To index train & test together:
```
To index train and test together:

```bash
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate \
-index indexes/lucene-index.20newsgroups.pos+docvectors+raw \
-index indexes/lucene-index.20newsgroups.all \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw
-storePositions -storeDocvectors -storeRaw -optimize
```

To index the train set:
```
To index just the train set:

```bash
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate-train \
-index indexes/lucene-index.20newsgroups.train.pos+docvectors+raw \
-index indexes/lucene-index.20newsgroups.train \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw
-storePositions -storeDocvectors -storeRaw -optimize
```

To index the test set:
```
To index just the test set:

```bash
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \
-input collections/20newsgroups/20news-bydate-test \
-index indexes/lucene-index.20newsgroups.test.pos+docvectors+raw \
-index indexes/lucene-index.20newsgroups.test \
-generator DefaultLuceneDocumentGenerator -threads 2 \
-storePositions -storeDocvectors -storeRaw
-storePositions -storeDocvectors -storeRaw -optimize
```

You should see similar states as the table below.
Indexing should take just a few seconds.
For reference, the number of docs indexed should be exactly as follows:

| | # of docs | pre-built index |
|---------------|----------:|-----------------|
| Train | 11,314 | [[download](https://www.dropbox.com/s/npg5eovr92h5k7w/lucene-index.20newsgroups.train.tar.gz)]
| Test | 7,532 | [[download](https://www.dropbox.com/s/aptj8hz9wti3qaf/lucene-index.20newsgroups.test.tar.gz)]
| Train + Test | 18,846 | [[download](https://www.dropbox.com/s/qo2wt6fzu01yt4c/lucene-index.20newsgroups.all.tar.gz)]

For convenience, we also provide pre-built indexes above.

| | Index Duration | # of docs |
|---------------|-----------------|-----------|
| Train | ~12 seconds | 11,314 |
| Test | ~6 seconds | 7,532 |
| Train + Test | ~15 seconds | 18,846 |
## Replication Log

0 comments on commit 793d92c

Please sign in to comment.