Skip to content

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

License

Notifications You must be signed in to change notification settings

yilinjz/pyserini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pyserini: Anserini Integration with Python

Generic badge Maven Central PyPI PyPI Download Stats LICENSE

Pyserini provides a simple Python interface to the Anserini IR toolkit via pyjnius.

A low-effort way to try out Pyserini is to look at our online notebooks, which will allow you to get started with just a few clicks. For convenience, we've pre-built a few common indexes, available to download here.

Pyserini versions adopt the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. The current stable release of Pyserini is v0.9.2.0 on PyPI. The current experimental release of Pyserini on TestPyPI is behind the current stable release (i.e., do not use). In general, documentation is kept up to date with the latest code in the repo.

If you're looking to work with the COVID-19 Open Research Dataset (CORD-19), start with this guide.

Installation

Install via PyPI

pip install pyserini==0.9.2.0

Simple Usage

Here's a sample pre-built index on TREC Disks 4 & 5 to play with (used in the TREC 2004 Robust Track):

wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz
tar xvfz index-robust04-20191213.tar.gz -C indexes
rm index-robust04-20191213.tar.gz

Use the SimpleSearcher for searching:

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('indexes/index-robust04-20191213/')
hits = searcher.search('hubble space telescope')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

The results should be as follows:

 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920

To further examine the results:

# Grab the raw text:
hits[0].raw

# Grab the raw Lucene Document:
hits[0].lucene_document

Configure BM25 parameters and use RM3 query expansion:

searcher.set_bm25(0.9, 0.4)
searcher.set_rm3(10, 10, 0.5)

hits2 = searcher.search('hubble space telescope')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits2[i].docid:15} {hits2[i].score:.5f}')

Usage of the Analyzer API

Pyserini exposes Lucene Analyzers in Python with the Analyzer class. Below is a demonstration of these functionalities:

from pyserini.analysis import pyanalysis

# Default analyzer for English uses the Porter stemmer:
analyzer = pyanalysis.Analyzer(pyanalysis.get_lucene_analyzer())
tokens = analyzer.analyze('City buses are running on time.')
print(tokens)
# Result is ['citi', 'buse', 'run', 'time']

# We can explicitly specify the Porter stemmer as follows:
analyzer = pyanalysis.Analyzer(pyanalysis.get_lucene_analyzer(stemmer='porter'))
tokens = analyzer.analyze('City buses are running on time.')
print(tokens)
# Result is same as above.

# We can explicitly specify the Krovetz stemmer as follows:
analyzer = pyanalysis.Analyzer(pyanalysis.get_lucene_analyzer(stemmer='krovetz'))
tokens = analyzer.analyze('City buses are running on time.')
print(tokens)
# Result is ['city', 'bus', 'running', 'time']

# Create an analyzer that doesn't stem, simply tokenizes:
analyzer = pyanalysis.Analyzer(pyanalysis.get_lucene_analyzer(stemming=False))
tokens = analyzer.analyze('City buses are running on time.')
print(tokens)
# Result is ['city', 'buses', 'running', 'time']

Usage of the Query Builder API

The pyquerybuilder provides functionality to construct Lucene queries through Pyserini. These queries can be directly issued through the SimpleSearcher. Instead of issuing the query hubble space telescope directly, we can also construct the same exact query manually as follows:

from pyserini.search import pyquerybuilder

# First, create term queries for each individual query term:
term1 = pyquerybuilder.get_term_query('hubble')
term2 = pyquerybuilder.get_term_query('space')
term3 = pyquerybuilder.get_term_query('telescope')

# Then, assemble into a "bag of words" query:
should = pyquerybuilder.JBooleanClauseOccur['should'].value

boolean_query_builder = pyquerybuilder.get_boolean_query_builder()
boolean_query_builder.add(term1, should)
boolean_query_builder.add(term2, should)
boolean_query_builder.add(term3, should)

query = boolean_query_builder.build()

Then issue the query:

hits = searcher.search(query)

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

The results should be exactly the same as above.

By manually constructing queries, it is possible to define the boost for each query term individually. For example:

boost1 = pyquerybuilder.get_boost_query(term1, 2.)
boost2 = pyquerybuilder.get_boost_query(term2, 1.)
boost3 = pyquerybuilder.get_boost_query(term3, 1.)

should = pyquerybuilder.JBooleanClauseOccur['should'].value

boolean_query_builder = pyquerybuilder.get_boolean_query_builder()
boolean_query_builder.add(boost1, should)
boolean_query_builder.add(boost2, should)
boolean_query_builder.add(boost3, should)

query = boolean_query_builder.build()

hits = searcher.search(query)

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

Note that the results are different, because we've placed more weight on the term hubble.

Usage of the Index Reader API

The IndexReaderUtils class provides methods for accessing and manipulating an inverted index.

IMPORTANT NOTE: Be aware whether a method takes or returns analyzed or unanalyzed terms. "Analysis" refers to processing by a Lucene Analyzer, which typically includes tokenization, stemming, stopword removal, etc. For example, if a method expects the unanalyzed term and is called with an analyzed term, it'll reanalyze the term; it is sometimes the case that analysis of an already analyzed term is also a valid term, which means that the method will return incorrect results without triggering any warning or error.

Initialize the class as follows:

from pyserini.index import pyutils
from pyserini.analysis import pyanalysis

index_utils = pyutils.IndexReaderUtils('indexes/index-robust04-20191213/')

Use terms() to grab an iterator over all terms in the collection, i.e., the dictionary. Note that these terms are analyzed. Here, we only print out the first 10:

import itertools
for term in itertools.islice(index_utils.terms(), 10):
    print(f'{term.term} (df={term.df}, cf={term.cf})')

How to fetch term statistics for a particular (unanalyzed) query term, "cities" in this case:

term = 'cities'

# Look up its document frequency (df) and collection frequency (cf).
# Note, we use the unanalyzed form:
df, cf = index_utils.get_term_counts(term)
print(f'term "{term}": df={df}, cf={cf}')

What if we want to fetch term statistics for an analyzed term? This can be accomplished by setting Analyzer to None:

term = 'cities'

# Analyze the term.
analyzed = index_utils.analyze(term)
print(f'The analyzed form of "{term}" is "{analyzed[0]}"')

# Skip term analysis:
df, cf = index_utils.get_term_counts(analyzed[0], analyzer=None)
print(f'term "{term}": df={df}, cf={cf}')

Here's how to fetch and traverse postings:

# Fetch and traverse postings for an unanalyzed term:
postings_list = index_utils.get_postings_list(term)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyzer=None)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

Here's how to fetch the document vector for a document:

doc_vector = index_utils.get_document_vector('FBIS4-67701')
print(doc_vector)

The result is a dictionary where the keys are the analyzed terms and the values are the term frequencies. To compute the tf-idf representation of a document, do something like this:

tf = index_utils.get_document_vector('FBIS4-67701')
df = {term: (index_utils.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}

The two dictionaries will hold tf and df statistics; from those it is easy to assemble into the tf-idf representation. However, often the BM25 score is better than tf-idf. To compute the BM25 score for a particular term in a document:

# Note that the keys of get_document_vector() are already analyzed, we set analyzer to be None.
bm25_score = index_utils.compute_bm25_term_weight('FBIS4-67701', 'citi', analyzer=None)
print(bm25_score)

# Alternatively, we pass in the unanalyzed term:
bm25_score = index_utils.compute_bm25_term_weight('FBIS4-67701', 'city')
print(bm25_score)

And so, to compute the BM25 vector of a document:

tf = index_utils.get_document_vector('FBIS4-67701')
bm25_vector = {term: index_utils.compute_bm25_term_weight('FBIS4-67701', term, analyzer=None) for term in tf.keys()}

Another useful feature is to compute the score of a specific document with respect to a query, with the compute_query_document_score method. For example:

query = 'hubble space telescope'
docids = ['LA071090-0047', 'FT934-5418', 'FT921-7107', 'LA052890-0021', 'LA070990-0052']

for i in range(0, len(docids)):
    score = index_utils.compute_query_document_score(docids[i], query)
    print(f'{i+1:2} {docids[i]:15} {score:.5f}')

The scores should be very close (rounding at the 4th decimal point) to the results above, but not exactly the same because search performs additional score manipulation to break ties during ranking.

Usage of the Collection API

The collection classes provide interfaces for iterating over a collection and processing documents. Here's a demonstration on the CACM collection:

wget -O cacm.tar.gz https://github.com/castorini/anserini/blob/master/src/main/resources/cacm/cacm.tar.gz?raw=true
mkdir collections/cacm
tar xvfz cacm.tar.gz -C collections/cacm
rm cacm.tar.gz

Let's iterate through all documents in the collection:

from pyserini.collection import pycollection
from pyserini.index import pygenerator

collection = pycollection.Collection('HtmlCollection', 'collections/cacm/')
generator = pygenerator.Generator('DefaultLuceneDocumentGenerator')

for (i, fs) in enumerate(collection):
    for (j, doc) in enumerate(fs):
        parsed = generator.create_document(doc)
        docid = parsed.get('id')            # FIELD_ID
        raw = parsed.get('raw')             # FIELD_RAW
        contents = parsed.get('contents')   # FIELD_BODY
        print('{} {} -> {} {}...'.format(i, j, docid, contents.strip().replace('\n', ' ')[:50]))

Direct Interaction via Pyjnius

Alternatively, for parts of Anserini that have not yet been integrated into the Pyserini interface, you can interact with Anserini's Java classes directly via pyjnius. First, call Pyserini's setup helper for setting up classpath for the JVM:

from pyserini.setup import configure_classpath
configure_classpath('pyserini/resources/jars')

Now autoclass can be used to provide direct access to Java classes:

from jnius import autoclass

JString = autoclass('java.lang.String')
JIndexReaderUtils = autoclass('io.anserini.index.IndexReaderUtils')
reader = JIndexReaderUtils.getReader(JString('index-robust04-20191213/'))

# Fetch raw document contents by id:
rawdoc = JIndexReaderUtils.documentRaw(reader, JString('FT934-5418'))

Known Issues

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

Release History

About

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.9%
  • Other 1.1%