This sample application contains code, document schema and dependencies for running examples from https://docs.vespa.ai/documentation/semantic-qa-retrieval.html where we build a semantic end-to-end answer retrieval system building on the methodology described in the ReQA: An Evaluation for End-to-End Answer Retrieval Models paper released by Google October 5th. We reproduce Recall@K and MRR results as reported in the paper on Vespa over The Stanford Question Answering Dataset(SQuAD) dataset.
We hope that this sample application can enable more research on semantic retrieval and also enable organization to build powerful question-answer applications using Vespa.
As reported in ReQA: An Evaluation for End-to-End Answer Retrieval Models versus the Vespa implementation for sentence level retrieval and paragraph level retrieval is given in the tables below:
Sentence Level Retrieval
Model | MRR | R@1 | R@5 | R@10 |
---|---|---|---|---|
USE_QA for sentence answer retrieval | 0.539 | 0.439 | 0.656 | 0.727 |
USE_QA on Vespa using tensors | 0.538 | 0.438 | 0.656 | 0.726 |
Paragraph Level Retrieval
Model | MRR | R@1 | R@5 | R@10 |
---|---|---|---|---|
USE_QA for paragraph answer retrieval | 0.634 | 0.533 | 0.757 | 0.823 |
USE_QA on Vespa using tensors and Vespa grouping | 0.633 | 0.532 | 0.756 | 0.822 |
On average the sentence tensor encoding model described in the paper and realized on Vespa has the sentence with the correct answer at the top 1 position in 44% of the questions for sentence level retrieval over a collection of 91,729 sentences and 53% when doing paragraph retrieval over a collection of 18,896 paragraphs.
Some sample questions from the SQuAD v1.1 dataset is show below:
- Which NFL team represented the AFC at Super Bowl 50?
- What color was used to emphasize the 50th anniversary of the Super Bowl?
- What virus did Walter Reed discover?
One can explore the questions and the labeled answers here
Requirements for running this sample application:
- Docker installed and running
- git client to checkout the sample application repository
- Operating system: macOS or Linux, Architecture: x86_64
- Minimum 6GB memory dedicated to Docker (the default is 2GB on Macs).
See also Vespa quick start guide. This setup is slightly different then the official quick start guide as we build a custom docker image with the tensorflow dependencies.
Checkout the sample-apps repository
This step requires that you have a working git client:
$ git clone --depth 1 https://github.com/vespa-engine/sample-apps.git; cd sample-apps/semantic-qa-retrieval
Build a docker image (See Dockerfile for details)
The image builds on the vespaengine/vespa docker image (latest) and installs python3 and the python dependencies to run tensorflow. This step takes a few minutes.
$ docker build . --tag vespa_semantic_qa_retrieval:1.0
Run the docker container built in the previous step and enter the running docker container
$ docker run --detach --name vespa_qa --hostname vespa-container --privileged vespa_semantic_qa_retrieval:1.0 $ docker exec -it vespa_qa bash
Deploy the document schema and configuration - this will start Vespa services
$ vespa-deploy prepare qa/src/main/application/ && vespa-deploy activate
Download the SQuaAD train v1.1 dataset and convert format to Vespa (Sample of 269 questions) The download script will extract a sample set (As processing the whole dataset using the Sentence Encoder for QA takes time).
$ ./qa/bin/download.sh $ ./qa/bin/convert-to-vespa-squad.py sample_squad.json 2> /dev/null
After the above we have two new files in the working directory: squad_queries.txt and squad_vespa_feed.json. The queries.txt file contains question. The sample question set generates 351 sentence documents and 55 context documents (The paragraphs).
Feed Vespa json
We feed the documents using the Vespa http feeder client:
$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --file squad_vespa_feed.json --endpoint http://localhost:8080
Run evaluation
The evaluation script runs all questions produced by the convertation script and for each question it executes different recall and ranking strategies and finally it computes the mean reciprocal rank MRR@100 and the Recall@1,Recall@5 and Recall@10 metrics.
The evaluations script uses the Vespa search api
Running the evaluation.py script:
$ cat squad_queries.txt |./qa/bin/evaluation.py
Which should produce output like this:
Start query evaluation for 269 queries Sentence retrieval metrics: Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', MRR@100 0.5799 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@1 0.4498 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@5 0.7398 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@10 0.8290 Paragraph retrieval metrics: Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', MRR@100 0.7030 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@1 0.5725 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@5 0.8625 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@10 0.9405
Reproducing the paper metrics
To reproduce the paper one need to convert the entire dataset and do evaluation over all questions:
$ ./qa/bin/convert-to-vespa-squad.py SQuAD_train_v1.1.json 2> /dev/null $ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --file squad_vespa_feed.json --endpoint http://localhost:8080 $ cat squad_queries.txt |./qa/bin/evaluation.py 2> /dev/null
Which should produce output like this:
Start query evaluation for 87599 queries Sentence retrieval metrics: Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', MRR@100 0.5376 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@1 0.4380 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@5 0.6551 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@10 0.7262 Paragraph retrieval metrics: Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', MRR@100 0.6330 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@1 0.5322 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@5 0.7555 Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad', R@10 0.8218