Skip to content

Latest commit

 

History

History
 
 

semantic-qa-retrieval

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Vespa sample application - Semantic Retrieval for Question-Answer Applications

This sample application contains code, document schema and dependencies for running examples from https://docs.vespa.ai/documentation/semantic-qa-retrieval.html where we build a semantic end-to-end answer retrieval system building on the methodology described in the ReQA: An Evaluation for End-to-End Answer Retrieval Models paper released by Google October 5th. We reproduce Recall@K and MRR results as reported in the paper on Vespa over The Stanford Question Answering Dataset(SQuAD) dataset.

We hope that this sample application can enable more research on semantic retrieval and also enable organization to build powerful question-answer applications using Vespa.

Evaluation results for 87,599 questions

As reported in ReQA: An Evaluation for End-to-End Answer Retrieval Models versus the Vespa implementation for sentence level retrieval and paragraph level retrieval is given in the tables below:

Sentence Level Retrieval

Model MRR R@1 R@5 R@10
USE_QA for sentence answer retrieval 0.539 0.439 0.656 0.727
USE_QA on Vespa using tensors 0.538 0.438 0.656 0.726

Paragraph Level Retrieval

Model MRR R@1 R@5 R@10
USE_QA for paragraph answer retrieval 0.634 0.533 0.757 0.823
USE_QA on Vespa using tensors and Vespa grouping 0.633 0.532 0.756 0.822

On average the sentence tensor encoding model described in the paper and realized on Vespa has the sentence with the correct answer at the top 1 position in 44% of the questions for sentence level retrieval over a collection of 91,729 sentences and 53% when doing paragraph retrieval over a collection of 18,896 paragraphs.

Some sample questions from the SQuAD v1.1 dataset is show below:

  • Which NFL team represented the AFC at Super Bowl 50?
  • What color was used to emphasize the 50th anniversary of the Super Bowl?
  • What virus did Walter Reed discover?

One can explore the questions and the labeled answers here

Running this sample application

Requirements for running this sample application:

  • Docker installed and running
  • git client to checkout the sample application repository
  • Operating system: macOS or Linux, Architecture: x86_64
  • Minimum 6GB memory dedicated to Docker (the default is 2GB on Macs).

See also Vespa quick start guide. This setup is slightly different then the official quick start guide as we build a custom docker image with the tensorflow dependencies.

Checkout the sample-apps repository

This step requires that you have a working git client:

$ git clone --depth 1 https://github.com/vespa-engine/sample-apps.git; cd sample-apps/semantic-qa-retrieval

Build a docker image (See Dockerfile for details)

The image builds on the vespaengine/vespa docker image (latest) and installs python3 and the python dependencies to run tensorflow. This step takes a few minutes.

$ docker build . --tag vespa_semantic_qa_retrieval:1.0 

Run the docker container built in the previous step and enter the running docker container

$ docker run --detach --name vespa_qa --hostname vespa-container --privileged vespa_semantic_qa_retrieval:1.0
$ docker exec -it vespa_qa bash 

Deploy the document schema and configuration - this will start Vespa services

$ vespa-deploy prepare qa/src/main/application/ && vespa-deploy activate

Download the SQuaAD train v1.1 dataset and convert format to Vespa (Sample of 269 questions) The download script will extract a sample set (As processing the whole dataset using the Sentence Encoder for QA takes time).

$ ./qa/bin/download.sh
$ ./qa/bin/convert-to-vespa-squad.py sample_squad.json 2> /dev/null

After the above we have two new files in the working directory: squad_queries.txt and squad_vespa_feed.json. The queries.txt file contains question. The sample question set generates 351 sentence documents and 55 context documents (The paragraphs).

Feed Vespa json

We feed the documents using the Vespa http feeder client:

$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --file squad_vespa_feed.json --endpoint http://localhost:8080 

Run evaluation

The evaluation script runs all questions produced by the convertation script and for each question it executes different recall and ranking strategies and finally it computes the mean reciprocal rank MRR@100 and the Recall@1,Recall@5 and Recall@10 metrics.

The evaluations script uses the Vespa search api

Running the evaluation.py script:

$ cat squad_queries.txt |./qa/bin/evaluation.py 

Which should produce output like this:

Start query evaluation for 269 queries
Sentence retrieval metrics:
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   MRR@100  0.5799
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@1 0.4498
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@5 0.7398
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@10 0.8290
Paragraph retrieval metrics:
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   MRR@100  0.7030
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@1 0.5725
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@5 0.8625
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@10 0.9405

Reproducing the paper metrics

To reproduce the paper one need to convert the entire dataset and do evaluation over all questions:

$ ./qa/bin/convert-to-vespa-squad.py SQuAD_train_v1.1.json 2> /dev/null
$ java -jar $VESPA_HOME/lib/jars/vespa-http-client-jar-with-dependencies.jar --file squad_vespa_feed.json --endpoint http://localhost:8080
$ cat squad_queries.txt |./qa/bin/evaluation.py 2> /dev/null

Which should produce output like this:

Start query evaluation for 87599 queries
Sentence retrieval metrics:
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   MRR@100  0.5376
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@1 0.4380
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@5 0.6551
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@10 0.7262
Paragraph retrieval metrics:
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   MRR@100  0.6330
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@1 0.5322
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@5 0.7555
Profile 'sentence-semantic-similarity', doc='sentence', dataset='squad',   R@10 0.8218