Skip to content

Latest commit

 

History

History
 
 

DuReader-vis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

DuReadervis

This is a repository of the paper:DuReadervis,: A Chinese Dataset for Open-domain Document Visual Question Answering ACL 2022 Findings.

Introduction

Figure 1

Open-domain question answering (Open-domain QA) ( Figure (a)) has been used widely in many applications, which usually takes clean texts extracted from various formats of documents (e.g., web pages, PDFs, or Word documents) as the information source. However, designing different text extraction approaches is time-consuming and not scalable.

To tackle the above limitations, we propose an Open-domain Document Visual Question Answering (Open-domain DocVQA) task ( Figure (b)). In this task, we apply a universal document extractor (e.g., OCR) to extract all the texts and layouts from the document images and then utilize them along with the visual features to perform the following procedures, including Document Visual Retrieval (DocVRE) to retrieve relevant document images, and Document Visual Question Answering (DocVQA) to extract answers from retrieved document images. The task is more scalable when applied to different application scenarios.

To advance this task, we create the first Chinese Open-domain DocVQA dataset called DuReadervis, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. The questions are real ones issued by users to the search engine. There are three main challenges in DuReadervis: (1) long document understanding, (2) noisy texts, and (3) multi-span answer extraction.

Dataset

The Open-domain DocVQA task consists of two stages: DocVRE and DocVQA. We list all the dataset in DuReadervis for both stages in the following table:

FileName MD5 Description Task
dureader_vis_docvqa.tar.gz 03559a8d01b3939020c71d4fec250926 The train and dev dataset for DocVQA. We align the textual answer to the OCR results of documents, tokenize the OCR results by the LayoutXLM tokenizer, and generate the label sequence for training. DocVQA
dureader_vis_open_docvqa.tar.gz 5907ce4126d3eef8ca32d291dbf14abb (1) The original dataset for open-domain DocVQA. and (2) Top-1 document image retrieved by BM25. DocVRE+DocVQA
dureader_vis_ocr.tar.gz 48d17330bc301cd8966d97d954d33853 The OCR results of all 158K images. DocVRE
dureader_vis_images_part_1.tar.gz 6f41c5efe457f8acd35de8599e083c89 Original image part 1 DocVRE
dureader_vis_images_part_2.tar.gz 6cb6600095aae1e625351bc006bcc906 Original image part 2 DocVRE
dureader_vis_images_part_3.tar.gz 00a616c2421ce30a8e1d106f24fe78db Original image part 3 DocVRE
dureader_vis_images_part_4.tar.gz 3a8ba1a5bb7c8abbd25a45c1daf9aa85 Original image part 4 DocVRE
dureader_vis_images_part_5.tar.gz 920af983ccf39a74f4d438c8d43549f5 Original image part 5 DocVRE
dureader_vis_images_part_6.tar.gz a671f142e55f26888cfb965010d88e8c Original image part 6 DocVRE
dureader_vis_images_part_7.tar.gz cb53d8a0f17a2a0f8a0791634cf35d96 Original image part 7 DocVRE
dureader_vis_images_part_8.tar.gz 3046edb565d90ecb385d5c44430ccc60 Original image part 8 DocVRE
dureader_vis_images_part_9.tar.gz 51a8ccf2cce9ef6b614045cac99b2526 Original image part 9 DocVRE
dureader_vis_images_part_10.tar.gz 73e9f4282b0a8d432df9fc4a79627134 Original image part 10 DocVRE

If you focus on the DocVQA task, dataset dureader_vis_docvqa.tar.gz should be downloaded

If you focus on the Open Domain DocVQA task, dataset dureader_vis_docvqa.tar.gz and dureader_vis_open_docvqa.tar.gz should be downloaded.

If you would like to process the dataset using the original OCR results, dataset dureader_vis_docvqa.tar.gz , dureader_vis_open_docvqa.tar.gz and dureader_vis_ocr.tar.gz should be downloaded.

If you would like to start from the initial point, all the datasets should be downloaded.

DuReadervis Baseline System

The baseline code will come soon...

Citation

If you find our paper and code useful, please cite the following paper:

@inproceedings{dureadervis2022acl,
  title={DuReader\({}_{\mbox{vis}}\): {A} Chinese Dataset for Open-domain Document Visual Question Answering},
  author={Le Qi, Shangwen Lv, Hongyu Li, Jing Liu, Yu Zhang, Qiaoqiao She, Hua Wu, Haifeng Wang and Ting Liu},
  booktitle={Findings of the Association for Computational Linguistics: {ACL} 2022,
               Dublin, Ireland, May 22-27, 2022},
  pages={1338--1351},
  year={2022}
}

Copyright and License

Copyright 2022 Baidu.com, Inc. All Rights Reserved

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.