- Install the dependencies and devDependencies and start the server.
pip install -U huggingface_hub
- Download dataset you need
huggingface-cli download --repo-type dataset <name>
You may need to open huggingface to get the full name of your dataset (If cannot enter,try huggingface mirror)
Name | Description | Structure | Data Splits |
---|---|---|---|
deepmind\narrativeQA | NarrativeQA 是一个包含故事(影视剧剧本)和相应问题的英语数据集,旨在测试阅读理解能力,尤其是长文档的阅读理解能力。回答问题依靠summary(人类标注)或者story(原文) 一个文本对应一个问题 |
A typical data point consists of a question and answer pair along with a summary/story which can be used to answer the question. Additional information such as the url, word count, wikipedia page, are also provided. | training(32747,saved in 24 parquet files), valiudation(3461,saved in 3 parquet files), test(10557,saved in 8 parquet files) based on story (i.e. the same story cannot appear in more than one split) |
UltraDomain | 包含多个领域的数据集,如finance,cs,legal,cooking(all:20 classes)。每个数据集存储在一个jsonl文件中. 存在一个文本(context)对应多个不同问题,即一篇文章可以出现在多个example中 |
'input'(question);'context';'dataset';'label';'answers';'_id';'length' | All train set(total:3933). Example: agriculture(100),art(200).(All information can be seen in READ_UltraDomain.ipynb). Max size:438(legal), Min size:100(cs) |
qasper | QASPER 是一个用于科学研究论文问答的数据集。它由 1,585 篇自然语言处理论文中的 5,049 个问题组成。每个问题均由 NLP 从业者撰写,他仅阅读相应论文的标题和摘要,并且该问题寻求全文中存在的信息。然后由一组单独的 NLP 从业者回答这些问题,他们还提供答案的支持证据 | ||
hotpotqa | 分两个jsonl数据集corpus(5233329篇文章,每一行存储一个title,简介和Wikipedia地址)和queries(97852个问题); 这个数据库没有将文本和问题一一对应 |
corpus:dict_keys(['_id', 'title', 'text', 'metadata']) queries:dict_keys(['_id', 'text', 'metadata']) |
|
2WikiMultihopQA | 分为train,dev,train三个部分,每个部分存储在一个json文件中。 训练集中每行的context不是原始文本,是已经提取实体的文本; |
'_id', 'type', 'question', 'context', *'supporting_facts', *'evidences', 'answer' | train(167454) development(12576) test(12576) |