GitHub - zirconiumnmlee/Summary-of-Datasets-for-GRAG

📢Summary of Datasets for GRAG

This is a repository to summarize the datasets used in several papers,which may include how to install the datasets,how to read them with python and something else. It will help us become familiar with the structure of relevant data sets for further scientific research.

How to download datasets from Huggingface

Install the dependencies and devDependencies and start the server.

pip install -U huggingface_hub

Download dataset you need

huggingface-cli download --repo-type dataset <name>

You may need to open huggingface to get the full name of your dataset (If cannot enter,try huggingface mirror)

Summary of Datasets

Name	Description	Structure	Data Splits
deepmind\narrativeQA	NarrativeQA 是一个包含故事(影视剧剧本)和相应问题的英语数据集，旨在测试阅读理解能力，尤其是长文档的阅读理解能力。回答问题依靠summary（人类标注）或者story（原文）一个文本对应一个问题	A typical data point consists of a question and answer pair along with a summary/story which can be used to answer the question. Additional information such as the url, word count, wikipedia page, are also provided.	training(32747,saved in 24 parquet files), valiudation(3461,saved in 3 parquet files), test(10557,saved in 8 parquet files) based on story (i.e. the same story cannot appear in more than one split)
UltraDomain	包含多个领域的数据集，如finance,cs,legal,cooking(all:20 classes)。每个数据集存储在一个jsonl文件中. 存在一个文本(context)对应多个不同问题，即一篇文章可以出现在多个example中	'input'(question);'context';'dataset';'label';'answers';'_id';'length'	All train set(total:3933). Example: agriculture(100),art(200).(All information can be seen in READ_UltraDomain.ipynb). Max size:438(legal), Min size:100(cs)
qasper	QASPER 是一个用于科学研究论文问答的数据集。它由 1,585 篇自然语言处理论文中的 5,049 个问题组成。每个问题均由 NLP 从业者撰写，他仅阅读相应论文的标题和摘要，并且该问题寻求全文中存在的信息。然后由一组单独的 NLP 从业者回答这些问题，他们还提供答案的支持证据
hotpotqa	分两个jsonl数据集corpus(5233329篇文章，每一行存储一个title，简介和Wikipedia地址)和queries(97852个问题)；这个数据库没有将文本和问题一一对应	corpus：dict_keys(['_id', 'title', 'text', 'metadata']) queries:dict_keys(['_id', 'text', 'metadata'])
2WikiMultihopQA	分为train，dev，train三个部分，每个部分存储在一个json文件中。训练集中每行的context不是原始文本，是已经提取实体的文本;	'_id', 'type', 'question', 'context', 'supporting_facts', 'evidences', 'answer'	train(167454) development(12576) test(12576)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
READ_2Wiki.ipynb		READ_2Wiki.ipynb
READ_UltraDomain.ipynb		READ_UltraDomain.ipynb
READ_deepmind-narrativeqa.ipynb		READ_deepmind-narrativeqa.ipynb
READ_hotpotqa.ipynb		READ_hotpotqa.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📢Summary of Datasets for GRAG

How to download datasets from Huggingface

Summary of Datasets

About

Releases

Packages

Languages

zirconiumnmlee/Summary-of-Datasets-for-GRAG

Folders and files

Latest commit

History

Repository files navigation

📢Summary of Datasets for GRAG

How to download datasets from Huggingface

Summary of Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages