Skip to content

The official repository of UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

Notifications You must be signed in to change notification settings

ZhishanQ/UniHGKR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌟 This is the official repository for Dense Heterogeneous Knowledge Retrievers: UniHGKR, and the heterogeneous knowledge retrieval benchmark CompMix-IR.

arXiv

Abstract

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 4.80 points.

Notes:

We are preparing to update more code and benchmark datasets. Please be patient.

1. CompMix-IR Benchmark

For more detailed information about the CompMix-IR Benchmark, please refer to the CompMix_IR directory.

1.1 Corpus of CompMix-IR:

Download from 🤗 HuggingFace Dataset: Link or ☁️ Google Drive: Link .

The complete version of the CompMix_IR heterogeneous knowledge corpus is approximately 3-4 GB in size. We also provide a smaller file, which is a subset, to help readers understand its content and structure: subset of corpus

1.2 QA pairs of CompMix:

CompMix QA pairs: CompMix

ConvMix QA pairs: ConvMix_annotated

or Huggingface dataset:

CompMix, ConvMix

1.3 Code to evaluate

Code to evaluate whether the retrieved evidence is positive to the question:

Code to judge relevance

1.4 Data-Text Pairs

It is used in training stages 1 and 2.

Download from 🤗 HuggingFace Dataset: Link or ☁️ Google Drive: Link .

The complete version of Data-Text Pairs is about 1.2 GB. We also provide a smaller file, which is a subset, to help readers understand its content and structure: subset of data-text pairs

The CompMix_IR directory provides detailed explanations for the keys within each dict item.

2. UniHGKR model checkpoints

Mdeol Name Description 🤗 Huggingface Link Usage Example
UniHGKR-base adapted for evaluation on CompMix-IR UniHGKR-base demo code to use
UniHGKR-base-beir adapted for evaluation on BEIR UniHGKR-base-beir code for evaluation_beir
UniHGKR-7B LLM-based retriever UniHGKR-7B demo code to use
UniHGKR-7B-pretrained The model was trained after Stages 1 and 2. It needs to be fine-tuned before being used for an information retrieval task. UniHGKR-7B-pretrained

3. Code to train and evalutation

3.1 Evalutation on CompMix-IR

3.2 Evalutation on Convmix

3.3 Evalutation on BERI

Our variant model UniHGKR-base-beir adapted for evaluation on BEIR can be found at: https://huggingface.co/ZhishanQ/UniHGKR-base-beir

The code for evaluation on BEIR at: evaluation_beir.

✏️ Citation

If you find our paper and resource useful in your research, please consider giving a star ⭐ and citation 📝.

@article{min2024unihgkr,
  title={UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers},
  author={Min, Dehai and Xu, Zhiyang and Qi, Guilin and Huang, Lifu and You, Chenyu},
  journal={arXiv preprint arXiv:2410.20163},
  year={2024}
}

📧 contact


dmin0007[at]student.monash.edu

About

The official repository of UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages