GitHub - ZhishanQ/UniHGKR: The official repository of UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

🌟 This is the official repository for Dense Heterogeneous Knowledge Retrievers: UniHGKR, and the heterogeneous knowledge retrieval benchmark CompMix-IR.

Abstract

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 4.80 points.

Notes:

We are preparing to update more code and benchmark datasets. Please be patient.

1. CompMix-IR Benchmark

For more detailed information about the CompMix-IR Benchmark, please refer to the CompMix_IR directory.

1.1 Corpus of CompMix-IR:

Download from 🤗 HuggingFace Dataset: Link or ☁️ Google Drive: Link .

The complete version of the CompMix_IR heterogeneous knowledge corpus is approximately 3-4 GB in size. We also provide a smaller file, which is a subset, to help readers understand its content and structure: subset of corpus

1.2 QA pairs of CompMix:

CompMix QA pairs: CompMix

ConvMix QA pairs: ConvMix_annotated

or Huggingface dataset:

CompMix, ConvMix

1.3 Code to evaluate

Code to evaluate whether the retrieved evidence is positive to the question:

Code to judge relevance

1.4 Data-Text Pairs

It is used in training stages 1 and 2.

Download from 🤗 HuggingFace Dataset: Link or ☁️ Google Drive: Link .

The complete version of Data-Text Pairs is about 1.2 GB. We also provide a smaller file, which is a subset, to help readers understand its content and structure: subset of data-text pairs

The CompMix_IR directory provides detailed explanations for the keys within each dict item.

2. UniHGKR model checkpoints

Mdeol Name	Description	🤗 Huggingface Link	Usage Example
UniHGKR-base	adapted for evaluation on CompMix-IR	UniHGKR-base	demo code to use
UniHGKR-base-beir	adapted for evaluation on BEIR	UniHGKR-base-beir	code for evaluation_beir
UniHGKR-7B	LLM-based retriever	UniHGKR-7B	demo code to use
UniHGKR-7B-pretrained	The model was trained after Stages 1 and 2. It needs to be fine-tuned before being used for an information retrieval task.	UniHGKR-7B-pretrained

3. Code to train and evalutation

3.1 Evalutation on CompMix-IR

3.2 Evalutation on Convmix

3.3 Evalutation on BERI

Our variant model UniHGKR-base-beir adapted for evaluation on BEIR can be found at: https://huggingface.co/ZhishanQ/UniHGKR-base-beir

The code for evaluation on BEIR at: evaluation_beir.

✏️ Citation

If you find our paper and resource useful in your research, please consider giving a star ⭐ and citation 📝.

@article{min2024unihgkr,
  title={UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers},
  author={Min, Dehai and Xu, Zhiyang and Qi, Guilin and Huang, Lifu and You, Chenyu},
  journal={arXiv preprint arXiv:2410.20163},
  year={2024}
}

📧 contact


dmin0007[at]student.monash.edu

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CompMix_IR		CompMix_IR
code_for_UniHGKR_7B		code_for_UniHGKR_7B
code_for_UniHGKR_base		code_for_UniHGKR_base
evaluation_beir		evaluation_beir
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

Abstract

Notes:

1. CompMix-IR Benchmark

1.1 Corpus of CompMix-IR:

1.2 QA pairs of CompMix:

1.3 Code to evaluate

1.4 Data-Text Pairs

2. UniHGKR model checkpoints

3. Code to train and evalutation

3.1 Evalutation on CompMix-IR

3.2 Evalutation on Convmix

3.3 Evalutation on BERI

✏️ Citation

📧 contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ZhishanQ/UniHGKR

Folders and files

Latest commit

History

Repository files navigation

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

Abstract

Notes:

1. CompMix-IR Benchmark

1.1 Corpus of CompMix-IR:

1.2 QA pairs of CompMix:

1.3 Code to evaluate

1.4 Data-Text Pairs

2. UniHGKR model checkpoints

3. Code to train and evalutation

3.1 Evalutation on CompMix-IR

3.2 Evalutation on Convmix

3.3 Evalutation on BERI

✏️ Citation

📧 contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages