BCEmbedding: Bilingual and Crosslingual Embedding for RAG

Click to Open Contents

🌐 Bilingual and Crosslingual Superiority
💡 Key Features
🚀 Latest Updates
🍎 Model List
📖 Manual
- Installation
- Quick Start (transformers, sentence-transformers)
⚙️ Evaluation
- Evaluate Semantic Representation by MTEB
- Evaluate RAG by LlamaIndex
📈 Leaderboard
- Semantic Representation Evaluations in MTEB
- RAG Evaluations in LlamaIndex
🛠 Youdao's BCEmbedding API
🧲 WeChat Group
✏️ Citation
🔐 License
🔗 Related Links

Bilingual and Crosslingual Embedding (BCEmbedding), developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks.

BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably QAnything [github], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation.

Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves

A high performence on Semantic Representation Evaluations in MTEB;
A new benchmark in the realm of RAG Evaluations in LlamaIndex.

🌐 Bilingual and Crosslingual Superiority

Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.

EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko).

💡 Key Features

Bilingual and Crosslingual Proficiency: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
RAG-Optimized: Tailored for diverse RAG tasks including translation, summarization, and question answering, ensuring accurate query understanding. See RAG Evaluations in LlamaIndex.
Efficient and Precise Retrieval: Dual-encoder for efficient retrieval of EmbeddingModel in first stage, and cross-encoder of RerankerModel for enhanced precision and deeper semantic analysis in second stage.
Broad Domain Adaptability: Trained on diverse datasets for superior performance across various fields.
User-Friendly Design: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task.
Meaningful Reranking Scores: RerankerModel provides relevant scores to improve result quality and optimize large language model performance.
Proven in Production: Successfully implemented and validated in Youdao's products.

🚀 Latest Updates

2024-01-03: Model Releases - bce-embedding-base_v1 and bce-reranker-base_v1 are available.
2024-01-03: Eval Datasets [CrosslingualMultiDomainsDataset] - Evaluate the performence of RAG, using LlamaIndex.
2024-01-03: Eval Datasets [Details] - Evaluate the performence of crosslingual semantic representation, using MTEB.

🍎 Model List

Model Name	Model Type	Languages	Parameters	Weights
bce-embedding-base_v1	`EmbeddingModel`	ch, en	279M	Huggingface, ModelScope
bce-reranker-base_v1	`RerankerModel`	ch, en, ja, ko	279M	Huggingface, ModelScope

📖 Manual

Installation

First, create a conda environment and activate it.

conda create --name bce python=3.10 -y
conda activate bce

Then install BCEmbedding for minimal installation:

pip install BCEmbedding==0.1.1

Or install from source:

git clone [email protected]:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

Quick Start

1. Based on `BCEmbedding`

Use EmbeddingModel by BCEmbedding, and cls pooler is default.

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

Use RerankerModel by BCEmbedding to calculate relevant scores and rerank:

from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1', ...]

# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]

# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")

# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)

# method 1: rerank passages
rerank_results = model.rerank(query, passages)

NOTE:

For RerankerModel.rerank method in BCEmbedding, we provide an advanced preproccess that we use in production for making sentence_pairs, when "query" + "passage" is longer than max_length.

2. Based on `transformers`

For EmbeddingModel:

from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cuda'  # if no GPU, please "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

For RerankerModel:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')

device = 'cuda'  # if no GPU, please "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# calculate scores
scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
scores = torch.sigmoid(scores)

3. Based on `sentence_transformers`

For EmbeddingModel:

from sentence_transformers import SentenceTransformer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")

# set max_length to 512 to avoid an error.
model.max_seq_length = 512

# extract embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)

For RerankerModel:

from sentence_transformers import CrossEncoder

# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)

# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)

⚙️ Evaluation

Evaluate Semantic Representation by MTEB

We provide evaluation tools for embedding and reranker models, based on MTEB and C_MTEB.

First, install MTEB:

pip install mteb==1.1.1

1. Embedding Models

Just run following cmd to evaluate your_embedding_model (e.g. maidalun1020/bce-embedding-base_v1) in monolingual, bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls

The total evaluation tasks contain 114 datastes of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering".

NOTE:

All models are evaluated in their recommended pooling method (pooler). "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "multilingual-e5-base" and "multilingual-e5-large" use mean pooler, while the others use cls.
"jina-embeddings-v2-base-en" model should be loaded with trust_remote_code.

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large | intfloat/e5-large-v2 | intfloat/multilingual-e5-base | intfloat/multilingual-e5-large} --pooler mean

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code

2. Reranker Models

Run following cmd to evaluate your_reranker_model (e.g. "maidalun1020/bce-reranker-base_v1") in monolingual, bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1

The evaluation tasks contain 12 datastes of "Reranking".

3. Metrics Visualization Tool

We proveide a one-click script to sumarize evaluation results of embedding and reranker models as Embedding Models Evaluation Summary and Reranker Models Evaluation Summary.

python BCEmbedding/evaluation/mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir}

Evaluate RAG by LlamaIndex

LlamaIndex is a famous data framework for LLM-based applications, particularly in RAG. Recently, a LlamaIndex Blog has evaluated the popular embedding and reranker models in RAG pipeline and attracts great attention. Now, we follow its pipeline to evaluate our BCEmbedding.

First, install LlamaIndex, and upgrade transformers to 4.36.0:

pip install transformers==4.36.0

pip install llama-index==0.9.22

Export your "openai" and "cohere" app keys, and openai base url (e.g. "https://api.openai.com/v1") to env:

export OPENAI_BASE_URL={openai_base_url}  # https://api.openai.com/v1
export OPENAI_API_KEY={your_openai_api_key}
export COHERE_APPKEY={your_cohere_api_key}

1. Metrics Definition

Hit Rate:

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it's about how often our system gets it right within the top few guesses. The larger, the better.
Mean Reciprocal Rank (MRR):

For each query, MRR evaluates the system's accuracy by looking at the rank of the highest-placed relevant document. Specifically, it's the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it's second, the reciprocal rank is 1/2, and so on. The larger, the better.

2. Reproduce LlamaIndex Blog

In order to compare our BCEmbedding with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our BCEmbedding:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py

Then, sumarize the evaluation results by:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_reproduce_results

Results Reproduced from the LlamaIndex Blog can be checked in Reproduced Summary of RAG Evaluation, with some obvious conclusions:

In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models.
With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence.
The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.

3. Broad Domain Adaptability

The evaluation of LlamaIndex Blog is monolingual, small amount of data, and specific domain (just including "llama2" paper). In order to evaluate the broad domain adaptability, bilingual and crosslingual capability, we follow the blog to build a multiple domains evaluation dataset (includding "Computer Science", "Physics", "Biology", "Economics", "Math", and "Quantitative Finance". Details), named CrosslingualMultiDomainsDataset, by OpenAI gpt-4-1106-preview for high quality.

First, run following cmd to evaluate the most popular and powerful embedding and reranker models:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py

Then, run the following script to sumarize the evaluation results:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_results

The summary of multiple domains evaluations can be seen in Multiple Domains Scenarios.

📈 Leaderboard

Semantic Representation Evaluations in MTEB

1. Embedding Models

Model	Retrieval (47)	STS (19)	PairClassification (5)	Classification (21)	Reranking (12)	Clustering (15)	Avg (119)
bge-base-en-v1.5	37.14	55.06	75.45	59.73	43.05	37.74	47.20
bge-base-zh-v1.5	47.60	63.72	77.40	63.38	54.85	32.56	53.60
bge-large-en-v1.5	37.15	54.09	75.00	59.24	42.68	37.32	46.82
bge-large-zh-v1.5	47.54	64.73	79.14	64.19	55.88	33.26	54.21
jina-embeddings-v2-base-en	31.58	54.28	74.84	58.42	41.16	34.67	44.29
m3e-base	46.29	63.93	71.84	64.08	52.38	37.84	53.54
m3e-large	34.85	59.74	67.69	60.07	48.99	31.62	46.78
bce-embedding-base_v1	57.60	65.73	74.96	69.00	57.29	38.95	59.43

NOTE:

Our bce-embedding-base_v1 outperforms other opensource embedding models with various model size.
114 datastes including 119 eval results (some dataset contain multiple languages) of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering" in ["en", "zh", "en-zh", "zh-en"] setting.
The crosslingual evaluation datasets we released belong to Retrieval task.
More evaluation details please check Embedding Models Evaluations.

2. Reranker Models

Model	Reranking (12)	Avg (12)
bge-reranker-base	57.78	57.78
bge-reranker-large	59.69	59.69
bce-reranker-base_v1	60.06	60.06

NOTE:

Our bce-reranker-base_v1 outperforms other opensource reranker models.
12 datastes of "Reranking" in ["en", "zh", "en-zh", "zh-en"] setting.
More evaluation details please check Reranker Models Evaluations.

RAG Evaluations in LlamaIndex

1. Multiple Domains Scenarios

NOTE:

Consistent with our Reproduced Results of LlamaIndex Blog.
In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models.
With fixing the embedding model, our bce-reranker-base_v1 achieves the best performence.
The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.

🛠 Youdao's BCEmbedding API

For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, BCEmbedding is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at Youdao BCEmbedding API. Here, you'll find all the necessary guidance to easily implement BCEmbedding across a variety of use cases, ensuring a smooth and effective integration for optimal results.

🧲 WeChat Group

Welcome to scan the QR code below and join the WeChat group.

✏️ Citation

If you use BCEmbedding in your research or project, please feel free to cite and star it:

@misc{youdao_bcembedding_2023,
    title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG},
    author={NetEase Youdao, Inc.},
    year={2023},
    howpublished={\url{https://github.com/netease-youdao/BCEmbedding}}
}

🔐 License

BCEmbedding is licensed under Apache 2.0 License

🔗 Related Links

Netease Youdao - QAnything

FlagEmbedding

MTEB

C_MTEB

LLama Index | LlamaIndex Blog

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BCEmbedding		BCEmbedding
Docs		Docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BCEmbedding: Bilingual and Crosslingual Embedding for RAG

🌐 Bilingual and Crosslingual Superiority

💡 Key Features

🚀 Latest Updates

🍎 Model List

📖 Manual

Installation

Quick Start

1. Based on `BCEmbedding`

2. Based on `transformers`

3. Based on `sentence_transformers`

⚙️ Evaluation

Evaluate Semantic Representation by MTEB

1. Embedding Models

2. Reranker Models

3. Metrics Visualization Tool

Evaluate RAG by LlamaIndex

1. Metrics Definition

2. Reproduce LlamaIndex Blog

3. Broad Domain Adaptability

📈 Leaderboard

Semantic Representation Evaluations in MTEB

1. Embedding Models

2. Reranker Models

RAG Evaluations in LlamaIndex

1. Multiple Domains Scenarios

🛠 Youdao's BCEmbedding API

🧲 WeChat Group

✏️ Citation

🔐 License

🔗 Related Links

About

Releases

Packages

Languages

License

lizeyuan-z/BCEmbedding

Folders and files

Latest commit

History

Repository files navigation

BCEmbedding: Bilingual and Crosslingual Embedding for RAG

🌐 Bilingual and Crosslingual Superiority

💡 Key Features

🚀 Latest Updates

🍎 Model List

📖 Manual

Installation

Quick Start

1. Based on BCEmbedding

2. Based on transformers

3. Based on sentence_transformers

⚙️ Evaluation

Evaluate Semantic Representation by MTEB

1. Embedding Models

2. Reranker Models

3. Metrics Visualization Tool

Evaluate RAG by LlamaIndex

1. Metrics Definition

2. Reproduce LlamaIndex Blog

3. Broad Domain Adaptability

📈 Leaderboard

Semantic Representation Evaluations in MTEB

1. Embedding Models

2. Reranker Models

RAG Evaluations in LlamaIndex

1. Multiple Domains Scenarios

🛠 Youdao's BCEmbedding API

🧲 WeChat Group

✏️ Citation

🔐 License

🔗 Related Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Based on `BCEmbedding`

2. Based on `transformers`

3. Based on `sentence_transformers`

Packages