We present BioSyn for learning biomedical entity representations. You can train BioSyn with the two main components described in our paper: 1) synonym marginalization and 2) iterative candidate retrieval. Once you train BioSyn, you can easily normalize any biomedical mentions or represent them into entity embeddings.
- [Mar 17, 2022] Checkpoints of BioSyn for normalizing gene type are released. The BC2GN data used for the gene type has been pre-processed by Tutubalina et al., 2020.
- [Oct 25, 2021] Trained models are uploaded in Huggingface Hub(Please check out here). Other than BioBERT, we also train our model using another pre-trained model SapBERT, and obtain better performance than as described in our paper.
$ conda create -n BioSyn python=3.7
$ conda activate BioSyn
$ conda install numpy tqdm scikit-learn
$ conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch
$ pip install transformers==4.11.3
Note that Pytorch has to be installed depending on the version of CUDA.
Datasets consist of queries (train, dev, test, and traindev), and dictionaries (train_dictionary, dev_dictionary, and test_dictionary). Note that the only difference between the dictionaries is that test_dictionary includes train and dev mentions, and dev_dictionary includes train mentions to increase the coverage. The queries are pre-processed with lowercasing, removing punctuations, resolving composite mentions and resolving abbreviation (Ab3P). The dictionaries are pre-processed with lowercasing, removing punctuations (If you need the pre-processing codes, please let us know by openning an issue).
Note that we use development (dev) set to search the hyperparameters, and train on traindev (train+dev) set to report the final performance.
dataset cannot be shared because of the license issue. Please visit the website or see here for pre-processing scripts.
The following example fine-tunes our model on NCBI-Disease dataset (train+dev) with BioBERTv1.1.
CUDA_VISIBLE_DEVICES=1 python train.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_dictionary_path ${DATA_DIR}/train_dictionary.txt \
--train_dir ${DATA_DIR}/processed_traindev \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--epoch 10 \
--train_batch_size 16\
--learning_rate 1e-5 \
--max_length 25
Note that you can train the model on processed_train
and evaluate it on processed_dev
when you want to search for the hyperparameters. (the argument --save_checkpoint_all
can be helpful. )
The following example evaluates our trained model with NCBI-Disease dataset (test).
python eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dictionary_path ${DATA_DIR}/test_dictionary.txt \
--data_dir ${DATA_DIR}/processed_test \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--max_length 25 \
The predictions are saved in predictions_eval.json
with mentions, candidates and accuracies (the argument --save_predictions
has to be on).
Following is an example.
"queries": [
"mentions": [
"mention": "ataxia telangiectasia",
"golden_cui": "D001260",
"candidates": [
"name": "ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
"name": "ataxia telangiectasia syndrome",
"cui": "D001260|208900",
"label": 1
"name": "ataxia telangiectasia variant",
"cui": "C566865",
"label": 0
"name": "syndrome ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
"name": "telangiectasia",
"cui": "D013684",
"label": 0
"acc1": 0.9114583333333334,
"acc5": 0.9385416666666667
We provide a simple script that can normalize a biomedical mention or represent the mention into an embedding vector with BioSyn.
Model | Acc@1/Acc@5 |
biosyn-biobert-ncbi-disease | 91.1/93.9 |
biosyn-sapbert-ncbi-disease | 92.4/95.8 |
Model | Acc@1/Acc@5 |
biosyn-biobert-bc5cdr-disease | 93.2/96.0 |
biosyn-sapbert-bc5cdr-disease | 93.5/96.4 |
Model | Acc@1/Acc@5 |
biosyn-biobert-bc5cdr-chemical | 96.6/97.2 |
biosyn-sapbert-bc5cdr-chemical | 96.6/98.3 |
Model | Acc@1/Acc@5 |
biosyn-biobert-bc2gn | 90.6/95.6 |
biosyn-sapbert-bc2gn | 91.3/96.3 |
The example below gives the top 5 predictions for a mention ataxia telangiectasia
. Note that the initial run will take some time to embed the whole dictionary. You can download the dictionary file here.
python inference.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dictionary_path ${DATA_DIR}/test_dictionary.txt \
--use_cuda \
--mention "ataxia telangiectasia" \
"mention": "ataxia telangiectasia",
"predictions": [
{"name": "ataxia telangiectasia", "id": "D001260|208900"},
{"name": "ataxia telangiectasia syndrome", "id": "D001260|208900"},
{"name": "telangiectasia", "id": "D013684"},
{"name": "telangiectasias", "id": "D013684"},
{"name": "ataxia telangiectasia variant", "id": "C566865"}
The example below gives an embedding of a mention ataxia telangiectasia
python inference.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--use_cuda \
--mention "ataxia telangiectasia" \
"mention": "ataxia telangiectasia",
"mention_sparse_embeds": array([0.05979538, 0., ..., 0., 0.], dtype=float32),
"mention_dense_embeds": array([-7.14258850e-02, ..., -4.03847933e-01,],dtype=float32)
Web demo is implemented on Tornado framework. If a dictionary is not yet cached, it will take about couple of minutes to create dictionary cache.
python demo.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--use_cuda \
--dictionary_path ./datasets/ncbi-disease/test_dictionary.txt
title={Biomedical Entity Representations with Synonym Marginalization},
author={Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},