
This repository accompanies the project: Time Matters: Examine Temporal Effects on Biomedical Language Models presented in AMIA 2024 Annual Symposim
The project investigates how biomedical language models' performance degrades over time due to shifts in data between training and deployment. It focuses on three key biomedical tasks, employing various data drift measurements, and statistical analyses to quantify the temporal effects and explore the relation between data shift and model performance variation.
This study reveals that time impacts model performance, and data shift could serve as an indicator for determining when model updates are necessary. The presentation slides are available for viewing: Time_Matters_Slides_AMIA24.pdf
- Environment Setup
- Time-varying Biomedical Data
- Usage
- Contact and Citation
- Platform:
- Ubuntu 22.04
- Anaconda, Python 3.10.13
- Linux Kernel: 6.8.0-40-generic
- Run the following commands to create the environment:
conda env create -f environment.yml
conda activate tempo0
- MIMIC-IV-Notes
- BioNLP Shared Task (BioNER)
- BioASQ (task b)
Dataset | Time Intervals | Task | Labels | Data Size |
---|---|---|---|---|
MIMIC | 2014 - 2016, 2008 - 2010, 2011 - 2013, 2017 -2019 | Phenotype Inference | Top 50 frequent ICD codes | 331,794 notes |
BioNER | 2009ST, 2011EPI, 2011ID, 2013GE | Information Extraction | IOBES Tags of Protein Entity | 49,354 entities |
BioASQ | 2013-2015, 2016-2018, 2019-2020, 2021-2023 | Question Answering | Gold Standard Answer | 5,046 QA pairs |
- Data preprocessing and split:
After downlading data, go to each data folder and run
python build-[data].py
andpython split-[data].py
. - Test temporal effect:
Run
sh [data].sh
to conduct test the performance on all time domain pairs. After getting the result, usepython record_[data].py
to get the experiment result summary. - Data shift measuring:
- Run
analysis/[encoder]_embedings.py
to get the text embedding for semantic shift evaluation. Here, [encoder] can be selected from general domain encoders like SBERT and USE, or biomedical domain encoders like BioLORD and MedCPT. - Run
analysis/token_level_metrics.py
to measure the token-level data shift, using metrics such as TF-IDF cosine similarity and Jaccard similarity.
- Statistic analysis of the performance and the data shift.
Run
analysis/statistic_analysis.py
to explore the relationships between language model performance and data shifts.
Note: Replace [data] with the actual data path. Since processing differs for each dataset, ensure to specify the correct [data] and file path.
For any inquiries, feel free to contact the author at: [email protected]
To cite this work:
@misc{liu2024timemattersexaminetemporal,
title={Time Matters: Examine Temporal Effects on Biomedical Language Models},
author={Weisi Liu and Zhe He and Xiaolei Huang},
year={2024},
eprint={2407.17638},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.17638},
}