Zurlo, G., Ronchieri, E. (2024). Abstracts Embeddings Evaluation: A Case Study of Artificial Intelligence and Medical Imaging for the COVID-19 Infection. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing - ICIAP 2023 Workshops. ICIAP 2023. Lecture Notes in Computer Science, vol 14365. Springer, Cham. https://doi.org/10.1007/978-3-031-51023-6_18
The SARS-CoV-2 pandemic triggered unprecedented research efforts across various disciplines. Notably, the field of artificial intelligence (AI) applied to medical imaging has been prominently involved. Given the scarcity of resources in facing this devious disease, AI-based tools have emerged as potentially valuable assets to be harnessed. Natural Language Processing (NLP) offers a means to expedite the analysis of scientific articles on a larger scale, and has long been recognized as a solution to mitigate information overload in biomedical research. Since the beginning of the pandemic, the natural language processing (NLP) community has been consistently addressing the needs of domain experts by applying cutting-edge methods to enhance comprehension and knowledge discovery.
The primary objective of this study is to assess the adequacy of commonly employed biomedical transformer-based models, trained on pre-pandemic corpora, in capturing the semantic features present in medical imaging literature. Concurrently, we aim to observe the potential advantage of continual and citation-informed pretraining on COVID-19 literature.
To accomplish this, we introduce a unique and independent test set specifically focused on the medical imaging domain. This novel dataset serves as a valuable resource for the extrinsic evaluation of contextual embeddings, comprising realistic text classification tasks based on 560 gold labels referred to two target variables: the clinical task and imaging modality.
This project depends on Python (pip install .
in the project root, i.e.:
git clone https://github.com/zurlog/abs-embeddings-eval
cd abs-embeddings-eval
pip install -e .
Notebooks in scripts/
:
Embeddings_Extraction
: Compute the abstracts embeddings from 15 BERT models.Embeddings_Comparison_Modality
: Metrics calculations in the prediction of the imaging modality employed.Embeddings_Comparison_Task
: Metrics calculations in the prediction of the clinical task.Setup
: Dependencies and utility functions.
Files in results/
:
Modality_accuracy.csv
andModality_balanced_acc.csv
: Results of the imaging modality prediction comparison.Task_(primary)_accuracy.csv
andTask_(primary_balanced_acc.csv
: Results of the clinical task prediction comparison.- 📁
embeddings
: Pre-computed vectors stored as serialized Pandas Series.
Files in data/
:
subset_wlabels.csv
: 560 records subset with gold labels.
With the TensorBoard Embedding Projector, we graphically represented SPECTER embeddings against the corresponding labels. The interactive dashboard allows users to search for specific terms in abstracts, and highlights articles that are adjacent to each other in the embedding (low-dimensional) space. The user can choose and tune three popular dimensionality reduction methods (UMAP, T-SNE, PCA).
References, Inspiration, Code Snippets, etc.
- Classification Framework from Born et al. (2021). On the role of artificial intelligence in medical imaging of COVID-19. Patterns (New York, N.Y.), 2(6), 100269.
- Labels from Detailed results of systematic meta-analysis
[Direct Link]
- Inspiration from González-Márquez et al. (2023). The Landscape of Biomedical Research. bioRxiv, 2023.04.10.536208.