Skip to content

Latest commit

 

History

History

baselines

Running Compared Baseline Methods

This folder contains the code for running all compared baseline methods, including BM25, Levenshtein Distance, BioBERT, and SAPBERT.

Run the code

  • Cosine Distance: Run the "matching_simstring.py" file and specify the "sim_measure" parameter as "cosine". More details here.

  • Jaccard Distance: Run the "matching_simstring.py" file and specify the "sim_measure" parameter as "jaccard". More details here.

  • Levenshtein Distance: Run the "matching_strings.py" file and use the "generate_predictions_levenshtein" function. More details here.

  • Jaro-Winkler Distance: Run the "matching_strings.py" file and use the "generate_predictions_jaro_winkler" function. More details here.

  • BM25: Run the "matching_bm25.py" file. More details here.

  • ada002: Run the "generate_embedding_openai" file to generate concept embeddings using OPENAI's "text-embedding-ada-002" model Link. Then run the "matching_embeddings.py" file to calculate pairwise embedding similarities and generate linking prediction results.

  • BioGPT: Run the "generate_embedding_biogpt" file to generate concept embeddings using BioGPT model from here. Then run the "matching_embeddings.py" file to calculate pairwise embedding similarities and generate linking prediction results.

  • SAPBERT: Run the "generate_embedding_hf" file to generate concept embeddings using SAPBERT model from the Hugging Face platform Link. Then run the "matching_embeddings.py" file to calculate pairwise embedding similarities and generate linking prediction results.

Other baseline methods' pipelines are similar to the SAPBERT's pipeline. Only the utilized models are replaced.

  • BioBERT: Utilize the BioBERT model from here.

  • BioClinicalBERT: Utilize the BioClinicalBERT model from here.

  • BioDistilBERT: Utilize the BioDistilBERT model from here.

  • KRISSBERT: Utilize the KRISSBERT model from here.

File Details

  • File "generate_embedding_openai.py": Generates embeddings for the compared "ada002" method.

  • File "generate_embedding_biogpt.py": Generates embeddings for the compared "BioGPT" method.

  • File "generate_embedding_hf.py": Generates embeddings for the compared "SAPBERT", "BioBERT", "BioClinicalBERT", "BioDistilBERT", and "KRISSBERT" methods.

  • File "matching_simstring.py": Identifies the top-K candidates for "Cosine Distance" method and "Jaccard Distance" method.

  • File "matching_bm25.py": Identifies the top-K candidates for "BM25" method.

  • File "matching_strings.py": Identifies the top-K candidates for "Levenshtein Distance" method and "Jaro-Winkler Distance" method.

  • File "matching_embeddings.py": Identifies the top-K candidates for other embedding-based compared methods.

  • File "utils/metrics.py": Describes how we calculate the linking accuracy results.

  • File "utils/distances.py": Details the method for calculating the similarity between embedding pairs.

  • File "utils/others.py": Contains other utility functions for data input/output (I/O).