Knowledge Distillaton of DNABERT for Prediction of Genomic Elements

This repository includes the implementation of "Knowledge Distillaton of DNABERT for Prediction of Genomic Elements". It includes source codes for data acquisition, distillation and fine-tuning of student models and usage examples. Pre-trained and fine-tuned models will be available soon.

All the models were build upon the framework provided by HuggingFace. Parts of the code provided by the authors of DNABERT, DistilBERT and an unoﬀicial reimplementation of MiniLM served as a base for this work and were extended to fit the specific needs.

For more information on the methodology and results, please check the thesis report.

Pre-trained models

Model	# Layers	# Hidden units	Link
DNABERT	12	768	*Made available by the authors. Official github here
DistilBERT	6	768	dnabert-distilbert
MiniLM	6	768	dnabert-minilm
MiniLM small	6	384	dnabert-minilm-small
MiniLM mini	3	384	dnabert-minilm-mini

Usage examples

Below you can find examples on how to perform basic distillations, promoter fine-tuning and evaluation. Check allowed arguments to find more advanced options, e.g. python run_distil.py -h.

Pre-train DistilBERT student model

python run_distil.py \
    --train_data_file data/pretrain/sample_6_3k.txt \
    --output_dir models \
    --student_model_type distildna \
    --student_config_name src/transformers/dnabert-config/distilbert-config-6 \
    --teacher_name_or_path Path_to_pretrained_DNABERT \
    --mlm \
    --do_train \
    --alpha_ce 2 \
    --alpha_mlm 7 \
    --alpha_cos 1 \
    --per_gpu_train_batch_size 32 \
    --learning_rate 0.0004 \
    --logging_steps 500 \
    --save_steps 8000 \
    --num_train_epochs 2

Pre-train MiniLM student model

python run_distil.py \
    --train_data_file data/pretrain/sample_6_3k.txt \
    --output_dir models \
    --student_model_type minidna \
    --student_config_name src/transformers/dnabert-config/minilm-config-6 \
    --teacher_name_or_path Path_to_pretrained_DNABERT \
    --mlm \
    --do_train \
    --per_gpu_train_batch_size 32 \
    --learning_rate 0.0004 \
    --logging_steps 500 \
    --save_steps 8000 \
    --num_train_epochs 2

DistilBERT additional distillation for promoter identification

Before running the script, process promoter dataset to obtain the training data using porcess_finetune_data.py

python run_distil.py \
    --train_data_file data/promoters/6mer \
    --output_dir models \
    --student_model_type distildnaprom \
    --student_name_or_path Peltarion/dnabert-distilbert \
    --teacher_model_type dnaprom \
    --teacher_name_or_path Path_to_finetuned_DNABERT \
    --do_train \
    --alpha_ce 1 \
    --alpha_mlm 1 \
    --per_gpu_train_batch_size 32 \
    --learning_rate 0.00005 \
    --logging_steps 500 \
    --save_steps 1000 \
    --num_train_epochs 3 \
    --do_val \
    --eval_data_file data/promoters/6kmer

Fine-tune for promoter identification

Before running the script, process promoter dataset to obtain the training data using porcess_finetune_data.py

python run_finetune.py \
    --data_dir data/promoters/6mer \
    --output_dir models \
    --model_type distildnaprom \
    --model_name_or_path Peltarion/dnabert-distilbert \
    --do_train \
    --per_gpu_train_batch_size 32 \
    --learning_rate 0.00005 \
    --logging_steps 100 \
    --save_steps 1000 \
    --num_train_epochs 3 \
    --evaluate_during_training

Promoter prediction on test set

Example with fine-tuned DNABERT

python run_finetune.py \
    --data_dir data/promoters/6mer \
    --output_dir models \
    --do_predict \
    --model_type dnaprom \
    --model_name_or_path Path_to_finetuned_DNABERT \
    --per_gpu_eval_batch_size 32

Classification metrics evaluation on test set

Example with fine-tuned MiniLM

python run_finetune.py \
    --data_dir data/promoters/6mer \
    --output_dir models \
    --do_eval \
    --model_type minidnaprom \
    --model_name_or_path Path_to_finetuned_MiniLM \
    --per_gpu_eval_batch_size 32

Attention landscapes for TATA-promoters

Example with fine-tuned DistilBERT

python run_finetune.py \
    --data_dir data/promoters/6mer \
    --output_dir models \
    --do_visualize \
    --model_type distildnaprom \
    --model_name_or_path Path_to_finetuned_DistilBERT \
    --per_gpu_eval_batch_size 32

Citation

@misc{pales2022knowledge,
  title={Knowledge Distillation of DNABERT for Prediction of Genomic Elements},
  author={Pal{\'e}s Huix, Joana},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
src/transformers		src/transformers
README.md		README.md
data_loaders.py		data_loaders.py
distiller.py		distiller.py
environment.yml		environment.yml
initialise_distilbert.py		initialise_distilbert.py
run_distil.py		run_distil.py
run_finetune.py		run_finetune.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Distillaton of DNABERT for Prediction of Genomic Elements

Pre-trained models

Usage examples

Pre-train DistilBERT student model

Pre-train MiniLM student model

DistilBERT additional distillation for promoter identification

Fine-tune for promoter identification

Promoter prediction on test set

Classification metrics evaluation on test set

Attention landscapes for TATA-promoters

Citation

About

Releases

Packages

Languages

joanaapa/Distillation-DNABERT-Promoter

Folders and files

Latest commit

History

Repository files navigation

Knowledge Distillaton of DNABERT for Prediction of Genomic Elements

Pre-trained models

Usage examples

Pre-train DistilBERT student model

Pre-train MiniLM student model

DistilBERT additional distillation for promoter identification

Fine-tune for promoter identification

Promoter prediction on test set

Classification metrics evaluation on test set

Attention landscapes for TATA-promoters

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages