Skip to content

yupliu/TransGEM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TransGEM

Transformer-based model from gene expression to molecules.

TransGEM is a phenotype-based de novo drug design model, which can generate new bioactive molecules, independent from disease target information.

Graphical abstract

Setup

Install the environment

  • Create a conda environment:
conda env create -f environment.yaml
  • Activate the environment:
conda activate TransGEM

Download data

The data related to this study can be downloaded here.

Usage

TransGEM training

  • in subLINCS dataset
python train.py --data_path ./data/  --dataset subLINCS --gene_encoder tenfold_binary --gpu cuda:0 --epochs 200
  • in HCC515 dataset
python train.py --data_path ./data/  --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0 --epochs 200

TransGEM fine-tuning

python ft_train.py --data_path ./data/  --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0

Trained TransGEM testing

python test.py --data_path ./data/  --dataset subLINCS --gene_encoder tenfold_binary --gpu cuda:0

Fine-tuned TransGEM testing

python ft_test.py --data_path ./data/  --dataset HCC515 --gene_encoder tenfold_binary --gpu cuda:0

TransGEM application

  • for Prostate cancer
python app.py --data_path ./data/  --dataset PC --cell_line PC3 --gene_encoder tenfold_binary --gpu cuda:0 --seq_num 1000
  • for Non-small cell lung cancer
python app.py --data_path ./data/  --dataset nsclc --cell_line A549 --gene_encoder tenfold_binary --gpu cuda:0 --seq_num 1000

Model options

  • usage:
python train.py --help
  • optional arguments:
-h, --help            show this help message and exit
--data_path           the directory where the data was inputted
--out_path            the directory where the trianing results was output
--dataset             the dataset used by the model (subLINCS/HCC515/PC/nsclc)
--gene_encoder        encoding form of gene expression (value/one_hot/binary/tenfold_binary)
--gpu                 CUDA device ids
--hidden_dim          hidden size of transformer decoder
--ff_dim              dimension number of the feed-forward layer
--PE_dropout          dropout of position coding
--TF_dropout          dropout of transformer layer
--TF_N                number of transformer decoder layer
--TF_H                number of transformer decoder head
--TF_act              activation function of transformer layer
--batch_size          number of batch_size
--epochs              number of epochs
--lr                  learning rate of adam
--cell_line           cell line names of disease
--pad_idx             id of pad symbol
--start_idx           id of start symbol
--end_idx             id of end symbol
--max_len             maximum length of generated molecule
--vocab_size          vocab size
--k                   number of molecules generated in a single beam search
--alpha               the weight of the length and score of molecules generated by bundle search
--seq_num             number of molecules ultimately retained
  • Model parameters of 4 encoding forms
Encoding form hidden_dim ff_dim TF_N TF_H
value 64 2048 6 8
one_hot 64 512 6 8
binary 64 512 6 8
tenfold_binary 64 512 6 8

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%