To avoid potential dependency conflicts among Python packages, and facilitate future upgrades, we employed a dual-environment approach for preprocessing and model training.
tp_pre environment is used for structure prediction and graph construction. The key package includes:
conda create -n tp_pre python=3.7
conda activate tp_pre
pip install tensorflow==1.14.0
pip install tqdm==4.66.1
pip install pyyaml
pip install --upgrade protobuf==3.19.0
conda install -c conda-forge -c bioconda hhsuite==3.3.0
tp_lmmsg is used for feature extraction, training, and testing. The key package includes:
conda create -n tp_lmmsg python=3.7
conda activate tp_lmmsg
pyparsing=2.4.7,
rdflib=4.2.2
bio-embeddings=0.2.2,
torch=1.10.0 + cuda11.1
conda install -c conda-forge -c bioconda hhsuite==3.3.0
gensim=4.2.0
The pyg package is required in tp_lmmsg:
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_cluster-1.5.9-cp38-cp38-linux_x86_64.whl
pip install torch_cluster-1.5.9-cp38-cp38-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_scatter-2.0.9-cp38-cp38-linux_x86_64.whl
pip install torch_scatter-2.0.9-cp38-cp38-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_sparse-0.6.12-cp38-cp38-linux_x86_64.whl
pip install torch_sparse-0.6.12-cp38-cp38-linux_x86_64.whl
torch_geometric==2.4.0
The detailed info for pyg whl installation can be found at pyg whl.
We provided the .yml env configuration for a fast deployment:
conda env create -f tp_pre.yml
conda env create -f tp_lmmsg.yml
To run the model completely, additional tools and databases need to be installed.
cd
mkdir temp
-
Install psiblast:
Download:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.12.0/ncbi-blast-2.12.0+-x64-linux.tar.gz
Decompress:
tar zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
Add the path to the system environment in ~/.bashrc.
vim ~/.bashrc export BLAST_HOME={your_path}/ncbi-blast-2.12.0+ export PATH=$PATH:$BLAST_HOME/bin
Reload and check:
source ~/.bashrc psiblast -h
-
Download database: Users can download the nrdb90.tar.gz file from our huggingface repo and decompress it.
-
Install and check After conda install:
conda install -c conda-forge -c bioconda hhsuite==3.3.0
Check:
hhblits -h
-
Download database: The database uniclust30_2018_08 would be used. It can be downloaded from this link.
Once finished downloading, rename it according to the
config.yaml
(i.e. uniclust30_2018_08).
-
According to the introduction of trRosetta, you need to download the trRosetta pretrain model.
-
We have made a backup of the trRosetta pre-trained model in huggingface repo for a quick implementation.
-
SeqVec pretrain model address:SeqVec download. For detailed introduction: SeqVec model.
We have also made a backup of the SeqVec pre-trained model in huggingface repo.
-
ProtTrans prerequisites: Transformers installations. For detailed introduction: ProtTrans model.
Now we can organize each module to the repository as follows:
.
├── data
│ │── XU_train
│ │ │── positive
│ │ └── negative
│ │── XU_test
│ │ │── positive
│ │ └── negative
│ └── other datasets ...
│
├── model_files (for ProtTrans models)
│
├── node_embed
│ │── seqvec_pretrain_model
│ │ │── options.json
│ │ └──weights.hdf5
│ │
│ │── node_embed/LM_embedding.py
│ └── other python files
│
├── utils
│ │── psiblast
│ │ │── nrdb90
│ │ │ ├── nrdb90.phr
│ │ │ ├── nrdb90.pin
│ │ │ ├── nrdb90.pog
│ │ │ └── ...
│ │ │── blosum62
│ │ └── blosum62.json
│ │
│ │── trRosetta
│ │ │── model2019_07
│ │ │ ├── model.xaa.data-00000-of-00001
│ │ │ ├── model.xaa.index
│ │ │ ├── model.xaa.meta
│ │ │ └── ...
│ │ │── predict_many.py
│ │ └── utils.py
│ │
│ │── hhblits (optional)
│ │ └── uniclust30_2018_08
│ │ ├── uniclust30_2018_08_a3m_db
│ │ ├── uniclust30_2018_08_a3m_db.index
│ │ ├── uniclust30_2018_08_a3m_db.ffdata
│ │ └── ...
│ │
│ └── other python files
│
├── saved_models (TP_LMMSG model check points)
│
└── Other files ...
At this point, we have completed all environment configuration and software installation. If user want to modify the tools, they could check config.yaml
and python scripts in node_embed\
and utils\
.
To run the entire preprocessing, training, and testing pipeline, users can utilize the Bash script: run.sh
.
Due to the utilization of two environments in the entire workflow, please note the following steps before running the script:
Check the path to Conda by using the command:
conda info | grep -i 'base environment'.
Modify the following command in the run.sh
file via vim run.sh
:
source /home/username/anaconda3/etc/profile.d/conda.sh
to match the correct path of the conda.sh file in your system.
Then use the following commands:
cd TP_LMMSG
bash run.sh
To conduct training on different datasets, users can achieve this by modifying the paths in the run.sh
script.
If you wish to make modifications to the training and testing code, you can directly edit train.py
and test.py
. You can either modify the args
or use the command line for parameter tuning:
python train.py -use_hhm False -use_lm uni bfd brt -e 5 -d 37 -top_k 10
Note: --topk should be modified according to the shortest sequence in the dataset.
python test.py -use_hhm False -use_lm uni bfd brt -d 27
If you wish to include the hhblits, please modify the run.sh
script: generate_features_no_hmm.py
-> generate_features_hmm.py
. Then conduct training and testing as normal.