Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at Neural Information Processing Systems, 2023. Currently the code is a raw version, will be updated ASAP. If you have any inquiries, feel free to contact [email protected]
same as Uni-Mol
rdkit version should be 2022.9.5
https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing
It currently includes the train data, the trained checkpoint and the test data for DUD-E
The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:
dick_pkt.txt: dictionary for pocket atom types
dict_mol.txt: dictionary for molecule atom types
train.lmdb: train dataset
valid.lmdb: validation dataset
Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:
"atoms": "atom types for each atom in the ligand"
"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"
"pocket_atoms": "atom types for each atom in the pocket"
"pocket_coordinates": "3D coordinates for each atom in the pocket"
"mol": "RDKit molecule object for the ligand"
"smi": "SMILES string for the ligand"
"pocket": "pdbid of the pocket",
The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.
DUD-E
├── gene id
│ ├── receptor.pdb
│ ├── crystal_ligand.mol2
│ ├── actives_final.ism
│ ├── decoys_final.ism
│ ├── mols.lmdb (containing all actives and decoys)
│ ├── pocket.lmdb
lit_pcba
├── target name
│ ├── PDBID_protein.mol2
│ ├── PDBID_ligand.mol2
│ ├── actives.smi
│ ├── inactives.smi
│ ├── mols.lmdb (containing all actives and inactives)
│ ├── pocket.lmdb
see py_scripts/write_dude_multi.py
Please refer to HomoAug directory for details
bash drugclip.sh
bash test.sh
bash retrieval.sh
In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.
If you find our work useful, please cite our paper:
@inproceedings{gao2023drugclip,
author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
booktitle = {NeurIPS 2023},
year = {2023},
url = {https://openreview.net/forum?id=lAbCgNcxm7},
}