In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.
We evaluate our sequence and MSA models – EvoDiff-Seq and EvoDiff-MSA, respectively – across a range of generation tasks to demonstrate their power for controllable protein design. Below, we provide documentation for running our models.
EvoDiff is described in this preprint; if you use the code from this repository or the results, please cite the preprint.
- Evodiff
- Table of contents
- Installation
- Available models
- Unconditional generation
- Conditional sequence generation
- Analysis
- Downloading generated sequences
To download our code, we recommend creating a clean conda environment with python v3.8.5
.
conda create --name evodiff python=3.8.5
In that new environment, install EvoDiff:
pip install evodiff
pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch
You will also need to install PyTorch (we tested our models on v2.0.1
), PyTorch Geometric, and PyTorch Scatter.
We provide a notebook with installation guidance that can be found in examples/evodiff.ipynb. It also includes examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.
Our downstream analysis scripts make use of a variety of tools we do not include in our package installation. To run the scripts, please download the following packages in addition to EvoDiff:
- TM score
- Omegafold
- ProteinMPNN
- ESM-IF1; see this Jupyter notebook for setup details.
- PGP
- DR-BERT
We refer to the setup instructions outlined by the authors of those tools.
We obtain sequences from the Uniref50 dataset, which contains approximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the OpenFold dataset, which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the Reverse Homology GitHub.
For the scaffolding structural motifs task, we use the baselines compiled in RFDiffusion. We provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide We provide pdb files used for conditionally generating MSAs in the examples/scaffolding-msas folder.
To access the UniRef50 test sequences, use the following code:
test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences
The filenames for train and validation Openfold splits are saved in data/valid_msas.csv
and data/train_msas.csv
To load a model:
from evodiff.pretrained import OA_DM_38M
model, collater, tokenizer, scheme = OA_DM_38M()
Available evodiff models are:
D3PM_BLOSUM_640M()
D3PM_BLOSUM_38M()
D3PM_UNIFORM_640M()
D3PM_UNIFORM_38M()
OA_DM_640M()
OA_DM_38M()
MSA_D3PM_BLOSUM_RANDSUB()
MSA_D3PM_BLOSUM_MAXSUB()
MSA_D3PM_UNIFORM_RANDSUB()
MSA_D3PM_UNIFORM_MAXSUB()
MSA_OA_DM_RANDSUB()
MSA_OA_DM_MAXSUB()
It is also possible to load our LRAR baseline models:
LR_AR_640M()
LR_AR_38M()
Note: if you want to download a BLOSUM
model, you will first need to download data/blosum62-special-MSA.mat.
We investigated two types of forward processes for diffusion over discrete data modalitiesto determine which would be most effective.
In order-agnostic autoregressive diffusion OADM, one amino acid is converted to a special mask token at each step in the forward process.
After
To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the MSA Transformer architecture on the OpenFold dataset}. To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes.
EvoDiff can generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids. All available models can be used to unconditionally generate new sequences, without needing to download the training datasets.
To unconditionally generate 100 sequences from EvoDiff-Seq, run the following script:
python evodiff/generate.py --model-type oa_dm_38M --num-seqs 100
The default model type is oa_dm_640M
, other evodiff models available are:
oa_dm_38M
d3pm_blosum_38M
d3pm_blosum_640M
d3pm_uniform_38M
d3pm_uniform_640M
Our LRAR baseline models are also available:
lr_ar_38M
lr_ar_640M
An example of unconditionally generating a sequence of a specified length can be found in this notebook.
To evaluate the generated sequences, we implement our self-consistency Omegafold ESM-IF pipeline, as shown in analysis/self_consistency_analysis.py. To use this evaluation script, you must have the dependencies listed under the Installation section installed.
To explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture on the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”).
It is possible to unconditionally generate an entire MSA, using the following script:
python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming
The default model type is msa_oa_dm_maxsub
, which is EvoDiff-MSA-OADM trained on Max subsampled sequences, and the other available
evodiff models are:
- EvoDiff-MSA OADM trained on random subsampled sequences:
msa_oa_dm_randsub
- EvoDiff-MSA D3PM-BLOSUM trained on Max subsampled sequences:
msa_d3pm_blosum_maxsub
- EvoDiff-MSA D3PM-BLOSUM trained on random subsampled sequences:
msa_d3pm_blosum_randsub
- EvoDiff-MSA D3PM-Uniform trained on Max subsampled sequences:
msa_d3pm_uniform_maxsub
- EvoDiff-MSA D3PM-Uniform trained on random subsampled sequences:
msa_d3pm_uniform_randsub
You can also specify a desired number of sequences per MSA, sequence length, batch size, and more.
EvoDiff’s OADM diffusion framework induces a natural method for conditional sequence generation by fixing some subsequences and predicting the remainder. Because the model is trained to generate proteins with an arbitrary decoding order, this is easily accomplished by simply masking and decoding the desired portions. We apply EvoDiff’s power for controllable protein design across three scenarios: conditioning on evolutionary information encoded in MSAs, inpainting functional domains, and scaffolding structural motifs.
First, we test the ability of EvoDiff-MSA (msa_oa_dm_maxsub
) to generate new query sequences conditioned on the remainder of an MSA,
thus generating new members of a protein family without needing to train family-specific generative models.
To generate a new query sequence, given an alignment, use the following with the --start-msa
flag. This starts conditional
generation by sampling from a validation MSA. To run this script you must have the Openfold dataset and splits downloaded.
python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-msa
If you want to generate on a custom MSA, it is possible to retrofit existing code.
Additionally, the code is capable of generating an alignment given a query sequence, use the following --start-query
flag.
This starts with the query and generates the alignment.
python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-query
NOTE: you can only specify one of the above flags at a time. You cannot specify both (--start-query
& --start-msa
) together.
Please look at generate.py
for more information.
Because EvoDiff generates directly in sequence space, we hypothesized that it could natively generate intrinsically disordered regions (IDRs). IDRs are regions within a protein sequence that lack secondary or tertiary structure, and they carry out important and diverse functional roles in the cell directly facilitated by their lack of structure. Despite their prevalence and critical roles in function and disease, IDRs do not fit neatly in the structure-function paradigm and remain outside the capabilities of structure-based protein design methods.
We used inpainting with EvoDiff-Seq and EvoDiff-MSA to intentionally generate disordered regions conditioned on their surrounding structured regions, and then used DR-BERT to predict disorder scores for each residue in the generated and natural sequences. Note: to generate with our scripts here, you must have the IDR dataset downloaded. Different pre-processing steps may apply to other datasets.
To run our code and generate IDRs from EvoDiff-Seq, run the following:
python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --num-seqs 1
or equivalently, from EvoDiff-MSA:
python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --query-only --max-seq-len 150 --num-seqs 1
Which will sample IDRs from the IDR dataset, and generate new ones.
Given that the fixed functional motif includes the residue identities for the motif, we show that a sequence-only model can be used for a motif scaffolding task. We used EvoDiff to generate scaffolds for a set of 17 motif-scaffolding problems by fixing the functional motif, supplying only the motif's amino-acid sequence as conditioning information, and then decoding the remainder of the sequence.
For the scaffolding structural motifs task, we provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide We provide a3m files used for conditionally generating MSAs in the examples/scaffolding-msas folder. Please view the PDB codes available and select an appropriate code. In this example, we use PDB code 1prw with domains 16-35 (FSLFDKDGDGTITTKELGTV) and 52-71 (INEVDADGNGTIDFPEFLTM). An example of generating 1 MSA scaffold of a structural motif can be found in this notebook.
To generate from EvoDiff-Seq:
python evodiff/conditional_generation.py --model-type oa_dm_640M --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 100 --scaffold-min 50 --scaffold-max 100
The --start-idxs
and --end-idxs
indicate the start & end indices for the motif being scaffolded. If defining multiple motifs, you can supply the start and end index motifs as new arguments, such as in the example we provide above.
Equivalent code for generating a new scaffold sequence from an EvoDiff-MSA:
python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 1 --query-only
To generate a custom scaffold for a given motif, one simply needs to supply the PDB ID, and the residue indices of the motif. The code will download the PDB for you. In some cases PDB files downloaded from rcsb will be incomplete, or contain additional residues. We have implemented code to circumvent PDB-reading issues, but we recommend care when generating files for this task.
To analyze the quality of the generations, we look at:
- amino acid KL divergence (aa_reconstruction_parity_plot)
- secondary structure KL divergence (evodiff/analysis/calc_kl_ss.py)
- model perplexity for sequences (evodiff/analysis/sequence_perp.py)
- model perplexity for MSAs (evodiff/analysis/msa_perp.py)
- Fréchet inception distance (evodiff/analysis/calc_fid.py)
- Hamming distance (evodiff/analysis/calc_nearestseq_hamming.py)
- RMSD score (analysis/rmsd_analysis.py)
We also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:
- TM score
- Omegafold
- ProteinMPNN
- ESM-IF1; see this Jupyter notebook for setup details.
- PGP
- DR-BERT
We refer to the setup instructions outlined by the authors of those tools.
Our analysis scripts for iterating over these tools are in the evodiff/analysis/downstream_bash_scripts folder. Once we run the scripts in this folder, we analyze the results in self_consistency_analysis.py.
We provide all generated sequences on the EvoDiff Zenodo.
To download our unconditional generated sequences from unconditional_generations.csv
file:
curl -O https://zenodo.org/record/8332830/files/unconditional_generations.csv?download=1
To extract all unconditionally generated sequences created using the EvoDiff-seq oa_dm_640M
model, run the following code:
import pandas as pd
df = pd.read_csv('unconditional_generations.csv', index_col = 0)
subset = df.loc[df['model'] == 'evodiff_oa_dm_640M']
The CSV files containing generated data are organized as follows:
- Unconditional generations from sequence based models:
unconditional_generations.csv
sequence
: generated sequencemin hamming dist
: minimum Hamming distance between generated sequence and all training sequencesseq len
: length of generated sequencemodel
: model type used for generations, models:evodiff_oa_dm_38M
,evodiff_oa_dm_640M
,evodiff_d3pm_uniform_38M
,
evodiff_d3pm_uniform_640M
,evodiff_d3pm_blosum_38M
,evodiff_d3pm_blosum_640M
,carp_38M
,carp_640M
,lr_ar_38M
lr_ar_38M
,lr_ar_640M
,esm_1b
, oresm_2
- Sequence predictions for unconditional structure generation baselines:
esmif_predictions_unconditional_structure_generations.csv
sequence
: predicted protein sequence from protein structure (using ESM-IF1 model)seq len
: length of generated sequencemodel
: 'foldingdiff' or 'rfdiffusion'
- Sequence generation via evolutionary alignments:
msa_evolution_conditional_generations.csv
sequence
: generated query sequencesseq len
: length of generated sequencemodel
: model type used for generations:evodiff_msa_oa_dm_maxsub
,evodiff_msa_oa_dm_randsub
,esm_msa_1b
, orpotts
- Generated IDRs:
idr_conditional_generations.csv
sequence
: subsampled sequence that contains IDRseq len
: length of generated sequencegen_idrs
: the generated IDR sequenceoriginal_idrs
: the original IDR sequencestart_idxs
: indices corresponding to start of IDR in sequenceend_idxs
: indices corresponding to end of IDR in sequence (inclusive)model
: model type used for generationsevodiff_seq_oa_dm_640M
orevodiff_msa_oa_dm_maxsub
- Successfully generated scaffolds
msa_scaffold.csv
(EvoDiff-MSA generations) orseq_scaffold.csv
(Evodiff-Seq generations)pdb
: pdb code corresponding to scaffold taskseqs
: generated scaffold and motifstart_idxs
: indices corresponding to start of motifend_idxs
: indices corresponding to end of motifseq len
: length of generated sequencescores
: average predicted local distance difference test (pLDDT) of sequencermsd
: motifRMSD between predicted motif coordinates and crystal motif coordinatesmodel
: model type used for generations
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos are subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third party trademarks or logos is subject to those third-party's policies.