Skip to content

Commit

Permalink
Code release for "Language Models Generalize Beyond Natural Proteins". (
Browse files Browse the repository at this point in the history
facebookresearch#527)

Co-authored-by: kwanUm <[email protected]>
Co-authored-by: Tom Sercu <[email protected]>
  • Loading branch information
3 people authored Apr 17, 2023
1 parent c7ba180 commit d6a2a8b
Show file tree
Hide file tree
Showing 33 changed files with 3,475 additions and 8 deletions.
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@

[![atlas](https://user-images.githubusercontent.com/3605224/199301187-a9e38b3f-71a7-44be-94f4-db0d66143c53.png)](https://esmatlas.com)

***Update March 2023:*** ESM Atlas was updated to `v2023_02` bringing the number of predicted protein structures from 617 million to a total of 772 million. This update was simultaneous with the MGnify 2023_02 release in collaboration with EBI. We also release pre-computed ESM2 embeddings for the whole Atlas.
***Update April 2023:*** Code for the two simultaneous preprints on protein design is now released! Code for "Language models generalize beyond natural proteins" is under [examples/lm-design/](examples/lm-design/). Code for "A high-level programming language for generative protein design" is under [examples/protein-programming-language/](examples/protein-programming-language/).

This repository contains code and pre-trained weights for **Transformer protein language models** from the Meta Fundamental AI Research Protein Team (FAIR), including our state-of-the-art [**ESM-2** and **ESMFold**](#esmfold), as well as [**MSA Transformer**](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1), [**ESM-1v**](#zs_variant) for predicting variant effects and [**ESM-IF1**](#invf) for inverse folding.
Transformer protein language models were introduced in the [2019 preprint](https://doi.org/10.1101/622803) of the paper ["Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences"](https://doi.org/10.1073/pnas.2016239118).
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks.
ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

In November 2022, we released `v0` of the [ESM Metagenomic Atlas](https://esmatlas.com), an open atlas of 617 million predicted metagenomic protein structures.
The Atlas was updated in March 2023 in collaboration with EBI. The new `v2023_02` adds another 150 million predicted structures to the Atlas.
The Atlas was updated in March 2023 in collaboration with EBI. The new `v2023_02` adds another 150 million predicted structures to the Atlas, as well as pre-computed ESM2 embeddings.
Bulk download, blog post and the resources provided on the Atlas website are documented [on this README](#atlas).

In December 2022, we released two simultaneous preprints on protein design.
["Language models generalize beyond natural proteins"](https://doi.org/10.1101/2022.12.21.521521) uses ESM2 to design de novo proteins. The data associated with the preprint can be found in [scripts/design_lm/](scripts/design_lm/).
["A high-level programming language for generative protein design"](https://doi.org/10.1101/2022.12.21.521526) uses ESMFold to design proteins according to a high-level programming language.
* "Language models generalize beyond natural proteins" ([PAPER](https://doi.org/10.1101/2022.12.21.521521), [CODE](examples/lm-design/)) uses ESM2 to design de novo proteins. The code and data associated with the preprint can be found [here](examples/lm-design/).
* "A high-level programming language for generative protein design" ([PAPER](https://doi.org/10.1101/2022.12.21.521526), [CODE](examples/protein-programming-language/)) uses ESMFold to design proteins according to a high-level programming language.



Expand Down Expand Up @@ -78,6 +78,7 @@ For transformer protein language models:

<details><summary><b>What's New</b></summary>

- April 2023: Code for the protein design preprints released under [examples/lm-design/](examples/lm-design/).
- March 2023: We release an update to the ESM Metagenomic Atlas, `v2023_02`. See [website](https://esmatlas.com/) and [bulk download details](#atlas).
- December 2022: The Meta Fundamental AI Research Protein Team (FAIR) released two simultaneous preprints on protein design:
["Language models generalize beyond natural proteins" (Verkuil, Kabeli, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521), and ["A high-level programming language for generative protein design" (Hie, Candido, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521).
Expand Down
1,097 changes: 1,097 additions & 0 deletions examples/lm-design/2N2U.pdb

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions examples/lm-design/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# LM design examples

This folder contains code for demonstration of protein design using a language model. The code was used to perform the two design tasks specified at the paper [Language models generalize beyond natural proteins
](https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1).


## Notebook examples

Refer to the two notebooks at this folder to run the fixed backbone and free generation design tasks.


## Shell examples

To run the two design tasks from shell, do the following:

1. First, install additional requirements: ```pip install -r additional_requirements.txt```
2. Running Fixed backbone design: ```python -m lm_design task=fixedbb pdb_fn=$PWD/2N2U.pdb```
3. Running Free generation design: ```python -m lm_design task=free_generation```

Notes:
Use the ```seed=<number>``` flag to generate different designs, e.g:
```python -m lm_design task=free_generation seed=42```

Control generated length in free generation using ```free_generation_length=<number>```, e.g:
```python -m lm_design task=free_generation free_generation_length=68```

Other, more advanced configurations can be observed at [config.yaml](conf/config.yaml)


## Paper data
The data from the preprint is available under [paper-data/](paper-data).
This includes designed sequences, their predicted structures, experimental validation results, linear projection for pairwise distance prediction, and details on dataset construction for model training.
Empty file added examples/lm-design/__init__.py
Empty file.
3 changes: 3 additions & 0 deletions examples/lm-design/additional_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
nltk
py3Dmol
hydra
Empty file.
62 changes: 62 additions & 0 deletions examples/lm-design/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
seed: 0
num_seqs: 1
test_mode: False
allow_missing_residue_coords: True
suppress_AA: 'C'
disable_cuda: False
cuda_device_idx: # Set to numberic value to override default GPU device used.
task: free_generation # fixedbb or free_generation
pdb_fn: # set as empty string when using free_generation
free_generation_length: 100

tasks:
free_generation:
num_iter: 170000
resample_y_every: 3
resample_y_temp: 1
stage_fixedbb_args: ${tasks.fixedbb}


fixedbb:
num_iter: 170000

# Accept/Reject
accept_reject:
energy_cfg:
struct_w: 3
LM_w: 2
ngram_w: 1
ngram_orders: [1,2,3]
temperature:
scheduler: StepLR
step_size: 10000
gamma: 0.5
initial: 8



# Hydra config
hydra:
job_logging:
formatters:
colorlog:
datefmt: "%m-%d %H:%M:%S"
handlers:
file:
class: logging.FileHandler
mode: w
filename: logging.l
console:
class: logging.StreamHandler
stream: ext://sys.stdout

hydra_logging:
handlers:
console:
class: logging.StreamHandler
stream: ext://sys.stdout
197 changes: 197 additions & 0 deletions examples/lm-design/fixed_backbone.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4f84941e",
"metadata": {},
"source": [
"# Fixed Backbone design from LM\n",
"\n",
"This notebook demonstrates the Fixed Backbone design task from the paper [Language models generalize beyond natural proteins\n",
"](https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1).\n",
"\n",
"Given an input structure as .pdb file, the LM is used iteratively in an MCMC optimization to find a sequence that folds to that structure\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d378b7f4-0792-446b-9e95-f7025bee5bec",
"metadata": {},
"outputs": [],
"source": [
"# First install additional dependencies\n",
"!pip install -r additional_requirements.txt\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfd13d6a",
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"import os\n",
"import time\n",
"import hydra\n",
"import py3Dmol\n",
"from lm_design import Designer\n",
"\n",
"# Params\n",
"pdb_fn = os.getcwd() + '/2N2U.pdb'\n",
"seed = 0 # Use different seeds to get different sequence designs for the same structure\n",
"TASK = \"fixedbb\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "989996bf",
"metadata": {},
"outputs": [],
"source": [
"# Load hydra config from config.yaml\n",
"with hydra.initialize_config_module(config_module=\"conf\"):\n",
" cfg = hydra.compose(\n",
" config_name=\"config\", \n",
" overrides=[\n",
" f\"task={TASK}\", \n",
" f\"seed={seed}\", \n",
" f\"pdb_fn={pdb_fn}\", \n",
" # 'tasks.fixedbb.num_iter=100' # DEBUG - use a smaller number of iterations\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63178538",
"metadata": {},
"outputs": [],
"source": [
"# Create a designer from configuration\n",
"des = Designer(cfg, pdb_fn)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86d25575",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Run the designer\n",
"start_time = time.time()\n",
"des.run_from_cfg()\n",
"print(\"finished after %s hours\", (time.time() - start_time) / 3600)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6d9f742",
"metadata": {},
"outputs": [],
"source": [
"print(\"Output seq:\", des.output_seq)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba6c8c66",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Fold output with ESMFold API\n",
"output_seq = des.output_seq\n",
"# Fold with api:\n",
"# curl -X POST --data \"GENGEIPLEIRATTGAEVDTRAVTAVEMTEGTLGIFRLPEEDYTALENFRYNRVAGENWKPASTVIYVGGTYARLCAYAPYNSVEFKNSSLKTEAGLTMQTYAAEKDMRFAVSGGDEVWKKTPTANFELKRAYARLVLSVVRDATYPNTCKITKAKIEAFTGNIITANTVDISTGTEGSGTQTPQYIHTVTTGLKDGFAIGLPQQTFSGGVVLTLTVDGMEYSVTIPANKLSTFVRGTKYIVSLAVKGGKLTLMSDKILIDKDWAEVQTGTGGSGDDYDTSFN\" https://api.esmatlas.com/foldSequence/v1/pdb/\n",
"import requests\n",
"import json\n",
"url = 'https://api.esmatlas.com/foldSequence/v1/pdb/'\n",
"r = requests.post(url, data=output_seq)\n",
"output_struct = r.text\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5c06ab3",
"metadata": {},
"outputs": [],
"source": [
"# Visualize output structure\n",
"view = py3Dmol.view(width=800, height=800)\n",
"view.addModel(output_struct, 'pdb')\n",
"view.setStyle({'cartoon': {'color': 'spectrum'}})\n",
"view.zoomTo()\n",
"view.show()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7247225",
"metadata": {},
"outputs": [],
"source": [
"des.x_logits.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8e5c184",
"metadata": {},
"outputs": [],
"source": [
"# Visualize wild type structure\n",
"wt_struct_file = pdb_fn\n",
"view = py3Dmol.view(width=800, height=800)\n",
"view.addModel(open(wt_struct_file).read(), 'pdb')\n",
"view.setStyle({'cartoon': {'color': 'spectrum'}})\n",
"view.zoomTo()\n",
"view.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "222ec344",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"vscode": {
"interpreter": {
"hash": "5502aca739f2549ad2771378ffc455b2bbb8b06f1a91617971f7097758a3cf84"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit d6a2a8b

Please sign in to comment.