Skip to content

Commit

Permalink
Update atlas science citation (facebookresearch#512)
Browse files Browse the repository at this point in the history
* Update Atlas paper to Science, and 772M proteins

* Update paper citation to Science version, 772M proteins
  • Loading branch information
tomsercu authored Apr 4, 2023
1 parent bc88c16 commit a944dc5
Showing 1 changed file with 21 additions and 17 deletions.
38 changes: 21 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

[![atlas](https://user-images.githubusercontent.com/3605224/199301187-a9e38b3f-71a7-44be-94f4-db0d66143c53.png)](https://esmatlas.com)

***Update March 2023:*** ESM Atlas was updated to `v2023_02`, simultaneous with the MGnify 2023_02 release in collaboration with EBI. We also release pre-computed ESM2 embeddings for the whole Atlas.
***Update March 2023:*** ESM Atlas was updated to `v2023_02` bringing the number of predicted protein structures from 617 million to a total of 772 million. This update was simultaneous with the MGnify 2023_02 release in collaboration with EBI. We also release pre-computed ESM2 embeddings for the whole Atlas.

This repository contains code and pre-trained weights for **Transformer protein language models** from the Meta Fundamental AI Research Protein Team (FAIR), including our state-of-the-art [**ESM-2** and **ESMFold**](#esmfold), as well as [**MSA Transformer**](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1), [**ESM-1v**](#zs_variant) for predicting variant effects and [**ESM-IF1**](#invf) for inverse folding.
Transformer protein language models were introduced in the preprint of the paper ["Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019)](https://doi.org/10.1101/622803).
Transformer protein language models were introduced in the [2019 preprint](https://doi.org/10.1101/622803) of the paper ["Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences"](https://doi.org/10.1073/pnas.2016239118).
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks.
ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

Expand All @@ -14,21 +14,25 @@ The Atlas was updated in March 2023 in collaboration with EBI. The new `v2023_02
Bulk download, blog post and the resources provided on the Atlas website are documented [on this README](#atlas).

In December 2022, we released two simultaneous preprints on protein design.
["Language models generalize beyond natural proteins" (Verkuil, Kabeli, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521) uses ESM2 to design de novo proteins. The data associated with the preprint can be found in [scripts/design_lm/](scripts/design_lm/).
["A high-level programming language for generative protein design" (Hie, Candido, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521) uses ESMFold to design proteins according to a high-level programming language.
["Language models generalize beyond natural proteins"](https://doi.org/10.1101/2022.12.21.521521) uses ESM2 to design de novo proteins. The data associated with the preprint can be found in [scripts/design_lm/](scripts/design_lm/).
["A high-level programming language for generative protein design"](https://doi.org/10.1101/2022.12.21.521521) uses ESMFold to design proteins according to a high-level programming language.



<details><summary><b>Citation</b></summary>
For ESM2, ESMFold and ESM Atlas:
```bibtex
@article{lin2022evolutionary,
title={Evolutionary-scale prediction of atomic level protein structure with a language model},
author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
year={2022},
journal={bioRxiv},
note={bioRxiv 2022.07.20.500902},
url={https://doi.org/10.1101/2022.07.20.500902},
@article{lin2023evolutionary,
title = {Evolutionary-scale prediction of atomic-level protein structure with a language model},
author = {Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives },
journal = {Science},
volume = {379},
number = {6637},
pages = {1123-1130},
year = {2023},
doi = {10.1126/science.ade2574},
URL = {https://www.science.org/doi/abs/10.1126/science.ade2574},
note={Earlier versions as preprint: bioRxiv 2022.07.20.500902},
}
```

Expand Down Expand Up @@ -78,8 +82,8 @@ For transformer protein language models:
- December 2022: The Meta Fundamental AI Research Protein Team (FAIR) released two simultaneous preprints on protein design:
["Language models generalize beyond natural proteins" (Verkuil, Kabeli, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521), and ["A high-level programming language for generative protein design" (Hie, Candido, et al., 2022)](https://doi.org/10.1101/2022.12.21.521521).
- November 2022: ESM Metagenomic Atlas, a repository of 600M+ metagenomics structures released, see [website](https://esmatlas.com/) and [bulk download details](#atlas)
- November 2022: ESMFold - new end-to-end structure prediction model released (see [Lin et al. 2022](https://doi.org/10.1101/2022.07.20.500902))
- August 2022: ESM-2 - new SOTA Language Models released (see [Lin et al. 2022](https://doi.org/10.1101/2022.07.20.500902))
- November 2022: ESMFold - new end-to-end structure prediction model released (see [Lin et al. 2022](https://www.science.org/doi/abs/10.1126/science.ade2574))
- August 2022: ESM-2 - new SOTA Language Models released (see [Lin et al. 2022](https://www.science.org/doi/abs/10.1126/science.ade2574))
- April 2022: New inverse folding model ESM-IF1 released, trained on CATH and UniRef50 predicted structures.
- August 2021: Added flexibility to tokenizer to allow for spaces and special tokens (like `<mask>`) in sequence.
- July 2021: New pre-trained model ESM-1v released, trained on UniRef90 (see [Meier et al. 2021](https://doi.org/10.1101/2021.07.09.450648)).
Expand All @@ -95,7 +99,7 @@ For transformer protein language models:

| Shorthand | `esm.pretrained.` | Dataset | Description |
|-----------|-----------------------------|---------|--------------|
| ESM-2 | `esm2_t36_3B_UR50D()` `esm2_t48_15B_UR50D()` | UR50 (sample UR90) | SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with [Lin et al. 2022](https://doi.org/10.1101/2022.07.20.500902) (Aug 2022 update). |
| ESM-2 | `esm2_t36_3B_UR50D()` `esm2_t48_15B_UR50D()` | UR50 (sample UR90) | SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with [Lin et al. 2022](https://www.science.org/doi/abs/10.1126/science.ade2574) (Aug 2022 update). |
| ESMFold | `esmfold_v1()` | PDB + UR50 | End-to-end single sequence 3D structure predictor (Nov 2022 update). |
| ESM-MSA-1b| `esm_msa1b_t12_100M_UR50S()` | UR50 + MSA | MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with [Rao et al. 2021](https://www.biorxiv.org/content/10.1101/2021.02.12.430858v2) (ICML'21 version, June 2021). |
| ESM-1v | `esm1v_t33_650M_UR90S_1()` ... `esm1v_t33_650M_UR90S_5()`| UR90 | Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with [Meier et al. 2021](https://doi.org/10.1101/2021.07.09.450648). |
Expand Down Expand Up @@ -228,7 +232,7 @@ print(struct.b_factor.mean()) # this will be the pLDDT

Besides `esm.pretrained.esmfold_v1()` which is the best performing model we recommend using, we
also provide `esm.pretrained.esmfold_v0()` which was used for the experiments in
[Lin et al. 2022](https://doi.org/10.1101/2022.07.20.500902).
[Lin et al. 2022](https://www.science.org/doi/abs/10.1126/science.ade2574).

We also provide a command line interface (`esm-fold`) that efficiently predicts structures in bulk from a FASTA file using ESMFold:
```
Expand Down Expand Up @@ -265,7 +269,7 @@ By default, predictions will be batched together so that shorter sequences are p

The `--cpu-offload` flag can be useful for making predictions on longer sequences. It will attempt to offload some parameters to the CPU RAM, rather than storing on GPU.

Finally, the ablation experiments for LMs of varying sizes [Lin et al. 2022 table S1](https://doi.org/10.1101/2022.07.20.500902) are released as `esm.pretrained.esmfold_structure_module_only_*()`. We don't recommend using these models for structure prediction.
Finally, the ablation experiments for LMs of varying sizes [Lin et al. 2022 table S1](https://www.science.org/doi/abs/10.1126/science.ade2574) are released as `esm.pretrained.esmfold_structure_module_only_*()`. We don't recommend using these models for structure prediction.


### Compute embeddings in bulk from FASTA <a name="bulk_fasta"></a>
Expand Down Expand Up @@ -691,7 +695,7 @@ Comparison to related protein language models on structure prediction tasks.
fitted with logistic regression on 20 structures.
For more details on the method, see [Rao et al. 2020](https://doi.org/10.1101/2020.12.15.422761).
* For structure prediction, an AlphaFold2 structure module is trained directly from the frozen language model embeddings.
For more details on the method, see [Lin et al. 2022](https://doi.org/10.1101/2022.07.20.500902).
For more details on the method, see [Lin et al. 2022](https://www.science.org/doi/abs/10.1126/science.ade2574).
* Direct coupling analysis methods (Gremlin, mfDCA, Psicov) and ESM-MSA-1 use the [trRosetta MSAs](https://yanglab.nankai.edu.cn/trRosetta/benchmark/), while other methods predict from single sequence.


Expand Down

0 comments on commit a944dc5

Please sign in to comment.