Automated pipeline for massive PPI prediction and figure creation.
PPIFold is a tool for analyzing Protein-Protein Interactions from AlphaPulldown, with automated pre- and post-processing. It is used to generate PPI predictions for multiple systems without wasting time on generating initial files and sorting results. It predicts the best homo-oligomer for a protein and the best interface for interacting with specific proteins. This allows for the prediction of massive multimeric complexes with numerous PPIs.
- AlphaFold data base
- Conda
- SignalP5 (optional)
- Singularity and Singularity Image
Installation of AlphaFold data base :
sudo apt install aria2
git clone https://github.com/deepmind/alphafold.git
cd alphafold
scripts/download_all_data.sh /<Directory></Directory> > download.log 2> download_all.log
SignalP5 installation (optional) :
https://services.healthtech.dtu.dk/services/SignalP-5.0/9-Downloads.php
tar -xvzf signalp-5.0b.Linux.tar.gz
cd signalp-5.0b/
cp bin/signalp /usr/local/bin
sudo cp -r lib/* /usr/local/lib
Note
If you do not want to use SignalP, set --use_signalP to False.
Singularity installation :
https://docs.sylabs.io/guides/3.0/user-guide/installation.html#install-on-linux
Download Singularity image (score generation) :
singularity build alpha-analysis_jax_0.4.sif alpha_analysis_jax0.4.def
PPIFold installation :
conda create -n PPIFold -c omnia -c bioconda -c conda-forge python==3.11 openmm==8.0 pdbfixer==1.9 kalign2 networkx hhsuite hmmer
conda activate PPIFold
pip install PPIFold
pip install -U "jax[cuda12]"
You need two initial files :
test.txt
This file needs to be a ".txt" file.
The initial file can be set up using UniProt IDs, FASTA sequences, or both.
UniProt IDs need to be on the same line, separated by commas.
Ex :
UniprotID1,UniprotID2,UniprotID3...
The FASTA sequence needs to start with ">", followed by the protein name.
Ex :
>Name
MFKRSGSLSLALMSSFCSSSLATPLSSAEFDHVARKCAPSVATSTLAAIAK
VESRFDPLAIHDNTTGETLHWQDHTQATQVVRHRLDARHSLDVGLMQINSR
NFSMLGLTPDGALKACPSLSAAANMLKSRYAGGETIDEKQIALRRAISAYN
TGNFIRGFANGYVRKVETAAQSLVPALIEPPQDDHKALKSEDTWDVWGSYQ
RRSQEDGVGGSIAPQPPDQDNGKSADDNQVLFDLY
conf.txt
The conf.txt file needs to contains all path.
Path_Uniprot_ID : Path and name of the initial file.
Path_AlphaFold_Data : Path to the AlphaFold database (default on ./alphadata).
Path_Singularity_Image : Path and name of the singularity image.
Path_Pickle_Feature : Path to your feature folder (default on ./feature).
To use PPIFold, simply run the PPIFold command in the folder containing conf.txt and test.txt.
PPIFold --use_mmseq Boolean --make_multimers Boolean --max_aa Integer --use_signalP Boolean --org String
Optional arguments
--use_mmseq Enable or disable MMseq for feature generation ,set to True by default
--make_multimers This argument is set to True by default. If you only want to generate features and MSA, you need to set it to False
--max_aa The maximum length of a model that can be generated by your GPU (depending on VRAM), set to 2000 by default (24 GB)
--use_signalP Use SignalP if your proteins can be periplasmic, set to True by default
--org If you use SignalP, you can select the organism (gram-, gram+, arch or euk), set to Gram- by default
Tip
Save all your pickle files in the same directory.
This pipeline have a cutoff on PAE (10), iQ-score (35) and hiQ-score (50).
MSA depth
All aligned homologous sequences for O50333.
The y-axis represents the number of homologous sequences, the x-axis represents the positions in the sequence. The color represents the sequence identity.
Residue interaction table
Table of distance between two atoms of O50331 and O5333.
Chains represent different proteins. Two residues in contact are specified, along with their distances. Distances are calculated from the center of mass of the residues. The distance threshold is 10 angtroms, and the PAE is 5.
Distogram
Distance map between each atom of O50331 and O5333.
The x and y axes represent interacting proteins. Pixels inside the black squares represent intra-protein residue distances, while pixels outside represent inter-protein residue distances. The color represents the distance in angstroms: blue indicates a short distance between two residues, and yellow indicates a large distance.
Interaction network
Protein-protein interaction network with iQ-score and homo-oligomers (hiQ-score) predictions.
This network represents interactions between R388 proteins. Each interaction is represented by a line connecting two proteins, colored according to the corresponding iQ-score. A loop on a protein indicates the best homo-oligomers with the highest hiQ-score.
iQ-Score heatmap
Heatmap of iQ-score between each PPI.
Color represents the iQ-score, with a better iQ-score indicated by a lighter color. The black boxes represent either poor PAE, homo-oligomers, or overly large total protein length.
Protein interface
Amino acid sequence with different interfaces used in interacations.
Each interface with a protein is represented by all contact residues, which are colored. The last interaction represents the interface used in homo-oligomerization. If two proteins use the same interface, they will have the same colors.
OOM_int.txt
A text file containing interactions that are too large, based on --max_aa.
Shallow_MSA.txt
A text file containing proteins with an MSA depth lower than 100 sequences.
Warning
Results for proteins with fewer than 100 sequences in the MSA are not accurate for validating or invalidating predicted PPIs.
table.cyt
A file for manually generating a network in Cytoscape.
_summary.signalp5
A file who resume signal peptides for all proteins.
.pdb file
Model structure, with residues colored according to their interaction interface.
After completing test.txt and conf.txt, you need to complete the conf.txt file with all your paths.
Activate your Conda environment.
You must run the command in the directory.
Command :
PPIFold