Here we provide a few worked examples of creating graph datasets from protein structures
The data contained within PPISP is drawn from DeepPPISP [1]. They collate a number of protein-protein interaction structures from three existing datasets. This is a node-classification task, where the task to is to predict whether or not a residue in the graph participates in a protein-protein interaction. The authors make available additional evolutionary information in the form of a PSSM for each protein.
The authors describe the dataset constuction as follows: The three benchmark datasets are given, i.e., Dset_186, Dset_72 and PDBset_164. Dset_186 consists of 186 protein sequences with the resolution less than 3.0 Å with sequence homology less than 25%. Dset_72 and PDBset_164 were constructed as the same as Dset_186. Dset_72 has 72 protein sequences and PDBset_164 consists of 164 protein sequences. These protein sequences in the three benchmark datasets have been annotated. Thus, we have 422 different annotated protein sequences. We remove two protein sequences for they do not have PSSM file.
The data contained with PSCDB is drawn from the Protein Structural Change Database [2] . The dataset consists of paired protein structures in their bound and unbound forms across 7 classes of structural rearrangement motion. Several tasks can be formulated with this dataset. E.g. predicting the bound conformation of a protein as and edge-prediction task or graph-classification task predicting which class of structural rearrangement a protein undergoes upon ligand binding.
Union of structural protein-nucleic acid interactions sourced from ccPDB. We combine them to produce a graph classification task (RNA/DNA) and a node classification task (a residue does/does not interact).
DNA_560 is a dataset of 560 Non-redundant PDB chain of DNA interacting protein chains. This dataset is generated by Blastclust (25%) and LPC. PDB resolution is maximum 3 Angstrom and PDB chain length is minimum 80 amino acids. Interaction distance (distance between DNA - amino acid interaction) is 0-4.0 Angstrom.
RNA_410 is a dataset of 410 Non-redundant PDB chains of RNA interacting protein chains. PDB resolution is maximum 3.0 Angstrom. PDB chain length is minimum 80 amino acids and interaction distance (distance between RNA - amino acid interaction) is 0-4.0 Angstrom. This dataset is generated by Blastclust (25%) and LPC.
[1] Min Zeng, Fuhao Zhang, Fang-Xiang Wu, Yaohang Li, Jianxin Wang, Min Li. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. DOI:10.1093/bioinformatics/btz699
[2] Amemiya, T., Koike, R., Kidera, A., & Ota, M. (2011). PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1), D554–D558. https://doi.org/10.1093/nar/gkr966