Name		Name	Last commit message	Last commit date
parent directory ..
pcddb		pcddb
ppisp		ppisp
proteins_ligands		proteins_ligands
proteins_metal		proteins_metal
proteins_nucleic		proteins_nucleic
proteins_nucleotides		proteins_nucleotides
pscdb		pscdb
README.md		README.md

README.md

Datasets

Here we provide a few worked examples of creating graph datasets from protein structures

PPISP

The data contained within PPISP is drawn from DeepPPISP [1]. They collate a number of protein-protein interaction structures from three existing datasets. This is a node-classification task, where the task to is to predict whether or not a residue in the graph participates in a protein-protein interaction. The authors make available additional evolutionary information in the form of a PSSM for each protein.

The authors describe the dataset constuction as follows: The three benchmark datasets are given, i.e., Dset_186, Dset_72 and PDBset_164. Dset_186 consists of 186 protein sequences with the resolution less than 3.0 Å with sequence homology less than 25%. Dset_72 and PDBset_164 were constructed as the same as Dset_186. Dset_72 has 72 protein sequences and PDBset_164 consists of 164 protein sequences. These protein sequences in the three benchmark datasets have been annotated. Thus, we have 422 different annotated protein sequences. We remove two protein sequences for they do not have PSSM file.

PSCDB

The data contained with PSCDB is drawn from the Protein Structural Change Database [2] . The dataset consists of paired protein structures in their bound and unbound forms across 7 classes of structural rearrangement motion. Several tasks can be formulated with this dataset. E.g. predicting the bound conformation of a protein as and edge-prediction task or graph-classification task predicting which class of structural rearrangement a protein undergoes upon ligand binding.

PROTEINS_LIGANDS

PROTEINS_NUCLEOTIDES

PROTEINS_METAL

PROTEINS_NUCLEIC

Union of structural protein-nucleic acid interactions sourced from ccPDB. We combine them to produce a graph classification task (RNA/DNA) and a node classification task (a residue does/does not interact).

DNA_560

DNA_560 is a dataset of 560 Non-redundant PDB chain of DNA interacting protein chains. This dataset is generated by Blastclust (25%) and LPC. PDB resolution is maximum 3 Angstrom and PDB chain length is minimum 80 amino acids. Interaction distance (distance between DNA - amino acid interaction) is 0-4.0 Angstrom.

RNA_410

RNA_410 is a dataset of 410 Non-redundant PDB chains of RNA interacting protein chains. PDB resolution is maximum 3.0 Angstrom. PDB chain length is minimum 80 amino acids and interaction distance (distance between RNA - amino acid interaction) is 0-4.0 Angstrom. This dataset is generated by Blastclust (25%) and LPC.

References

[1] Min Zeng, Fuhao Zhang, Fang-Xiang Wu, Yaohang Li, Jianxin Wang, Min Li. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics. DOI:10.1093/bioinformatics/btz699

[2] Amemiya, T., Koike, R., Kidera, A., & Ota, M. (2011). PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Research, 40(D1), D554–D558. https://doi.org/10.1093/nar/gkr966

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

Datasets

PPISP

PSCDB

PROTEINS_LIGANDS

PROTEINS_NUCLEOTIDES

PROTEINS_METAL

PROTEINS_NUCLEIC

DNA_560

RNA_410

References

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Datasets

PPISP

PSCDB

PROTEINS_LIGANDS

PROTEINS_NUCLEOTIDES

PROTEINS_METAL

PROTEINS_NUCLEIC

DNA_560

RNA_410

References