Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
where Z is the normalizing factor:
Table 2.5 displays some of the publicly available databases that integrate PPI datasets from multiple sources. However, a common problem with integrating multiple scored-datasets is the low agreement between the schemes or experiments used to produce these data. Moreover, most scoring methods favor high abundance proteins and are not effective enough to filter out common contaminants [Pu et al. 2015]. Therefore, better scoring and integrating schemes for PPI datasets are always required.
2.7 Enhancing PPI Networks by Integrating Functional Interactions
In addition to the presence of spurious interactions, another limitation in existing PPI datasets is the lack of coverage for true interactions among certain kinds of proteins (the “sparse zone” [Rolland et al. 2014]). This is in part due to limitations in experimental protocols (e.g., washing away of weakly connected proteins during purification of pull-down complexes in TAP experiments), and in part due to the under-representation of certain groups of proteins in these experiments (e.g., membrane proteins). The paucity of true interactions can considerably affect downstream analysis including protein complex prediction. For example, in an analysis by Srihari and Leong [2012a] using protein complexes from MIPS and CYC2008, it was found that many true complexes are embedded in sparse and disconnected regions of the PPI network, thereby altering their dense connectivity and modularity. As we shall see in a subsequent chapter, many computational methods find it difficult to identify these sparse complexes.
Computational prediction of protein interactions can be a good alternative to experimental protocols for enriching the PPI network with true interactions, and to “densify” regions of the network that are sparsely connected. However, accurate prediction of physical interactions between proteins is a difficult problem in itself, and as several studies have noted [Von Mering et al. 2003, Szklarczyk et al. 2011, Srihari and Leong 2012a] most predicted interactions tend to be “functional associations”—that is, relationships connecting functionally similar pairs of proteins—instead of actual physical interactions between the proteins. Nevertheless, if these functional interactions are successful in “topologically enhancing” the PPI network, these can still aid downstream analysis including protein complex prediction.
Table 2.5 Publicly available databases that integrate PPI datasets from multiple experimental, literature, and computational sources
Database | Source | Reference |
ComPPI | http://comppi.linkgroup.hu/ | [Veres et al. 2015] |
GeneMANIA | http://www.genemania.org/ | [Warde-Farley et al. 2010] |
HIPPIE | http://cbdm.mdc-berlin.de/tools/hippie/ | [Schaefer et al. 2012] |
HitPredict | http://hintdb.hgc.jp/htp/ | [Patil et al. 2011] |
HumanNet | http://www.functionalnet.org/ | [Lee et al. 2011] |
I2D/OPHID | http://ophid.utoronto.ca/ophidv2.204/ | [Brown and Jurisica 2005, Brown and Jurisca 2007, Kotlyar et al. 2015] |
InnateDB | http://www.innatedb.com/ | [Lynn et al. 2008] |
IntScore | http://intscore.molgen.mpg.de/ | [Kamburov et al. 2012] |
InWeb | http://www.lagelab.org/resources/ | [Li et al. 2017] |
iRefIndex | http://irefindex.org/wiki/index.php?title=iRefIndex | [Razick et al. 2008, Turner et al. 2010] |
MyProteinNet | http://netbio.bgu.ac.il/myproteinnet/ | [Basha et al. 2015] |
MatrixDB | http://matrixdb.univ-lyon1.fr/ | [Chautard et al. 2011] |
IID/OPHID | http://ophid.utoronto.ca/iid/ | [Brown and Jurisica 2005, Kotlyar et al. 2015, Kotlyar et al. 2016] |
PrePPI | http://bhapp.c2b2.columbia.edu/PrePPI/ | [Zhang et al. 2012, Zhang et al. 2013] |
PSICQUIC | http://psicquic.googlecode.com/ | [Aranda et al. 2011] |
STRING | http://string-db.org/ | [Von Mering et al. 2003, Szklarczyk et al. 2011] |
UniHI | http://www.unihi.org/ | [Kalathur et al. 2014] |
Computational Prediction of Protein Interactions
Although high-throughput techniques produce large amounts of data, the covered fraction of the interactomes from most organisms are far from complete [Cusick et al. 2009, Hart et al. 2006, Huang et al. 2007]. For example, while ∼70% of the interactomes from model organisms including S. cerevisiae have been mapped, these interactomes still lack interactions among membrane proteins [Von Mering et al. 2002, Hart et al. 2006, Huang et al. 2007]. Likewise, estimates show that less than 50% of the interactomes from higher-order organisms including human (∼10%) and other mammals have been mapped [Hart et al. 2006, Stumpf et al. 2008, Vidal 2016]. Computational prediction of interactions could partially compensate for this lack of coverage by predicting interactions between proteins in network regions with low coverage. Here, we only present a brief conceptual overview of computational methods developed for protein interaction prediction; for methodological details and for a comprehensive list of these methods, the readers are referred to excellent surveys by Valencia and Pazos [2002], Obenauer and Yaffe [2004], Zahiri et al. [2013], Ehrenberger et al. [2015], and Keskin et al. [2016].
Gene Neighbors. A commonly used approach to predict protein interactions in prokaryotes is by using co-transcribed or co-regulated sets of genes. It is based on the observation that, in prokaryotes, proteins encoded by genes that are transcribed or regulated as single units—e.g., as operons—are often involved in similar functions and tend to physically interact. Computational methods exist to predict operons in bacterial genomes using intergenic distances [Ermolaeva et al. 2011, Price et al. 2005]. Analysis of gene-order conservation in bacterial and archaeal genomes shows that protein products of 63–75% of operonic genes physically interact [Dandekar et al. 1998]. In eukaryotes, evidence from yeast and worm [Teichmann and Babu 2002, Snel et al. 2004] shows that co-regulated sets of genes encode proteins that are functionally similar and these proteins are highly likely to interact. These studies therefore provide the basis to predict new interactions between proteins using sets of co-transcribed and co-regulated sets of genes [Huynen et al. 2000, Bowers et al. 2004].
Phylogenetic Profiles. Similar phylogenetic profiles between proteins provide strong evidence for protein interactions [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. For a given protein, a phylogenetic profile is constructed as a vector of N elements, where N is the number of genomes (species). The presence or absence of the protein in a genome is indicated as 1 or 0 at the corresponding position in the phylogenetic profile. Phylogenetic profiles of a collection of proteins can be clustered using a bit-distance measure, to generate clusters of proteins that co-evolve. Therefore, proteins appearing in the same cluster are considered to be evolutionarily co-evolving and these proteins are inferred to be functionally related and physically interacting. This inference is based on the hypothesis that interacting