Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
…, Cn and D1, …, Dm, and another protein B interacts with C1, …, Cn and E1, …, Em',. Suppose that each of these proteins can be localized to say h subcellular compartments with equal chance. The probability that A and B interact with a Ci in two different places is therefore x = (1 − p) · (1 − p) · (h − 1)/h. Thus, the probability that A and B interact with each C1, …, Cn in two different compartments is xn, and the probability that A and B interact with some Ci in the same compartment is 1 − xn, which monotonically increases with n. That is, the more common partners A and B have (as reported by the screen), the higher the chance they are in the same compartment. Thus, the more common partners the two proteins have in the PPI network, the higher the chance they satisfy the co-localization requirement for interaction. In other words, topology-based schemes based on counting common partners can be viewed as ways to guess whether A and B are likely to be in the same compartment (without needing to know explicitly which compartment); that is, these indices are in fact based on exploiting the biological fact that two proteins must be in the same compartment to physically interact (i.e., a piece of biological information). Therefore, Functional Similarity Weighting (FS Weight) [Chua et al. 2006], Iterative Czekanowski-Dice (CD) weight [Liu et al. 2008], and Dice coefficient [Zhang et al. 2008] can be classified as both topology-based as well as biological evidence-based schemes.
FS Weight, proposed by Chua et al. [2006], is inspired from the graph theory measure Czekanowski-Dice (CD) distance, which is given by
where Nu includes u and all the neighbors of u, similarly for Nv, and NuΔNv = (Nu − Nv) ∪ (Nv − Nu) is the symmetric difference between the sets of neighbors Nu and Nv. CD is a distance (dissimilarity) measure. If Nu = Nv, then CD(u, v) = 0; and if Nu ∩ Nv = ∅, then CD(u, v) = 1.
FS Weight takes inspiration from CD to use common neighbors, and estimates the confidence of the physical interaction between u and v based on their common neighbors. Chua et al. [2006] show that proteins share functions with their strictly indirect neighbors more than with their strictly direct neighbors, and proteins share functions with even higher likelihood with proteins that are simultaneously their direct and indirect neighbors than with proteins that are either strictly their direct or strictly their indirect neighbors. The weight FS(u, v) of the interaction between u and v is estimated as
where λu,v and λv,u are used to penalize protein pairs with very few neighbors, and are given by
where navg is the average number of level-1 neighbors that a protein has in the network. FS Weight thus assigns higher weight to protein pairs with larger number of common neighbors. FS Weight can be used to predict new functional associations between proteins based on the number of neighbors they share; some of these functional associations may be physical interactions.
In Iterative CD, proposed by Liu et al. [2008], the weight for each interaction is computed using the number of neighbors shared between the two interacting proteins, in a manner similar to FS Weight. However, Iterative CD then iteratively corrects these weights for the interactions, such that the weights computed in an iteration uses the weights computed in the previous iteration. By doing so, Iterative CD progressively reinforces the weights of true interactions and dampens the weights of false-positive interactions with each iteration. Iterative CD begins with an unweighted network (all weights set to 1) or a network with weights coming from some prior evidence (e.g., reliabilities of experiments from which the protein pairs were inferred). The weight wk(u, v) for each protein pair (u, v) in the k-th (k > 1) iteration is estimated as
where w1(u, v) = 1 if the interaction (u, v) exists in the original network, w1(u, v) = 0 if the interaction (u, v) does not exist; alternatively, w1(u, v) = r(e)(u,v), which is the reliability of the experiment e used to infer (u, v). The parameters λu and λv penalize protein pairs with very few common neighbors. Liu et al. show that the iterative procedure converges in two iterations for typical PPI networks, but 10–30 iterations may be necessary if the network has high levels of noise. The procedure produces a weight between 0 and 1 for each interaction, and a cut-off (recommended 0.2) is used to filter out low-scoring interactions.
Luo et al. [2015] scored interactions based on collaborative-filtering (CF). This approach is inspired by personalized recommendation in e-commerce where the problem is to identify useful patterns reflecting the connection between users and items from their usage history, and make reliable predictions for possible user-item links based on these patterns [Herlocker et al. 2002]. The CF scheme first computes the similarity sim(u, v) between a pair of interacting proteins u and v in the PPI network using the cosine distance of their adjacency vectors,
where 〈…〉 denotes the inner product between the vectors, and ||.|| denotes the Euclidean norm of the vectors. Therefore, the closer sim(u, v) is to 1, more the common neighbors that u and v share. This cosine distance is then rescaled by incorporating a Cγ parameter which, similar to λ in FS Weight and Iterative CD, is used to account for spurious neighbors in the network,
Like in e-commerce where people judge an item based on multiple reviews, preferably from known contacts, the score sim(u, v) from Equation (2.21) is rescaled as:
where (ru,v)d (d being a tunable parameter) is the mean of the scores of interactions with common neighbors, {(u, x) : x ∈ N(u) ∩ N(v)} ∪ {(v, x) : x ∈ N(u) ∩ N(v)}, in the PPI network.
Voevodski et al. [2009] developed PageRank Affinity, a scoring scheme inspired from Google’s PageRank algorithm [Haveliwala 2003], to score interactions in the PPI network. PageRank Affinity uses random walks to estimate the interconnectedness of protein interactions in the network. A random walk is a Markov process where in each step the walk moves from the current protein to the next with a certain preset probability. PageRank Affinity begins with an unweighted network G (with all weights set to 1) given by the adjacency matrix
The transition probability matrix for the random walks (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one,
where D is the degree matrix, which is a diagonal matrix containing the degrees of the proteins,
Therefore, the transition probabilities are given by the matrix WG, where the transition probabilities pt+1 at step t + 1 are simply pt+1 = ptWG. PageRank