1. Introduction to Molecular Biology

advertisement
Protein Interaction Networks
Thanks to Mehmet Koyuturk
7. Protein Interaction Networks
Protein-Protein Interactions
Physical association between proteins




Signal transduction, phosphorylation
Docking, complex formation
Permanent vs. transient interactions
Co-location of proteins




Proteins that work in the same cellular component
Soluble location: lysosome, mitochondrial stroma
Membrane location: receptors in plasma membrane,
transporters in mitochondrial membrane
Functional association of proteins



2
Proteins involved in the same biomolecular activity
Enzymes in the same pathway, co-regulated proteins
7. Protein Interaction Networks
Permanent vs Transient Interactions
Permanent interactions




Some proteins form a stable protein
complex that carries out a structural or
functional biomolecular role
These proteins are protein subunits of the
complex and they work together
ATPase subunits, subunits of nuclear pore
Transient interactions



3
Proteins that come together in certain
cellular states to undertake a biomolecular
function
DNA replicative complex, signal
transduction
7. Protein Interaction Networks
Signal Transduction

Phosphorylation


Protein-kinase interaction
Enzyme activation

4
Signaling cascade
7. Protein Interaction Networks
Why Study Protein Interactions?




5
Identification of functional modules and interconnections
between these modules
Functional annotation based on binding partners and
interaction patterns
Identification of evolutionarily conserved pathways
Identification of drug target proteins to minimize side effects
7. Protein Interaction Networks
Identification of Protein Interactions
Traditionally, protein interactions are identified by wetlab
experiments based on hypotheses on candidate proteins
Small scale assays




Coimmunoprecipitation: Immunoprecipitate one protein, see if
other is also precipitated
Reliable, but can only verify interactions between suspected
partners
High throughput screening




6
Throw in thousands of ORFs and see which ones bind to each
other
Yeast two hybrid, tandem affinity purification
Large scale, but a lot of noise
7. Protein Interaction Networks
Yeast Two Hybrid
Split yeast GAL4 gene, which encodes a transcription
factor, required for activation of GAL genes in two parts



7
Activating domain, binding domain
The split protein does not work unless the two parts are in
physical contact
7. Protein Interaction Networks
Protein Interaction Networks
Organize all identified interactions in a network, where
proteins are represented by nodes and interactions are
represented by edges

Interaction
Protein
TAP identifies a group of proteins that are caught by
target protein


8
Spoke model (star network) vs. matrix model (clique)
7. Protein Interaction Networks
Functional Modularity in PPI Networks
A protein complex


Dense subgraph
A signal transduction pathway


Simple path, parallel paths
A protein with common, key,
fundamental role (e.g., a kinase)



9
Hub node
7. Protein Interaction Networks
Computational Prediction of PPIs

Functional association is a higher level conceptualization
of interaction


Functionally associated proteins are likely to show up in
similar contexts


Proteins that act as enzymes catalyzing reactions in the same
metabolic pathway
Co-regulation, co-expression, co-evolution, co-citation…
Functional association between proteins can be
computationally identified by looking at different sources
of data such as sequences, gene expression, literature

10
Can also be extended to capture physical associations, for
example, by taking into account evolution at structural level
7. Protein Interaction Networks
Conservation of Gene Neighborhood

In bacteria, the genome of an
organism is organized in such
a way that that functionally
related proteins are coded by
neighboring regions


Operons
When more than one
bacterial species are
considered, it is observed
that this neighborhood
relationship becomes even
more relevant
11
Distribution of neighboring genes in
H. Influenzae and E. coli into functional
classes
7. Protein Interaction Networks
Comparison of Nine Bacterial Genomes

trpB-trpA is the only
gene pair whose
proximity is conserved
across nine
prokaryotic genomes

12
These genes encode
the two subunits of
tryptophan synthase
that interact and
catalyze a single
reaction
7. Protein Interaction Networks
Close Orthologs

Run of genes



A set of genes on one strand, such
that gaps between adjacent genes is
less than a threshold, g (in practice, g
 300 bp)
Any pair of genes on the same run
are said to be close
Bidirectional best hits

13
Genes X1 and X2 from genomes G1
and G2 are BBH, if their sequence
similarity is significant and there are
no Y1 (Y2) in G1(G2) that is more
similar to X2 (X1) than X1 (X2)
Pair of close bidirectional
best hits: Xa,Ya close in
G1, Xb,Yb close in G2,
Xa&Xb BBH,Ya& Yb BBH
7. Protein Interaction Networks
Predicting Interactions

For each pair of close orthologs (occuring at least one
pair of genomes), calculate a score



14
Score should increase with the phylogenetic distance between
the two genomes, since closely related organisms are more
likely to have similar genes nearby due to chance alone
Existence of a triplet (P1, P2, P3) should be stronger than the
existence of two pairs (P1, P2 and P1, P3)
Triplet distance can be estimated as the minimum distance
between any pair of organisms (in addition to pair score)
7. Protein Interaction Networks
Reconstructing Pathways
Purine Metabolism

Can identify the association between unknown proteins
and known pathways!
15
7. Protein Interaction Networks
Projection of Gene Neighborhood

The composition of operons is evolutionarily variable



A particular set of functionally related genes do not always
comprise an operon
The application of gene neighborhood based interaction
prediction is limited for a single organism
With multiple organisms, it is possible to statistically
strengthen conclusions and project findings on other
organisms


16
If an operon with functionally related genes exists in several
genomes, a functional association can be predicted for other
organisms, even if the corresponding genes are scattered
Variability turns out to be an advantage for prediction
7. Protein Interaction Networks
Gene Neighborhood - Limitations


It is only directly applicable to bacteria (and archaea),
because relevance of gene order does not necessarily
extend to eukaryotes
For closely related species, conserved gene order might
just be due to lack of time for genome rearrangements



17
We are interested in selective constraints that preserve gene
order
Compared species should be distant enough
But not too distant, because we need sufficient number of
orthologs to be able to derive statistically meaningful results
7. Protein Interaction Networks
Gene Fusion

Domain fusion events


18
Two protein domains that act as independent proteins
(components) in one organism may form (part of) a single
polypoptide chain (composites) in another organism
Most proteins that are involved in domain fusion events are
known to be subunits of multiprotein complexes (76% in E. coli
metabolic network)
7. Protein Interaction Networks
Gene Fusion Based PPI Prediction

A pair of proteins in
query genome are
candidate interacting
pairs if



19
They show (local)
sequence similarity to
the same protein
(rosetta stone) in
reference genome
They do now show
sequence similarity with
each other
Complete genomes!
7. Protein Interaction Networks
Predicted Interactions
Known physical Proteins in the
interactions same pathway
20
7. Protein Interaction Networks
Gene Fusion Based Prediction - Results

Interactions predicted based on gene fusion events

21
Distance on circle shows distance on genome
7. Protein Interaction Networks
Co-evolution of Interacting Proteins

Selective pressure is likely to act on common function



22
Proteins that are interacting are expected to either be
conserved together along with their interactions, or not
conserved at all
Hypothesis 1: Orthologs of interacting proteins also interact in
other species (supported by evidence, but there are subtleties,
which we will discuss this later)
Hypothesis II: If two proteins are
interacting, then they will show
similar conservation patterns
Phylogenetic profiles
7. Protein Interaction Networks
Phylogenetic Profiles
23
7. Protein Interaction Networks
Correlation of Phylogenetic Profiles

Assume we have N genomes, protein X has homologs in x
of them, Y has y, and they co-occur in z genomes
Hamming distance:
Pearson correlation:

Mutual information:

Statistical significance:


24
7. Protein Interaction Networks
Phylogenetic Profiles - Limitations
Many processes may be common
across lineages



Too many false positives
Database of genomes may be
biased
All organisms are treated equally


Yeast nucleoli and ribosomal proteins

Improvement: Use trees instead of
profiles
Proteins are assumed to be
conserved as a whole


25
It is domains that interact
Improvement: Use domain profiles
Organisms
7. Protein Interaction Networks
Phylogenetic Tree Based Prediction

Phylogenetic trees of Ntr-family two-component sensor
histidine kinases and their corresponding regulators
26
7. Protein Interaction Networks
Mirror Tree Method

Need to have sufficient
number of genomes that
contain homologs of both
proteins
27
7. Protein Interaction Networks
Matrix Method



Start with families of
proteins that are suspected
to interact
Identify specific pairs of
proteins that interact by
aligning the phylogenetic
trees that underly the two
families
Assumption: Identical
number of proteins in each
family
28
7. Protein Interaction Networks
Correlated Mutations

Co-evolution of interacting proteins can be followed
more closely by quantifying the degree of co-variation
between pairs of residues from these proteins

Correlated mutations may correspond to compensatory
mutations that stabilize the mutations in one protein with
changes in the other
Distribution of distances between aminoacid positions on a folded protein
29
7. Protein Interaction Networks
In silico Two-Hybrid

The correlation of mutations between two positions (may
be on different proteins) can be estimated from pairwise
assessment of aligned multiple sequences


Position pairs with high correlation are potential contact points
Interaction index

30
For a protein pair, compute the aggregate correlation (of
mutations) across all positions
7. Protein Interaction Networks
In silico Two-Hybrid
31
7. Protein Interaction Networks
Performance of I2H


I2H predicts physical, rather than functional association
It requires complete genomes & sufficient number of
homologs
32
7. Protein Interaction Networks
Co-citation Based PPI Prediction



Functionally associated proteins are likely to be cited in
the same research article
We can assess the statistical significance of co-citation
based on hypergeometric model
Algorithmic problem: How to recognize & match protein
names?

33
Train algorithm using annotated abstracts via conditional
random fields (CRF)
Performance of Co-citation


The method is robust to
choice of parameters for
name recognition
34
Statistical
significance is quite
relevant until it
saturates
7. Protein Interaction Networks
Integrating PPI Networks

Interaction data
coming from
multiple sources



35
Different sources
refer to different
levels of interaction
Can integration
handle noise, making
interaction data
more reliable?
Superpose
interactions based on
their reliability
7. Protein Interaction Networks
Bayesian Integration

For each prediction method, compute log-likelihood
score




Let P(L|E) be the number of interactions predicted by method
E, such that functional association between corresponding
proteins is known
Let ~P(L|E) be the number of false positives
Let P(L) and ~P(L) be the corresponding priors
Assign weights to methods based on their log-likelihood
scores
36
7. Protein Interaction Networks
Comparison of Prediction Methods

Integrated network captures functional association better

37
Note that the integrated network is “trained” using available
data on functional association
7. Protein Interaction Networks
Classification Based Integration

Points: Proteins, Space: Expression, Conservation, Labels:
Function

Points: Protein Pairs, Space: Co-expression, Co-evolution,
etc., Labels: Existence of Interaction
38
7. Protein Interaction Networks
Performance of Domain Co-evolution
39
7. Protein Interaction Networks
Co-Evolutionary Matrix
40
7. Protein Interaction Networks
Domain Identification
41
7. Protein Interaction Networks
Difference between Predicted PPIs
42
Pattern Discovery in
Signaling Networks
Reconstruction of Cellular Signaling

Network reconstruction includes

chemically accurate representation of all biochemical events
occurring within a defined signaling network
and incorporates



interconnectivity
functional relationships
that are inferred from experimental data.
Cellular signaling networks operate several orders of
magnitude in spatio-temporal scales


44
Quick responses (<10-1 secs.), e.g., protein modifications
Slow responses (minutes to hours), e.g., transcriptional
regulation
Cellular Signaling

Who are the actors?




45
Receptors reside inside or on the
surface of the cell and bind to
specific chemicals with high
specificity and affinity.
Protein kinases catalyze reactions
involving the transfer of phosphate,
from high-energy donor molecules,
such as ATP, which results in
activation of proteins
Protein phosphatases
dephosphorylate active proteins
Transcription factors
Combinatorics of Cellular Signaling

What is the scope of these actors?




In how many different ways a signal can be transmitted?
In how many different states can a cell be?
Number of receptors, kinases, phosphatases, transcription
factos, and the number of possible interactions between these
Alternative splicing




46
In eukaryotes, introns are spliced out before translation
Different combinations of introns can be spliced out, resulting
in different products of the same gene
One more level of combinatorial complexity
If a gene has k exons, then splicing of alternative exons can
generate upto 2k isoforms
Scope of Human Signaling Network
47
Combinatorial Effects

Genes that code for signaling proteins compose 75% of
all alternatively spliced genes


After post-transcriptional modification, number of mRNA
transcripts


This implies that cells use alternative splicing extensively to
achieve the extraordinary specificity that is required in
signaling systems
3858 for receptors, 1295 for kinases, 375 for phosphatases
After post-translational modification (phosphorylation,
acetylation, methylation), number of distinct protein states


48
30864 for receptors, 10360 for kinases, 3000 for phosphatases
20-fold increase in number of protein states over genes
Links and Connectivity

Interactions allow for an even greater degree of
combinatorial control


Homo- and heterodimerization of 224 proteins can provide
sufficient specificity to control the expression of 25000 genes
in human genome (n(n-1)/2)
If receptors assume only ligand bound and unbound
states, then k receptors can recognize 2k different ligand
combinations

49
If 1% of estimated 1543 receptors in human genome can be
independently expressed, then the cell could potentially
respond to 32768 different ligand combinations
Signal Reception

Based on the average surface area of a cell and average
area of a receptor, it is estimated that there can be as
many as a few million receptors on the surface of the
typical somatic cell at a given time




50
~ 30000 distinct receptor types
~130 receptors of each receptor type
A few receptors (~10-40) in high numbers (~105 per cell) for
highly differentiated and specialized cells
Many receptors (~2000-3000) in small numbers (~102 per cell)
for stem cells or undifferentiated cells
Reconstructing Signaling Networks
51
Focusing on Parts of the Network

Nodes


Modules


Group of related interactions, e.g., a protein complex
Pathways

52
Who does a single protein interact with? In what contexts?
Chain of interactions that connect a signaling input to output
Protein Complexes in PPI Networks

Spoke vs matrix
model


53
Recall that in PCP
methods like TAP
identify a group of
proteins that bind
to each other using
a single protein as
bait
How to encode this
into a network of
pairwise
interactions?
Actual
Complex
Spoke
Model
Matrix
Model
Protein Complexes in Matrix Model
54
Modules and Quotients


Define a module as a group of proteins such that the
interactions of the proteins with those outside the
module are identical
Quotient: Replace proteins in a module with a single node

55
The edges of the representative node will represent the
interactions of all proteins in the module
Types of Modules

Parallel module



Series module



No interaction between proteins in the module
These are likely to correspond to proteins that are
functionally related, but do not interact with each other
Proteins in the module form a clique among themselves
All proteins in the module perform some function together
(single complex or multiple related complexes)
Prime module


56
All other topologies
This is probably what you will
observe most of the time
Hierarchical Decomposition

Recursively identify
and contract
modules




57
This results in a
tree representation
of the network
Each node is a
quotient graph
Leaves are proteins
Root is entire
network
Decomposition of Yeast PPI Network
58
Identification of Modules

Graph clustering


Find groups of nodes with high interconnectivity (and relatively
low connectivity with outside)
Issues


Definition of clustering metrics
Density


Distance-based metrics


59
Has to be normalized by subgraph size
A module has low diameter
Normalizing intra-cluster
connectivity with outer
connectivity
Algorithms

The problems are generalizations of maximum clique


Maximum clique itself is NP-hard (enumeration of cliques in
early PPI networks was possible, though, and these were used
as seed subgraphs for dense clusters)
Heuristic approaches



Graph clustering is very well studied
Recall that, while clustering vectors in metric spaces (e.g., gene
expression data), it is common to generate similarity graphs
Bottom-up heuristics


Top-down heuristics

60
Start with a single node, grow subgraph until “density” is lost
Recursively partition the entire network until subgraph is dense
enough
MCODE Algorithm

Three stages




Vertex weighting
Complex prediction
Post-processing for finding overlapping clusters
Vertex weighting


61
How “clustered” is a network’s neighborhood?
Use core clustering coefficient instead of clustering coefficient
N : subgraph induced by neighbors of v
K : k-core subgraph of N that maximizes k
d : density of K
weight(v) = k x d
MCODE Algorithm (cont’d)

Complex prediction





Post-processing



Seed a complex with the node with highest weight
At each node addition, check the neighbors of that node, if
their weight is above a given threshold relative to that of the
seed vertex, add that node into the complex as well
Repeat until no node can be added
Once a complex is identified, remove those nodes and find
other complexes
Filter-out complexes that do not contain at least a 2-core
Add nodes to allow overlaps to a given threshold
Complex score: density x size
62
Scoring Subgraphs

Observe the trade-off between size and density



Statistical significance



A single interaction has density one
What is a good cut-off for density?
What is the expected size of the largest dense subgraph?
Implicitly trades off density and size
If we can analytically characterize the distribution of the
largest dense subgraph, then we can use statistical
significance as a score function (stopping criterion)

63
This also implicitly handles correction for multiple hypothesis
testing
G(n,p) Model

Let random variable R be the size of largest subgraph
with density 

The typical value of R is given by
r0 =
log(n) – log(log(n)) + log(Hp())
Hp()
where Hp() =  log(/p) + (1-) log((1-)/(1-p))
denotes divergence

The p-value of a larger dense subgraph is given by
P(R  r0)  O(log(n)/n1/H
64
()
)
Piecewise G(n,p) Model


Two protein groups; hubs (Vh) and regulars (Vl)
There is an edge between u and v with probability
ph if u, v  Vh
pb if u  Vh, v  Vl, or vice versa
p if u, v  Vl
ph > pb > pl, |Vh| < |Vl|





If |Vh| << |Vl|, it contributes an additive factor
r1 =
log(n) + 2|Vh| log(B) - log(log(n)) + log(Hp())
Hp()
where
65
B = pb(1-p)/p+1-pb
SIDES Algorithm

Recursive minimum-cut partitioning

Partition nodes into two parts such that the number of edges
in between is minimized, then recurse
p << 1
p << 1
66
p << 1
-log(p-value)
Specificity (%)
MCODE vs SIDES
Cluster Size
Correlation
SIDES: 0.76
MCODE: 0.43
67
Sensitivity (%)
Sensitivity (%)
Specificity (%)
MCODE vs SIDES
Module Size
Correlation
SIDES: 0.22
MCODE: -0.02
68
Module Size
Correlation
SIDES: 0.27
MCODE: 0.36
Fiedler Vector

For network G, Laplacian L is defined as follows:
L(i, i )   w(ui , u j )
j
L(i, j )   w(ui , u j )
Here, w(ui,uj) denotes the weight of edge uiuj.

It can be shown that


Matrix L is positive semi-definite, with exactly one zero
eigenvalue for each connected component
The eigenvector x that corresponds to the smallest non-zero
eigenvalue minimizes
xT Lx   w(i, j )( x(i )  x( j )) 2
i, j
This vector is known as the Fiedler vector of network G.
69
Spectral Graph Clustering

Fiedler vector provides the optimal mapping of the nodes
of the network on one-dimensional Euclidian space, in the
mean squares sense


This also generalizes to optimal k dimensional mapping
Once a one-dimensional mapping is obtained, clustering
algorithms can be used on this one dimensional space

Find cut points in one dimensional space
-1


70
0
1
Top-down: Partition one dimensional space by finding two cutpoints and recurse on each part
Bottom-up: Merge two closest nodes, recurse
Identification of Signaling Pathways


We would like to identify simple paths (chains of
interactions) in the PPI networks, which might
correspond to, for example, signaling cascades highlighting
the group of proteins and interactions that are resposible
for the transduction of a specific signal
What can we do based solely on interaction data?




71
In the PPI network, there may be be plenty of paths connecting
each pair of nodes
Which ones are interesting?
How long can a pathway be?
How about identifying “most reliable” paths?
Formulating Pathway Identification



Assume that the edges are scored, such that p(u,v)
denotes the likelihood that proteins u and v interact
Then the multiplication of edge scores along the path
quantifies the likelihood that the path exists
Let w(u,v) = -log p(u,v) denote the weight of edge


Then, if we define the weight of a path as the summation of the
weights of the edges on the path, paths with less weight will be
more reliable paths
For a given set I of proteins, find all minimum-weight
paths of length k from I to each protein in the network

72
I might be the set of receptor proteins
Enumerating Pathways

Dynamic programming

For v S  V, let W(v, S) be the minimum weight of a simple
path that starts from a protein in I, visits all proteins in S, and
ends in v
This function can be tabulated using the following recursion

where



if vI, and  otherwise
For given v, the minimum path from I to v is given by the
minimum W(v, S) over all S that contain v
The running time of this algorithm is O(knk)

73
Not feasible for k larger than a few
Color Coding



Color each protein randomly using a set of k colors
Search for paths that contain one protein from each
colour => No vertex will be repeated on the path
The dynamic programming algorithm can be modified to
solve this problem


The running time of this algorithm is O(2kkm)
However, this algorithm misses an optimal path if two
proteins on the path happen to be colored identically


74
For each path, the algorithm succeeds with
~ probability
Repeat
times to make sure that the probability that the
algorithm will fail for at least one protein is less than
Hunting Biologically Meaningful Paths

Constraining the set of proteins



If a protein is required to be in the path, assign a unique color
to the target protein
If a family is required, assign color to the family
Constraining order of occurrence



75
Signal transduction often progress in inward order, from
membrane proteins to nuclear proteins and transcription
factors
Segmented pathways: Assign labels to proteins, where labels
represent cellular component, require paths to be monotonic
with respect to labels
Labels can also be generalized to intervals (consistent
segments)
Download