Genecentric: Finding Graph Theoretic Structure in High- Throughput Epistasis Data Tufts University

advertisement
Genecentric: Finding Graph Theoretic Structure in HighThroughput Epistasis Data
Andrew Gallant, Max Leiserson, M. Kachalov,
Lenore Cowen, Ben Hescott
Tufts University
Protein-protein interaction
High-throughput Interaction Data:
aka ‘The Hairball’
What we want:
What we have:
Question: Can we
infer anything about
"real" pathways from
the low-resolution
graph model of
pairwise interactions?
The hairball: A simple graph model
vertices ↔ genes/proteins
edges ↔ physical interactions or
genetic interactions
simplifications:
• undirected
• loses temporal information
• difficult to decompose into separate
processes
• conflates different PPI types into one
class of "physical interactions"
1)Physical interactions
2) Genetic Interactions (epistasis)
Interaction types
• We distinguish here between two types of
interaction:
– physical interactions
• genetic interactions
Genetic interactions (epistasis)
Only 18% of yeast genes are essential (the yeast
dies when they’re removed).
For the rest, we can compare the growth of the
double knockout to its component single
knockouts.
Genetic interactions (epistasis)
• For non-essential genes, we can compare
the growth of the double knockout to its
component single knockouts
Picture: Ulitsky
Nonessential Genes
– Some genes are non-essential because they are
only required under certain conditions (i.e. an
enzyme to metabolize a particular nutrient).
– Other genes are non-essential because the
network has some built-in redundancy.
• One gene (completely or partially) compensates for the
loss of another.
• One functional pathway (completely or partially)
compensates for the loss of another.
Redundant pathways
and synthetic lethality
Kelley and Ideker (2005):
Between-Pathway Model (BPM)
In reality, the data are very incomplete:
Between-Pathway Model (BPM)
Kelley and Ideker (2005)
and Ulitsky and Shamir (2007)
• Goal: detect putative BPMs in yeast interactome
• Method:
1) find densely-connected subsets of the physical
protein-protein interaction (PI) network
(putative pathways)
2) check the genetic interaction (GI) network to see if
patterns in density of genetic interactions correlate
with these putative pathways
3) check resulting structures for overrepresentation of
biological function (gene set enrichment)
Kelley and Ideker (2005)
and Ulitsky and Shamir (2007)
(1)
(2)
enriched for function X
enriched for function Y
(3)
Kelley and Ideker (2005)
and Ulitsky and Shamir (2007)
• Problems:
– Sparse data limits the potential scope of discovery
– independent validation is difficult
Further work on this problem:
Synthetic lethality:
– Ulitsky and Shamir (2007)
– Ma, Tarrone and Li (2008)
– Brady, Maxwell, Daniels and Cowen (2009)
– Hescott, Leiserson, Cowen and Slonim (2010)
Epistasis (weighted) data:
-- Kelley and Kingsford (2011)
-- Leiserson, Tatar, Cowen and Hescott (2011)
So: what is the right way to generalize
BPMs to edge weights?
Quantitative interaction data
New methods generates high-throughput data for genetic interactions.
-7.3556
-0.6347
E-MAP, Epistatic Miniarray Profile
3.69893
3.2723
Data is scalar (-22 to 15)
-5.2571
-1.3668
-3.3368
-5.5312
Synthetic Lethal, < -2.5
Synthetic Sick, -2.5 < x < 0
0.5838
Synthetic Rescue, >+2.5
Allevating 0<x< 2.5
-6.3511
SGA, Synthetic Genetic Array
(smaller weights, -1.1 to 0.8)
Want most negative weight across
3.653986
6
3.23673
-7.32156
3.23723
-5.252571
-1.366879
-3.365368
-5.506312
3.68398
-3.36536
-0.66434
0.553838
-5.25271
-5.506312
-6.315511
2.73
0.53838
-1.36879
-6.31511
What is the Quality of a BPM?
-7.321556
Once we obtain a candidate
BPM we can score it using
interaction data.
Sum interactions within
3.685398
-3.365368
-0.664347
3.236723
-5.252571
2.13473
Sum interactions between
Take the difference and
normalize to create an
interaction score
0.553838
0.13342
-1.366879
-6.315511
Genecentric takes the perspective
of each gene in turn
What is the ‘best’
candidate BPM that
contains node g?
-7.321556
3.685398
-3.365368
Consider a diverse set
of GLOBAL partitions
that try to MAXIMIZE
our objective function
over the whole graph.
-0.664347
3.236723
-5.252571
2.13473
0.13342
0.553838
-1.366879
-6.315511
Which genes are
consistently placed in
the same (opposite)
partition as g?
So we can extract a gene’s best
BPM from a diverse set of good
global bipartitions
Idea for constructing the global
bipartitions: Maximal
cut
Create a random bipartition
For every vertex (gene) assign to a partition at random
Local search method
Now for each gene, v, consider its interaction scores
Unhappy vs happy vertices
Flip
Flip to the other side to make it happy!
nowchange
opposite(v)
and or unhappy
somesame(v)
vertices is
could
to happy
opposite(v) is same(v)
Important properties
Flip will always terminate
- finite number of possible partitions
- weight between partitions decreases with
each flip
- everyone is happy eventually
- local optimum
How we make a BPM from
bipartitions
-7.3215
3.6398
For every gene run weighted flip on the entire graph of
interactions, M times (250 -3.3653
times) -0.66434
3.23672
-5.252571
Some genes will stay on same
side for most runs.
2.1373
Some genes will stay0.55338
on the 0.13342
opposite side for most runs.
-1.36679
Most will switch sides among-6.3151
the different runs
BPM collection: Removing
Redundancies
-7.321556
Remove BPMs that are too large or small
3.685398
-3.365368
-0.664347
3.236723
-5.252571
Take the difference and divide by the size
2.13473
0.553838
0.13342
-1.366879
Sort by score, add to final output set if
Jaccard index < .66 for all previously
added BPMs
-6.315511
Numbers chosen to match previous
studies
How do we measure results?
• FuncAssociate to measure
gene set enrichment
Berriz, Beaver, Cenik, Tasan, Roth, “Next generation
software for functional trend analysis,”
Bioinformatics, 2009, 25(22): 3043-4.
Location of physical interactions
Our Results
Comparison to previous methods:
yeast ChromBio E-MAP
#Modules /
(%Enriched)
#BPMs
Enriched
Same
Function
Bandyopadhyay et
al.
37 (35)
96
41 (43%)
53 (55%)
Ulitsky et al.
43 (43)
111
43 (39%)
71 (64%)
Kelley et al.
40 (40)
98
35 (36%)
52 (53%)
Genecentric
112 (103)
58
39 (67%)
43 (74%)
Study
Enriched Same
or Similar
Function
How does Gencentric work with
various data?
SGA
-7.3215
-0.66434
E-MAP
(Cell Cycle)
3.6853
3.26723
-0.91511
-0.22314
0.54278
-0.687991
-5.252571
-1.366879
-3.365368
0.983123
0.253228
-5.506312
0.5538
0.404421
-6.315511
-6.31511
-7.22314
-3.12363
-1.687991
-6.63178
-5.7225
-0.22565
-0.55672
E-MAP -3.355371
(s. pombe)
-2.404421
1.2833
4.51368
0.253228
1.23711
E-MAP
(MAP-K)
5.22163
-7.137271
Genecentric on Various Data Sets
Data Set
#BPMs
Enriched
Same
Function
Collins et al.
(Cell Cycle)
58
39 (67%)
43 (74%)
Fiedler et al.
(MAP-K)
5
0 (0%)
4 (80%)
Tong et al. (SGA)
149
8 (5%)
17 (11%)
16
1 (6%)
1 (6%)
Roguev et al,
Enriched Same
or Similar
Function
Consider physical interactions
-7.3215
-0.66434
3.6853
3.236723
-5.252571
-1.366879
-3.365368
-5.506312
0.5538
-7.3556
-6.31511
3.5398
Physical Interactions
-3.33368
-0.66347
genetic interactions
3.2723
-5.25371
2.13473
0.55838
-1.3689
-6.3111
Physical interactions in Local Cut
BPMS
PIs
within
Pathways
Expected by
chance
within
PIs
between
Pathways
Expected
by
chance
between
Collins et
al.
172
20
18
20
Fiedler et
al.
13
1
1
1
147
41
17
39
Data Set
Tong et al.
Modifying the weights
-7.321556
-0.664347
How does alleviating
interaction data affect
the results?
3.685398
3.236723
-5.252571
-3.365368
-5.506312
-1.366879
Does a continuum of
possible weights
change the results?
0.553838
-6.315511
Do extreme weights
affect the quality of
the results?
Local Cut WeightEnriched
Variants
Enriched Same
Weight scheme
#BPMs
Same
Function
or Similar
Function
Unchanged
58
39 (67%)
43 (74%)
No alleviating
26
17 (65%)
19 (73%)
Large values capped
68
4 (6%)
6 (9%)
Alleviating +1
Aggravating -1
30
3 (10%)
7 (23%)
Genecentric: try this at home
• Project name: Genecentric
• Project homepage:
http://bcb.cs.tufts.edu/genecentric
• Operating system: platform independent
• Programming language: Python
• Other requirements: Python 2.6 or higher
• License: GNU Public License (GPL 2.0)
Gencentric parameters
• Set M (number of randomized bipartitions)
default 250
• Set C (consistency of same side/opposite side
for inclusion in g’s BPM) default 90%
• Set J (Jaccard index, how much overlap before
similar BPMs are pruned) default .66
• Do you want a min or max size module?
(default 3-25)
• FuncAssociate parameters: genespace, p-value
Genecentric works out of the box
• “New” E-MAP of plasma membrane genes
from Aguilar et al. in 2010.
• 374 genes including those known to be
involved in endocytosis, signaling, lipid
metabolism, eisome function.
• Genecentric was run with default E-MAP
parameters, except C was lowered from .9 to
.8 to produce more BPMs (22 instead of 6)
Genecentric on plasma membrane
E-MAP : example BPM
BPM1
BPM2
• COG6 COG5 COG8 PIB2
COG7
• ARL1 VPS35 GET3 ARL3
SYS1 GOT1 PEP8 SFT2
MNN1 VPS17
• Intra-Golgi vesicle-mediated
transport, protein targeting
to vacuole
• Protein transport, Golgi
apparatus, endsome
transport, vesicle-mediated
transport
Genecentric on plasma membrane
E-MAP : example BPM
BPM1
BPM2
• SLT2 BCK1 CLC1
• PEX1 PEX6 EDE1 SKN7 ERG4
ADH1 PEX15 ARC18 EMC33
• Endoplasmic reticulum
unfolded protein response
• Protein import into
peroxisome matrix, receptor
recycling
Biological Findings (cont.)
• Some complexes come up again and again– could
they be global mechanisms of fault tolerance?
In Plasma Membrane;
-- COG complex
In Chrombio;
– SWR-C complex (Chromatin remodeling)
– Prefoldin complex (Chaperone)
– MRE11 complex (DNA damage repair)
Co-authors and collaborators
•
•
•
•
Ben Hescott
Max Leiserson
Diana Tartar
Maxim Kachalov
thanks.
A Graph Theory Problem
• Our algorithm samples from the maximal
bipartite subgraphs. With what distribution? Is
it uniform? Proportional to the number of
edges that cross the cut?? ???
• What are the properties of the stable bipartite
subgraphs of the synthetic lethal network?
Are they conserved across species?
Approach
• Run the partitioning algorithm 250 times on
the yeast SL network (G).
• For each gene g in G,
– Construct a set A consisting of g and all nodes in G
which wind up in the same set as g at least 70% of
the time.
– Construct another set B consisting of all nodes in
G which wind up in the opposite set from g at
least 70% of the time.
• We call the subgraph of G defined by A and B
the “stable bipartite subgraph of g”, and
designate it as a candidate BPM.
Delete a gene in pathway 1; see if
changes in pathway 2 coherent
BPM
Deleted Gene
Pathway restriction
log10 ratio
Sort
Validation: Microarray Data
• Rosetta compendium (Hughes et al, 2000):
-- contains yeast expression profiles of 276
deletion mutants:
i.e. for each gene in the yeast genome,
measures how its expression levels change
when particular gene g is deleted, as
compared to wildtype yeast.
At step i: N to 1
Calculate weighted percent of
genes in pathway seen so far
and precent of genes not in
pathway:
Score is max difference
How to validate a pathway
Using a permutation test we sample 99
random subsets of genes the same size as the
pathway
We calculate the cluster rank score for each of
these 99 sets
We sort the test plus the pathway score
The p-value is the percentile
A pathway is validated if its p-value is <=0.1
Delete a gene in pathway 1; see if
changes in pathway 2 coherent
We call a pathway “Validated” if its Cluster Rank Score has p-value < .1
Kelley-Ideker Histogram of the Lowest CRS per Pathway per BPM
This histogram displays all the CRS scores from all of the results from Kelley and Ideker’s
BPMs bucketed according to their lowest p value score. The p value scores <= 0.10
indicate a validated BPM.
Ulitskyi Histogram of the Lowest CRS per Pathway per BPM
This histogram displays all the CRS scores from all of the results from Ulitskyi’s BPMs
bucketed according to their lowest p value score. The p value scores <= 0.10 indicate a
validated BPM.
Ma Histogram of the Lowest CRS per Pathway per BPM
This histogram displays all the CRS scores from all of the results from Ma’s BPMs
bucketed according to their lowest p value score. The p value scores <= 0.10 indicate a
validated BPM.
Brady Histogram of the Lowest CRS per BPM
This histogram displays all the CRS scores from all of the results from Brady’s BPMs
bucketed according to their lowest p value score. The p value scores <= 0.10 indicate a
validated BPM. Clearly, Brady’s BPMs are disproportionately represented in the lower p
value range.
Results
BPM dataset # paths hit
knockouts
Kelley-Ideker 160
(05)
Ulitsky36
Shamir (07)
Ma et al.
54
(08)
Our results 959
# validated
pathways
16
% validated
pathways
10%
5
14%
6
11%
230
24%
A Tantalizing Peek of What We can Do
With More Data!
• A heat map of the
differential expression
of yeast genes in
pathway 2 in response
to the deletion of two
different genes (SHE4
and GAS1) from
pathway 1 in a validated
BPM of Ma et al.
A random-gene validation test couples
the two pathways together
Download