Network Inference

advertisement
Beyond Co-expression:
Gene Network Inference
Patrik D’haeseleer
Harvard University
http:/genetics.med.harvard.edu/~patrik
Beyond Co-expression
• Clustering approaches rely on co-expression of
genes under different conditions
• Assumes co-expression is caused by co-regulation
• We would like to do better than that:
– Causal inference
– What is regulating what?
Gene Network Inference
Overview
• Modeling Issues:
–
–
–
–
•
•
•
•
Level of biochemical detail
Boolean or continuous?
Deterministic or stochastic?
Spatial or non-spatial?
Data Requirements
Linear Models
Nonlinear models
Conclusions
Level of Biochemical Detail
• Detailed models require lots of data!
• Highly detailed biochemical models are only feasible
for very small systems which are extensively studied
• Example: Arkin et al. (1998), Genetics 149(4):1633-48
lysis-lysogeny switch in Lambda:
5 genes, 67 parameters based on 50 years of research,
stochastic simulation required supercomputer
Example: Lysis-Lysogeny
Arkin et al. (1998),
Genetics 149(4):1633-48
Level of Biochemical Detail
• In-depth biochemical simulation of e.g. a whole cell is
infeasible (so far)
• Less detailed network models are useful when data is
scarce and/or network structure is unknown
• Once network structure has been determined, we can
refine the model
Boolean or Continuous?
• Boolean Networks (Kauffman (1993), The Origins of Order)
assumes ON/OFF gene states.
A 0
1
C
0
C = A AND B
B
• Allows analysis at the network-level
• Provides useful insights in network dynamics
• Algorithms for network inference from binary data
Boolean or Continuous?
• Boolean abstraction is poor fit to real data
• Cannot model important concepts:
– amplification of a signal
– subtraction and addition of signals
– compensating for smoothly varying environmental parameter
(e.g. temperature, nutrients)
– varying dynamical behavior (e.g. cell cycle period)
• Feedback control:
negative feedback is used to stabilize expression
 causes oscillation in Boolean model
Deterministic or Stochastic?
• Use of concentrations assumes individual molecules
can be ignored
• Known examples (in prokaryotes) where stochastic
fluctuations play an essential role (e.g. lysislysogeny in lambda)
• Requires stochastic simulation (Arkin et al. (1998),
Genetics 149(4):1633-48), or modeling molecule counts
(e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5)
• Significantly increases model complexity
Deterministic or Stochastic?
• Eukaryotes: larger cell volume, typically longer halflives. Few known stochastic effects.
• Yeast: 80% of the transcriptome
is expressed at 0.1-2 mRNA
copies/cell
Holstege, et al.(1998), Cell 95:717-728.
• Human: 95% of transcriptome is
expressed at <5 copies/cell
Velculescu et al.(1997), Cell 88:243-251
Spatial or Non-Spatial
• Spatiality introduces additional complexity:
–
–
–
–
intercellular interactions
spatial differentiation
cell compartments
cell types
• Spatial patterns also provide more data
e.g. stripe formation in Drosophila:
Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454.
• Few (no?) large-scale spatial gene expression data
sets available so far.
Overview
• Modeling Issues:
–
–
–
–
•
•
•
•
Level of biochemical detail
Boolean or continuous?
Deterministic or stochastic?
Spatial or non-spatial?
Data Requirements
Linear Models
Nonlinear models
Conclusions
Overview
• Modeling Issues
• Data Requirements:
–
–
–
–
Lower bounds from information theory
Effect of limited connectivity
Comparison with clustering
Variety of data points needed
• Linear Models
• Nonlinear models
• Conclusions
Lower Bounds from
Information Theory
• How many bits of information are needed just to
specify the connection pattern of a network?
• N2 possible connections between N nodes
 N2 bits needed to specify which connections are
present or absent
O(N) bits of information per “data point”
 O(N) data points needed
•
Effect of Limited Connectivity
• Assume only K inputs per gene (on average)
 NK connections out of N2 possible:
 N2 

 possible connection patterns
 NK 


• Number of bits needed to fully specify the
connection pattern:
 N2 
  NK log N K 
log 2 

NK


 O(Klog(N/K)) data points needed
Comparison with clustering
• Use pairwise correlation comparisons as a stand-in
for clustering
• As number of genes increases, number of false
positives will increase as well  need to use more
stringent correlation test
• If we want to use the same correlation cutoff value r,
we need to increase the number of data points as N
increases
 O(log(N)) data points needed
Summary
Fully connected
Connectivity K
Clustering
N
Klog(N/K)
log(N)
(thousands)
(hundreds?)
(tens)
• Additional constraints reduce data requirements:
– limited connectivity
– choice of regulatory functions
• Network inference is feasible, but does require
much more data than clustering
Variety of Data Points Needed
• To unravel regulation of a gene, need to sample
many different combinations of its regulatory inputs
(different environmental conditions and perturbations)
• Time series data yields dynamics, but all data points
are related
• Steady-state data yields attractors, can sample state
space more efficiently
• Both types of data will be needed, and multiple data
sets of each
Overview
• Modeling Issues
• Data Requirements
• Linear Models:
–
–
–
–
Formulation
Underdetermined problem!
Solution 1: reduce N
Solution 2
• Nonlinear models
• Conclusions
Linear Models
• Basic model: weighted sum of inputs
yi (t  t )   w ji y j (t )  bi or
dyi
  w ji y j  bi
dt
j
j
g1 w12
• Simple network representation:
w55
g2
• Only first-order approximation
w23
g5
g4
g3
• Parameters of the model:
weight matrix containing NxN interaction weights
• “Fitting” the model: find the parameters wji, bi such
that model best fits available data
Underdetermined problem!
• Assumes fully connected network: need at least as
many data points (arrays, conditions) as variables
(genes)!
• Underdetermined (underconstrained, ill-posed)
model: we have many more parameters than data
values to fit
• No single solution, rather infinite number of
parameter settings that will all fit the data equally well
Solution 1: reduce N
• Rather than trying to model all genes, we can
reduce the dimensionality of the problem:
• Network of clusters: construct a linear model based
on the cluster centroids
– rat CNS data (4 clusters): Wahde and Hertz (2000),
Biosystems 55, 1-3:129-136.
– yeast cell cycle (15-18 clusters): Mjolsness et al.(2000),
Advances in Neural Information Processing Systems 12; van
Someren et al.(2000) ISMB2000, 355-366.
• Network of Principle Components: linear model
between “characteristic modes” of the data
Holter et al.(2001), PNAS 98(4):1693-1698.
Solution 2:
• Take advantage of additional information:
–
–
–
–
replicates
accuracy of measurements
smoothness of time series
…
• Most likely, the network will still be poorly
constrained.
 Need a method to identify and extract those parts of
the model that are well-determined and robust
What’s next?
• Regulatory motifs:
once we have identified the corresponding DNA binding
proteins (transcription factors), we can start building the
gene network from there
• Integration with other data:
–
–
–
–
–
–
–
transcription factors
functional annotation
known interactions in the literature
protein-protein interactions
protein expression levels
genetic data
...
Linking Regulatory Motifs
to Expression Data
Patrik D’haeseleer
Harvard University
http:/genetics.med.harvard.edu/~patrik
Introduction
• Gene expression is regulated by Transcription
Factors (TFs), that bind to specific regulatory motifs
in the promoter region of the gene.
Synonyms: regulatory element, regulatory sequence,
promoter elements, promoter motifs, (TF) binding site,
operator (in prokaryotes), …
TF
DNA
regulatory
motif
gene
• Question: Do genes with similar expression
patterns share regulatory motifs?
1: Systematic Determination of
Genetic Network Architecture
0.2
1
2
-0.8
-1.3
-1.8
Time -point
3
Normalized
Expression
Time-point 3
Normalized
Expression
1.2
0.7
-0.3
Tavazoie et al., Nature Genetics 22, 281 – 285 (1999)
Normalized
Expression
Time-point 1
1.5
1
0.5
0
-0.5
1
2
3
-1
-1.5
Time -point
1.5
1
0.5
0
-0.5
1
2
-1
-1.5
-2
Time -point
3
Search for Motifs in Promoter Regions
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
Best Motif Found by AlignACE
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
3
2.5
2
MCB
N=182
1.5
1
0.5
0
-0.5
-1
Number of ORFs
s.d. from mean
Replication & DNA synthesis (2)
-1.5
100
80
60
40
20
0
Distance from ATG (b.p.)
0
00
-1
00
-2
00
-3
00
-4
00
-5
00
-6
00
-7
00
-8
-9
00
35
30
25
20
15
10
5
0
00
0
23
30
11
40
CLUSTER
Number of sites
DNA synthesis and replication (82)
Cell cycle control and mitosis (312)
Recombination and DNA repair (84)
Nuclear organization (720)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
-1
MIPS Functional category (# ORFs)
ORFs within
category
Systematic Determination of Genetic
Network Architecture
• Tavazoie et al., Nature Genetics 22, 281–285 (1999)
• Most motifs found are highly selective for the cluster
they were found in.
• Can find many known binding sites for transcription
factors.
• Also finds many novel regulatory motifs, associated
with specific functional categories.
1) cluster
2) identify regulatory motifs in clustered genes
3) identify TF’s that bind to those motifs
 Gene regulation network
2: Regulatory Element Detection Using
Correlation with Expression
Bussemaker et al., Nature Genetics 27, 167 – 174 (2001)
• What is the contribution of each regulatory motif (or
the TF that binds to that motif) to the expression level
of the genes containing the motif?
• Given a set of known or putative regulatory motifs,
identify all genes that contain the motif in their
promoter region.
• For a single expression experiment (e.g. single point
in a time series), is the presence of the motif
correlated with the expression level of the genes?
• Perform multiple regression of (log) expression level
on the presence/absence of the motifs.
• Plot contribution of motif throughout time series.
Contribution of Motifs to Expression Levels
: the presence of motif 1
is correlated with the
expression levels of the
genes in which it appears
: motif 2 is not correlated
with expression levels of
the genes in which it
appears
: motif 3 is negatively
correlated with expression
levels of the genes in
which it appears
Linear Combination of Motif Contributions
• Find the most highly correlated motif.
• Determine its contribution Fi to expression level by
linear regression.
• Subtract its contribution from the expression levels.
• Find the next highest correlated motif.
• Repeat until no more significantly correlated motifs.
Ag  C  F1 N1g  F2 N 2 g  F3 N 3 g  ...
• Repeat this entire analysis for each time point of a
time series  weights Fi for the individual motifs will
change throughout he time course.
Time Courses of Regulatory Signals
Time (minutes)
Time (minutes)
Ag (t )  C  N1g F1 (t )  N 2 g F2 (t )  N 3 g F3 (t )  ...
• We can think of the time-varying contributions Fi of
each motif as the Regulatory Signals of the
transcription factors that bind to these motifs
Regulatory Element Detection Using
Correlation with Expression
• Bussemaker et al., Nature Genetics 27, 167–174 (2001)
• Can be used with known regulatory motifs, sets of
putative motifs, and even exhaustively on the set of all
motifs up to a certain length (n=7).
• Known motifs generally have high statistical significance.
• Allows us to infer regulatory inputs of (possibly unknown)
transcription factors.
• Accounts for only 30% of total signal present in genomewide expression patterns.
• Purely linear model: no synergistic effects between TF’s,
cooperative binding, etc.
3: Identifying Regulatory Networks
by Combinatorial Analysis
of Promoter Elements
Pilpel et al., Nature Genetics 29, 153–159 (2001)
• Most transcription factors are thought to work in
concert with other TF’s.
 Synergistic effects
• Clustering:
– a motif may occur in more than one cluster, because it may
give rise to different expression patterns depending on its
interaction partners.
– several motifs may occur in the same cluster.
• Correlation with expression pattern:
– by itself, a motif may not show a clear expression pattern.
– contributions of multiple motifs may not be simply additive.
Synergy between Mcm1 and SFF
in Cell Cycle Data Set
Mcm1 and SFF
were not detected
in Tavazoie et al
Yet TFs that bind
these motifs are
known to interact in
control of G2-genes
Expression level
SFF but not Mcm1
EC=0.05
2
Mcm1 but not SFF
0
0
-2
-2
5
EC=0.05
2
10 Time 15
5
10 Time 15
SFF and Mcm1
Bussemaker et al
found that these
motifs are
antagonistic.
Expression level
(Nature. 2000 406:90-4.)
EC=0.23
2
0
-2
5
10
Time
15
Expression Coherence and Synergy
• Expression Coherence (EC) score indicates how
tightly clustered the expression profiles of a set of
genes are.
• For every combination of N=2,3 motifs:
1) Calculate the expression coherence score of the
genes that have the N motifs
2) Calculate the expression coherence score of genes
that have every possible subset of N-1 motifs
3 )Test (statistically) the hypothesis that the score of
the orfs with N motifs is significantly higher than
that of orfs that have any sub set of N-1 motifs
Correlation
-1
The “Combinogram”
-0.5
0
0.5
G2
G1
1
MCB
MSE
URS1
SCB
MCM1'
SFF'
MCB
MSE
URS1
SCB
MCM1'
SFF'
Highly synergistic
interaction between
MCB and SFF
Previously unknown
Subsequently
predicted via
chromatin immunoprecipitation (ChIP)
Expression
Coherence
0.2
0.4
0.6
0.8
(cell cycle data)
Ho et al. Nature. 2002
Identifying Regulatory Networks
by Combinatorial Analysis
of Promoter Elements
• Pilpel et al., Nature Genetics 29, 153–159 (2001)
• Found several known and novel interactions between
regulatory elements active in cell cycle, sporulation
and stress response.
• Doesn’t assume a specific (e.g. linear) model of TF
interactions.
• Combined with TF expression patterns, may allow us
to infer a model of interaction.
Protein Networks
Patrik D’haeseleer
Harvard University
http:/genetics.med.harvard.edu/~patrik
Yeast 2-Hybrid Assays
Transcription Factor
(e.g. Gal4)
BD
Binding site
MATa
MATa
AD
+
Prot1
BD AD
Prot2
BD
AD
Reporter gene
Reconstituted
active TF
“bait” fusion:
“prey” fusion:
BD
Prot1
AD
Prot2
BD
BD
AD
Prot1
AD
Binding site
Prot2
BD
AD
Reporter gene
Fields and Song, Nature 340:245-246 (1989)
Large-Scale 2-Hybrid Data Sets
• Uetz et al, Nature 403:623-627 (2000)
– 6000 x 192 protein pairs screened using protein array
– nearly all 6000 x 6000 pairs, using pooled prey libraries
– total of 957 putative interactions between 1004 proteins
• Ito et al, PNAS 98:4569-4574 (2001)
– nearly all 6000 x 6000 pairs, using bait and prey pools
– total of 4549 putative interactions between 3278 proteins
– core set of 841 interactions between 797 proteins
• Surprisingly little overlap between the data sets,
possibly indicating a large number of missed
interactions (false negative).
Intersections between
Protein Interaction Data Sets
MIPS
1546
MIPS
1546
Ito full
4475
1415
1436
49
54 28
709
Uetz
947
156
61 21
4242
28
109
756
Uetz
947
Ito core
806
648
Causes of False Positives
•
•
•
•
•
•
•
•
•
•
Bait acts as activator
Bait interacts with endogenous activator
Prey binds to DNA
Prey interacts with endogenous transcription factors
Bait interacts with Activation Domain
Prey interacts with DNA Binding Domain
“Sticky” proteins (nonspecific binding)
Changes in plasmid copy number
AD
Various other artifacts
Prot2
BD
Prot1
AD
...
BD
Binding site
Reporter gene
Yeast Protein-Protein Interaction Map
Each node is a protein
Each line is an interaction
5560 putative interactions
3725 different proteins
~ 3 interactions / protein
Uetz, Schwikowski, Fields and co-workers; Ito and co-workers
Membrane
Proteins
Transcription
Factors
- membrane protein
- DNA-binding protein
- all other yeast proteins
- physical interaction between two proteins
Problem: How to Rank Possible Pathways?
Ste2/3
Possible Paths from Ste3:
1045 different paths to
143 transcription factors
Ste12
Rank Predicted Paths by Degree of Expression
Correlation from Microarray Expreriments
• Known pathways often show
correlated expression
• Known interacting proteins
often show correlated
expression
Average Pairwise
Correlation Coefficient
Among Pathway Members
STE3AKR1STE5STE4FAR1CDC24SOH1
STE3AKR1IQG1CDC42BEM4RHO1SKN7
STE3AKR1STE4FAR1FUS3DIG2STE12
STE3AKR1GCS1YGL198WSAS10NET1
0.190
0.059
0.281
-0.106
* Microarray Data Downloaded from Rosetta Inpharmatics
Classical View of MAPK Pathways
adapted from C.Roberts, et al., Science, 287, 873 (2000)
The Protein Network View
• Highly interconnected,
not just a linear
pathway!
• Some proteins are
missing from the protein
interaction data sets
(Cdc42, Ste20).
• Includes several
additional proteins
(especially Akr1, Kss1).
Conclusions
• Protein interaction data and expression data are both
noisy. Combining them increases the accuracy.
• Can estimate protein interaction error rates by
looking at consistency between data sets 
probabilistic interaction model (work in progress).
• Pathways are far more interconnected than often
portrayed.
• Can integrate various other forms of data:
– co-localization of proteins
– homology with known interacting proteins
– “Rosetta Stone” method
Acknowledgments
Roland Somogyi
NCGR:
Stefanie Fuhrman
Xiling Wen
Jason Stewart
Pedro Mendes
Harvard:
Tzachi Pilpel
Martin Steffen
Allegra Petti
John Aach
George Church
UNM:
Stephanie Forrest
Andreas Wagner
David Peabody
Barak Pearlmutter
The Santa Fe Institute
Download