CS 5263 Bioinformatics Reverse-engineering Gene Regulatory Networks

advertisement
CS 5263 Bioinformatics
Reverse-engineering Gene
Regulatory Networks
Genes and Proteins
Gene (DNA)
Transcriptional
regulation
Transcription (also called expression)
mRNA
mRNA
degradation
Translational
regulation
Translation
Protein
(De)activation
Post-translational regulation
Gene Regulatory Networks
• Functioning of cell controlled by interactions between
genes and proteins
• Genetic regulatory network: genes, proteins, and their
mutual regulatory interactions
repressor
gene 1
activator
gene 2
repressor
gene 3
Reverse-engineering GRNs
• GRNs are large, complex, and dynamic
• Reconstruct the network from observed gene expression
behaviors
– Experimental methods focus on a few genes only
– Computer-assisted analysis: large scale
• Since 1960s
– Theoretical study mostly
• Attracting much attention since the invent of Microarray
technology
• Emerging advanced large-scale assay techniques are
making it even more feasible (ChIP-chip, ChIP-seq, etc.)
Problem Statement
• Assumption: expression value of a gene
depends on the expression values of a set of
other genes
• Given: a set of gene expression values under
different conditions
• Goal: a function for each gene that predicts its
expression value from expression of other genes
–
–
–
–
Probabilistically: Bayesian network
Boolean functions: Boolean network
Linear functions: linear model
Other possibilities such as decision trees, SVMs
Characteristics
• Gene expression data is often noisy,
with missing values
• Only measures mRNA level
– Many genes regulated not only on the
transcriptional level
• # genes >> # experiments.
Underdetermined problem!!!!
• Correlation  causality
• Good news: Network structure is
sparse (scale-free)
Methods for GRN inference
• Directed and undirected graphs
– E.g. KEGG, EcoCyc
• Boolean networks
– Kauffman (1969), Liang et al (1999), Shmulevich et al (2002),
Lähdesmäki et al (2003)
• Bayesian networks
– Friedman et al (2000), Murphy and Mian (1999), Hartmink et al
(2002)
• Linear/non-linear regression models
– D’Haeseleer et al (1999), Yeung et al (2002)
• Differential equations
– Chen, He & Church (1999)
• Neural networks
– Weaver, Workman and Stormo (1999)
Boolean Networks
• Genes are either on or off (expressed or not
expressed)
• State of gene Xi at time t is a Boolean function of
the states of some other genes at time t-1
X
Y
Z X’ Y’ Z’
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
0
1
1
0
0
1
X’ = Y and (not Z)
1
0
0
0
1
0
Y’ = X
1
0
1
0
1
0
Z’ = Y
1
1
0
1
1
1
1
1
1
0
1
1
X
X’
Y
Y’
Z
Z’
Learning Boolean Networks for
Gene Expression
• Assumptions:
– Deterministic (wiring does not change)
– Synchronized update
– All Boolean functions are probable
• Data needed: 2N for N genes. (In comparison, N
needed for linear models)
• General techniques: limit the # of inputs per
gene (k). Data required reduced to 2k log(N).
Learning Boolean Networks
• Consistency Problem
– Given: Examples S: {<In, Out>}, where
• In {0,1}k, output  {0,1}
– Goal: learn Boolean function f such that for every <In,
Out>  S, f(In) = out.
– Note:
• Given the same input, the output is unique.
• For k input variables, there are at most 2k distinct
input configurations.
– Example:
<001,1> <101,1> <110,1> <010,0> <011,0> <101,0>
1,1
5,1
6,1
2,0
3,0
5,0
Learning Boolean Networks
<001,1>
<101,1>
<110,1>
<010,0>
<101,1>
<101,0>
?
1
0
0
?
*
1
?
 no clash -> consistency.
 Question marks ->
undetermined elements
 O (Mk), M is # of experiments
 N genes, Choose k from N,
N * C(N, k) * O(MK)
 Best-fit problem: Find a function f with minimum # of
errors
 Limited error-size problem: Find all functions with
error-size within max
Lähdesmäki et al, Machine Learning 2003;52: 147-167.
State space and attractor basins
What are some biological
interpretations of basins
and attractors?
Linear Models
• Expression level of gene at time t depends
linearly on the expression levels of some
genes at time t-1
t-1
X1
t
W11
W21
X2
X3
X1
W31
X2
W32
W33
W31
X3
o Basic model: Xi (t) = Σj Wij Xj(t-1)
o Xi’ (t) = Σj Aij Xj(t), where Xi(t) can be
measured, Xi’ (t) can be estimated
from Xi(t)
o In matrix form: X’NM = ANN XNM ,
where M is the number of time
points, N is the number of genes
Linear Models (cont’d)
• X’NM = ANN ·XNM
• ANN: connectivity matrix, Aij describes the
type and strength of the influence of the jth
gene on the ith gene.
• To solve A, need to solve MN linear
equations
• In general N2 >> MN, therefore underdetermined => infinity number of solutions
Get Around The Curse of
Dimensionality
• Non-linear interpolation to increase # of
time points
• Cluster genes to reduce # of genes
• Singular Value Decomposition (SVD)
– A = A0 + CNN · VTNN, where cij = 0 if j > M
– Take A0 as a solution, guaranteed smallest
sum of squares.
• Robust regression
– Minimize # of edges in the network
– Biological networks are sparse (scale-free)
CNN
Cij
0
Robust Regression
• A = A0 + CNN · VTNN,
• Minimizing # of non-zero entries
in A by selecting C
– Set A = 0, then C ·
-A0 , solve
for C.
– Over-determined. (N2 equations,
MN free variables).
VT =
6
5
4
3
2
• Robust regression
– Fit a hyper-plane to a set of points
by passing as many points as
possible
1
0
0
2
4
6
Simulation Experiments
SVD + Robust Regression
Yeung et al, PNAS. 2002;99:6163-8.
SVD alone
Simulation Experiments (cont’d)
Linear System
Nonlinear System
close to steady state
 Does not work for nonlinear system not close to steady state
 Scale-free property does not hold on small networks
Bayesian Networks
X1
• A DAG G (V, E), where
X2
X3
X5
X4
– Vertex: a random variable
– Edge: conditional distribution for a
variable, given its parents in G.
• Markov assumption:
i, I (Xi, non-descendent(Xi) |
PaG(Xi))
I(X3,
X4P(Xi
| X2),
| X3)
G(Xi), X5
Chain rule: P(X1, X2,e.g.
…, Xn)
=Π
| PaI(X1,
i = 1..n
i
P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3)
Learning: argmaxG P (G | D) = P (D | G) * P (G) / C
Bayesian Networks (Cont’d)
• Equivalence classes of
Bayesian Networks
– Same topology, different edge
directions
– Can not be distinguished from
observation
C
A
B
A
B
I (A, B | C)
C
PDAG
• Causality
A
– Bayesian network does not
directly imply causality
– Can be inferred from
observation with certain
assumptions:
• no hidden common cause
• ……
C
B
C Hidden variable
A
B
Bayesian Networks for Gene
Expression
Gene E
Gene D
Gene C
Gene A
 (D | E):
Multinomial
or linear
Gene B
Other variables can be added,
such as promoters sequences,
experiment conditions and time.
• Deals with noisy data well,
reflects stochastic nature of
gene expression
• Indication of causality
• Practical issues:
– Learning is NP-hard
– Over-fitting
– Equivalent classes of
graphs
• Solution:
– Heuristic search, sparse
candidate
– Model averaging
– Learning partial models
Learning Bayesian Nets
• Find G to maximize Score (G | D), where
– Score(G | D) = Σi Score (Xi, PaG(Xi) | D)
• Hill-climbing
– Edge addition, edge removal, edge reversal
• Divide-and-conquer
– Solve for sub-graphs
• Sparse candidate algorithm
– Limit the number of candidate parents for each
variables. (Biological implications – sparse graph)
– Iteratively modifying the candidate set
Partial Models (Features)
• Model Averaging
– Learn many models, common sub-graphs will be more
likely to be true
– Confidence measure: # of times a sub-graph
appeared
– Method: bootstrap
• Markov relations
– A is in B’s Markov blanket iff
A
B
or
A
• Order relations
A
…
B
C
B
A and B in some
joint biological
interaction
A is a cause of B
Experimental Results
Markov Relations
• Real biological data set: Yeast
cell cycle data
• 800 genes, 76 experiments,
200-fold bootstrap
• Test for significance and
robustness
– More higher scoring
features in real data than in
randomized data
– Order relations are more
robust than Markov
relations with respect to
local probability models.
Friedman et al, J Comput Biol. 2000;7:601-20
Transcriptional regulatory
network
•
•
•
•
TF
Gene
Promoter
Who regulates whom?
When?
Where?
How?
A and not B
B
A
g1
A or B
A
B
g3
Not (A and B)
A and B
A
B
g2
A
B
g4
PNAS 2003;100(9):5136-41
Data-driven vs. model-driven
methods
condition
clustering
gene
MF
Descriptive
Learning
model
Post-processing
Biological insights
Explanatory, predictive
“A description of a process that could
have generated the observed data”
Data-driven approaches
Genes
Clustering
Hierarchical,
K-means, …
Motif finding
MEME, Gibbs,
AlignACE, …
Experiments
• Assumption
– Co-expressed genes are likely co-regulated: not
necessarily true
• Limitations:
– Clustering is subjective
– Statistically over-represented but non-functional “junk”
motifs
– Hard to find combinatorial motifs
Model-based approaches
• Intuition: find motifs that are not only statistically
over-represented, but are also associated with
the expression patterns
– E.g., a motif appears in many up-regulated genes
but very few other genes => real motif?
• Model: gene expression = f (TF binding motifs,
TF activities)
• Goal: find the function that
– Can explain the observed data and predict future
data
– Captures true relationships among motifs, TFs
and expression of genes
Transcription modeling
e = f (m1, m2, m3, m4)
Variables
Promoters
Motifs
Expression
g1
g2
g3
g4
g5
?
g6
g7
g8
Assume that gene expression levels under a
certain condition are a function of some TF binding
motifs on their promoters.
Gene
labels
Different modeling approaches
• Many different models, each with its own
limitations
• Classification models
– Decision tree, support vector machine (SVM),
naïve bayes, …
• Regression models
– Linear regression, regression tree, …
• Probabilistic models
– Bayesian networks, probabilistic Boolean
networks, …
Decision tree
g1
m1 m2 m3 m4
m1
e
g2
g3
g4
g5
g6
e = f (m1, m2, m3, m4)
no
g7
A
g8
7, 8
no
yes
m4
m2
yes
B
1, 2, 5
no
yes
C
D
4
3, 6
• Tree structure is learned from data
– Only relevant variables (motifs) are used
– Many possible trees, the smallest one is preferred
• Advantages:
– Easy to interpret
– Can represent complex logic relationships
A real example: transcriptional
regulation of yeast stress response
• 52 genes up-regulated in heat-shock (postive)
• 156 random irresponsive genes (negative)
• 356 known motifs
RRPE
No
FHL1
No
RAP1
No
151 (-)
10 (+)
Yes
5 (+)
Yes
11 (+)
1(-)
Small tree: only used 4
motifs
Yes
PAC
No
4 (-)
3 (+)
Yes
23 (+)
All 4 motifs are wellknown to be stressrelated
RRPE-PAC combination
well-known
Application to yeast cell-cycle
genes
Network by our method
Ruan et. al., BMC Genomics, 2009
Model network in Science,
2002;298(5594):799-804
Regression tree
g1
m1 m2 m3 m4
e
m1
g2
g3
e = f (m1, m2, m3, m4)
g4
no
yes
m4
m2
yes
no
yes
0>e>2
e2
g5
g6
no
g7
g8
0<e<2
e2
• Similar to decision tree
• Difference: each terminal node predicts a
range of real values instead of a label
Multivariate regression tree
•
•
•
Multivariate labels: use multiple experiments
simultaneously
Use motifs to classify genes into co-expressed groups
Does not need clustering in advance
m1
m1 m2 m3 m4
g1
g2
g3
g4
g5
g6
g7
g8
no
e1 e2 e3 e4 e5
yes
m4
no
7
Phuong,T., et. al., Bioinformatics, 2004
yes
m2
no
yes
3
6
8
1
2
5
4
Modeling with TF activities
• Gene expression = f (binding motifs, TF activities)
g = f (tf1, tf2, tf3, tf4)
e1 e2 e3 e4 e5
tf1
tf2
tf3
tf4
g
tf1 tf2 tf3 tf4
e1
rotate e2
e3
e4
e5
Soinov et al., Genome Biol, 2003
g
tf1
0
>0
g0
g>0
A Decision Tree Model
Segal et al. Nat Genet.
2003,34(2):166-76.
gene
experiment
A decision tree
model of gene
expressions
Algorithm BDTree
• Gene expression = f (binding motifs, TF
activities)
• Ruan & Zhang, Bioinformatics 2006
• Basic idea:
– Iteratively partition an expression matrix by
splitting genes or experiments
– Split of genes is according to motif scores
– Split of conditions is according to TF
expression levels
– The algorithm decides the best motifs or TFs
to use
Transcriptional regulation of
yeast stress response
• 173 experiments under ~20 stress
conditions
• 1411 differentially expressed genes
• ~1200 putative binding motifs
– Combination of ChIP-chip data, PWMs, and
over-represented k-mers (k = 5, 6, 7)
• 466 TFs
Experiments
Genes
Genes with motifs
FHL1 but no RRPE are
down-regulated when
Ppt1 is down-regulated
and Yfl052w is upregulated
……
Genes with motifs
RRPE & PAC are
down-regulated
when TFs Tpk1 &
Kin82 are upregulated
Biological validation
• Most motifs and TFs selected by the tree are
well-known to be stress-related
– E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and
Ppt1
• 42 / 50 blocks are significantly enriched with
some Gene Ontology (GO) functional terms
• 45 / 50 blocks are significantly enriched with
some experimental conditions
RRPE & PAC, ribosome biogenesis (60/94, p < e-65)
RRPE only, ribosome biogenesis (28/99, p < e-18)
FHL1, protein biosynthesis (98/105, p<e-87)
STRE (agggg)
carbohydrate
metabolism
p < e-20
PAC
Nitrogen
metabolism
Relationship between methods
c1 c2 c3 c4 c5
t1
t2
t3
t4
A m1 m2 m3 m4
g1
g2
g3
g4
g5
g6
g7
g8
B
• A, C: from
promoter to
expression
– A: single cond
– C: multi conds
D
C
• B, D: from
expression to
expression
– B: single gene
– D: multi genes
Download