lecture24 - School of Computer Science

advertisement
Advanced Algorithms
and Models for
Computational Biology
-- a machine learning approach
Systems Biology:
Inferring gene regulatory network using
graphical models
Eric Xing
Lecture 24, April 17, 2006
Gene regulatory networks

Regulation of expression of genes is crucial:



Genes control cell behavior by controlling which proteins are made by a
cell and when and where they are delivered
Regulation occurs at many stages:

pre-transcriptional (chromatin structure)

transcription initiation

RNA editing (splicing) and transport

Translation initiation

Post-translation modification

RNA & Protein degradation
Understanding regulatory processes is a central problem
of biological research
Inferring gene regulatory
networks

Expression network
B
E
R
mostly

sometimes
rarely

gets most attentions so far, many algorithms

still algorithmically challenging
Protein-DNA interaction network
+
+

actively pursued recently, some interesting methods available

lots of room for algorithmic advance
A
C
Inferring gene regulatory
networks

Network of cis-regulatory pathways

Success stories in sea urchin, fruit fly, etc, from decades of experimental
research

Statistical modeling and automated learning just started
1: Expression networks

Early work



Clustering of expression data

Groups together genes with similar expression pattern

Disadvantage: does not reveal structural relations between genes
Boolean Network

Deterministic models of the logical interactions between genes

Disadvantage: deterministic, static
Deterministic linear models



Disadvantage: under-constrained, capture only linear interactions
The challenge:

Extract biologically meaningful information from the expression data

Discover genetic interactions based on statistical associations among data
Currently dominant methodology

Probabilistic network
Probabilistic Network Approach

Characterize stochastic (non-deterministic!)
relationships between expression patterns of different
genes

Beyond pair-wise interactions => structure!


Many interactions are explained by intermediate factors

Regulation involves combined effects of several gene-products
Flexible in terms of types of interactions (not necessarily
linear or Boolean!)
Probabilistic graphical models
• Given a graph G = (V,E), where each node vV is
associated with a random variable Xv
p(X3| X2)
p(X2| X1)
X3
AAAAGAGTCA
X2
X1
X6
p(X1)
p(X6| X2, X5)
p(X4| X1)
p(X5| X4)
X4
X5
• The joint distribution on (X1, X2,…, XN) factors according to
the “parent-of” relations defined by the edges E :
p(X1, X2, X3, X4, X5, X6) = p(X1) p(X2| X1)p(X3| X2) p(X4| X1)p(X5| X4)p(X6| X2, X5)
Probabilistic Inference


Computing statistical queries regarding the network, e.g.:

Is node X independent on node Y given nodes Z,W ?

What is the probability of X=true if (Y=false and Z=true)?

What is the joint distribution of (X,Y) if R=false?

What is the likelihood of some full assignment?

What is the most likely assignment of values to all or a subset the nodes
of the network?
General purpose algorithms exist to fully automate such
computation

Computational cost depends on the topology of the network

Exact inference:


The junction tree algorithm
Approximate inference;

Loopy belief propagation, variational inference, Monte Carlo sampling
Two types of GMs

Directed edges give causality relationships (Bayesian
Network or Directed Graphical Model):

Undirected edges simply give correlations between
variables (Markov Random Field or Undirected
Graphical model):
Bayesian Network

Structure: DAG

Meaning: a node is
conditionally independent
of every other node in the
network outside its Markov
blanket

Location conditional
distributions (CPD) and the
DAG completely determines
the joint dist.
Ancestor
Y1
Y2
Parent
X
Child
Children's co-parent
Descendent
Local Structures &
Independencies

Common parent

B
Fixing B decouples A and C
"given the level of gene B, the levels of A and C are
independent"

C
A
Cascade

Knowing B decouples A and C
A
B
C
"given the level of gene B, the level gene A provides no
extra prediction value for the level of gene C"

V-structure

Knowing C couples A and B
because A can "explain away" B w.r.t. C
"If A correlates to C, then chance for B to also correlate to B
will decrease"

The language is compact, the
concepts are rich!
A
B
C
Why Bayesian Networks?

Sound statistical foundation and intuitive probabilistic
semantics

Compact and flexible representation of (in)dependency
structure of multivariate distributions and interactions

Natural for modeling global processes with local
interactions => good for biology

Natural for statistical confidence analysis of results and
answering of queries

Stochastic in nature: models stochastic processes & deals
(“sums out”) noise in measurements

General-purpose learning and inference

Capture causal relationships
Possible Biological Interpretation

Measured expression
level of each gene
Random variables
(node)
Gene interaction
Probabilistic dependencies
(edge)
B
Common cause
C
A

Intermediate gene

Common/combinatorial effects
A
B
C
A
B
C
More directed probabilistic
networks

Dynamic Bayesian Networks
(Ong et. al)

Temporal series "static"
gene
1

Probabilistic Relational
Models (Segal et al.)

Data fusion: integrating related data
from multiple sources
gene
1
gene
2 gene
3
gene
N
Module network (Segal et al.)


gene
2 gene
3
gene
N
Partition variables into modules that
share the same parents and the
same CPD.
Bayesian Network – CPDs
Local Probabilities: CPD - conditional probability
distribution P(Xi|Pai)

Discrete variables: Multinomial Distribution (can represent any kind
of statistical dependency)
Earthquake
Radio
Burglary
Alarm
Call
E
e
B
b
0.9
0.1
e
b
0.2
0.8
e
b
0.9
0.1
e
b
0.01
0.99
P(A | E,B)
Bayesian Network – CPDs (cont.)
Continuous variables: e.g. linear Gaussian
k
P (X |Y1 ,...,Yk ) ~ N (a0   ai yi ,  )
2
i 1
P(X | Y)

X
Y
Learning Bayesian Network

The goal:

Given set of independent samples (assignments of random
variables), find the best (the most likely?) Bayesian Network
(both DAG and CPDs)
R
B
E
B
E
R
A
A
C
C
(B,E,A,C,R)=(T,F,F,T,F)
(B,E,A,C,R)=(T,F,T,T,F)
……..
(B,E,A,C,R)=(F,T,T,T,F)
E
e
e
e
e
B
b
b
b
b
P(A | E,B)
0.9
0.1
0.2
0.8
0.9
0.1
0.01
0.99
Learning Bayesian Network
• Learning of best CPDs given DAG is easy
– collect statistics of values of each node given specific assignment to its parents
• Learning of the graph topology (structure) is NP-hard
– heuristic search must be applied, generally leads to a locally optimal network
• Overfitting
– It turns out, that richer structures give higher likelihood P(D|G) to the data
(adding an edge is always preferable)
A
B
C
A
B
C
P(C | A) ≤P(C | A, B)
– more parameters to fit => more freedom => always exist more "optimal" CPD(C)
• We prefer simpler (more explanatory) networks
– Practical scores regularize the likelihood improvement complex networks.
BN Learning Algorithms

Structural EM (Friedman 1998)

Expression data

Sparse Candidate Algorithm (Friedman et al.)




Learning Algorithm 
B
R
A
C
Discretizing array signals
Hill-climbing search using local operators: add/delete/swap
of a single edge
Feature extraction: Markov relations, order relations
Re-assemble high-confidence sub-networks from features
Module network learning (Segal et al.)

E
The original algorithm



Heuristic search of structure in a "module graph"
Module assignment
Parameter sharing
Prior knowledge: possible regulators (TF genes)
Confidence Estimates
Bootstrap approach:
B
E
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
1
C (f )   1f  Gi 
m i 1
m
A
C
...
Estimate “Confidence level”:
B
Dm
E
R
B
A
Learn
C
Results from SCA + feature
extraction (Friedman et al.)
KAR4
SST2
TEC1
NDJ1
KSS1 FUS1
AGA2 TOM6
YLR343W
YLR334C
The initially learned network of
~800 genes
MFA1
STE6
PRM1 AGA1
FIG1
FUS3
YEL059W
The “mating response” substructure
A Module Network
Nature Genetics 34, 166 - 176 (2003)
Gaussian Graphical Models

Why?
Sometimes an UNDIRECTED
association graph makes more
sense and/or is more
informative
gene expressions may be
influenced by unobserved factor
that are post-transcriptionally
regulated

B
B
C
A

A
B
C
A
C
The unavailability of the state of B
results in a constrain over A and C
Covariance Selection

Multivariate Gaussian over all continuous expressions
p([ x1 ,..., xn ]) =

1
(2 ) | S |
n
2
1
2
T -1 
1 
{
exp - 2 ( x -  ) S ( x -  )}
The precision matrix KS-1 reveals the topology of the
(undirected) network
E ( xi | x-i ) = ∑(K ij / K ii ) x j


Edge ~ |Kij| > 0
j
Learning Algorithm: Covariance selection

Want a sparse matrix


Regression for each node with degree constraint (Dobra et al.)
Regression for each node with hierarchical Bayesian prior (Li, et al)
Gene modules from identified
using GGM (Li, Yang and Xing)
GGM
A comparison of BN and GGM:
BN
2: Protein-DNA Interaction
Network

Expression networks are not necessarily causal

BNs are indefinable only up to Markov equivalence:
B
and
B
C
C
A
A
can give the same optimal score, but not further distinguishable under a likelihood
score unless further experiment from perturbation is performed


GGM have yields functional modules, but no causal semantics
TF-motif interactions provide direct evidence of casual,
regulatory dependencies among genes

stronger evidence than expression correlations

indicating presence of binding sites on target gene -- more easily verifiable

disadvantage: often very noisy, only applies to cell-cultures, restricted to known
TFs ...
ChIP-chip analysis
Simon I et al (Young RA) Cell 2001(106):697-708
Ren B et al (Young RA) Science 2000 (290):2306-2309
Advantages:
- Identifies “all” the sites where a TF
binds “in vivo” under the experimental
condition.
Limitations:
- Expense: Only 1 TF per experiment
- Feasibility: need an antibody for the TF.
- Prior knowledge: need to know what TF
to test.
Transcription network
[Horak, et al, Genes & Development, 16:3017-3033]
Often reduced to a clustering problem


A transcriptional module can be defined as a set of genes that are:
1.
co-bound (bound by the same set of TFs, up to limits of experimental noise) and
2.
co-expressed (has the same expression pattern, up to limits of experimental noise).
Genes in the same module are likely to be co-regulated, and hence likely
have a common biological function.
Inferring transcription network

+
+
Array/motif/chip data

Learning Algorithm
Clustering + Post-processing

SAMBA (Tanay et al.)

GRAM (Genetic RegulAtory Modules) (BarJoseph et al.)
Bayesian network

Joint clustering model for mRNA signal, motif,
and binding score (Segal, et al.)

Using binding data to define Informative
structural prior (Bernard & Hartemink)
(the only reported work that directly learn the network)
The SAMBA model
edge
conditions
The GE (gene-condition) matrix
no edge
Goal : Find high
similarity submatrices
Goal : Find dense
G=(U,V,E)
subgraphs
The SAMBA approach

Normalization: translate gene/experiment matrix to a
weighted bipartite graph using a statistical model for the
data
Bicluster Random
Subgraph Model
Background Random
Graph Model

=
Likelihood model
translates to sum of
weights over edges
and non edges
Bicluster model: Heavy subgraphs

Combined hashing and local optimization (Tanay et al.): incrementally
finds heavy-weight biclusters that can partially overlap

Other algorithms are also applicable, e.g., spectral graph cut, SDP: but
they mostly finds disjoint clusters
Date pre-processing
From experiments to properties:
p2
p1
p1
p2
p3
Strong
Medium
Medium
Induction Induction Repression
p1
Strong
Binding to
TF T
p2
Medium
Binding to
TF T
Strong complex
binding to
protein P
Medium complex
binding to
Protein P
p4
gene g
Strong
Repression
p1
p2
High
Medium
Sensitivity Sensitivity
p1
High Confidence
Interaction
p2
Medium Confidence
Interaction
A SAMBA module
Genes
Properties
CPA1
GO annotations
CPA2
Post-clustering analysis

Using topological properties to identify genes connecting
several processes.


Specialized groups of genes (transport, signaling) are more likely to
bridge modules.
Inferring the transcription network

Genes in modules with expression properties should be driven by at
least one common TF

There exist gaps between information sources (condition specific TF
binding?)

Enriched binding motifs help identifying “missing TF profiles”
AP-1
STRE
Fhl1
Rap1
CGATGAG
AGGGG
AAAATTTT
Ribosom
Biogenesys
RNA
Processing
Stress
Protein
Biosynthesis
The GRAM model
Cluster genes that
co-express
Statistically significant
union
Group genes bound by
(nearly) the same set of TFs
The GRAM algorithm

Exhaustively search all subsets of TFs
(starting with the largest sets); find
genes bound by subset F with
significance p: G(F,p)

Arg80
Arg81
Leu3
Gcn4
g1
1
1
0
1
g2
1
0
1
1
g3
1
1
1
1
g4
0
0
0
0
g5
1
1
1
1
g6
1
1
0
1
Find an r-tight gene cluster with
centroid c*: B(c*,r)|, s.t.
r
c* = argmaxc |G(F,p) ∩ B(c,r)|

Some local massage to further
improve cluster significance
Arg80
Arg81
Leu3
Gcn4
g1
.0004
.00003
.33
.0004
g2
.00002 0.0006
.02
g3
.0007
.002
g4
.007
.2
g5
.00001 .00001
g6
.00001 .00007
√
√
.15
.0002 √
x
0.04
.7
.0001 .0002 √
.5
.0001 √
.0001
Regulatory Interactions revealed
by GRAM
The Full Bayesian Network
Approach
TF Binding sites
regulation
indicator
p-value for
localization assay
experimental
conditions
expression
level
The Full BN Approach, cond.

Introduce random variables for each observations, and
latent states/events to be modeled

A network with hidden variables:


regulation indicator and binding sites not observed
Results in a very complex model that integrates motif
search, gene activation/repression function, mRNA level
distribution, ...

Advantage: in principle allow heterogeneous information to flow and
cross influence to produce a globally optimal and consistent solution

Disadvantage: in practice, a very difficult statistical inference and
learning problem. Lots of heuristic tricks are needed, computational
solutions are usually not a globally optimum.
A Bayesian Learning Approach
For model p(A|Q ):

Treat parameters/model Q as unobserved random variable(s)

probabilistic statements of Q is conditioned on the values of
the observed variables Aobs
p(Q )
B
E
R
A
R
C
•
Bayes’ rule:
Bayesian estimation:
A
C
p(Q | A)  p( A | Q) p(Q)
posterior
•
B
E
likelihood
prior
Q Bayes   Qp(Q | A)dQ
Informative Structure Prior
(Bernard & Hartemink)

If we observe TF i binds Gene j from location assay, then the
probability of have an edge between node i and j in a
expression network is increased

This leads to a prior probability of an edge being present:

This is like adding a regularizer (a compensating score) to the
originally likelihood score of an expression network

The regularized score is now ...
Informative Structure Prior
(Bernard & Hartemink)

The regularized score of an expression network under
an informative structure prior is:
Score = P(array profile|Chip profile)
= logP(array profile|S) + logP(S)

Now we run the usual structural learning algorithm as in
expression network, but now optimizing
logP(array profile|S) + logP(S), instead logP(array profile|S)
We've found a network…now
what?

How can we validate our results?

Literature.

Curated databases (e.g., GO/MIPS/TRANSFAC).

Other high throughput data sources.

“Randomized” versions of data.

New experiments.

Active learning: proposing the "most informative" new experiments
What is next?

We are still far away from being able to correctly infer
cis-regulatory network in silico, even in a simple multicellular organism


Open computational problems:

Causality inference in network

Algorithms for identifying TF bindings motifs and motif clusters directly from
genomic sequences in higher organisms

Techniques for analyzing temporal-spatial patterns of gene expression (i.e., whole
embryo in situ hybridization pattern)

Probabilistic inference and learning algorithms that handle large networks
Experimental limitation:

Binding location assay can not yet be performed inexpensively in vivo in multicellular organisms

Measuring the temporal-spatial patterns of gene expression in multi-cellular
organisms is not yet high-throughput

Perturbation experiment still low-throughput
More "Direct" Approaches

Principle:


using "direct" structural/sequence/biochemical level evidences of network
connectivity, rather than "indirect" statistical correlations of network products, to
recover the network topology.
Algorithmic approach

Supervised and de novo motif search

Protein function prediction

protein-protein interactions
Networks of cisregulatory pathway
This strategy has been a hot trend, but faces many algorithmic challenges

Experimental approach

In principle feasible via systematic mutational perturbations

In practice very tedious

The ultimate ground truth
A thirty-year network for a single
gene
The Endo16 Model (Eric Davidson, with tens of students/postdocs over years)
Acknowledgments
Sources of some of
the slides:
Mark Gerstein, Yale
Nir Friedman, Hebrew U
Ron Shamir, TAU
Irene Ong, Wisconsin
Georg Gerber, MIT
References
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D.
Pe'er. In Proc. Fourth Annual Inter. Conf. on Computational Molecular Biology RECOMB, 2000.
Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression,
E. Segal, R. Yelensky, and D. Koller. Bioinformatics, 19(Suppl 1): 273--82, 2003.
A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules Joshua M.
Stuart, Eran Segal, Daphne Koller, Stuart K. Kim, Science, 302 (5643):249-55, 2003.
Module networks: identifying regulatory modules and their condition-specific regulators from
gene expression data. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N.
Nature Genetics, 34(2): 166-76, 2003.
Learning Module Networks E. Segal, N. FriedmanD. Pe'er, A. Regev, and D. Koller.
In Proc. Nineteenth Conf. on Uncertainty in Artificial Intelligence (UAI), 2003
From Promoter Sequence to Expression: A Probabilistic Framework, E. Segal, Y. Barash, I.
Simon, N. Friedman, and D. Koller. In Proc. 6th Inter. Conf. on Research in Computational
Molecular Biology (RECOMB), 2002
Modelling Regulatory Pathways in E.coli from Time Series Expression Profiles, Ong, I. M., J. D.
Glasner and D. Page, Bioinformatics, 18:241S-248S, 2002.
References
Sparse graphical models for exploring gene expression data, Adrian Dobra, Beatrix Jones, Chris
Hans, Joseph R. Nevins and Mike West, J. Mult. Analysis, 90, 196-212, 2004.
Experiments in stochastic computation for high-dimensional graphical models, Beatrix Jones,
Adrian Dobra, Carlos Carvalho, Chris Hans, Chris Carter and Mike West, Statistical Science,
2005.
Inferring regulatory network using a hierarchical Bayesian graphical gaussian model, Fan Li,
Yiming Yang, Eric Xing (working paper) 2005.
Computational discovery of gene modules and regulatory networks. Z. Bar-Joseph*, G. Gerber*,
T. Lee*, N. Rinaldi, J. Yoo, F. Robert, B. Gordon, E. Fraenkel, T. Jaakkola, R. Young, and D.
Gifford. Nature Biotechnology, 21(11) pp. 1337-42, 2003.
Revealing modularity and organization in the yeast molecular network by integrated analysis of
highly heterogeneous genomewide data, A. Tanay, R. Sharan, M. Kupiec, R. Shamir Proc.
National Academy of Science USA 101 (9) 2981--2986, 2004.
Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types
of Data, Bernard, A. & Hartemink, A. In Pacific Symposium on Biocomputing 2005 (PSB05),
Download