Gene Networks

advertisement
Genetic Networks
.
Cellular Networks
 Most
processes in the cell are controlled by
“networks” of interacting molecules:



Metabolic Networks
Signal Transduction Networks
Regulatory Networks
Unifying View
 The
cell as a “state machine”
 Cell state S = (P1,P2, …, R1, R2, …m1, m2, …)
 P proteins, R mRNA molecules, m metabolites
 Each cell at any given time, can be
characterized using its state S
 Dynamics:
 Input(t), S(t) => S(t+Dt)
What does it mean?
Cell State – cell type
Neuron
RBC
muscle cell
Tumor cell
 Steady




– cellular process
Differentiation
Apoptosis
Cell Cycle
 Dynamics



Gene Regulation Networks
 Regulation
of expression of genes is crucial
 Regulation
occurs at many stages:






pre-transcriptional (chromatin structure)
transcription initiation
RNA editing (splicing) and transport
Translation initiation
Post-translation modification
RNA & Protein degradation
 Understanding
regulatory processes is a central
problem of biological research
Genetic Network Models: Goals
Incorporate rule-based dependencies between genes
 Rule-based dependencies may constitute important
biological information.
 Allow to systematically study global network dynamics
 In particular, individual gene effects on long-run network
behavior.
 Must be able to cope with uncertainty
 Small sample size, noisy measurements, biological
“noise”
 Quantify the relative influence and sensitivity of genes in
their interactions with other genes
 This allows us to focus on individual (groups of) genes.
 What model should we use?

Level of Biochemical Detail
 Detailed
models require lots of data!
 Highly detailed biochemical models are only
feasible for very small systems which are
extensively studied
 Example: Arkin et al. (1998), Genetics 149(4):1633-48
lysis-lysogeny switch in Lambda phage:
5 genes, 67 parameters based on 50 years of
research, stochastic simulation required
supercomputer
Example: Lysis-Lysogeny
Arkin et al. (1998),
Genetics 149(4):1633-48
Level of Biochemical Detail
 In-depth
biochemical simulation of e.g. a whole cell
is infeasible (so far)
 Less detailed network models are useful when data
is scarce and/or network structure is unknown
 Once network structure has been determined, we
can refine the model
Boolean or Continuous?
 Boolean
Networks (Kauffman (1993), The Origins of Order)
assumes ON/OFF gene states.
A 0
1
C
0
C = A AND B
B
 Allows
analysis at the network-level
 Provides useful insights in network dynamics
 Algorithms for network inference from binary data
Boolean Formalism: Cons
Boolean abstraction is poor fit to real data
 Cannot model important concepts:
 amplification of a signal
 subtraction and addition of signals
 compensating for smoothly varying environmental
parameter (e.g. temperature, nutrients)
 varying dynamical behavior (e.g. cell cycle period)
 Feedback control:
negative feedback is used to stabilize expression
 causes oscillation in Boolean model

Boolean Formalism: Pros
Studies give rise to qualitative phenomena, as
observed by experimentalists.
 Some studied systems exhibit multiple steady
states and “switchlike” transitions between them.
 It is experimentally shown that such systems are
“robust” to exact values of kinetic parameters of
individual reactions.

Concentrations or Molecules?
 Use
of concentrations assumes individual
molecules can be ignored
 Known examples (in prokaryotes) where
stochastic fluctuations play an essential role
(e.g. lysis-lysogeny in lambda)
 Requires stochastic simulation (Arkin et al. (1998),
Genetics 149(4):1633-48), or modeling molecule
counts (e.g. Petri nets, Goss and Peccoud (1998),
PNAS 95(12):6750-5)
 Significantly
increases model complexity
Concentrations or Molecules?
 Eukaryotes:
larger cell volume, typically
longer half-lives. Few known stochastic
effects.
 Yeast: 80% of the transcriptome
is expressed at 0.1-2 mRNA
copies/cell
Holstege, et al.(1998), Cell 95:717-728.
 Human: 95% of transcriptome is
expressed at <5 copies/cell
Velculescu et al.(1997), Cell 88:243-251
Spatial or Non-Spatial
 Spatiality
introduces additional complexity:
 intercellular interactions
 spatial differentiation
 cell compartments
 cell types
 Spatial patterns also provide more data
e.g. stripe formation in Drosophila:
Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454.
 Few
(no?) large-scale spatial gene expression data
sets available so far.
Example: Drosophila Segmentation
anterior
posterior
eve (even-striped) expression
high
eve (stripe 2)
hb
bcd
gt
Kr
low
expression of transcription factors in embryo
Deterministic or Stochastic?
 Many
sources of stochasticity
 Bioloical stochasticity
 Experimental noise
 Stochastic models can account for those
 Deterministic models are usually simpler to analyze
(dynamics, steady states) and interpret
Modeling Approaches
 Boolean
 Linear
Networks
Models
 Bayesian
Networks
Boolean Network
What is a Boolean Network?

Boolean network is a kind of Graph
 G(V, F) – V is a set of nodes ( genes )
F is a list of Boolean functions
Every node has only two value:
ON ( 1 ) and OFF ( 0 )
 Every function has the result value of each node :

xi  f i ( x1 , x2 ,

, xn )
Representation: standard, wiring , automata
What is a Boolean Network?
 Attractor
: Certain states revisited infinitely often
depending on the initial starting state.
 Basin
of attraction
 Limit-cycle
attractor
Boolean Network Example
Wiring diagram G’(V’,F’)
Nodes (genes)
x1
0
1
x2
0
1
x3
0
1
x1
x2
x3
Time = t
x1
x2
x3
Time = t+1
V  {x1 , x2 , x3}
Activate gene
inactivate gene
fi ( x1 , x2 , x3 )
 x1  x2  x3
Trajectory
example
Interation
X1
X2
X3
1
1
1
0
2
1
1
1
3
0
1
1
4
0
0
1
5
0
0
0
6
0
0
0
Boolean Network Example
Nodes (genes)
Interation
1
2
3
4
5
6
x1
0
1
X1
1
1
0
0
0
0
x2
0
1
X2
1
1
1
0
0
0
x3
0
1
X3
0
1
1
1
0
0
f1 ( x1 , x2 , x3 )
 x2  x3
110
111
011
f 3 ( x1 , x2 , x3 )
 x2
000
trajectory 1
f 2 ( x1 , x2 , x3 )
 x1
001
100
Start!
010
101
trajectory 2
Basic Structure
of Boolean Networks
A
B
X
•Each node is a gene
•1 means active/expressed
•0 means inactive/unexpressed
Boolean function
AB X
00 1
01 1
10 0
11 1
In this example, two genes (A and B) regulate gene X. In
principle, any number of “input” genes are possible.
Positive/negative feedback is also common (and necessary
for homeostasis).
Dynamics of Boolean Networks
A
B
C D
E
F
0
1
1
0
1
0
1
1
0
1
1
0
A
B
C D
E
F
Time
At a given time point, all the genes form a genome-wide
gene activity pattern (GAP) (binary string of length n ).
Consider the state space formed by all possible GAPs.
State Space of Boolean Networks
Similar GAPs lie close
together.
 There is an inherent
directionality in the state
space.
 Some states are attractors
(or limit-cycle attractors).
The system may alternate
between several attractors.
 Other states are transient.

Picture generated using the program DDLab.
Reverse Engineering Problem
Can we infer the structure and rules of a genetic network
from gene expression measurements?
Reverse Engineering Problem
 Input:
Gene expression data
 Output:
Network structure and parameters (or
regulation rules)
Gene Expression Time Series Data
gene 1
gene 2
gene 3
0
10
20
30
time (min)
40
50
60
Problem: how can these data be used to infer how
these three genes influence each other?
Modelling Gene Expression Data
gene 1
gene 2
gene 3
0
10
20
30
time (min)
40
50
60
assume that genes exist in two states: on and off
if expression of gene i is above level ti consider it
on, otherwise, consider it off
Modelling Gene Expression Data
gene 1
gene 2
gene 3
t1
t2
t3
0
10
20
30
time (min)
40
50
60
assume that genes exist in two states: on and off
if expression of gene i is above level ti consider it
on, otherwise, consider it off
Modelling Gene Expression Data
gene 1
on
on
on
on
on
gene 2
on
on
t1
on
on
gene 3
on
on
t2
t3
on
off on
on
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
0
off
10
off
20
30
time (min)
40
50
60
assume that genes exist in two states: on and off
if expression of gene i is above level ti consider it
on, otherwise, consider it off
Modelling Gene Expression Data

we obtain the following discretized gene expression data:
time
0
5
10
15
20
25
30
35
40
45
50
55
gene 1
0
0
0
0
0
0
1
1
1
1
1
1
gene 2
0
0
0
0
0
0
0
1
1
0
0
0
gene 3
1
1
1
1
1
1
1
0
0
0
0
0

the gene expression data is now in the form of bit
streams
Information Theoretic Tools

we define some necessary information theoretic
tools:
Shannon entropy of data stream
H(X) = - ∑ pi log(pi)
where pi is the probability that a
random element of data stream X is i
(the base of the logarithm can be anything, but
must be consistent throughout; usually we use
base 2)
Information Theoretic Tools
e.g.
Shannon entropy of data streams X and Y
X = [0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Y = [0, 0, 0, 1, 1, 0, 0, 1, 1, 1]
H(X) = - ∑ pi logn(pi)
= -(pX=0 log2(pX=0) + pX=1 log2(pX=1))
= -(0.4 log2(0.4) + 0.6 log2(0.6))
= 0.971
H(Y) = - ∑ pi logn(pi)
= -(0.5 log2(0.5) + 0.5 log2(0.5))
= 1.0
Information Theoretic Tools
e.g.
Shannon joint entropy of data streams X and Y
X = [0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Y = [0, 0, 0, 1, 1, 0, 0, 1, 1, 1]
H(X, Y) = - ∑ pi logn(pi)
= -(pX=0,Y=0 log2(pX=0,Y=0,) + pX=1,Y=0 log2(pX=1,Y=0)
+ pX=0,Y=1 log2(pX=0,Y=1,) + pX=1,Y=1 log2(pX=1,Y=1))
= -(0.1 log2(0.1) + 0.4 log2(0.4)
+ 0.3 log2(0.3) + 0.2 log2(0.2)
= 1.85
Information Theoretic Tools
Define:
Conditional Entropy
H(X|Y) = H(X, Y) – H(X)
H(Y|X) = H(X, Y) – H(Y)
Mutual Information
M(X, Y) = H(Y) - H(Y|X)
= H(X) - H(X|Y)
= H(X) + H(Y) - H(X,Y)
Information Theoretic Tools
It is easy to show that:
Let X be an input data stream
and Y be an output data stream
If M(Y, X) = H(Y)
then X exactly determines Y

Look for pairs(x,y) where M(Yt+1, Xt) = H(Yt+1)
Identification of the Network Graph

back to the data:

time
1
2
3
4
5
6
1
2
3
1
2
3
1
2
gene A
0
0
1
1
1
1
0
1
1
0
1
1
1
1
gene B
0
0
0
1
0
0
1
0
1
1
0
1
1
1
gene C
0
1
1
0
0
0
0
1
0
1
0
0
1
0
step 1: put data in “state transition table” form
Identification of the Network Graph

state transition table:
Input stream value

Output stream value
Ai-1
Bi-1
Ci-1
Ai
Bi
Ci
0
0
0
0
0
1
0
0
1
1
0
1
0
1
0
0
0
1
0
1
1
1
0
1
1
0
0
1
0
0
1
0
1
1
1
0
1
1
0
1
0
0
1
1
1
1
1
0
step 1: put data in “state transition table” form
Identification of the Network Graph
state transition table tells us how to get from
state i – 1 to state i as a lookup table
 however, it is difficult to discern functional relationships,
so…
 step 2: use information theoretic tools to discover which
inputs determine the outputs

Identification of the Network Graph

step 2a: calculate entropies
note: limx+0xx=1, therefore in the left-hand limit, (0)log(0) = 0.
H(Ai) = -((0.25)log(0.25) + (0.75)log(0.75)) = 0.81
H(Bi) = -((0.75)log(0.75) + (0.25)log(0.25)) = 0.81
H(Ci) = -((0.5)log(0.5) + (0.5)log(0.5)) = 1
H(Ai-1) = H(Bi-1) = H(Ci-1) = -((0.5)log(0.5) + (0.5)log(0.5)) = 1
H(Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
Identification of the Network Graph

step 2a: calculate entropies
H(Ai, Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
H(Bi, Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
H(Ci, Ai-1) = -((0.5)log(0.5) + (0.5)log(0.5) = 1
Identification of the Network Graph

step 2b: calculate mutual information
M(Ai, [Ai-1, Ci-1]) = H(Ai) + H(Ai-1, Ci-1) - H(Ai, Ai-1, Ci-1)
= 0.81 + 2 – 2
= 0.81
= H(Ai), therefore Ai-1 and Ci-1 determine Ai
M(Bi, [Ai-1, Ci-1]) = H(Bi) + H(Ai-1, Ci-1) - H(Bi, Ai-1, Ci-1)
= 0.81 + 2 – 2
= 0.81
= H(Bi), therefore Ai-1 and Ci-1 determine Bi
M(Ci, Ai-1) = H(Ci) + H(Ai-1) - H(Ci, Ai-1)
=1+1–1
=1
= H(Ci), therefore Ai-1 determines Ci
Identification of the Boolean Circuits

step 3: determine functional relationship between
variables (this is simply the truth table)
Ai-1
Ci-1
Ai
0
0
0
0
1
1
1
0
1
1
1
1
Ai = Ai-1 OR Ci-1
Identification of the Boolean Circuits

step 3: determine functional relationship between
variables
Ai-1
Ci-1
Bi
0
0
0
0
1
0
1
0
0
1
1
1
Bi = Ai-1 AND Ci-1
Identification of the Boolean Circuits

step 3: determine functional relationship between
variables
Ai-1
Ci
0
1
0
0
Ci = NOT Ai-1
Problems With This Approach
no theory exists for determining the discretization level ti
 the assumption that genes can be modeled as either ‘on’ or
‘off’ may be sufficient for some genes, but will certainly not
be sufficient for all genes
 Ignores noise of all kinds (experimental, biological)

Boolean networks are
inherently deterministic


Conceptually, the regularity
of genetic function and
interaction is not due to
“hard-wired” logical rules,
but rather to the intrinsic
self-organizing stability of
the dynamical system.
Additionally, we may want
to model an open system
with inputs (stimuli) that
affect the dynamics of the
network.
From an empirical viewpoint,
the assumption of only one
logical rule per gene may
lead to incorrect
conclusions when inferring
these rules from gene
expression measurements,
as the latter are typically
noisy and the number of
samples is small relative to
the number of parameters
to be inferred.
Linear Models

Basic model: weighted sum of inputs
yi (t  Dt )   w ji y j (t )  bi or
dyi
  w ji y j  bi
dt
j
j
g1 w12
 Simple network representation:
w55
g2

Only first-order approximation
w23
g5
g4
g3
weight

Parameters of the model:
matrix containing NxN interaction weights

“Fitting” the model: find the parameters wji, bi such
that model best fits available data
Underdetermined problem!
 Assumes
fully connected network: need at least as
many data points (arrays, conditions) as variables
(genes)!
 Underdetermined (underconstrained, ill-posed)
model: we have many more parameters than data
values to fit
 No single solution, rather infinite number of
parameter settings that will all fit the data equally
well
Solution 1: reduce N
Rather than trying to model all genes, we can reduce the
dimensionality of the problem:
 Network of clusters: construct a linear model based on the
cluster centroids
 rat CNS data (4 clusters): Wahde and Hertz (2000),

Biosystems 55, 1-3:129-136.

yeast cell cycle (15-18 clusters): Mjolsness et al.(2000),
Advances in Neural Information Processing Systems 12; van
Someren et al.(2000) ISMB2000, 355-366.

Network of Principal Components: linear model between
“characteristic modes” of the data
Holter et al.(2001), PNAS 98(4):1693-1698.
Solution 2:
 Take
advantage of additional information:
 replicates
 accuracy of measurements
 smoothness of time series
 …
 Most likely, the network will still be poorly
constrained.

Need a method to identify and extract those
parts of the model that are well-determined and
robust
Danger of Overfitting
 The
linear model assumes every gene is regulated
by all other genes (i.e. full connectivity)
 This is the richest model of its kind
 Danger to over fit the training data
 Will result in poor prediction on new data
 Far from reality: only few regulators for each
gene
Download