To: Lei Liu <leiliu@uiuc

advertisement
A proposal to infer gene nets from array
data using evolutionary computation
Prof. Jay Mittenthal
Dept. of Cell &amp; Structural Biology, UIUC
Alexander Kosorukoff
Depts. of General Engineering (formerly)
and Computer Science (now), UIUC
1
The problem: Find the gene net underlying a cell’s
time-varying behavior, using data from a temporal
sequence of microarrays.
The approach: Evolutionary computation. Let a
population of model gene nets that fits the available
data evolve in silico.
Topics:
Representing a gene net:
reactions
topology
kinematics
dynamics
Representing the evolution of a gene net
Simulating the evolution of gene nets
Gene net: A network of reactions
in which genes regulate one another’s activity.
Relevant classes of reactions:
Transcription:
 (TFs to activate G) + gene G  mRNA M
Translation: mRNA M  protein P
Regulation of protein activity: S + P  P*
2
The topology (connectivity) of a simple gene net:
A monostable flip-flop.
S  P5, P6.
S  TF1*
S  P5, P6.
 TF3  P5
 TF4  P6
TF2
Reaction list (incomplete):
S + TF1  TF1*
TF1* + G3  M3
M3  TF3
TF3 + G5  M5
M5  P5
TF2 + NOT TF1* + G4  M4
M4 TF4
TF4 + G6  M6
M6  P6
3
 TF3  P5
S  TF1*
 TF4  P6
TF2
Reactions:
Even a simple gene net has several reactions of
genes, mRNAs, proteins, and small molecules.
mRNAs, proteins, and small molecules are
synthesized, but also degraded (more reactions, not
shown).
Genes and their regulation:
Gene regulation can involve switching among two
or more forms of a molecule.
TFs can activate or repress transcription. A given
TF can activate one gene and repress another.
4
 TF3  P5
S  TF1*
 TF4  P6
TF2
The topology of a gene net can be represented with a
bipartite directed hypergraph.
Bipartite: Molecular nodes and reaction nodes.
S
TF1*
TF1
Directed: Reactions have thermodynamically
favored directions.
Hypergraph: In a reaction the production of
outputs can depend on several inputs jointly.
5
Kinematics of a gene net:
 TF3  P5
S  TF1*
 TF4  P6
TF2
time 
S:
TF2:
TF1:
TF1*:
M3:
TF3:
M5:
P5:
M4:
TF4:
M6:
P6:
























           
           
           
           
           
           
           
           
           
           
           
           
mRNA levels are a small part of the relevant data.
Protein data echo mRNA data with a delay, which
may be long if there is post-transcriptional
regulation.
6
Protein data should distinguish different forms of a
protein (TF1 vs. TF1*).
7
Dynamics of a gene net:
From the topology of the network, form a reaction
list. e.g.
reaction i: gene i + TFs for i  mRNA i
Use the reaction list to specify dynamical equations
governing the synthesis and degradation of each
molecule.
boolean:
synthesis: G1 + A + (B OR C) + NOT D  M1
degradation: specify lifetime of molecule
continuous: ordinary differential equation
(ODE)
d/dt[M1] = F{TFs} k*M1
synthesis
degradation
parameters: binding and rate constants;
weights of TFs
Solve the system of dynamical equations
to generate time courses
of concentrations of molecules.
8
Part II: Framework for evolving a gene net
Aim: To find a gene net with topology and dynamics
sufficient to simulate the observed time courses of
concentrations.
Problem: We have partial knowledge of the molecules,
reaction list, and dynamics.
molecules present?
genes? (genome done?)
proteins?
functions known?
assaying each protein in all forms?
small molecules?
inputs and outputs for each reaction known?
dynamics known? (functional form? parameters?)
Dealing with partial knowledge:
Make it up as we go along.
Mutate the interactions of molecules and the dynamics
of reactions iteratively, to fit the model to data.
9
An element-based representation of molecules
and their interactions facilitates mutation.
proteins:
A protein contains one or more domains. A
domain is a sequence of amino acids.
Proteins interact through pairs of complementary
domains (A, A’) – a dyad.
Within a dyad signaling is often directional, from a
pre-domain A to a post-domain A’.
networks of proteins:
A pathway: RAA’BB’C ... U’VV’T.
Pathways may
converge
or
diverge
B’T1
R1A
A’B’C
R2B
C’T
RA
A’BC
C’T2
10
Genes, proteins, and their interactions
proteins: XA YB ZC
gene:
DEF
____a’__b’c’__|__D_E__F__
regulatory
coding
region
region
Mutation of genes can occur by
point mutation or DNA transfer.
Examples:
deletion of an element:
a

duplication of an element:
a

aa
addition of an element:
a
| c

a’ b’ | d
a
b’
| c
a’
b’
| d
11
Part III: Simulating the evolution of a gene net
Aim: To find a gene net with topology and dynamics
sufficient to simulate the observed time courses of
concentrations.
The approach: Evolutionary computation. Let a
population of model gene nets that fits the available
data evolve in silico.
Framework for the method:
Mutate networks, evaluate fitness, select fitter.
Evaluation of fitness:
Avoid integration of ODEs.
Compare rate estimates from data and from ODEs.
Bigger cumulative discrepancy gives lower fitness.
Evaluate fitness in stages, to reject poor nets fast:
Qualitative: Compare signs of rates
Coarse quantitative: Compare rates from
ODEs to rates estimated from data as finite
differences.
Fine quantitative: Compare rates from ODEs
to rates est. from data w spline approximation.
12
Overview of the method
species of network: connectivity given. parameters:
some given, some free. fitness: initial or previous.
cell: a species with connectivity and all parameters
given. fitness calculated from dynamics.
N species




F1
mutate
one species
cells of
mutated species


 



 

a; F1a
boolean test




F2


 

b; F1b
.
.
.
.
.
.
cell-level iteration
discard worst species
in population
assign fitness of
best cell to
mutant species
Species-level iteration: Search for the optimal
connectivity by mutation and selection.
13
Specify an initial population of N species: Use core
models -- the fittest models available from previous
experimentation, with a standard value for fitness.
Species-level iteration: Search for optimal connectivity.
Make a mutated species: Transfer or delete a
response element or an exon.
Cell-level iteration:
Search for values of free parameters
that optimize the mutated species’ fitness:
Make and select cells from the mutated species.
From the mutant species make a cell with
random parameter values.
Evaluate the fitness of the cell in a sequence of
stages. Fitness at each stage defines the cell’s
probability of going to the next stage.
If the cell is rejected, make a new cell.
If enough cells generated from a mutant
species are rejected, reject the mutant
species and make a new one.
If the cell is accepted, add it to the
population of cells for the mutant species,
and make a new cell.
14
At some point,
Stop making new cells with random
parameter values.
Make a mutant cell by point mutation or
crossover of cells in the population,
choosing parent(s) with probability
proportional to fitness.
Evaluate the fitness of the cell in a
sequence of stages.
After the population has M+1 cells,
discard the least fit cell.
Stop after best cell fitness converges.
Assign the fitness of the best cell to the mutated
species. In the population of N+1 species, discard
the least fit species.
Stop after the fitness of the best species converges.
Compare the fittest species to additional data.
Compare connectivity to that inferred by other
methods, e.g. clustering.
Search for response elements.
15
Comments on the method
This procedure functions as a breadth-first search if all
species are equally fit, but as a depth-first search if
some species are much more fit than others.
After a mutant species is made, its connectivity can be
evaluated with the program Netscan, which assembles
networks that meet specified constraints (including
boolean constraints) from a reaction list. This boolean
test can reject mutant networks with inadequate
connectivity before cells are made from them.
Netscan can be downloaded from
http://www.stat.ubc.ca/people/riffraff/
16
Two related issues:
Dealing with proteins without proteome data:
Solve ODEs for proteins in terms of mRNA
concentrations.
(Chen, He, &amp; Church, 1999 PSB)
Post-transcriptional or post-translational
modification: too bad.
How many free parameters (f.p.) can we evaluate?
want # data points  # f.p.
assay R mRNAs at T times, R*T data points.
if R pairs of ODEs for mRNA + protein,
6+W f.p.:
4 rate constants
2 initial values
W weights in transcription rule.
 W = T - 6.
small molecules &amp; hypoth. genes, more f. p.
17
Merits of the method
Clustering methods
show groups of genes with similar behavior.
use data from pairs of sampling times.
A dynamical model of the network
uses the whole time span of response, so can show
extended temporal correlations.
can incorporate data on mRNAs, proteins, and
small molecules.
gives more detailed predictions:
shows network structure
gives parameter values
can suggest novel interactions, and attributes
of novel molecules.
18
Download