A proposal to infer gene nets from array data using evolutionary computation Prof. Jay Mittenthal Dept. of Cell & Structural Biology, UIUC Alexander Kosorukoff Depts. of General Engineering (formerly) and Computer Science (now), UIUC 1 The problem: Find the gene net underlying a cell’s time-varying behavior, using data from a temporal sequence of microarrays. The approach: Evolutionary computation. Let a population of model gene nets that fits the available data evolve in silico. Topics: Representing a gene net: reactions topology kinematics dynamics Representing the evolution of a gene net Simulating the evolution of gene nets Gene net: A network of reactions in which genes regulate one another’s activity. Relevant classes of reactions: Transcription: (TFs to activate G) + gene G mRNA M Translation: mRNA M protein P Regulation of protein activity: S + P P* 2 The topology (connectivity) of a simple gene net: A monostable flip-flop. S P5, P6. S TF1* S P5, P6. TF3 P5 TF4 P6 TF2 Reaction list (incomplete): S + TF1 TF1* TF1* + G3 M3 M3 TF3 TF3 + G5 M5 M5 P5 TF2 + NOT TF1* + G4 M4 M4 TF4 TF4 + G6 M6 M6 P6 3 TF3 P5 S TF1* TF4 P6 TF2 Reactions: Even a simple gene net has several reactions of genes, mRNAs, proteins, and small molecules. mRNAs, proteins, and small molecules are synthesized, but also degraded (more reactions, not shown). Genes and their regulation: Gene regulation can involve switching among two or more forms of a molecule. TFs can activate or repress transcription. A given TF can activate one gene and repress another. 4 TF3 P5 S TF1* TF4 P6 TF2 The topology of a gene net can be represented with a bipartite directed hypergraph. Bipartite: Molecular nodes and reaction nodes. S TF1* TF1 Directed: Reactions have thermodynamically favored directions. Hypergraph: In a reaction the production of outputs can depend on several inputs jointly. 5 Kinematics of a gene net: TF3 P5 S TF1* TF4 P6 TF2 time S: TF2: TF1: TF1*: M3: TF3: M5: P5: M4: TF4: M6: P6: mRNA levels are a small part of the relevant data. Protein data echo mRNA data with a delay, which may be long if there is post-transcriptional regulation. 6 Protein data should distinguish different forms of a protein (TF1 vs. TF1*). 7 Dynamics of a gene net: From the topology of the network, form a reaction list. e.g. reaction i: gene i + TFs for i mRNA i Use the reaction list to specify dynamical equations governing the synthesis and degradation of each molecule. boolean: synthesis: G1 + A + (B OR C) + NOT D M1 degradation: specify lifetime of molecule continuous: ordinary differential equation (ODE) d/dt[M1] = F{TFs} k*M1 synthesis degradation parameters: binding and rate constants; weights of TFs Solve the system of dynamical equations to generate time courses of concentrations of molecules. 8 Part II: Framework for evolving a gene net Aim: To find a gene net with topology and dynamics sufficient to simulate the observed time courses of concentrations. Problem: We have partial knowledge of the molecules, reaction list, and dynamics. molecules present? genes? (genome done?) proteins? functions known? assaying each protein in all forms? small molecules? inputs and outputs for each reaction known? dynamics known? (functional form? parameters?) Dealing with partial knowledge: Make it up as we go along. Mutate the interactions of molecules and the dynamics of reactions iteratively, to fit the model to data. 9 An element-based representation of molecules and their interactions facilitates mutation. proteins: A protein contains one or more domains. A domain is a sequence of amino acids. Proteins interact through pairs of complementary domains (A, A’) – a dyad. Within a dyad signaling is often directional, from a pre-domain A to a post-domain A’. networks of proteins: A pathway: RAA’BB’C ... U’VV’T. Pathways may converge or diverge B’T1 R1A A’B’C R2B C’T RA A’BC C’T2 10 Genes, proteins, and their interactions proteins: XA YB ZC gene: DEF ____a’__b’c’__|__D_E__F__ regulatory coding region region Mutation of genes can occur by point mutation or DNA transfer. Examples: deletion of an element: a duplication of an element: a aa addition of an element: a | c a’ b’ | d a b’ | c a’ b’ | d 11 Part III: Simulating the evolution of a gene net Aim: To find a gene net with topology and dynamics sufficient to simulate the observed time courses of concentrations. The approach: Evolutionary computation. Let a population of model gene nets that fits the available data evolve in silico. Framework for the method: Mutate networks, evaluate fitness, select fitter. Evaluation of fitness: Avoid integration of ODEs. Compare rate estimates from data and from ODEs. Bigger cumulative discrepancy gives lower fitness. Evaluate fitness in stages, to reject poor nets fast: Qualitative: Compare signs of rates Coarse quantitative: Compare rates from ODEs to rates estimated from data as finite differences. Fine quantitative: Compare rates from ODEs to rates est. from data w spline approximation. 12 Overview of the method species of network: connectivity given. parameters: some given, some free. fitness: initial or previous. cell: a species with connectivity and all parameters given. fitness calculated from dynamics. N species F1 mutate one species cells of mutated species a; F1a boolean test F2 b; F1b . . . . . . cell-level iteration discard worst species in population assign fitness of best cell to mutant species Species-level iteration: Search for the optimal connectivity by mutation and selection. 13 Specify an initial population of N species: Use core models -- the fittest models available from previous experimentation, with a standard value for fitness. Species-level iteration: Search for optimal connectivity. Make a mutated species: Transfer or delete a response element or an exon. Cell-level iteration: Search for values of free parameters that optimize the mutated species’ fitness: Make and select cells from the mutated species. From the mutant species make a cell with random parameter values. Evaluate the fitness of the cell in a sequence of stages. Fitness at each stage defines the cell’s probability of going to the next stage. If the cell is rejected, make a new cell. If enough cells generated from a mutant species are rejected, reject the mutant species and make a new one. If the cell is accepted, add it to the population of cells for the mutant species, and make a new cell. 14 At some point, Stop making new cells with random parameter values. Make a mutant cell by point mutation or crossover of cells in the population, choosing parent(s) with probability proportional to fitness. Evaluate the fitness of the cell in a sequence of stages. After the population has M+1 cells, discard the least fit cell. Stop after best cell fitness converges. Assign the fitness of the best cell to the mutated species. In the population of N+1 species, discard the least fit species. Stop after the fitness of the best species converges. Compare the fittest species to additional data. Compare connectivity to that inferred by other methods, e.g. clustering. Search for response elements. 15 Comments on the method This procedure functions as a breadth-first search if all species are equally fit, but as a depth-first search if some species are much more fit than others. After a mutant species is made, its connectivity can be evaluated with the program Netscan, which assembles networks that meet specified constraints (including boolean constraints) from a reaction list. This boolean test can reject mutant networks with inadequate connectivity before cells are made from them. Netscan can be downloaded from http://www.stat.ubc.ca/people/riffraff/ 16 Two related issues: Dealing with proteins without proteome data: Solve ODEs for proteins in terms of mRNA concentrations. (Chen, He, & Church, 1999 PSB) Post-transcriptional or post-translational modification: too bad. How many free parameters (f.p.) can we evaluate? want # data points # f.p. assay R mRNAs at T times, R*T data points. if R pairs of ODEs for mRNA + protein, 6+W f.p.: 4 rate constants 2 initial values W weights in transcription rule. W = T - 6. small molecules & hypoth. genes, more f. p. 17 Merits of the method Clustering methods show groups of genes with similar behavior. use data from pairs of sampling times. A dynamical model of the network uses the whole time span of response, so can show extended temporal correlations. can incorporate data on mRNAs, proteins, and small molecules. gives more detailed predictions: shows network structure gives parameter values can suggest novel interactions, and attributes of novel molecules. 18