Talks to IPAM group UCLA September 26 Why biologists need mathematicians I. Themes from biological systems Roger Brent The Molecular Sciences Institute I. Themes from biological systems A. Why do we care about biology? A. Why do we care about biology? 1. It’s fascinating 2. It’s us, and it’s a good deal of the rest of our world A. Why do we care about biology? 1. It’s fascinating It’s quite different from the hard world There are biological functions that don’t have great counterparts outside Core functions -> Self replication -> Self repair -> Ordered growth -> Unusual and un-understood approaches to information processing -> Development into great complexity from un- differentiated state (using stored information to get there) Biological functions that don’t have good counterparts in the non-biological world Neurobiological functions -> Perception -> Memory -> Cognition -> Self-consciousness Species and population functions -> Variation-selection (aka evolution) -> Distributed coordinated control (immune system) -> Resiliance (ecologies) A. Why do we care about biology? 2. It’s us, and it’s our world -> Medicine -> Agriculture, food, etc. etc. -> Increasingly, engineering and design B. Biological systems are built from stored information. DNA makes RNA makes protein. The DNA of course, contains genes. Much of the progress in molecular biology in the mid20th century concerned progress in understanding “the molecular biology of the gene”. What is a gene? -> Definition 1. Operational. Rediscovery of Mendel. T. H. Morgan. Allele. -> Definition 2. Coding sequence, or is that the coding sequence + the control region. The first definition launched genetics and is still with us in popular culture. The second is what, in the main, day to day allows us to work. C. For reasons that I hope will become apparent, most of what I talk about over the next two days will deal with nonneurobiological information handling functions. These processes are arguably the most important current preoccupation of molecular and cellular biologists. Let’s first consider the development of an organism from a relatively undifferentiated state. This development depends on the elaboration of the information stored in the genome. At the limit, most of the construction of the cell and metazoan organism is specified by that stored information. C. Information processing during development There is a fairly clean division between stored information (program) and the rest of organism. Consider the cell or organism as the consequence of an expressed genome. In the cell or organism, information represented at two levels, the DNA and the dynamic mRNA and protein products. Information stored in level 2 is a subset of what is in level 1 and is synonymous, albeit in a different language. We’ll get back to this. D. Non-developmental information processing 1. Metabolism If I were to be giving this talk in 1950, I would be telling you that the living state could be understood in terms of metabolism, and that if one wanted to use mathematics to understand biology more deeply, one should consider metabolism. Small molecules schlepped from workstation to workstation. There are all kinds of feedback. This is a tale of rates and fluxes. Stoffwechsel Information processing, decisionmaking E. Regulation OK, there is overall information flow from DNA to mRNA and protein. Especially during development. However, information is processed in response to changes inside and outside of cell. Regulation is the mother of this information processing At the level of the cell, the “primitives”, the “components” that perform information processing are subsystems of protein molecules and sites. Examples of regulatory subsystems. 1) The operon 2) The Bicoid gradient 3) Genetic networks of repressors and activators-usually, not. 4) Signal transduction networks. 1) The operon Draw PL from phage lambda on board 2) Bicoid protein gradient 3) Genetic networks of repressors and activators, not--Draw “Boolean network” on board E. Regulation, continued 4) Signal transduction networks-- yes. Part of pathway governing response of yeast to a factor. F. Overview of non-neurobiological information processing in biology. 1. Information processing is the current metaphor for understanding how things work. This is as true in biology as it is in any other human endeavor. 2. Contemporary molecular and cellular biology has its roots in an attempt to understand transmission and elaboration of the genetic information. 3. Certainly, during development, there is an overall flow of information from DNA to mRNA and protein. F. Non-neurobiological information processing in biology, cot’d 4. However, a good deal of the interesting things that happen after development, the decisionmaking, the phenomena that cause biological changes to occur, are due to processing of information coming from the inside and outside of cell. 5. Some of this information processing, particularly information processing that can be slow and stately, occurs through changes in gene expression. However, relatively little of it occurs as multiple cycles of gene activation and repression. F. Non-neurobiological information processing in biology, cot’d 6. By contrast, a good deal of cellular information processing occurs via protein protein interactions. Mammalian G1-> S circuitry F. Non-neurobiological information processing in biology, cot’d 7. Genomic biological techniques are generating new kinds of information that promise to heighten our understanding of core biological functions, including these information processing functions. 8. In the understanding of the processes that bring about changes in biological systems, there will be mathematical opportunities. 9. The challenges and opportunities will likely be both problem specific and specific to each data type. Talks to IPAM group UCLA September 26 and 27 Why biologists need mathematicians II. How biological information and knowledge is created Roger Brent The Molecular Sciences Institute A. Ad hoc experimentation. (sometimes known as (“hypothesis-driven” research) Modular transcription factor story. lex A lex op Gene lex op Gene lex AGAL4 A. Ad hoc biological experimentation, cot’d. -> Not always hypothesis driven. -> Can sometimes be reduced to hypothesis-driven, or, more precisely, in the Francis-Baconian (Novum Organum) sense, can be reduced to test that decides between alternatives. -> Reasoning toward conclusions, and conclusions, typically expressed qualitatively, in natural language (eg English). -> Strength of conclusions depends on finding right words. -> Most of what we know we know from this. B. Genomic research, defined Genetics, study of genes Genomics, systematic study of genes (as long as you can do it in a factory) Functional genomics, study of what genes do -- but that is arguably all of biology -- but wait, as long as you can do it in a factory, it’s functional genomics -- Function Genomics is a good deal more than the study of differences in gene expression B. Genomic research, defined, continued This expanded definition of “functional genomics” would have been OK, but no-- along came proteomics, and even worse neologisms. John C. Weinstein has proposed calling all of this “omics” I will call all of it “genomic biology”. C. DNA sequencing 1) Sequencing of organismic (genomic) DNA 2) Sequencing of cDNAs, ESTs 3) Main use taught to you so far has probably been to identify genes and assign function to them. 4) In all of this, the main inferential tactic is to see that one thing that looks like another thing may do the same thing as the first thingl-- transitivity. 5) And look. What do we mean by function? C. DNA polymorphisms, such as SNPs 1) Sequencing 2) PCR 3) Melting 4) Microarrays 5) MALDI-TOF Mass spec C. DNA polymorphisms 1) Sequencing Draw on board C. D NA polymorphisms 2) PCR Draw on board C. DNA polymorphisms 3) Melting draw on board 4) Hybridization, for example to microarrays PCR + does hybridize to this sequence? Expensive and difficult to access now due to restricted Affymetrix chip technology. Cost will come down as people with deep pockets challenge this semi monopoly C. DNA polymorphisms 5) PCR approach + Matrix Assisted Laser Desorption Ionization Time Of Flight Mass Spectrometry (aka “MALDI-TOF mass spec”) Draw on board MassArray MALDI-TOF Mass Spectrometry Desorption/ Ionization Separation Courtesy Charles Cantor, Sequenom, Inc. Detection/ Identification MassArray Allele Discovery: TRSP (-12) SNP analysis using combined DNA samples (184 chromosomes) (courtesy Charles Cantor, Sequennom, Inc) C allele 5837 T allele 6165 G allele 6190 D. Gene expression analysis 1) Methods -> E. M. Southern -> Catalog cDNA sequence abundance -> Affymetrix -> Brown-Botstein-Davis-DeRisi microspot -> Suspension array (ala Lynx, Luminex) 2) Quality control problems -> Surface problems -> Source of mRNA problems (sample grab, sample treatment) -> How you turn mRNA into cDNA -> How you amplify D. Gene expression analysis, continued 3) Inferential tactics -> Two things that are expressed at the same time or under the same conditions might do the same thing (“guilt by association”) -> One thing that is expressed, or that ceases to be expressed, after another thing is expressed, may be controlled by the prior-expressed thing. (post hoc, ergo procter hoc). -> Seems as if we could do with a few more ideas, no? (I mean, there are more logical fallacies left.) Guessing gene function from expression pattern. Chu et al, 1999 Time after sporulation induction (hours) 0 1/2 2 5 7 9 11 E. Protein -> 2-D gels -> Two-hybrid -> Gst-pulldown activity -> Mass spec ID of proteins in complexes -> Follow ons. That which can be systematized will be 2-D Gels Draw on board nuc loc seq acid epitope tag “prey” interacting cDNA ORF fused moiety “bait” LexA dimerization LexA DNA binding op “prey” gene act “bait” gene Mass production of two-hybrid information: interaction mating Baits Reporter 1: LEU2 Preys Reporter 2: lacZ Mating Aptamer affinity vs LexAop-GFP reporter expression. Colas, et al. unpublished -8 Log KD (M) -7 0 25 50 75 100 % Fluorescence above background, 500nm Finding new functions by coprecipitation and forced association. Martzen et al, 1999 Biochemical Activity Protein 1 Protein 2 Protein 1 Protein 2 Gst Glutathione bead Biochemical Activity Gst and Gst Glutathione bead Protein mass spectrometric identification of proteins in complexes Won’t show the lurid commercial slide again Follow ons Ways to track information by changes in protein phosporylation by changes in mass etc., etc. Things we need A Receptors and ligands B C * Cytoplasmic proteins Rate constants D * k Sites and regulatory proteins Talks to IPAM group UCLA September 26 and 27 Why biologists need mathematicians III. Things to watch: current areas of genomic biology that seem to be generating mathematical issues Roger Brent The Molecular Sciences Institute A. “Ken’s light”, courtesy Andrea Way OMI1782 2 B. Guessing protein function from interaction pattern. Lok et al., 1998, www.molsci.org cdk4 6 1 Cart1 2 cycD Interactions in our database, Interaction 1.0 Two patterns of protein interactions Protein Complex Signal Transduction Pathway Computer searching for patterns of protein interaction (Connect The Dots, or CTD v3.0) Lok’s algorithm finds noncircular patterns first. New version > 10 exp 3 faster than first C. Quantitative simulations and models of biological processes Part of pathway governing response of yeast to a factor. Continuous and stochastic reaction computations reactant k k product reactant product startloop d[reactant] = -k[reactant] dt Preactant product/Dt is f(#reactants, k) d[reactant] = -kt [reactant] Use Preactantproduct/Dt to compute reaction PDF [reactant]timet =[reactant]time0e-kt Sample reaction PDF to determine when reaction next occurs (systems of these are solved numerically) # reactants = # reactants - 1 #products = # products + 1 go to startloop (Gillespie, 1977) Output of n-2 version of a factor simulation. 106 105 104 103 102 101 100 Endy and Lyons, unpublished -200 0 200 400 600 time (seconds) 800 1000 Test simulation by assaying system output in populations of single cells. Colman-Lerner, unpublished. G1 arrest state P Fus1 YFP P H2-like CFP P Fus1 YFP P H2-like CFP S state Vary amount of proteins, protein complexes, etc. Measure variation. Measure change in system output. P Fus1 YFP P H2-like CFP Issues with ongoing simulation work -> Devising new experimental methods to constrain upstream steps of pathway -> Coping computationally with increase in number of species -- Run time generation of code for each possible species (Levchenko and Bruck, 2000) -- Computing and keeping track of only those species that come into being during each simulation run (“Moleculizer”, Lok, 2000) D. Fuzzier stuff 1) Finding appropriate symbolic representation of qualitative knowledge and performing operations on said symbols ie. Computation on qualitative knowledge Mammalian G1-> S circuitry Limited-vocabulary information Geographical Biological Names (Berkeley, Berkeley- Names (p107, Rb, Raf, ATP, Oakland line, Interstate Ovalbumin Estrogen Response 80, Bay Bridge, San Element) Francisco Bay) Verbs (homodimerizes, Relationships (shares border ubiquitinates, cleaves, with, is under, north, right, represses) 5 miles, 11 minutes) Modifiers (strongly, slowly, most, Verbs (Go) some) Modifiers (fastest route, under construction) Locations (37.8 N, 122.3 W, 50m above sea level) Locations (in cytoplasm, near plasma membrane, in plasma membrane, in nucleus) Cassini plumbing D. Fuzzier stuff, cot’d 2) Coming back to deal with selection and fitness a) Can evolve in silico. You need a source of variation and a way to evaluate fitness. b) Make variants and use simulation to evaluate their fitness. John Koza, Forrest Bennett III, et al. 1) Create language to describe circuit genomes (components and connections) 2) Start with 100 random 10-component circuits 3) Use SPICE simulator to simulate circuit behavior 4) Select 4 circuits that best approximate desired behavior (eg, bandpass filters, amplifiers) 5) Mate, Meiose, Duplicate, Delete, Point Mutate 6) Select 100 random progeny circuits 7) Go to step 3) Op Amp. Best circuit from generation 109. Koza et al, 1999. RFEEDBACK 1000K Q23 POS15 Q25 R24 9.11K R85 11.6K Q31 RSOURCE 1K Q17 R7 15.6K Q69 R59 Q312k Q64 Q37 POS15 R48 1.23K Q30 Q55 R22 11.6k Q39 R32 1.23K ZGNO POS15 Q73 R90 6.45K RBSR C 1K Q27 4 RLOAD 1K Q67 R65 9.11K Q18 VOUT Q82 R41 18.2K NEG15 Q58 NEG15 Q77 Q43 VSOURCE NEG15 ZOUT Q79 R20 9.11K Q36 5 Q5 Q81 RBFOBK 1000K POS15 D(2), cot’d c) You have seen we may have simulations of biological systems, so we should be able to evolve in silico. d) Meanwhile, we also seem to be moving toward whole-genome experimental methods to quantitate contribution individual genes make to fitness of an organism under selection. Honest numbers, honest quantitation. D(2) Cot’d. Quantitating contribution of individual genes to fitness. Smith et al, 1996, etc. 10 gens Pot of different mutants 50 gens Organisms with mutations in essential genes gone from population Organisms with mutations in significant genes (5% growth disadvantage) depleted from population D(2) Cot’d. Actual detection and quantitation of numbers relevant to evolultion. Thus, we can almost imagine closing the loop between “theory” and experiment on one aspect of evolution, contribution of individual genes to fitness for a given condition. There are obvious math/statistical issues herein. E. Conclusions 1) In your thoughts about what sorts of data was worth your insights, it would not be good to be biased by the prevailing genomic hype toward the kinds of data that can be collected using nucleic acid microarrays. 2) There is a very large world of experimental observation out there that needs mathematics at all levels, from basic applied math work on behalf of the innumerate, to potentially profound insights. 3) Because most of these problems are not articulated, it requires work to articulate them. In this case, it probably requires people of goodwill on both sides, people who are not afraid of embarrassing themselves in a good cause.