From Genome Sequences to Regulatory Network Phenotypes (bioinformatic functional genomics:) • Study the systematic operation of genes and their products in whole genome, whole cell contexts. • Discover the effect of every gene on growth, expression, & interaction . • Test quantitative network models. Growth, Expression, & Interaction Harvard Center for Computational Genetics John Aach Tim Chen George Church Jason Hughes Jason Johnson Abby McGuire Jong Park Fritz Roth HMS Genetics Andy Link, Doug Selinger Pete Estep, Michael Ching Martha Bulyk, Sonali Bose Martin Steffen Saeed Tavazoie, Annie Chan Dereth Phillips, Chris Harbison NCBI Affymetrix Andrew Neuwald David Lockhart Eric Gentalen UCSD Bernhard Palsson DOE, DARPA, Lipper, NIST, HMR Sequenced genomes Organism S cerevisiae E coli B subtilus Synechocystis sp. A fulgidus H influenzae M thermoautotrophicum H pylori M jannaschii B burgdorgeri M pneumoniae M genitalium Total Science 277: 1433 (1997) # Genes 6034 4288 4000 3168 2471 1740 1855 1590 1692 863 677 470 % Unknown function 49% 38% 42% 56% 52% 42% 56% 43% 54% 42% 51% 31% 28848 47% FUNs Choice of Cells Small genome size: Mycoplasma, Haemophilus, Methanococcus Energy relevance: Methanobacterium, Synechocystis Major Pathogens: Mycobacterium, Escherichia, Helicobacter Biotech Production: Escherichia, Saccharomyces, Homo Recombinant protein production, in vivo combinatorial chemistry, BACs, gene delivery, etc. 15 going on 40 complete genomes. 30,000 going on 150,000 complete genes (& intergenic regions). Smith, et al. (1997) J. Bacteriol. 179:7135-55. Methanobacterium Blattner, et al. (1997) Science 277, 1453-74. Escherichia Goffeau, et al. (1996) Science 274, 563-7. Saccharomyces Metabolic & regulatory databases 4288 / 4909 E. coli orfs / genes 587 - 804 enzymes 720 - 988 metabolic reactions 436 / 1303 metabolites / compounds Varma & Palsson (1994) Appl. Env. Micro. 60:3724. Karp et al. (1998) NAR 26:50. EcoCyc Selkov, et al. (1997) NAR 25:37. WIT Robison and Church http://arep.med.harvard.edu Conceptual Data Model Biomolecule Interaction, Growth, Expression, & Database: Project : TBEID1 Model : TBEID (C) Author : John Aach Version: 1.04 7/7/97 Condition Set Condition Set Number Description Comment (S) Strain Mix Strain Mix Number Strain Mix Name Description Preparation Comments (P) (S,N) used in Strain Strain Number ProgenitorInd Description Comment used in used in Competition PhenotypeExpt Starting Cell Count Starting Cell Density Protein Preparation Set Prot Prep Set Number Description Comment input to used in Strain Phenotype Expt Starting Cell Count Starting Cell Density DNA Protein BindingExpt Experiment Info Experiment Number Experiment Type Experimenter Name Description Comment Start Time End Time Outcome Comment Success Code Sample Size OpenInd described by described by Protein Protein BindingExpt described by described by has BIGED exhibits exhibits Experiment Measures Set Expt Measures Set No Time of Measurement Expt Measures Set Type Description Comment Raw Data Sets Descrip Data Transform Descrip Outcome Comment Success Code Date Recorded Sample Size OpenInd exhibits has Results Selection Results Selection Code Expt Measures Set Type Results Selection Description exhibits exhibits exhibits John Aach Harvard Center for Computational Genetics Growth Rel Growth Mutant Std dev Rel Growth Mutant Winner Mutant Ind Rel Growth All Std dev Rel Growth All Winner All Ind Footprint Fraction Occupancy exhibits St Dev Frac Occupancy exhibits mRNA Expression mRNA Expression Level Std dev Express Level Protein Expression Cell Fraction Protein State Exp Level Std Dev Prot State Level DNA Seq Binding DNA Seq Bind Const Num DNA Sequence Binding Constant Std Dev Binding Constant Non Specific DNA Binding Non Specific BindingConst Std Dev Non Spec BindConst Protein Protein Binding Binding Level Std Dev Binding Level Submodel cross-references: * = main model, C = Condition Set Entities, D = DNA and Protein Elements, N = Names, P = Protein Preparation Entities, S = Strain and Strain Mix Entities Functional Genomics: Growth, Expression, & Interaction Why? Sampled sequence vs. Completed genomes Random vs. Engineered mutations & environments Evolutionary models vs. High-throughput assays Pure comparative genomics challenge: 15% amino acid identity: Globins retain heme & oxygen binding functions 100% amino acid identity: Enolase functions vary from enzymatic to major vertebrate lens structural component. Escherichia coli & Saccharomyces cerevisiae Regulatory and Metabolic Networks Expression DNA kR kD Growth rate RNA Protein kP kI Interactions Environments kc Metabolites kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade Translating successful strategies: Metrics (physics envy & killer applications) Automate Data quality Model quality Similarity search X-ray 1960 diffraction resolution < 0.2nm |o-c|/o R < 0.2 DALI Sequence discrepancy bp <0.01% conserved proteins BLAST 1988 Function 1999 completion DNAgibbs CorFun (growth, expression, & interaction; CorEnvironment) Ratio of strains over environments, e , times, te , selection coefficients, se, R = Ro exp[-sete] 80% of 34 random yeast insertions have s<0.3% or s>0.3% t=160 generations, e=1 (rich media); ~50% for t=15, e=7. Should allow comparisons with population allele models. Other multiplex competitive growth experiments: Thatcher, et al. (1998) PNAS 95:253. Link AJ (1994) thesis; (1997) J Bacteriol 179:6228. Smith V, et al. (1995) PNAS 92:6479. Shoemaker D, et al. (1996) Nat Genet 14:450. Multiplex: Tag(Mix) > Process > Decode Internal standards, identical conditions, microscale Multiplex DNA sequencing. Church GM. Kieffer-Higgins S. (1988) Science. 240:185. Physical mapping of complex genomes by cosmid multiplex analysis. Evans GA. Lewis KA. (1989) PNAS 86: 5030. Multiplexed biochemical assays with biological chips. Fodor SP, et al. (1993) Nature 364:555. Lashkari DA, et al. (1995) An automated multiplex oligonucleotide synthesizer. PNAS 92(17):7912. Multiplex Competitive Growth Experiments t=0 107 Environments (so far) minimal media Combinatorial: yeast extract a,H,F,Q,t synthetic rich g,L,Y,N,S Low N C,I,W,u,E Low P M,K,T,D,dap NaCl V,P,R,G,thiamine urine a,g,C,M,thiamine pancreatin H,L,I,K,V Bile F,Y,W,T,P Cholate Q,N,u,D,R triton X-100 t,S,E,dap,G 2 acetate pyridoxin,nicotinate,biotin,pantothenate,A 4 butyrate pH: 5, 6, 7, 8, 9 6 hexanoate homoserine lactone Temperature: 25, 30, 37, 45 Genome Engineering Challenges: Construct any mutant in any background, multiple mutants, minimizing hitchhiking mutants. Avoid undesired residual activities and neomorphic effects on adjacent genes in most deletion, insertion nonsense, or antisense alleles. Full in-frame replacements, computationally track gene overlaps, primer & genomic repeats. Link, et al. (1997) J. Bacteriol. 179: 6228-6237. (pKO3) http://arep.med.harvard.edu Crossover PCR in-frame deletions / tag substitutions gene of interest nearby gene Primer with NotI site tag ATG ATG TAA TAA c-tag Primer with Bam site ATG TAA ATG tag TAA pKO3: in-frame tagged deletions rep A ts sacB tag cam R M13 ori 43° Cam Resolving the cointegrant wild type = 1 30° sucrose 2 = mutant tag Primer design for size-tagged PCR 3% agarose Deleted Orf ygfX universal tag primer length 789 yiaU 518 yhcS 348 ydhB 266 yfiE 194 ygoX pssR 141 106 size-tagged primers Competitive Growth Rate Tag Readout Effects of pH in rich media 700 r' pH5 % change from inoculate 600 r' pH6 500 r' pH7 400 r' pH8 300 r' pH9 200 100 0 -100 -200 pssR farR nhaR ydhB yhcS yidP yhiF yidL uw6519 Genome Engineering Current status 5 46 24 20 Highly Expressed Genes Putative regulatory FUNs Highly conserved FUNs Flux Balance Predictions Link Phillips Loferer in prep. Glucose Flux balance model with max growth objective: S.v = b S = stoichiometric matrix (m x n) v = vector of n fluxes b = I/O rate vector n = 720 metabolic fluxes m= 436 metabolites Predict major flux changes: zwfzwf- pnt- G6 P 6 PGA 6 PG 6 .1 6 0 0 3 .9 2 1 0.08 1 0.11 1 0.50 1 0.50 1 0.50 3 .9 2 9 .2 7 9 .3 6 FD P R5 P 1 .8 9 -0 .16 -0 .15 S7 P GA 3 P 3 .9 2 9 .2 7 9 .3 6 DHA P 1 5.92 1 8.00 1 8.21 Su c c D PG 3 PG 1 5.92 1 8.00 1 8.21 1 4.52 1 6.62 1 6.93 1 4.52 1 6.62 1 6.93 0 0 3 .6 1 2 .3 8 2 .3 5 5 .7 9 PEP 0 .9 5 3 .0 7 0 Fu m 1 .3 4 3 .3 4 5 .9 4 Mal 2 PG OA A Cit 9 .3 7 1 1.51 0 A cCo A 1 .4 0 3 .4 0 5 .9 9 Icit Ac 0 Fo r 0 .0 4 0 .5 2 2 .5 2 5 .1 8 QH2 0 3 0 .0 & synthetic lethals: zwf- pgi- ATP 2 9.12 2 7.12 2 4.52 H+ 3 6.27 3 1.56 3 3.43 NADH N A DPH SuccCoA KG 1 .4 0 3 .4 0 5 .9 9 1 .4 0 3 .4 0 5 .9 9 5 .0 8 5 .2 5 3 .5 4 0 0 1 2.19 0 .1 2 2 .1 3 4 .8 2 0 .5 2 2 .5 2 5 .1 8 0 .5 2 2 .5 2 5 .1 8 1 .3 4 3 .3 4 2 .3 3 1 0 .5 Py r 3 .4 4 -0 .67 -0 .62 E4 P 1 .8 7 -0 .18 -0 .62 3 .9 2 9 .2 7 9 .3 6 2 .7 0 0 .5 9 0 .6 4 X5 P 1 .5 4 -0 .51 -0 .47 F6 P Ru 5 P 0 1 0 .2 0 FA DH Non-coding regions: E. coli: 11% Yeast: 25% Human: 95% Similarity searching for environments, growth, expression, & interaction data and then the Challenges of DNA sequence motifs: short motifs & limited alphabet (4) kdg T YidX n = #environ+genotypes g = gene sites E rsp A mtlA3 ’mtlA5 ’o18 4ppi (switching n & g gives CorEnv) D A f10 5hrs A f21 Catabolite repression glucose & Crp regulated C 4carA B YiaK B Log vs. stationaryphase regulated A Positive correlation Negative correlation kdgT YidX rspA mtlA3’ mtlA5’ o184 ppiA f105 hrsA f214 carAB YiaK o85 pspA Yggn o8 5psp A Ygg n CorFun = Zg.ZgT /n F growth, expression, &/or interaction Expression data from four cultures, allow three comparisons glucose 30oC Mating type a galactose 30oC Mating type a glucose 30oC Mating type glucose 30o C -> 39o C shock Mating type a Expression Quantitation Options 1) n-dimensional cDNA or protein displays 2) Computer selected oligomer-arrays photolithographic or piezoelectric deposition 3) Gridded microarrays from clones 4) Counting 13-bp cDNA tags (SAGE) (20,000 tags means <800 RNAs have S/N>4) Lockhart, et al. (1997) Nature Biotechnology 15:1359. DeRisi, et al. (1997) Science 278:680. Velculescu, et al. (1997) Cell 88:243. Galactose Regulatory Network GAL4 GAL80 Gal4p-Gal80p inactive complex GALACTOSE Gal1p Gal3p GAL3 Gal4p-Gal80p active complex ? GCY1 PGM2 MEL1 GAL2 GAL7 Structural Genes For Galactose Metabolism GAL10 GAL1 Fold Change in GAL3 in Galactose vs. Glucose (Median Fold Change is 3.1) GAL3: Fold Change in Expression between Growth in Galactose and Growth in Glucose 25 15 10 5 Probe Number 19 17 15 13 11 9 7 5 3 0 1 Fold Change 20 30 Relative expression of all genes: Galactose vs. Glucose 25 15 Number of Genes 20 10000 o r f I D / g e n e : c h i p # p r o b e s Y B R 0 2 0 w / G A L 1 : A Y B R 0 1 8 c / G A L 7 : A Y B R 0 1 9 c / G A L 1 0 : A Y D R 3 4 5 c / H X T 3 : A Y O R 1 2 0 W / G C Y 1 : D Y L R 0 8 1 w / G A L 2 : C Y G L 1 8 9 C / R P S 2 6 A : B Y P L 0 6 6 W / V P S 2 8 : D Y H R 0 9 4 c / H X T 1 : B Y O L 1 5 4 W / : D Y P L 0 6 7 C / : D Y G L 0 3 0 W / R P L 3 2 _ e x 1 : Y FL 0 4 5 C / S E C 5 3 : B Y B R 1 0 6 w / : A Y E R 1 9 0 w / _ f : B Y M R 3 1 8 C / : D Y N L 0 1 5 W / P B I 2 : D Y B R 0 1 1 c / I P P 1 : A Y E R 1 7 8 w / P D A 1 : B Y O L 0 5 8 W / A R G 1 : D Y C R 0 0 5 c / C I T 2 : A Y H R 0 9 2 c / H X T 4 : B 2 5 s r R n a a : A : : 2 5 s r R n a a : Y G L 0 5 5 W / O L E 1 : B Y FR 0 2 4 C / _ r : B Y H R 0 3 3 W / : B Y D R 0 0 9 W / G A L 3 : A Y G R 2 4 4 C / : B Y K L 0 9 6 W / C W P 1 : C Y N L 0 5 2 W / C O X 5 A : D Y J R 0 7 3 C / O P I 3 : C Y M R 2 5 6 c / C O X 7 : D 10 1000 100 10 m e d FC c o n s FC 2 1 6 4 . 8 1 2 4 . 5 7 2 1 4 1 . 9 1 1 0 . 5 8 2 0 3 7 . 8 1 3 . 0 3 2 0 2 5 . 0 5 1 3 . 5 8 2 0 1 2 . 3 1 7 . 8 1 2 1 8 . 1 9 3 . 5 6 1 9 7 . 8 2 0 . 4 5 2 0 6 . 3 5 2 . 7 5 2 0 6 . 2 6 2 . 3 8 2 1 6 . 0 4 3 . 2 7 2 1 5 . 9 5 3 . 1 3 B 2 1 5 . 3 2 3 . 1 1 2 1 5 . 1 7 2 . 7 3 2 1 5 . 0 3 2 . 6 6 2 0 4 . 9 2 . 4 8 2 0 4 . 0 2 2 . 3 6 2 0 3 . 8 9 2 . 3 2 0 3 . 7 3 1 . 7 5 2 0 3 . 4 6 2 . 2 2 2 0 3 . 3 6 2 . 2 4 2 0 3 . 3 2 . 1 5 2 0 3 . 2 7 1 . 5 2 B 8 : : 4 2 5 s r R n a a : 3 C . : 2 : 2 7 5 s r R n a a : D 1 . 4 9 2 0 3 . 2 1 1 . 9 8 2 0 3 . 2 1 1 . 4 3 2 0 3 . 1 5 1 . 5 2 2 0 3 . 0 8 1 . 3 8 2 0 2 . 9 9 1 . 5 5 2 1 2 . 9 7 1 . 7 8 2 0 2 . 9 4 1 . 9 6 2 0 2 . 9 2 1 . 5 2 2 1 2 . 8 4 1 . 6 4 t h r s h l d m i s s i n g M M ? e x 2 2 2 2 2 2 1 2 1 1 2 1 1 2 2 p r r a t i o 6 4 . 8 1 4 1 . 9 1 3 7 . 8 0 . 0 3 9 9 2 0 1 6 1 2 . 3 1 8 . 1 9 0 . 1 2 7 8 7 7 2 4 6 . 3 5 0 . 1 5 9 7 4 4 4 1 0 . 1 6 5 5 6 2 9 1 5 . 9 5 0 . 1 8 7 9 6 9 9 2 0 . 1 9 3 4 2 3 6 0 . 1 9 8 8 0 7 1 6 0 . 2 0 4 0 8 1 6 3 4 . 0 2 3 . 8 9 0 . 2 6 8 0 9 6 5 1 0 . 2 8 9 0 1 7 3 4 3 . 3 6 0 . 3 0 3 0 3 0 3 0 . 3 0 5 8 1 0 4 0 . 3 0 5 8 1 0 4 3 . 2 1 0 . 3 1 1 5 2 6 4 8 3 . 1 5 3 . 0 8 2 . 9 9 0 . 3 3 6 7 0 0 3 4 2 . 9 4 0 . 3 4 2 4 6 5 7 5 2 . 8 4 l o g e x p r r a t i o 1 . 8 1 1 6 4 2 0 2 1 . 6 2 2 3 1 7 6 6 1 . 5 7 7 4 9 1 8 1 . 3 9 8 8 0 7 7 3 1 . 0 9 0 2 5 8 0 5 0 . 9 1 3 2 8 3 9 0 . 8 9 3 2 0 6 7 5 0 . 8 0 2 7 7 3 7 3 0 . 7 9 6 5 7 4 3 3 0 . 7 8 1 0 3 6 9 4 0 . 7 7 4 5 1 6 9 7 0 . 7 2 5 9 1 1 6 3 0 . 7 1 3 4 9 0 5 4 0 . 7 0 1 5 6 7 9 9 0 . 6 9 0 1 9 6 0 8 0 . 6 0 4 2 2 6 0 5 0 . 5 8 9 9 4 9 6 0 . 5 7 1 7 0 8 8 3 0 . 5 3 9 0 7 6 1 0 . 5 2 6 3 3 9 2 8 0 . 5 1 8 5 1 3 9 4 0 . 5 1 4 5 4 7 7 5 0 . 5 1 4 5 4 7 7 5 0 . 5 0 6 5 0 5 0 3 0 . 5 0 6 5 0 5 0 3 0 . 4 9 8 3 1 0 5 5 0 . 4 8 8 5 5 0 7 2 0 . 4 7 5 6 7 1 1 9 0 . 4 7 2 7 5 6 4 5 0 . 4 6 8 3 4 7 3 3 0 . 4 6 5 3 8 2 8 5 0 . 4 5 3 3 1 8 3 4 B I N S l o g Jan Feb Mar Apr May Jun e x p r FR r a t E i o Q 2 1 . 9 5 1 . 9 1 . 8 5 1 . 8 1 . 7 5 1 . 7 1 . 6 5 1 . 6 1 . 5 5 1 . 5 1 . 4 5 1 . 4 1 . 3 5 1 . 3 1 . 2 5 1 . 2 1 . 1 5 1 . 1 1 . 0 5 1 0 . 9 5 0 . 9 0 . 8 5 0 . 8 0 . 7 5 0 . 7 0 . 6 5 0 . 6 0 . 5 5 0 . 5 0 . 4 5 1 Log of Fold Change 0 Food Gas Motel 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 5 -1.5 -2.0 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 2 3 1 0 1 5 3 To analyze the most induced genes, we... • Extracted the intergenic DNA sequence upstream of each translation start using the Saccharomyces Genome Database. • Used an algorithm for multiple sequence alignment to look for sequence motifs conserved among the most induced (or repressed). • Looked at the intersection of genes which both matched a conserved motif and were induced (or repressed) Gibbs Motif Sampling Strategy 1 Initialize the alignment by choosing a random subset of all possible sites as the ‘site’ alignment, and use all remaining sequences to give a ‘non-site’ alignment. 2 Select a potential site from among all possible sites. 3 If the site is in the alignment, take it out. 4 Calculate the relative likelihood that the potential site belongs with the site alignment rather than the ‘non-site’ alignment, based on a Bayesian multinomial distribution model. 5 Randomly choose whether or not to add the site, weighted by this relative likelihood. 6 Repeat Step 2 ‘DNAGibbs’: A Modified Gibbs Motif Sampler Optimized for DNA searches. • Either forward or reverse strand of a potential site -- but not both -- may be added to the alignment. • Near-optimum sampling method was improved so that it is faster and tends to result in higher scoring alignments. • Simultaneous multiple motif searching was replaced with a more efficient iterative masking approach. • The model for base frequencies of non-site sequence was fixed using the average nucleotide frequencies of S. cerevisiae. • Now runs on DEC Unix and Windows platforms, in addition to the formerly supported SGI and Sun Unix platforms. Finally, exclude motifs with: • DNAGibbs (maximum log a posteriori likelihood ratio) scores less than 5. . • Good matches (Z < 3 sd below the mean of the aligned positive motifs) with greater than 10% of all yeast genes (ORFs) *O.G. Berg & P.H. von Hippel, J. Mol. Biol., 193: 723-750 (1987) Information (Bits) Using the top 10 genes induced in galactose, DNAGibbs found UASG, the site recognized by Gal4p CGYTCGGA-GA-AGT---CCGA Previous UASG consensus sequence logos were developed by T.D. Schneider & R.M. Stephens, Nucleic Acids Res., 18: 6097-6100 (1990). Genes that changed between galactose and glucose by more than 2-fold and have strong matches to the UASG motif Gene GAL1 GAL7 GAL10 GCY1 GAL2 YPL066W YPL067C YMR318C GAL3 Fold Change >65 >42 >38 >12 >8 >6 >6 4 >3 Best Z-Score -1.4 -0.7 -1.4 0.5 0.4 -1.1 -1.1 1.1 2 # of Sites 5 2 5 1 4 1 1 1 2 Galactose Regulatory Network GAL4 GAL80 Gal4p-Gal80p inactive complex YPL067C YPL066W GALACTOSE ? YMR318C Gal1p Gal3p GAL3 Gal4p-Gal80p active complex ? GCY1 PGM2 MEL1 GAL2 GAL7 Structural Genes For Galactose Metabolism GAL10 GAL1 DNAGibbs and mating type Motif Score %ORF Consensus mt-1 (A) mta-1 (B) mta-2 (C) mta-3 (D) mt-mta-1 (E) mt-mta-2 (F) mt-mta-3 (G) mt-mta-4 (H) 8.9 8.5 5.0 28.1 20.7 5.3 8.6 5.3 0.11 0.05 0.10 0.31 0.34 0.13 0.27 0.31 Similarity ttcctarttng P Box anwncwnkmaananantcwtbwtnw aaaycawmawnanwa grnawktacayg 2-bind, mt-mta-1 crtgtanntwyc 2-bind mta-3 kwtnywnnnknnntgtttsa PRE, mt-mta-2 tgamaywwtnaama PRE, mt-mta-1 rmtgmcngcma Q Box Expect DNABP Consensus Ref: Herskowitz, et al., P Box Q Box 2-bind PRE tttcctaattaggnan tcaatgacag crtgtaawt tgaaaca in Gene Expression, E. W. Jones, et al., Eds. (CSHL Press, NY, 1992) . vol. 2: pp. 583-656 Mcm1p Mat1p Mat2p Ste12p rpoN cys B melR rpoE flhCD hipB tus araC rpoH13 ilvY rpoH14 marR lacI carP deoR ada cynR fhlA iclR rhaS ntrC galR fnr gcvA lexA pdhR arcA purR fadR nagC torR cspA fruR phoB metJ fur cytR argR tyrR metR oxyR ihf s oxS trpR glpR farR narL fis dnaA crp rpoS malT rpoD19 lrp hns ompR rpoD18 rpoD16 rpoD17 rpoD15 rpoN cys B melR rpoE flhCD hipB tus araC rpoH13 ilvY rpoH14 marR lacI carP deoR ada cynR fhlA iclR rhaS ntrC galR fnr gcvA lexA pdhR arcA purR fadR nagC torR cspA fruR phoB metJ fur cytR argR tyrR metR oxyR ihf s oxS trpR glpR farR narL fis dnaA crp rpoS malT rpoD19 lrp hns ompR rpoD18 rpoD16 rpoD17 rpoD15 Calibration of 60 E. coli binding site matrices 0 1 2 3 4 5 Z-score 6 7 8 9 10 Interaction Quantitation Options Over-expression: Yeast two-hybrid screens (in vivo complexity) In vitro chip assays Martha Bulyk, David Lockhart, Erik Gentalen Natural levels, environmental regulation: Subcellular fractionation (unstable) In vivo footprinting (partners unknown) In vivo crosslinking Combinatorial ds-DNA Chips (chemical, photo & enzymatic synthesis) 3' 5' A C A C A C h spacer n-mer mask 2 x x xx 3' SiO2 Polymerase ACACA C AACCGG AAo o o o specific 16-mer C g G c AACCGG C g G c primer 3' 5' Interaction Quantitation Options Over-expression: Yeast two-hybrid screens (in vivo complexity) In vitro chip assays Natural levels, environmental regulation: Subcellular fractionation (unstable) In vivo footprinting (partners unknown) In vivo crosslinking Martin Steffen, Andy Link Isolate in vivo crosslinked complexes by nucleic acid CsCl (or hybridization) by protein epitope tag analyze protein by DNase 2D gel, trypsin-LC-ESI-MS/MS analyze DNA/RNA by chip pH kdal Link et al. (1997) Electrophoresis 18:1259 & 1314 Rich media log-phase, in vivo crosslink, DNaseI digest pH 4 5 6 7 100 50 40 30 kdal grpE lacI s sp A 20 ef p ssb dps dps 10 f ur hns ihf B purE In vivo crosslinking & footprinting summary 11% of the E.coli genome is non-coding. About 340 / 4328 proteins are likely DNA-binding proteins (2 or the top 380 proteins). 24/25 footprinted GATC sites are non-coding. Odds = 10-27. 2/3 crosslinked DNA molecules are likely regulatory binding sites. Odds = 0.04 8/11 top DNA-crosslinked proteins are known DNA-binding proteins. Odds = 10-16. Thoughts on chips for crosslinked epitope selections (& generally). An easy 10-fold enrichment but with 40,000 fragments means an expensive 1:4000 Signal:Noise, if sequencing (or SAGE) were used. However, spread over a chip, 1:10. E. coli oligonucleotide chip challenges: #1) Closely spaced transcripts, e.g. carAB: (Intergenic 25-mers overlap, start 6 bp apart on average) P1(pyrimidine) ... 48 bp ... P2(arginine) gggtaagcaaatttgcattgcttcatactgactgaatgaattaatatgcaaataaagtg #2) Repeats, e.g. tufA & tufB DNA. Mismatches: * .....*.........*..*....................................................................... .......................................................................................... .......................................................................................... .......................................................................................... .......................................................................................... .......................................................................................... ................................................*......................................... .......................................................................................... .......................................................................................... ....................................................................................*..... .......................................................................................... ............................................................*............................. ......*.................*..*........*.......................*............................. *............. From Genome Sequences to Regulatory Network Phenotypes Summary Expression: Cell-type & condition clustering plus DNAGibbs algorithm extracts intergenic binding motifs for yeast Gal-Glc, Mat-Mata, & 30oC-39oC comparisons. Interaction: Strong enrichment for low abundance wild-type & mutant in vivo E.coli DNA-protein contacts establishes mechanistically anchored intergenic elements. Growth: Multiplex competitive growth of in-frame replacements for novel E.coli regulatory genes defines cellular system integration & environments. Escherichia coli & Saccharomyces cerevisiae Regulatory and Metabolic Networks Population Selection, Flux Balance, & Gibbs Expression DNA kR kD Growth rate RNA Protein kP kI Interactions Environments kc Metabolites kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade Growth, Expression, & Interaction Harvard Center for Computational Genetics John Aach Tim Chen George Church Jason Hughes Jason Johnson Abby McGuire Jong Park Fritz Roth HMS Genetics Andy Link, Doug Selinger Pete Estep, Michael Ching Martha Bulyk, Sonali Bose Martin Steffen Saeed Tavazoie, Annie Chan Dereth Phillips, Chris Harbison NCBI Affymetrix Andrew Neuwald David Lockhart Eric Gentalen UCSD Bernhard Palsson DOE, DARPA, Lipper, NIST, HMR