Array quantitation for modeling mutations affecting RNA, protein interactions & cell proliferation. CHI Macroresults through Microarrays 3 George Church 1-May-02 Thanks to the Lipper Center for Computational Genetics Government and private grant agencies: NHLBI, NSF, ONR, DOE, DARPA, HHMI, Armenise Corporate collaborators & sponsors: Affymetrix, GTC, Mosaic, Aventis, Dupont, Cistran gggatttagctcagtt gggagagcgccagact gaa gat Post- 300 genomes & 3D structures ttg gag gtcctgtgttcgatcc acagaattcgcacca Biosystems Measures & Models Environment Metabolites RNAi Insertions SNPs DNA RNA Replication rate Protein: in vivo & in vitro interactions Microbes Cancer & stem cells Darwinian In vitro replication Small multicellular organisms Functional Genomics Challenges • Systems dynamics and optimality modeling. • Multiple genetic domains per gene: high density readout of whole genome mutant phenotypes. • Multiple RNAs & regulatory proteins per gene. • Many causative genes & haplotypes per disease. • Polony RNA exon-typing • Multiplex in situ RNA & protein analyses • Automated differentiation • Homologous recombination genome engineering Human Red Blood Cell ODE model 200 measured parameters ADP ATP FDP DHAP PEP GL6P GO6P ADP ATP NADH NAD RU5P X5P S7P NADP NADP NADPH NADPH GLCi ATP ADP 2 GSH GSSG GA3P E4P F6P NADPH NADP ADO ADE ADP INO LACi ClpH AMP ATP ADOe 2PG R5P GA3P F6P PYR ADP + K Na+ 2,3 DPG F6P G6P GLCe 3PG GA3P ADP ATP ADP ATP 1,3 DPG NADH NAD IMP ATP HCO3- PRPP AMP PRPP Jamshidi, Edwards,INOe R1P HYPX Fahland, Church, Palsson, B.O. (2001) Bioinformatics 17: 286. (http://atlas.med.harvard.edu/gmc/rbc.html) ATP R5P ADEe LACe Modeling suboptimality: Segre, Edwards, Vitkup Calculated &and Observed Fluxes in wt Sauer data FBA fluxes comparison 200 180 Wild type, C 0.4-limited CC=0.97 160 7 8 140 Calculted LPFlux wt 120 9 100 10 80 3 14 13 11 12 1 60 2 40 16 20 4 180 6 5 1517 0 20 40 60 80 100 120 Sauer wild type in Observed Fluxes 140 wt 160 180 200 Replication rate of a whole-genome set of mutants Badarinarayana, et al. (2001) Nature Biotech.19: 1060 Replication rate challenge met: multiple homologous domains thrA 1 1.1 metL probes 2 3 6.7 1 2 1.8 1.8 lysC 3 1 2 10.4 Selective disadvantage in minimal media Multiple mutations per gene Correlation between two selection experiments Badarinarayana, et al. (2001) Nature Biotech.19: 1060 Comparison of selection data with Flux Balance Optimization predictions on 488 genes predictions number of genes negatively selected essential 143 80 reduced growth rate 46 24 non essential 299 119 not negatively selected > 22 < Position effects, toxin accumulation, non-opt? P-value Chi Square = 0.004 63 180 Novel duplicates? Biosystems Measures & Models Environment Metabolites RNAi Insertions SNPs DNA RNA Replication rate microbes cancer & stem cells In vitro replication small multicellular organisms Protein: in vivo & in vitro interactions RNA quantitation issues Small fold changes in RNA are important. Example: 1.5-fold in trisomies. Cross-hybridizing RNAs. Alternative RNAs, gene families. Mixed tissues. In situ hybridization has low multiplex. Gene Expression database Aach, Rindone, Church, (2000) Genome Research 10: 431-445. experiment • Microarrays 1 ORF control • R/G ratios • R, G values • quality indicators ORF • Affymetrix2 • Averaged PM-MM • “presence” PM MM • feature statistics • 25-mers • Lynx-MPSS3, SAGE4 1 agactagcag • Counts of 14-mers sequence tags for each ORF DeRisi, et.al., Science 278:680-686 (1997) 2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996) 3 Brenner et al. Massively Parallel Signature Sequencing, Nat Biotechnol. 18:630-4 (2000) 4 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995) RNA Cluster Analyses: Cell Cycle 1 0.5 0 -0.5 -1 -1.5 -1.5 -2 1.5 1 0.5 -2 23 30 11 40 16 8 5 4 0 00 -1 00 -2 00 00 -4 00 -5 00 -6 00 -7 00 -8 00 00 0 -1 Distance from ATG (b.p.) Distance from ATG (b.p.) MCB SCB Number of ORFs 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Tavazoie, et al. P-value -Log10 18 16 14 12 10 8 6 4 2 0 0 00 -1 00 -2 00 -3 00 -4 00 -5 00 00 -6 -7 00 -8 00 -9 -1 00 0 Number of sites 35 30 25 20 15 10 5 0 -9 -1.5 Number of sites 0.5 0 DNAsynthesis andreplication(82) Cell cycle control andmitosis (312) RecombinationandDNArepair (84) Nuclear organization(720) -0.5 -1 Number of ORFs 2 1.5 1 -0.5 -1 -1.5 ORFs within functional category (k) 2.5 2 0 2.5 N = 186 MIPSFunctional category(total ORFs) 3 s.d. from mean 2.5 2 1.5 1.5 1 0.5 0 -0.5 -1 -3 Replication & DNA synthesis (2) 3 2.5 2 CLUSTER 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1999 Nature Genetics 22:281. CLUSTER Combining mouse knockouts with RNA array analysis (homeobox gene Crx-/-) Livesey, Furukawa, Steffen, Church, Cepko (2000) Current Biol. 10:301. sp Biosystems Measures & Models Environment Metabolites RNAi Insertions SNPs DNA RNA Replication rate microbes cancer & stem cells In vitro replication small multicellular organisms Protein: in vivo & in vitro interactions Combinatorial arrays for binding constants Human/Mouse EGR1 HMS: Martha Bulyk, Xiaohua Wang, Martin Steffen MRC: Yen Choo ds-DNA array Combinatorial arrays for binding constants pVIII Antibodies pIII Combinatorial DNA-binding protein domains Phage ds-DNA array Combinatorial arrays for binding constants Phycoerythrin - 2º IgG Combinatorial DNA-binding protein domains Phage ds-DNA array Martha Bulyk et al Interactions of Adjacent Basepairs in EGR1 Zinc Finger DNA Recognition Isalan et al., Biochemistry (‘98) 37:12026-12033 Wildtype EGR1 Microarray high [DNA] (+) ctrl sequence for wt binding etc. alignment oligos Wildtype RSDHLTT Motifs weight all 64 Kaapp TGG 2.8 nM GCG 16 nM RGPDLAR REDVLIR LRHNLET KASNLVS 2.5 nM TAT 5.7 nM AAA,AAT,ACT,AGA, AGC,AGT,CAT,CCT, CGA,CTT,TTC,TTT AAT 240 nM Biosystems Measures & Models Environment Metabolites RNAi Insertions SNPs DNA RNA Replication rate microbes cancer & stem cells In vitro replication small multicellular organisms Protein: in vivo & in vitro interactions Common diseases: billions of “new” alleles plus a millions of balanced polymorphisms • 60 new mutations per generation * 5,000 generations since major bottleneck(s) which set up the linkage patterns (=300,000 per genome) • Each of the 3 Gbp in the genome exist in all SNP forms: A,C,G,T,D 600,000 of each SNP on earth (spread over the common haplotypes). The population frequency will be <0.01%. (Aach et al, 2001 Nature 409: 856) • Functional genomics (FG) may provide better leads for therapies & diagnostics. (Accuracy goal 1 ppb?) Projected costs affect our view of what is possible. In 1985, the dawn of the genome project, $10 per bp, would have been $30B per genome. In 2002, Perlegen or Lynx: $3M (103 bits/$, 4 logs) In 2001, the cost of video data collection? 1013 bits/$ Genotyping & functional genomics demand will probably be as high as permitted by costs. Why lower-cost, high quality “sequencing”? Environmental, food, & biodiversity monitoring Human genome haplotyping RNA splicing & editing immune B&T cell receptor spectra & How? Femtoliter (10-15) scale & low-cost scanners Polymerase DNA colonies (polonies) Fluorescent in situ sequencing (FISSEQ) Mitra & Church Nucleic Acids Res. 27: e34 Primer A has 5’ immobilizing (Acrydite) modification. Single Molecule From Library B A’ A’ A’ B B B A’ A’ B A’ Primer is Extended by Polymerase B A’ A’ B A’ B B 1st Round of PCR Sequence polonies by sequential, fluorescent single-base extensions 3’ 5’ 1. 2. 3. 4. Remove 1 strand of DNA. Hybridize Universal Primer. Add Red (Cy3) dTTP. Wash; Scan Red Channel 3’ 5’ B B’ B B’ G C G . . AT G T . . Sequence polonies by sequential, fluorescent single-base extensions 5. Add Green (FITC) dCTP 6. Wash; Scan Green Channel 3’ 5’ 3’ 5’ B B’ B B’ GC C G . . AT GC T . Primer Extension 26 cycles, 34 Nucleotides Mean Intensity: 58, 0.5 40, 6.5 0.3, 48 0.4, 43 Polony Template 3’ 5’ P’ P TATTGTTAAAGTGTGTCCTTTGTCGATACTGGTA…5’ A TAACAAT TTCACACAGGAAACAGCTATGAC CAT FITC ( C ) CY3 ( T ) Why lower-cost, high quality “sequencing”? Environmental, food, & biodiversity monitoring •Human genome haplotyping RNA splicing & editing immune B&T cell receptor spectra & How? Femtoliter (10-15) scale & low-cost scanners Polymerase DNA colonies (polonies) Fluorescent in situ sequencing (FISSEQ) Mitra & Church Nucleic Acids Res. 27: e34 Why lower-cost, high quality “sequencing”? Environmental, food, & biodiversity monitoring Human genome haplotyping •RNA splicing & editing immune B&T cell receptor spectra & How? Femtoliter (10-15) scale & low-cost scanners Polymerase DNA colonies (polonies) Fluorescent in situ sequencing (FISSEQ) Mitra & Church Nucleic Acids Res. 27: e34 RNA Exon typing •Single molecules of RNA dispersed. •Multiplex polonies spanning all likely variable exons •Sequential probing of each exon. Functional Genomics Challenges • Systems dynamics and optimality modeling. • Multiple genetic domains per gene: high density readout of whole genome mutant phenotypes. • Multiple RNAs & regulatory proteins per gene. • Many causative genes & haplotypes per disease. • Polony RNA exon-typing • Multiplex in situ RNA & protein analyses • Automated differentiation • Homologous recombination genome engineering For more information: arep.med.harvard.edu