The Biology, Technology and Statistical Modeling of Highthroughput Genomics Data New Tools for Cell Biology Biology has gone from "data poor" to "data rich" practically overnight. Naomi Altman Dept. of Statistics Penn State U. May 25, 2010 1 New Tools for Cell Biology 2 New Tools for Cell Biology Biology has gone from "data poor" to "data rich" practically overnight. 4 technologies are driving this: biological tools: microarrays sequencing informatics tools: computational tools : internet data sharing 3 New Tools for Cell Biology Biology has gone from "data poor" to "data rich" practically overnight. 4 technologies are driving this: biological tools: microarrays sequencing informatics tools: computational tools : internet data sharing There are lots of opportunities for statistical input 4 due to rapidly evolving technology. Outline Focus today: Methods for measuring DNA and RNA Biology - What are we measuring? Why are we measuring it? multiple objectives: characterize organism understand a particular process (e.g. tumor growth) understand development understand disease infer evolutionary history characterize a sample of mixed organisms 5 Technology - How are we measuring? What are the sources of bias and variance? What are we sharing? Statistics - A few problems of great interest. 6 1 Biology DNA 100 A Statistician’s Simplification Every cell has the same genetic material, stored in the double helix of DNA. Some Cell Biology The Genome is the set of all DNA in the organism. http://www.bioteach.ubc.ca/MolecularB iology/AMonksFlourishingGarden/ 7 Biology DNA 100 A Statistician’s Simplification 8 Biology DNA 100 A Statistician’s Simplification Every cell has the same genetic material, stored in the double helix of DNA. Every cell has the same genetic material, stored in the double helix of DNA. The Genome is the set of all DNA in the organism (or in the nucleus). The rungs are "base pairs" . Each pair consists of 2 bound nucleotides which are designated C, G, A, T. C binds only to G. A binds only to T. http://www.bioteach.ubc.ca/MolecularB iology/AMonksFlourishingGarden/ http://www.bioteach.ubc.ca/MolecularB iology/AMonksFlourishingGarden/ 9 Biology DNA 100 A Statistician’s Simplification Biology Every cell has the same genetic material, stored in the double helix of DNA. In a diploid population, most cells have 2 copies of each chromosome 10 and so of each gene. Some Questions of Interest Sequence Analysis Genetic sequence: What is the genetic code for this species or strain? Cells differ because different genes are active. http://www.bioteach.ubc.ca/MolecularB iology/AMonksFlourishingGarden/ The fundamental problems: •What is the sequence of the DNA? •Which genes are active, where, when and how? 11 12 2 Biology Biology Some Questions of Interest Sequence Analysis Genetic sequence: What is the genetic code for this species or strain? The primary data are dye intensities for labels for each nt at each position. After processing, the data are stored as: AGTCTAGGCT There is also a quality score. Some Questions of Interest Sequence Analysis Genotyping - Where do genes differ among individuals in the same species? - What do these differences tell us about the phenotype? - What do these differences tell us about how the species evolved? - How do these differences evolve between species? 13 14 http://stat.fsu.edu/~lilei/lilei/research/sanger-c.gif Biology Biology Some Questions of Interest Sequence Analysis Copy number variation - have additional copies of the gene been inserted into the chromosomes (and where)? Some Questions of Interest Sequence Analysis Are there particular DNA sequences that have function such as: • genes • exons/introns • transcription factor binding sites • other "regulatory" regions – RNA binding sites – methylation sites 15 Biology 16 Biology Some Questions of Interest Sequence Analysis Gene Expression Metagenomics: • Can we identify species in a mixed sample by sequencing the DNA? • Can we recognize DNA for a target (unsequenced) species from a contaminated sample? Expression = Transcription The DNA unzips. mRNA is created using the DNA as a template. The mRNA is processed creating a transcript. Protein Creation = Translation 17 18 http://www.phschool.com/science/biology_place/biocoach/images/transcription/euovrvw.gif 3 Biology Biology Gene Expression Transcription factors direction of transcription 3' promoter upstream regulation 5' Gene 3' Gene •transcription factors bind to the promoter and bind RNA polymerase •transcription continues in the 5'-3' direction until the stop codons are reached 19 http://www.phschool.com/science/biology_place/biocoach/images/transcription/euovrvw.gif Biology Introns downstream regulation 5' Transcription DNA 100 A Statistician’s Simplification Exons In protein coding genes, introns are excised from the pre-mRNA. Different combinations of exons form different splice variants. The "poly-A tail" is added and marks this as mRNA. Biology mRNA Splice Variants (isoforms) AAAAA 3' AAAAA 5' AAAAA transcripts 20 Transcription Some genes encode other functional RNA types. These are important entities, but will not be discussed here. •The function of each cell is determined by which proteins it produces. •We might be directly measuring DNA - genotyping, copy number, protein binding sites, methylation sites •We might be measuring mRNA - gene expression, splice variant expression There are also tools for direct measurement of proteins but we will not discuss these here 21 Biology Questions about Transcription & Expression 22 Biology Questions about Transcription & Expression • Which genes are transcribing (expressing)? • Which proteins are initiating or obstructing transcription of which genes? • What proteins are being transcribed (or splice variants, isoforms)? • Where are the protein binding sites? • How much transcription is occurring? 23 • Which genes are being turned off by local mechanisms (RNA binding and epigenetics methylation, DNA coiling )? 24 4 Biology Questions about Transcription & Expression • Which specific cells are expressing? Biology Questions about Transcription & Expression • Which genes co-express? • What does gene expression tell us about tissue development? • Do homologous genes express in the same treatments? • Do homologous genes in different species express in the same treatments? (e.g. developmental genes) http://scienceblogs.com/pharyngula/upload/2006/ 09/septuple_hox_lg.php • Gene co-expression networks •http://www.biomedcentral.com/14712105/8/217/figure/F4?highres=y 25 Biology 26 Some Questions of Interest Transcriptome Analysis • What mechanisms are causing genes to turn on and off? • What mechanisms cause genes to express different splice variants? • Which proteins are regulated by transcription and which by other cell mechanisms? Some Important Technology for Characterizing RNA and DNA 27 Sample Preparation Technology 28 Reverse Transcription PCR RT-PCR is used to convert RNA (chemically unstable) to complementary DNA (stable) primer DNA mRNA RT-PCR is used to convert RNA (chemically unstable) to complementary DNA (stable) in the test tube in the cell cDNA mRNA mRNA Sample Preparation Technology Reverse Transcription PCR in the test tube in the cell cDNA primer primer cDNA DNA Nobel prize in chemistry (1993) KARY B. MULLIS for his invention of the polymerase chain reaction (PCR) method. 29 mRNA cDNA mRNA mRNA cDNA primer cDNA So we can use the same methods to measure DNA and RNA 30 5 Sample Preparation Technology Technology Quantitative PCR A similar PCR reaction can also be used to quantify the amount of RNA in a sample. Sample Preparation Chromatin Immunoprecipitation To capture the locations at which molecules bind to a chromosome: Allow molecules to bind. Cross-link chemically to form more stable but reversible bond. Attach a tag to the protein that can be captured chemically. This is called RT-PCR or q-PCR. It is considered the gold standard for gene expression (although it also has error). Fragment DNA. Capture fragments bound to tag. The quantification is based on curve fitting. Release DNA fragments. 31 Technology Sample Preparation Chromatin Immunoprecipitation 32 Technology Measuring DNA Instead of directly measuring quantities bound to the DNA, we can use ChIP to find the DNA binding sites. Because of PCR and ChIP technologies, methods for measuring DNA can be used to: This method can also be used for other chemical modifications to the DNA such as methylation. • measure DNA • measure RNA • find locations on chromosome where chemical events occur 33 Technology Microarrays Measurement A microarray is a substrate on which are attached 1000's of single strands of (c)DNA complementary to the items you wish to detect. There can be from a few thousand to a million probes consisting of these single strands. A labeled sample of DNA or cDNA is allowed to hybridize (attach) to the probes. Dye intensity for each probe is summarized by a scanning microscope. Intensity is expected to be proportional to the amount of material in the labeled sample. 35 34 Technology Microarrays Measurement Microarrays come in many formats: Ewa Paszek Affymetrix Chip-Basic Concepts http://cnx.org/content/m12387/1.4/ Affymetrix GeneChip@ http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg 2-channel glass or plastic slide 36 6 Technology Measurement Microarrays Microarrays come in many formats: Technology Microarrays Measurement The most fundamental data is a digitized photo of the array giving the label intensity. bead array http://www.illumina.com/Images/technology/beadarray_multi_sample_array_formats_lg.gif Technology 37 38 Measurement Microarrays These days most of us use intermediate probe summaries produced by the scanner. Col Row Name X Y Dia. F635 F635 Median Mean F635 SD 1 1 Pro25G 1120 13960 120 8281 7993 1182 2 1 Pro25G 1310 13960 130 8570 8260 1373 3 NegativeContr 1 ol 1490 13940 130 29 30 6 4 1 AT1G07480.1 1680 13960 120 372 373 51 5 1 AT2G41780.1 1870 13960 130 516 509 79 6 1 AT5G67530.1 2050 13960 120 1682 1598 325 7 1 AT3G30751.1 2250 13950 140 35 37 9 Technology Microarrays Microarrays come in many formats: • bead arrays • GeneChip@ (Affymetrix@) • glass or plastic slides (1 or 2 channel) Each format has strengths and weaknesses. Different preprocessing methods are required to obtain reasonably accurate quantification. 39 Technology Microarrays Measurement The probes are designed to detect various "items" by selecting parts of the gene or transcript to match. mRNA • gene expression • exon expression • tiling DNA • SNPs (biological variation) • protein binding sites • methylation sites • genes (for copy number) 41 Measurement 40 Technology Microarrays Measurement In general, microarrays are species specific (actually, genotype specific). However, the same array can sometimes be used for closely related species (e.g. human/chimp). 42 7 Technology Microarrays Measurement Microarrays require known sequences to be used as probes. Genetic variation affects hybridization. Technology Massively Parallel Sequencing Measurement The key to the genomics revolution has been the development of fast accurate DNA sequencing. "Next generation" sequencing technologies can "read" the genomic sequence of up to millions of short fragments of DNA. The fragments can be genomic DNA or cDNA. A priori sequence information is "not required". 43 Technology Massively Parallel Sequencing Measurement • New sequencing technologies can sequence 1 - 20 million short fragments of DNA per sample. • Some common brand names – SOLiD 17 - 35 nt = A,G,C or T – Illumina (Solexa) 17 - 100 nt – 454 200 - 500 nt Between methods - short is cheaper (per nt) than long Within method - short is cheaper (per mRNA) than long 45 Technology Massively Parallel Sequencing DNA • "de novo" sequencing • metagenomics • resequencing (biological variation) • SNPs (biological variation) • protein binding sites • methylation sites Measurement RNA • gene expression • exon expression • non-coding RNA expression • isoform discovery • isoform expression • microarray probe construction 44 Technology Massively Parallel Sequencing Measurement Some data from Marioni et al, 2008 GGAAAGAAGACCCTGTTGGGATTGACTATAGGCTGG GGAATTTAAATTTTAAGAGGACACAACAATTTAGCC GGGCATAATAAGGAGATGAGATGATATCATTTAAGA These are 36mers. They need to be matched to a reference to determine what gene (if any) they represent. The file size is 11 Gb - a bit inconvenient on my Windows computer! Technology Massively Parallel Sequencing 46 Measurement The most common method is "shot-gun" sequencing. Many identical strands of DNA are fragmented at random sites. 47 48 8 Technology Massively Parallel Sequencing Measurement The sequence of the fragments is determined by a sequencer starting from either the 3' or 5' end of the fragment. ACTTG--------ATCGA ACGTT------------ACGAT CTTAG---AATCA Technology Paired end sequencing uses longer fragments and sequences from both ends, with an unsequenced linker in the center. It is twice as expensive, but more informative for many 49 purposes. Computation Gene Assembly Gene Assembly For "de novo" sequencing, the fragments must be linked back into the genomic sequence. The most common method is "shot-gun" sequencing. ACTAACCTGACT ATCGAATCGATT CATTGCATATTG Technology For assembly projects: • longer is better than shorter • paired end is better than single The fragments are matched by sequence single end sequencing AAGCCTATTAGGCGTA-------------------------------------GGCGTACCTGATTAG--------------------------- assembled AAGCCTATTAGGCGTACCTGATTAG--------------------------- or paired end sequencing AAGCCTATT-----------------------------AGTTCCAAT AGGTCAAGC-----------------ACCGTAAT assembled 50 AGGTCAAGCCTATT-------ACCGTAAT-----AGTTCCAAT Technology Gene Assembly Problems Computation • errors introduced either processing and sequencing • sequencing is based on a signal from C,G,A or T - the nt with the strongest signal is used • PHRED score is a measure of reliability in the sequence for each position Typical assembly projects: de novo sequencing isoform detection resequencing (may use both the observed fragments and a "reference genome") • genetic variation (imperfect match to reference) • gene families (perfect matches among regions of different genes) • not enough sequence (incomplete assembly) 51 Technology Computation Resequencing 52 Technology Resequencing Computation If there is a reference genome or transcriptome, reads are matched to the reference. For gene expression, this is called RNA-seq. reference sequence • for gene and exon expression, if reads are mappable, more reads are better than long reads AACGTTACCTGAATTGTGTGACCTAAACTGGAGATCATATCGAATGGTACCAGTAC TTACCTG TGAATTGT CCTAAACTG • the number of reads falling an a region (e.g. an exon) can be used to quantify expression CGAATGGT CTAAACTGGA ACCTAAAC reads 53 54 from Mortazavi et al, 2008 9 Technology Resequencing Computation For gene expression, this is called RNA-seq. • reads spanning noncontiguous regions of the genome provide direct evidence of splicing Technology Resequencing Computation Typically, in resequencing studies many reads do not match the reference. • reads are too short to provide direct splice variant information 55 Technology The Internet Computation The Internet Computation •NCBI website: reference genomes, transcriptomes, microarray data, sequencing data, analysis tools .... •Gene Ontology (GO) Database: standardized vocabulary to annotate genes for many species • documentation •Kyoto Encyclopaedia of Genes and Genomes (KEGG): diagrams of known gene networks, genomic information, software tools for network analysis • documentation tools • sequence matching tools • statistical analysis tools •Bioconductor: hundreds of R libraries for bioinformatics work along with useful databases and tools to download information from, e.g. NCBI, GO and KEGG • visualization tools • tools to organize the tools •GALAXY: a data and software management system to keep track of analyses 57 The Internet Technology Some examples of what is out there: A large percentage of all the microarray and sequencing data collected in academic research settings is freely available on the internet along with: Technology 56 Computation NCBI website: reference genomes, transcriptomes, microarray data, sequencing data, software tools • UCSC Genome Browser: a visualization tool Technology The Internet 58 Computation NCBI Gene Expression Omnibus: 433,240 samples including 7342 platforms There is another database for "short read" data 59 60 10 Technology The Internet Computation Technology The Internet Computation KEGG has a variety of resources including scanned network diagrams 61 Technology The Internet Computation 62 Technology The Internet Computation 63 Statistics • • • • • • • • • • • Statistical Analysis normalization differential expression cross-platform analysis combining information utilizing known error structure eQTL and high-dimensional response problems expression networks peak finding metagenomics isoform expression 65 from genome to phenome 64 Statistics Normalization Main focus: Remove some of the sample-specific noise to improve signal detection. 66 11 Statistics Statistics Normalization Main focus: Remove some of the sample-specific noise to improve signal detection. Normalization Main focus: Remove some of the sample-specific noise to improve signal detection. Well-studied problem (but improvement still on-going) Gene Expression Microarrays Other types of microarrays • platform-specific methods abound • work well within study when all arrays should have the same mean expression level (averaged over genes) •less well studied 67 Statistics 68 Statistics Normalization Normalization Main focus: Remove some of the sample-specific noise to improve signal detection. Main focus: Remove some of the sample-specific noise to improve signal detection. New problems New problems RNA-seq normalization Within-platform cross-study batch effects •the hidden problem: low quality data not returned to the user •quality scores •total reads versus mappable reads •other features depending on sample preparation • lots of studies use the same type of microarray but large nonlinear batch effects are evident 69 Statistics 70 Statistics Normalization Differential Expression Main focus: Remove some of the sample-specific noise to improve signal detection. Main focus: Determine if there are treatment effects New problems Well-studied problem (but improvements are possible) Cross-platform effects Gene expression microarrays • different arrays, RNA-seq and qPCR have been applied to the • ANOVA-type analysis of means • Bayes, empirical Bayes and shrinkage methods are often used • Normal theory tests, permutation tests and bootstrapping are commonly used • FWER and FDR are commonly applied to control for highly multiple comparisons same samples with different results • we should be able to correct for this so that we can do combined inference 71 72 12 Statistics Statistics Differential Expression Differential Expression Main focus: Determine if there are treatment effects Main focus: Determine if there are treatment effects Studied but less thoroughly New problems microarrays RNA-seq • ANOVA-type problems with complex designs • treatments with no expression • extra-Poisson or Binomial variation • Bayes and friends • other types of differential measurement RNA-seq 2-sample tests assuming multinomial or Poisson distributions 73 Statistics 74 Statistics Sequencing Error Main focus: Account for sequence error as an integral part of the analysis High-Dimensional Predictors and Response Main focus: Predict association between 2 types of "omics" data New problems Examples Massively parallel sequencing • eQTLs: find locations on chromosomes which are associated with gene expression (for every gene, for every location) • methylation: find methylation sites associated with gene expression • GWAS: find genes associated with multiple phenotypes (e.g. disease, growth patterns, etc) • account for quality scores during other analyses 75 Statistics 76 Statistics Metagenomics Main focus: Determine what organisms are in an environmental sample Expression Networks Main focus: Understand how genes work together Examples Expt. Design, Network Analysis, PDEs • environmental sampling: scoop up sea water in oil spill area, extract DNA, and try to document all organisms in the water (including unknown organisms • ancient DNA: extract all DNA from mammoth hair and separate into mammoth and contaminants Inputs: protein binding gene expression Experiments: time course gene knock-out, silencing or down-regulation gene recovery, enhancing or up-regulation 77 78 13 Statistics Statistics Peak Finding Isoform Expression Main focus: Understand "interesting" locations on chromosome Main focus: Identify and quantify splice variant expression Example Example Where are the binding sites for protein X? Peak finding. isoform fragments ex 1 site 79 Statistics ex 3 ex 4 ex 5 ... ex K isoform count iso1 n+1 iso2 n+2 : : isoI n+I read total T1+ chromosome ex 2 T2+ T3+ T4+ T5+ ... TK+ N We observe the number of reads in each exon. We want to infer the number of reads originating from each isoform. We 80 may not know all the exons or isoforms. From Genome to Phenome The Challenge Ahead Main focus: How do all the pieces fit together to form complex traits? Combining data from many sources Genotype affects binding affects expression. Epigenetics affects expression. Genotype affects epigenetics. How do we model the differing types of data together to understand biology on the cellular and larger level? How can we validate the models? 81 Genomics data are challenging because • the scientific questions are deep • the possible impact on human life is great (cancer, crops, ...) • analysis of the data requires at least some knowledge of a very rapidly changing technology • analysis of the data requires at least some knowledge of a rapidly evolving web-based knowledge repository (some of which has self-replicating errors) • the jargon is impenetrable (and biologists take for granted that you know what they are talking about) • the amounts of data are daunting (e.g. RNA-seq is 20 million short sequences per sample) • the data are very noisy with noise depending on the technology • p>>n always (and p is growing while n is not) 82 The Challenge Ahead The Challenge Ahead It is worth the effort of learning genomics because • the scientific questions are deep • the possible impact on human life is great (cancer, crops, ...) • the rapid rate of technological change ensures that there will be new statistical problems for many years to come • huge amounts of data are available • the work is inherently collaborative It is worth the effort of learning genomics because • the scientific questions are deep • the possible impact on human life is great (cancer, crops, ...) • the rapid rate of technological change ensures that there will be new statistical problems for many years to come • huge amounts of data are available • the work is inherently collaborative We are definitely in the race. molecular biology computer science 83 statistics Understanding bias, variance and the need to validate slows us down, but we get to the right nest! 84 14 with many thanks to: For the Seriously Interested The Huck Institute of Life Sciences (Penn State) my biology collaborators at PSU who patiently taught me: dePamphilis Lab Federoff Lab Ma Lab Pugh Lab Vandenberg Lab Baums Lab McSteen Lab and the students who waded through the material with me Statistical Genomics Journal Club Bioinformatics II - Microarrays Readings in Statistical Genomics Molecular biology of the cell: Reference edition [Book] by Bruce Alberts - Science - Garland Science (2008) - Hardback - 1601 pages 85 86 15