‘Omics’ - Analysis of high dimensional Data Achim Tresch Computational Biology Schedule Monday Lecture: Introduction to Omics Motivation: Transcriptomics Sebastian • Experimental techniques Dümcke • Data analysis (overview) Data exploration of univariate data • Measures of location and scale • Bar plot, box plot, histogramm, density plot Data exploration of bivariate data • Odds ratio, correlation crosstable, scatter plot, QQ-plot Exercises: Introduction to R and Bioconductor Forensic bioinformatics Arijit Das Henrik Failmezger Omics Omics (Wikipedia): Omics informally refers to a field of study in biology such as genomics, proteomics or metabolomics. Omics aims at the collective characterization and quantification of pools of biologically / biochemically similar molecules that translate into the structure, function, and dynamics of an organism or organisms. Ingredients for omics research: - High throughput experimental techniques for the simultaneous measurements of large numbers of molecules - Statistical methods for the appropriate analysis of high dimensional data. Generally, Omics data analysis takes longer than data generation! Genomics: Transcriptomics Techniques for RNA quantification - Northern Blot - Reporter genes low-medium throughput - Reverse Transkriptase PCR - Microarrays high throughput - RNA-Sequencing Northern Blot RNA (or DNA) is separated by the size on a gel, transfered to the membrane and hybridized with gene-specific probe RNA -> Nothern blot DNA -> Southern blot Low throughput and poor quantification Molecular Biology of the Cell (© Garland Science 2008) RT-PCR RNA DNA Reverse transcription PCR The course of PCR (amount of doublestranded DNA) is monitored using a specific fluorescent dye N Differences in concentration of particular mRNA in different samples can be calculated as 2N, with N being the difference in the number of cycles to obtain the same amount of product Medium throughput, high precision Molecular Biology of the Cell (© Garland Science 2008) Microarrays mRNA is converted to cDNA and labeled, and subsequently hybridized to an array of gene-specific probes (either spotted cDNA samples or oligonucleotides, either one or two sample(s) per microarray) Differences in expression between samples are determined as a ratio of fluorescence signals at individual spots. High throughput, medium precision (low dynamic range) Molecular Biology of the Cell (© Garland Science 2008) Next generation sequencing (NGS) Massively parallel sequencing techniques enable sequencing of genome-wide cellular RNA pools Typical sequencing read lentgh is 30-100 nucleotides RNA or cDNA has to be fragmented A single run comprises 106-108 reactions, depending on a platform, so most RNAs are covered by multiple “reads“ read occurence for a particular gene reflects expression level Zyklusvorlesung Molekularbiologie WS 2009/10 High throughput, precision depends on sequencing depth (#reads) Next generation sequencing (NGS) Illumina (Solexa) sequencing DNA fragments are coupled to glass slide and subjected to Bridge amplification. 106-108 individual reads of 30-100 bp are produced at a time by using fluorescently labeled removable terminator tags Sample preparation Sequencing Transcriptomics with Microarrays Workflow of a microarray experiment Experimental design Frame a biological question Choose a microarray platform Decide on biological and technical replicates Design the series of hybridization days Technical performance Obtain the samples Isolate total RNA Label cDNA or mRNA Perform the hybridizations Scan the chips weeks Statistical analysis Data mining Extract fluorescence intensities Cluster analysis and pattern recognition Normalize data to remove biases Study lists of genome ontologies Estimate expression changes Identify differentially expressd transcripts Search for regulatory motifs Reconstruct regulatory circuits Design validation and follow-up experiments days-weeks months After: Gibson, G and SV Muse, 2004. Transcriptomics with Microarrays labeled sample Sample amplification and labeling sample injected into microarray RNA sample Fluorescence intensity translated into mRNA abundance Probe array scanning and intensity quantitation Probe array hybridization Probe array washing and staining RNA Sample preparation RNA Sample preparation Hybridization onto microarray Quakenbush, 2006 Hybridization onto microarray Hybridization onto microarray mismatch probes perfect match probes probe pair Each gene is represented by 11-20 probe pairs of 25nt length, consisting of a perfect match probe and a mismatch probe. Perfect match probes are complementary to specific sequences of the target gene, preferentially located at the 3’ end of a gene. The mismatch probe is identical to the perfect match probe, except for the middle base. It is designed to detect unspecific binding. Affymetrix Microarrays – Probe Synthesis For the extension of all oligonucleotides by one base, four litographic steps with complementary masks are performed, one mask for each base A, C, T, G. Affymetrix raw Data Greyscale and false color image of the fluorescence readout Data Analysis Detection of differentially expressed genes genes Identification of similar samples and co-regulated genes in a multi-sample comparison samples Data Analysis A) B) C) D) E) Cluster Analysis Pearson correlation matrix Venn diagram Summary statistics up-/down regulation [ Phenotypic analysis ] Koschubs et al., EMBO J, 2009. Data Analysis: Gene Ontology 35 % of genes in genome % percentage in signifikant gene list 30 % of total changing genes 25 20 Response to chemical stimulus 15 Vitamin metabolic process 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 GO te rm_biologica l process Microarray Databases ArrayExpress (European Bioinformatics Institute) Gene Expression Omnibus (NCBI) http://www.ebi.ac.uk/arrayexpress Acknowledgement Dietmar Martin Gene Center, LMU Munich