Part 1 - Transcriptomics 1. Definitions (transcriptome, transcriptomics, EST, cDNA, RT-PCR) 2. Transcriptomics – what is it good for? 3. Overview about transcript regulation (eukaryotic gene structure, transcription factors, micro-RNAs) 4. Methods for measuring gene expression (single genes with qPCR, RT-PCR, Northern blotting, multiple genes with cDNA- or oligonucleotide arrays, cDNA-AFLP, DD-RTPCR, SAGE, MPSS) Part 2 – Expression profiling with microarrays 1. Microarray basics (procedure, platforms, applications) 2. Microarray bioinformatics (normalization, analysis of differentially expressed genes, visualization) 3. Microarray stories (examples from microarray research) Part 1 – Transcriptomics (Gene Expression Profiling) The Language Of Transcriptomics Genome = entire DNA sequence of an organism Transcriptome = percentage of the genetic code that is transcribed into RNA molecules (depends on development, environment, time of the day, tissue) = collection of all gene transcripts present in a given cell/tissue at a given time (“snapshot”) Transcriptomics = global analysis of gene expression = genome-wide expression profiling cDNA = complementary DNA synthesized from mature mRNA by the enzyme reverse transcriptase cDNA library = a population of bacterial transformants or phage lysates in which each mRNA isolated form an organism or tissue is represented as its cDNA insertion in a plasmid or a phage vector The Language Of Transcriptomics RT-PCR = a one or two-step process for converting mRNA to cDNA and the subsequent amplifcation of the reversely-transcribed DNA by PCR EST (Expressed Sequence Tag) = small pieces of DNA (200 - 500 bp) generated by sequencing either one or both ends (5’EST and 3’EST) of an expressed gene (cDNA) = can be used to identify unknown genes and to map their positions within a genome Hybridization = based on complementary molecules, sequences that are able to basepair with one another. Adenine is the complement of thymine, guanine is the complement of cytosine. Therefore, the complementary sequence to GT-C-C-T-A will be C-A-G-G-A-T. When two complementary sequences find each other, they will lock together, or hybridize. The Language Of Transcriptomics PCR (Polymerase chain reaction) = technique that results in the exponential amplification of almost any region of a selected DNA molecule = repeated cycles of denaturation, annealing and extension with the amount of DNA template doubling during each cycle The –omics World Genomics (Genome) DNA Genotype Transcription mRNA Transcriptomics (Transcriptome) Translation Protein provide structure & drive metabolism substrate product Proteomics (Proteome) Metabolomics (Metabolome) Phenotype morphology physiology behaviour ecology Common to all -omics: global (often genome-wide) approaches What Can We Learn From Transcriptomics? get an understanding of genes and pathways involved in biological processes (“guilt by association”: genes with similar expression may be functionally related and under the same genetic control mechanism) help elucidating the function of unknown genes based on their spatial and temporal expression identifies marker genes for diagnosis of diseases gene expression is a proxy for cis- and trans- regulation (allows indirect inferences about genetic differences) may be a proxy for changes in the proteome and metabolome Transcriptomics – Regulation of Transcription Structure Of A Eukaryotic Gene Transcribed region 5’ end of gene Upstream regulatory region 5’ untranslated region of mRNA 3’ end of gene Coding sequence ORF (open reading frame) 3’ untranslated region of mRNA Promoter +1 enhancers e.g. ERE CAAT TATA Transcription start AUG Exon Intron Exon Intron Exon Ter Initiator Terminator codon codon UGA, UAA or UAG Basic promoter sequence motifs such as TATA and CAAT, additional promoter elements such as ERE (ethylene response element, in some plant genes) and up- or downstream regulatory regions on the same strand as the coding region are called cis-elements. Before the RNA transcript (mRNA) serves as a template for protein biosynthesis, noncoding sequences (introns) are eliminated, coding sequences (exons) are fused (referred to as ‘splicing’) and the 5‘ and 3‘ untranslated regions (UTRs) are post-transcriptionally modified. Open reading frames (ORFs) that are translated into a protein always start with the initiator codon AUG and end with one of the terminator codons UGA,UAA or UAG. Regulation Of Transcript Abundance By … …transcription factors Close-up of the promoter of the plant gene strictosidine synthase (alkaloid synthesis) Cis - acting elements -339 -208 BA -108 -103 -100 G-Box CrBPF1 -58 JERE CrMyc2 CrGBFs ORCAs +1 TATA STR gene RNA polymerase II Trans - acting factors (= transcription factors) DNA binding proteins that activate or suppress transcription and thus modulate transcript abundance Regulation Of Transcript Abundance By MicroRNAs = miRNAa = small, single-stranded RNA molecules (~21-mer) that bind to complementary of one or more mRNAs (often transcription factor mRNAs) miRNA – RISC complex binding to complementary mRNA Mature miRNA within RNA-induced silencing complex (RISC) miRNAs either degrade or impair the translation of their target mRNAs! Taken from http://www.ambion.com/techlib/resources/miRNA/index.html 4 Hypothetical Scenarios Of Gene Regulation activator act suppressor + red gene sup microRNA gene red gene transcript X act act X red gene red gene cis-mutation X trans-mutation activator cis-mutation in activator red gene sup X cis-mutation in miRNA microRNA gene microRNA transcript activator activator transcript degradatin of activator transcript activator missing red gene Transcriptomics – Methods For Measuring Gene Expression Methods To Detect Single Gene Transcriptional Changes Northern Blotting transcript-specific radioactive probes are used to identify a target mRNA species within an immobilized RNA sample PCR-based based on the ability of a PCR to exponentially amplify initial differences in transcript number amplified products are visualized either in real time during (qPCR, A) or after (RT-PCR, B) the reaction A ΔRn Hybridizationbased B M.s.-induced - local M.s.-induced - systemic Control T. urticae 1d 1d 3d 3d PR2 Control - local Control - systemic SAMS SAMDC threshold probe ACO CT cycles rRNA RWC L C 14.2 Reporter gene-based 23.5 fusions of a promoter of a gene of interest with a reporter gene: b-glucuronidase (GUS), green fluorescent protein, luciferase reporter activity is measured histochemically or by fluorescence or luminescence allows for detailed spatial and kinetic analyses of transcript accumulation 39.1 41.6 61.1 68.1 18 S Proof of a herbivore-responsive promoter by GUS staining Larvaeattacked Control Methods To Detect Multiple Gene Transcriptional Changes PCR-based: Differential Display RT-PCR, cDNA-AFLP reverse transcription of mRNA into cDNA divide cDNA pool into subsets by selective PCR amplification separation of subpools on high resolution gels quantification of band intensity Hybridization-based: Macroarrays, microarrays a small membrane (macroarray) or glass slide (microarray) containing samples of many genes arranged in a regular pattern mRNA samples of interest are fluorescently labelled and either singly or competitively hybridized to a slide in a single experiment, the expression levels of thousands of genes can be determined by measuring the amount of mRNA bound to each gene on the array Sequencing-based: SAGE (Serial Analysis of Gene Expression) MPSS (Massively Parallel Signature Sequencing) short sequence signatures produced from a defined position within an mRNA the relative abundance of these signatures (tags) in a given library represents a quantitative estimate of expression of that gene no sequence knowledge required! universal platform to study any transcript Sequencing-Based Methods To Detect Multiple Changes SAGE concatemerized tags are sequenced using a traditional automated DNA sequencing method tags are ~9 to 14 bp in length library may contain ~50,000 tags MPSS uses a novel sequencing method whereby thousands of sequences are obtained simultaneously by sequencing off of beads tags are 17-21bp in length library may contain about 4 million tags Solexa (purchased by AWC!) uses a novel sequencing method whereby thousands of sequences are obtained simultaneously by sequencing off of clusters within a flow cell tags are 17bp in length one lane in a flow cell may yield more than 6 million tags SAGE procedure Part 2 – Expression Profiling With Microarrays Microarray Basics Gene Probes (cDNAs, ESTs, oligos) Technologies: Healthy tissue Diseased tissue mRNA mRNA cy3-labeled cDNA cy5-labeled cDNA robotic spotting (ESTs, oligonucleotides) in-situ synthesis (oligonucleotides), Affymetrix, NimbleGen, Agilent Hybridization: competitive in two color arrays single in one color arrays Cy signal ~ amount of mRNA in healthy & diseased tissue Green spots = cDNA from healthy tissue hybridized to the target DNA Red spots = cDNA from diseased tissue hybridized to the target DNA Yellow spots = control and sample cDNA hybridized equally to the target DNA Blue spots = neither control nor sample cDNA hybridized to the target DNA Cross-Species Hybridization: Gene probes and samples originate from different species Microarray Applications Medicine Disease-associated expression patterns (diagnosis) Cell-cycle monitoring (cancer research) Treatment-induced expression pattern (drug development and response) Biology Development and Morphology (juveniles vs adults, male vs female, tissue 1 vs tissue 2) Interactions between organisms (antagonistic, mutualistic, competitive) and organisms and their environments (temperature, radiation, draught, nutrient levels, toxins and heavy metals) Evolution (within- and between species variation, hybrids vs parents, diploids vs polyploids) Functional analyses (wild type vs mutant) Time series Microarray Stories – Examples From Biology Hybrid sunflowers colonizing new habitats… Lai et al., Molecular Ecology 2006 26 lower and 32 higher expressed genes in the hybrid sunflower Helianthus deserticola when compared with its two parental species, H. annuus and H. petiolaris, Among them many transport-related genes that may be important in the desert environment (acting in preventing desiccation) Differentially expressed genes are candidates for ecological divergence Source: Rieseberg lab website Tobacco plants perceiving chemical cues in caterpillar saliva… Halitschke et al., Plant Physiology 2003 majority of the genes up- and downregulated in tobacco when treated with caterpillar saliva were also induced with only two compounds isolated from saliva (fatty-acid amino acid conjugates) tobacco responds differently to mechanical wounding than to caterpillar attack because of these compounds Caterpillar spit Chemical cue Caterpillar spit Chemical cue Microarray Stories – Examples From Biology Behavioural plasticity in honeybees… Whitfield et al., Science 2003 Transition of adult honeybees from hive work to foraging is associated with mRNA abundance in the brain Individual brain expression profiles can reliably predict the behaviour of a bee based on as little as ten predictor genes Nightshade and tobacco respond differently to a common herbivore… Schmidt et al., The Plant Journal 2005 Venn diagram of the numbers of overlapping and non-overlapping up- and down-regulated genes in two plant species of the nightshade family that are induced by tobacco hornworm feeding no “blue print” of defence responses, evolutionary history matters! Steps In A Microarray Experiment Question-driven Goals? Hypotheses? Questions? Platform What technology? Source of gene probes? Cross-species hybridization? Experimental design Replication level Hybridization scheme What statistics, what analysis software? Laboratory steps Sample preparation and labelling Hybridization (manual or robotic) and washing Image acquisition (Scanner with two lasers) Bioinformatic steps Data transformation and normalization Analysis of differentially expressed genes (multiple testing issue, multivariate statistics, gene ontology) Visualization (graphics) Data storage (databases, MIAME standards) Data interpretation Answers? New Hypotheses? Follow-up experiments? Validation? Microarray Bioinformatics – Data Pre-Processing Data Pre-Processing – Overview Resolves systematic errors and bias introduced by experimental platform 1. Data cleaning and transformation (removing flagged spots, background subtraction, taking logarithms) 2. Within-array normalization (removes dye and spatial bias, brings cy3 and cy5 channels on equal footing) linear regression of cy5 against cy3 (scatter plots) linear/non-linear regression of log ratios against average log intensity (MA plots) 3. Between-array normalization (enables comparison of multiple arrays, brings samples hybridized to different arrays on equal footing, box plots) Scaling, Centering, Distribution normalization (quantile) Transformation – Taking Logarithms (log (to base 2)) Cy3 = green (sample 1), Cy5 = red (sample 2) Cy5 > Cy3 = higher expression in sample 2 Cy5 < Cy3 = higher expression in sample 1 Cy5 = Cy3 = equal expression in both samples M (log fold ratio) = log2(cy5/cy3) = log2(cy5) – log2(cy3) A (average log intensities) = (log2(cy5) + log2(cy3))/2 Log fold ratios have a natural symmetry which reflects the biology that is not present in the raw fold difference: 2 fold up- and down-regulated genes have a raw fold difference of 2 and 0.5 and non-regulated genes have a raw fold difference of 1. On the log2 scale 2 fold up- and down-regulated genes correspond to log fold ratios of +1 and -1 and non-regulated genes have a log fold ratio of 0. cy5/cy3 log2 (cy5/cy3) Transformation – Taking Logarithms (log (to base 2)) scatter plot Raw intensities are not evenly distributed across the intensity range, variability of the data increases with intensity and distributions of raw intensities are right skewed in both channels (Cy3 and Cy5) Cy5 raw intensity before histogram Cy3 raw intensity Cy3 raw intensity Cy5 raw intensity The data is spread evenly across the intensity range and the variability is the same at most intensities, plus the intensity distributions are closer to a bell-shaped normal curve. Cy5 log intensity after Cy3 log intensity Cy3 log intensity Cy5 log intensity Within Array Normalization – MA Plots & Regression X-axis = average of cy3 and cy5 log intensities (A) Y-axis = difference between cy5 and cy3 log intensities, log fold ratio (M) Log fold ratio Non-linear regression (loess) Loess normalized Log fold ratio after Linear normalized Log fold ratio before Log fold ratio Linear regression Average log intensity Between Array Normalization – Box Plots Log fold ratio whiskers = outliers or extreme values central line = mean or median box = standard deviation or middle 50% of the data Scaling subtracting the mean equalizes the mean across all samples Distribution normalization Distributionnormalized log ratio Scaled log ratio Centered log ratio Centering subtracting the mean + dividing by data distributions of all arrays are identical (quantile standard deviation (sd) normalization) equalizes mean and sd Microarray Bioinformatics – Analysis Of Differentially Expressed Genes Two Questions A Microarray Experiment Should Answer… 1. Which genes are differentially expressed in one set of samples relative to another? Comparing 2 samples Parametric statistics (one-sample t-test, two-sample t-test) Non-parametric statistics (Wilcoxon sign-rank test, Mann Whitney test) Bootstrap analyses Comparing >2 samples and/or measuring response to more than one variable One-way ANOVA, multifactor ANOVA General Linear Models Bootstrap analyses Mulitplicity of testing False Discovery Rate, Bonferroni correction 2. What are the relationships between genes or samples being measured? Dimensionality reduction Principal components analysis (PCA), multidimensional scaling (MDS) Grouping of genes or samples Hierarchical clustering K-means clustering, Self-organizing maps Bootstrap analyses Analysis Of Differentially Expressed Genes – Which genes are differentially expressed in one set of samples relative to another? Hypothesis Testing: T-test and P-value Each hypothesis test builds a probabilistic model under the null hypothesis that there is no biological effect. Using this model it is possible to calculate the probability of observing a statistic that is at least as extreme as the observed statistic in the data (= p-value). Usually, a p<0.05 is considered small enough to reject the null hypothesis. P-values are also know under type 1 error – the probability of rejecting the null hypothesis when it is actually true (= false positives). Example: unpaired t-test Null hypothesis (H0): gene x is not differentially expressed between A patients and B patients 1. Calculate the t-statistic for gene x (incorporates mean, standard deviation and sample size of both A and B patients) 2. Compare t-statistic with a t-distribution with an appropriate number of degrees of freedom (df) 3. If t-statistic is more extreme than the critical t-statistic at a chosen significance level (e.g. p=0.05) reject the null hypothesis, otherwise accept it T-distribution (df, p=0.05) Hypothesis Testing: Bootstrap Test Example: bootstrap test Null hypothesis (H0): gene x is not differentially expressed between A patients and B patients 1. generate many random data sets by re-sampling the original data, each individual is randomly allocated a measurement from either group 2. compare some property of the real data (e.g. t-statistic) with the distribution of the same property in the random data sets 3. compute proportion of t-statistics that have a more extreme value than the t-statistic from the real data (=pvalue) 4. small p-value indicates differential expression of gene X Distribution of bootstrap t-statistics t-statistic of real data majority of bootstrap statistics are less extreme then the real statistic (p<0.001) Bootstrapping works for paired and unpaired analyses, ANOVA and Cluster analysis, is robust and powerful, but computationally intensive! Multiplicity of Testing Meaning of a p-value? P-value of 0.01 means a false positive rate of 1%. Calculating the false discovery rate in a microarray experiment Imagine an array with 6350 genes and an experiment where 184 genes are differentially expressed at p=0.01. That means 64 genes would be expected to appear differentially expressed even when they’re not (false discovery rate 35%). With decreasing p-value, the false discovery rate also decrease, but so does the number of differentially expressed genes – choose a p-value which balances number of differentially expressed genes and false discovery rate! False discovery rate Fine print: Alternatively – multiply each p-value by the number of genes in the analysis to obtain Bonferroni-adjusted p-values. Usually none of the adjusted p-values is significant, thus Bonferroni correction is too stringent for microarray analyses! The Volcano Plot Arranges Genes … ... along dimensions of biological and statistical significance: The log fold change is plotted on the x-axis while the y-axis represents statistical evidence (either a p-value on a negative log scale – so smaller p-values appear higher up or an odds ratio on a positive log scale – so higher odds ratios appear higher up). The x-axis indicates biological impact of the change; the y-axis indicates the reliability of the change. The researcher can then make judgements about the most promising candidates for follow-up studies, by trading off both these criteria. Example. Up (red) and down (green) regulated genes in a tobacco plant attacked by a caterpillar as compared to an un-attacked plant on a 1.5fold change level (-0.58 and +0.58 on log scale). Black genes either have a too low log fold change or log odds ratio or both. Analysis of Differentially Expressed Genes – What are the relationships between genes or samples being measured? Dimensionality Reduction Methods for visualizing high-dimensional microarray data in two or three dimensions: Principal Components Analysis (PCA) PCA projects highdimensional space into twodimensional space by capturing as much of the variability of the original data as possible 5h 2h 7h 9h starts with a variance-covariance matrix of all genes calculates new variables that each explain a portion of the variance (=principal components) Example: Six arrays representing six time points during yeast sporulation 0.5h 11h three late samples fairly similar, main changes occur over first 7 h plus some processes switched on after the first 5h and then switched off Dimensionality Reduction Methods for visualizing high-dimensional microarray data in two or three dimensions: Multidimensional Scaling (MDS) MDS locates profiles in two-dimensional space such that their distances are as close as possible to their distances in the higher dimensional space starts by calculating the distance between profiles distance measures can vary (Euclidean distance, Pearson correlation, Spearman correlation) Example: Six arrays representing six time points during yeast sporulation a) using Euclidean distance b) using Pearson Correlation 5h 2h 2h 7h 9h 7h 9h 0.5h 5h 11h 0.5h 11h Hierarchical Clustering… …arranges gene and/or sample profiles into a tree so that similar profiles are closer together and dissimilar profiles are farther apart. How it’s done: 1. Calculate distance matrix between genes/samples (different methods) 2. Find nearest entries in distance matrix and join them into a cluster 3. Compute the distance between the newly formed cluster and the other genes/samples and clusters (different methods) 4. Return to step 1 until all genes and clusters are linked Results vary depending on choice of distance metric and cluster linkage! Pearson correlation + single linkage Pearson correlation + complete linkage Pearson correlation + average linkage From Hierarchical Clustering To Heatmaps Heatmap false color image with a dendrogram added on top and on the side very frequently used for visualization of microarray results top dendrogram represents relations between samples (e.g. patients or time points) Side dendrogram represents relations between genes Microarray Data Interpretation – Gene ontology (GO) Output of a microarray analysis: List of differentially expressed genes, annotated mostly with their names – not helpful for biological interpretation! Gene Ontology Consortium developed three structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. For example, the gene product cytochrome c can be described by the: molecular function term oxidoreductase activity, biological process terms oxidative phosphorylation and induction of cell death, cellular component terms mitochondrial matrix and mitochondrial inner membrane. GO is constantly up-dated (http://www.geneontology.org/) and a very valuable tool for micorarray data interpretation! For more information: Microarray case studies: • Halitschke, R., K. Gase, D. Q. Hui, D. D. Schmidt, and I. T. Baldwin. 2003. Molecular interactions between the specialist herbivore Manduca sexta (Lepidoptera, Sphingidae) and its natural host Nicotiana attenuata. VI. Microarray analysis reveals that most herbivore-specific transcriptional changes are mediated by fatty acid-amino acid conjugates. Plant Physiology 131:1894-1902. • Lai, Z., B. L. Gross, Y. Zou, J. Andrews, and L. H. Rieseberg. 2006. Microarray analysis reveals differential gene expression in hybrid sunflower species. Molecular Ecology 15:1213-1227. • Whitfield, C. W., A. M. Cziko, and G. E. Robinson. 2003. Gene expression profiles in the brain predict behavior in individual honey bees. Science 302:296-299. • Schmidt, D. D., C. Voelckel, M. Hartl, S. Schmidt, and I. T. Baldwin. 2005. Specificity in ecological interactions. Attack from the same lepidopteran herbivore results in species-specific transcriptional responses in two solanaceous host plants. Plant Physiology 138:1763-1773. Microarray text books: • Stekel, D., ed. 2003. Microarray Bioinformatics. Cambridge University Press, Cambridge. • Draghici, S., ed. 2003. Data Analysis Tools for DNA Microarrays. Chapman & Hall/CRC, Boca Raton. Microarray freeware: http://www.r-project.org/ + http://www.bioconductor.org/ Microarray online info: http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html Solexa sequencing: http://www.illumina.com/morethansequencing/Technology.ilmn