RNA-seq: the future of transcriptomics ……. ? Disclaimer: Tiago Hori is not an expert on RNA-seq Wang et al., 2009 • RNA-seq or RNA-sequencing is not a complete novel idea. • SAGE, long-SAGE, MPSS • The recent developments in next-generation sequencing (NGS) have made whole transcriptomic analyses more accessible. • Does it work? • Comparison with microarrays • Advantages and disadvantages • How does it work? • Challenges • Are microarrays going to go extinct? Weapons of choice: Marioni et al., 2008 There is a good correlation between microarray intensity and count data. There is also good correlation between Affymetrix foldchanges and Illumina-based RNA-seq fold-changes The Pros and Cons of RNA-seq – do the benefits definitely outweigh the problems? Advantages: • Allows for not only the identification of differentially expressed genes, but also identification of differential allelic expression, SNPs, splice variants, new genes or isoforms. • It is not limited to a set number of probes. • It is NOT impacted by background signal or saturation that causes problems in studying highand low-expression transcripts. Wang et al, 2009 The Pros and Cons of RNA-seq – do the benefits definitely outweigh the problems? Disadvantages: • Cost • Dependent on a reference genome or transcriptome. * see Trapnell et al., 2010 – Nature Biotechnology (used 430 million paired-end reads to assemble a transcriptome de-novo • Large amounts of data requiring large storage space and computational power • Statistical methods are still in their infancy How does it work? A) Agilent polyA selection B) NibleGen selection array C) Generation of target cDNA (sequence specific, e.g. for allele discrimination) D) Helicos sequencing Ozsolak and Milos, 2011 How does it work? Oshlack et al., 2010 Mapping Challenges: •Computational power required •Exon junctions •Alleles and SNPs Two main methods: Based on hash tables (local alignment similar to BLAST) Based on prefix/suffix trie BFAST Homer et al., 2009 BWA-SW Li and Durbin et al., 2010 One of the biggest challenge with mapping is to reduce the “RAM footprint” of the reference genome. This is accomplished by different ways of indexing the reference. The other challenge is to map accurately while allowing for variable reads (e.g. SNPs or error) to be mapped. Data summarization: There are 3 main ways of summarizing your data: 1. Counts per exon 2. Counts per transcript 3. Counts per gene (Oshlack et al., 2010) Normalization: Is RNA-seq data absolute mRNA count? • Within libraries: • Length bias • Sequencing efficiency • Between libraries: • Sequencing depth • Over-representation of highly-expressed transcripts Differential Expression detection: Challenges: • Requires biological replication but perhaps not technical replication. • Count data is discrete rather than continuous. • There is evidence the count data follow a negative binomial distribution similar to the Poisson distribution. • Accounting for type I error (False-Discovery) Bioconductor packages: edgeR: Developed for SAGE uses a modified Fisher exact test for dispersed data (means and variance estimated using maximum likelihood) DESeq: Similar to edgeR but uses a different model to estimate means and variance (empirical estimation of mean-variance relationship) BaySeq: Empirical Bayes inference to test of differential expression Systems Biology: DAVID and other microarray techniques used for GO enrichment KEGG pathways What do you do with data and what does it all mean? Resources: