RNA Sequencing Peter Tsai Bioinformatics Institute, University of Auckland What is RNA-seq? Study of transcriptomes Identify known genes, exons, splicing events, ncRNA, miRNA Novel genes or transcripts Abundances of transcripts (quantitive expression) Differential expressed transcripts between different conditions Reconstructing transcriptome. General workflow Raw data QC Map to reference genome De novo transcriptome assembly Estimate abundance Normalisation Differential expression analysis Require downstream annotation Quality checks and mapping Use FastQC, SolexQA Trim off low quality region, keep only proper-paired reads Most QC software assume normality, but in RNA-seq data you will probably see none-normality You might see some duplicated reads, its probably due to highly expressed gene. Specific reference mapping tool that can map across splice junctions between exons, i.e. Tophat Specific de novo transcriptome assembly software for reconstruction of transcriptomes from RNA-seq data, i.e. Trinity Expression value in RNA-seq The total number of reads mapped to a gene/transcript (Count data or raw counts or digital gene expression) Complexity of using simple counts Sequencing depth: the higher the sequencing depth, the higher the counts Gene length: Counts are proportional to the length of the gene times mRNA expression level Counts distribution: difference on how counts are distributed among samples. Normalisation RPKM (Mortazavi et al, 2008) ◦ Reads Per Kilobase of exon model per Million mapped reads FPKM (Mortazavi et al, 2010) ◦ Fragments Per Kilobase of exon model per Million mapped reads ◦ Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. Replicate 2 Data exploration Replicate 1 Gene.ID/Description 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 logFC logCPM LR 2.563086301 5.07961611 4.003686266 2.330395704 2.71372512 9.704651395 -2.052703196 3.402621025 1.95117636 4.438847349 2.465833373 12.20593577 1.817858683 5.308092036 1.577603322 6.556675456 1.20515812 4.542565518 1.233090336 10.08249873 1.120581944 12.14988136 1.045292369 4.913492018 1.089867189 3.885246135 1.353955354 2.21406615 1.049933686 3.281031472 -1.032999983 1.480514873 -1.313778857 4.325330722 0.864451602 4.338668381 -0.766266641 5.2972332 PValue FDR 28.4599795 9.57E-08 2.72E-05 28.3288251 1.02E-07 2.72E-05 25.01930526 5.68E-07 0.000100653 21.11492168 4.33E-06 0.000575287 19.21195535 1.17E-05 0.001244651 10.91756889 0.000952565 0.084460792 10.3738524 0.001278126 0.097137553 9.690419768 0.001852312 0.110687766 9.670466698 0.001872537 0.110687766 9.289827985 0.002304298 0.122588652 7.710102379 0.005491264 0.265577482 7.039209923 0.00797442 0.350270537 6.912558621 0.008559242 0.350270537 5.976193603 0.014500264 0.551010036 5.737563572 0.016605812 0.588952795 4.712476717 0.029944481 0.995653998 4.169234925 0.041164384 0.998742102 3.479808135 0.062121942 0.998742102 3.443865378 0.063486998 0.998742102 Up-regulated Down-regulated ERCC spike-in control Set of external RNA transcripts with known concentration. Dynamic range and lower limit of detection Fold-change response Internal control, in order to measure against defined performance criteria Dynamic range and lower limit of detection The dynamic range can be measured as the difference between the highest and lowest concentration. Measure of sensitivity, and it is defined as the lowest molar amount of ERCC transcript detected in each sample Fold-change response How much library depth is needed for RNA-seq? Depends on a number of factors ◦ Biological questions Complexity of the organism Types of analysis Types of RNA, miRNA, lncRNA. Literature search for similar work Pilot experiment Summary Have 3 or more biological replicates Analysis your data with different normalisation methods Perform data exploration Use a standard spike-in as internal control Validation with qPCR