RNAseq analysis Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center bioinformatics.service@mail.mcgill.ca •Module #: Title of Module •2 Why sequence RNA? • Functional studies – Genome may be constant but experimental conditions have pronounced effects on gene expression • Some molecular features can only be observed at the RNA level – Alternative isoforms, fusion transcripts, RNA editing • Interpreting mutations that do not have an obvious effect on protein sequence – ‘Regulatory’ mutations • Prioritizing protein coding somatic mutations (often heterozygous) Modified from Bionformatics.ca RNA-seq Isolate RNAs Generate cDNA, fragment, size select, add linkers Samples of interest Sequence ends Condition 1 Condition 2 (normal colon) (colon tumor) 100s of millions of paired reads 10s of billions bases of sequence RNA-seq – Applications • Gene expression and differential expression • Transcript discovery • SNV, RNA-editing events, variant validation • Allele specific expression • Gene fusion events detection • Genome annotation and assembly • etc ... RNAseq Challenges • RNAs consist of small exons that may be separated by large introns – Mapping splice-reads to the genome is challenging – Ribosomal and mitochondrial genes are misleading • RNAs come in a wide range of sizes – Small RNAs must be captured separately • RNA is fragile and easily degraded – Low quality material can bias the data Modified from Bionformatics.ca RNA-Seq: Overview RNA-Seq: Input Data Input Data: FASTQ End 1 End 2 Control1_R1.fastq.gz Control2_R1.fastq.gz KnockDown1_R1.fastq.gz Control1_R2.fastq.gz Control2_R2.fastq.gz KnockDown1_R2.fastq.gz KnockDown2_R1.fastq.gz KnockDown2_R2.fastq.gz ~ 10Gb each sample @ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1 GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCG + 4=B@D99BDDDDDDD:DD?B<<=?>6B########################### @ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1 GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCA Q = -10 log_10 (p) Where Q is the quality and p is the probability of the base being incorrect. QC of raw sequences QC of raw sequences low qualtity bases can bias subsequent anlaysis (i.e, SNP and SV calling, …) QC of raw sequences Positional Base-Content QC of raw sequences QC of raw sequences Species composition (via BLAST) RNA-Seq: Trimming and Filtering Read Filtering • Clip Illumina adapters: • Trim trailing quality 30 between the adapters and reads. First, Trimmomatic uses a two-step approach to find< matches Trimmomatic a two-step to 16 findbp)matches between thepossible adaptersposition and reads. First, short sections uses of each adapter approach (maximum are tested in each within the short adaptersh k(maximum bp) are i tested in each possible position within the reads.sections h If thi s sof or teach n alignment, own as te16 ‘seed’ a perfect or sufficiently close match, reads. h If thi sby sorthe t n alignment, sh kownparameter as te ‘seed’ i a perfect or sufficiently match, determined seedMismatch (see below), the entire alignmentclose between the determined by theisseedMismatch parameter (see below), entire alignment between the read and adapter scored. This two-step strategy results the in considerable efficiency gains, read is scored.can This results while in considerable efficiencyscore gains, sinceand the adapter seed alignment be two-step calculatedstrategy very quickly, the full alignment is since the seed alignment can be calculated very quickly, while the full alignment score is calculated relatively rarely. calculated relatively rarely. The full alignment score is calculated as follows. Each matching base increases the alignment The is calculated as follows. Each matching increases the alignment scorefull byalignment 0.6, whilescore each mismatch reduces the alignment score bybase Q/10. By considering the • Filter for read length ≥ 32 bp score each mismatch reduces Q/10. By considering the qualitybyof0.6, thewhile base calls, mismatches causedthebyalignment read errorsscore havebyless impact. A perfect match Trimmomatic uses a two-step approach to find matches between the adapters quality thesequence base calls, mismatches read25errors less impact. A perfect of a 12 of base will score just caused over 7, by while baseshave are needed to score 15. As match such and read sections of each 16value bp) aresimple tested inusadellab.org each possible of 12 baseshort sequence will score just7adapter while 25 bases are needed to score 15. As such . position w wearecommend values of between -over 15 as7,(maximum the threshold for alignment mode. RNA-Seq: Mapping Assembly vs. Mapping mapping Reference all vs reference reads contig1 assembly all vs all contig2 RNA-seq: Assembly vs Mapping Reference-based RNA-seq Ref. Genome or Transcriptome RNA-seq reads De novo RNA-seq contig1 contig2 Read Mapping • Mapping problem is challenging: – – – – Need to map millions of short reads to a genome Genome = text with billons of letters Many mapping locations possible NOT exact matching: sequencing errors and biological variants (substitutions, insertions, deletions, splicing) • Clever use of the Burrows-Wheeler Transform http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25. increases speed and reduces memory footprint http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 •TableOther mappers: BWA, Bowtie, STAR, GEM, etc. Langmead et al. R25.2 1 Bowtie alignment performance versus SOAP and Maq Table 1 Platform CPU time Wall clock time Reads mapped per Bowtie alignment performance versus SOAP and Maq hour (millions) Bowtie -v 2 Platform Server CPU 15 mtime 7s 15 mclock 41 s time Reads 33.8 mapped per Wall SOAP 91 h 57 m 35 s 91 h 47 m 46 s hour 0.10 (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 1,149virtual memory Peak 13,619 footprint (megabytes) 67.4 aligned (%) Bowtie speed-up Reads 351× 67.3 Bowtie-v 2 Server PC Bowtie Maq SOAP 57s s 1516mm741 s s 1517mm41 9117hh5746mm3535s s 9117hh4753mm467 s 29.5 33.8 0.49 0.10 1,353 1,149 804 13,619 -59.8× 351× 71.9 67.4 74.7 67.3 Bowtie Bowtie 1617mm4158s s 28.8 29.5 1,353 1,353 -- 71.9 71.9 Server PC 1718mm5726s s TopHat: Spliced Reads • Bowtie-based • TopHat: finds/maps to possible splicing junctions. • Important to assemble transcripts later (cufflinks) SAM/BAM Control1.bam Control2.bam SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAAT GGGAAATAAACACTATGTTCAA AG KnockDown1.bam KnockDown2.bam ~ 10Gb each bam SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGG AAATAAACACTATGTTCAAAG • Used to store alignments • SAM = text, BAM = binary Read name Bases Base Qualities Flag Reference Position CIGAR Mate Position SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@> … The BAM/SAM format samtools.sourceforge.net picard.sourceforge.net Sort,View, Index, Statistics, Etc. $ samtools flagstat C1.bam 110247820 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 110247820 + 0 mapped (100.00%:nan%) 110247820 + 0 paired in sequencing 55137592 + 0 read1 55110228 + 0 read2 93772158 + 0 properly paired (85.06%:nan%) 106460688 + 0 with itself and mate mapped 3787132 + 0 singletons (3.44%:nan%) 1962254 + 0 with mate mapped to a different chr 738766 + 0 with mate mapped to a different chr (mapQ>=5) $ RNA-Seq: Alignment QC RNA-seQc summary statistics broadinstitute.org/cancer/cga/rna-seqc RNA-seQc covergae graph Home-made Rscript: saturation RPKM Saturation Analysis RNA-Seq:Wiggle UCSC: bigWig Track Format • The bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph. • Count the number of read (coverage at each genomic position: Modified from http://biowhat.ucsd.edu/homer/chipseq/ucsc.html RNA-Seq:Gene-level counts HTseq:Gene-level counts Contro1: 11 reads Control2: 16 reads KnockDown1: 4 reads KnockDown2: 5 reads TSPAN16 • Reads (BAM file) are counted for each gene model (gtf file) using HTSeq-count: www-huber.embl.de/users/anders/HTSeq TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2 … NFYA Control1 11 1 435 203 216 2365 6 380 … 888 Control2 KnockDown1 16 4 0 0 743 836 218 416 643 714 5011 2828 1 4 865 431 … … 827 1674 KnockDown2 5 0 739 352 704 2294 0 523 … 1580 RNA-seq: EDA gqSeqUtils R package: Exploratory Data Analysis 50 MEF_WT_P5 WT 25 MEF_WT_P7 WT ● MEF_WT_P6 WT MEF_KO_P6 KO ● Mutation V2 MEF_KO_P5 KO ● KO 0 WT ● MEF_KO_P7 KO −25 ● MEF_KO_P3 KO −50 MEF_WT_P3 WT −50 −25 0 V1 25 RNA-Seq:Gene-level DGE Home-made Rscript: Gene-level DGE • edgeR and DESeq : Test the effect of exp. variables on gene-level read counts • GLM with negative binomial distribution to account for biological variability (not Poisson!!) DEseq edgeR Differential Gene Expression SYMBOL HNRNPC FAIM2 AC019178 SSC5D GGT5 EXOC3L4 FOXS1 AQP5 SLC27A3 TIMP4 logFC -5.26 -4.82 -6.57 -2.95 -3.03 -3.07 -4.02 -3.73 -2.39 -3.29 PValue 9.19E-55 8.02E-29 2.14E-28 2.39E-27 1.03E-26 9.19E-21 1.69E-19 2.82E-19 6.97E-18 1.21E-17 FDR 5.71E-50 2.49E-24 4.42E-24 3.71E-23 1.28E-22 9.51E-17 1.49E-15 2.18E-15 4.81E-14 7.52E-14 counts.C1 12611 191 100 2274 838 359 113 106 736 126 counts.C2 counts.KD1 counts.KD2 12404 244 443 194 11 3 104 1 1 2123 318 276 803 93 117 344 53 34 92 5 8 113 9 8 637 144 129 120 14 12 Downstream Analyses Pathways/Gene Set (e.g. GOSeq) Regulatory Networks Machine Learning / Classifiers RNA-Seq:Transcriptlevel DGE Cufflinks: transcipt assembly • Assembly: Reports the most parsimonious set of transcripts (transfrags) that explain splicing junctions found by TopHat Cufflinks: transcript abundance • Quantification: Cufflinks implements a linear statistical model to estimate an assignment of abundance to each transcript that explains the observed reads with maximum likelihood. Cufflinks: abundance output • Cufflinks reports abundances as Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) C: Number of read pairs (fragments) from transcript N: Total number of mapped read pairs in library L: number of exonic bases for transcript • Normalizes for transcript length and lib. size Cuffdiff: differential transcript expression • Cudiff – Tests for differential expression of a cufflinks assembly RNA-Seq:Generate report Home-made Rscript Generate report – Noozle-based html report which describe the entire analysis and provide a general set of summary statistics as well as the entire set of results Files generated: – index.html, links to detailed statistics and plots For examples of report generated while using our pipeline please visit our website