Finding genes de novo with RNA-seq Graham Etherington Graham.Etherington@tsl.ac.uk Today's topics • The basics – What is RNA-seq, alternative splicing. • Assembly techniques – Reference-based alignment – De-novo assembly • Expression analysis Today's topics • Tutorials in Galaxy – Finding genes through transcript assembly • TopHat – Cufflinks – Expression analysis • Cuffcompare – Cuffdiff RNA-seq – the basics • Genome of interest. – How many genes are there? – Are some novel? – Alternative spliced isoforms? – Are some transcripts more abundant than others? – Which genes are expressed under different environmental or biological conditions (e.g. lack of a nutrient, pathogen infection, etc)? What is RNA-seq? Genome Genes Extract mRNA (transcribed genes) Sequence RNA-seq basics - Alternative splicing Reference-based Alignment • Use when a closely-related reference is available. • 3 steps ① Use a splice-aware aligner (e.g. BLAT, TopHat) to align reads to a reference genome. ② Cluster reads from each locus to build isoform De Bruijn graphs. ③ Traverse graph to resolve isoforms. Each different path through graph represents a potentially different isoform. Alignment Seed and extend alignment (e.g. BLAST) Query ATCGCGTTACGATCCGTAA Find all occurrences of ‘ATCGCG’ ATCGCGGTCGTTAATCGCGCGTTCGATCGCGTTACGATCCGTAACGCACCATCGCGTTGC Seeds Target Alignment Seed and extend alignment (e.g. BLAST) Query ATCGCGTTACGATCCGTAA Extend alignments Genome ATCGCGTTAGTTAATCGCGTTACCGATCGCGTTACGATCCGTAACGCACCATCGCGTTAA Alignment • Burrow-Wheeler Transform (BWT) – used by BWA, SOAP, Bowtie (and TopHat) aligners • Creates a compressed index of the genome. • Index is a sorted range of substrings from genome that can be quickly searched. • Stretches of sequence can be looked-up – Like the index of a book. Words (sequences) can be looked up in index which then points you to the pages (genomic locations) were that word (sequence) is found. • Narrows-down the search space (searches index instead of genome) • Speeds up alignment and requires less memory when compared to older alignment algorithms. Creating and Traversing Graphs Aligned reads Create graph that represents alternative splicing Traverse graph to find all possible paths All possible splicevariants from graph Reference-based Alignment • Preferable where a high-quality reference exists. • Can assemble full-length transcripts at depth of 10x. • Advantages: – – – – Contamination not a great problem – won’t align. Less memory use than de novo assembly Detection of low-abundance transcripts Identify transcripts undiscovered in annotated reference Reference-based Alignment • Disadvantages: – Relies on the accuracy of the reference sequence • May contain errors, deletions, missassemblies. • Can miss divergent transcripts – Reads often align to multiple regions • Excluding multi-mapped reads – leaves gaps • Randomly assign multi-mapped reads – false transcripts – Can’t easily assemble trans-spliced genes (2 premRNAs spliced together to form 1 mature mRNA) De-novo assembly • Doesn’t use a reference sequence. • Constructs De Bruijn graph by breaking reads into k-mers and connecting overlapping nodes. • Graph is traversed to identify paths through it. • Each path represents a unique sequence. De Bruijn graphs • All substrings of length k (k-mers) are generated from each read. • 5-mers in this example De Bruijn graphs • • • • Overlapping k-mers used to create nodes in graph. Chains of adjacent nodes in graph are collapsed into a single node Alternative paths through graph are identified. Isoforms identified De-novo Assembly • Advantages – Doesn’t need a reference sequence. – Sometimes better than reference-based assembly when: • reference is of low quality (e.g. missing bits). • Unknown exogenous transcripts want to be detected. • Where long introns are expected. – Doesn’t depend on the correct alignment of reads to splice sites. De-novo Assembly • Disadvantages: – Lots of data requires lots of RAM – Requires greater sequencing depth than reference-based assembly (30x cf 10x). – Highly similar transcripts are likely to be assembled into single transcripts. – Sensitive to read-errors. Hard to tell errors from low-abundance transcripts. Expression analysis The more abundant an RNA, the more times it will be randomly selected for sequencing. The Cufflinks tool suite assembles transcripts and calculates their abundance. Sample 1 Gene A (control) Sample 2 Gene A (infected) expressed mRNA sequencing reads Expression analysis • Use number of mapped reads as an indicator of expression. Map reads back to genome Sample 1 Gene A (control) Differential expression Sample2 Gene A (infected) Normalisation • 2 sequence libraries can produce different volumes of data – transcript A present in same abundance in library X and library Y – library X produces 3 times more reads than library Y – transcript A in library X will appear 3 times more abundant. • Need some way to normalise the expression data. • Fragments Per Kilobase of exon, per Million fragments mapped (FPKM). – accounts for the number of reads in experiment, length of transcript and the number of reads aligning to it. – allows a comparisons between two datasets when there is considerably more data in one dataset than the other. Tutorials • Go through the tutorial sheet. • The task: – Reference-based RNA-seq assembly using TopHat and Cufflinks in Galaxy. – RNA-seq expression analysis using Cuffcompare and Cuffdiff in Galaxy. http://galaxy.tsl.ac.uk