Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/ Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope Illumina Sequencing Technology 3’ 5’ DNA (0.1-1.0 ug) A G C T G C T A C G A T A C C C G A T C G A T A T C G A T G C T Sample preparation Single molecule Cluster growtharray 5’ Sequencing 1 2 3 4 5 6 7 8 9 T G C T A C G A T … Image acquisition Base calling Slide courtesy of G Schroth, Illumina Raw data • Short sequence reads • Quality scores = -10log10(p) or similar… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2 Quality checks • Base composition • Quality score • PCR Artifacts Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS transcript CDS CDS CDS CDS Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions. CDS Coding Sequence CDS Exons Introns Aggregate reads based on: exons Exons + junctions All reads start to end of transcript De novo methods CDS CDS Splice Junctions Mapping tools Type Name Link General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/ BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIE http://bowtie-bio.sourceforge.net/index.shtml CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmap http://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWA http://maq.sourceforge.net/ Perm http://code.google.com/p/perm/ RazerS http://www.seqan.de/projects/razers.html Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html SOAP/SOAP2 http://soap.genomics.org.cn/ SHRiMP http://compbio.cs.toronto.edu/shrimp/ QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/ SOAPals http://soap.genomics.org.cn/ G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/ TopHat http://tophat.cbcb.umd.edu/ SplitSeek http://solidsoftwaretools.com/gf/project/splitseek Oases http://www.ebi.ac.uk/~zerbino/oases/ MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php De Novo annotator De Novo transcript assembler Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/ Familiarizing yourself with bowtie • The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. • The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. • Many optional parameters to tweak alignment. Handout Reference: Pages 2-7 How do you map 10^9, 76bp sequences, to a 10^9 bp reference • Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. • However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7 Lots of aligners, one general strategy • Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. • More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7 A fragment as seen by an aligner ; Seed Read Fragment Burrows Wheeler acts on this Smith-Waterman acts on this • Heuristic acts only on the seed. Putative mapping locations identified. • A more precise algorithm extends each seed to the full read and ranks them. • The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. Handout Reference: Pages 2-7 Our test data set • 76bp single end reads. • Sequenced using the Illumina Genome Analyzer • RNA taken from Mouse myoblast cell line. • We only look at a subset of one lane, full data available here http://www.ncbi.nlm.nih.gov/geo/query/acc.c gi?acc=GSE2084 Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/ A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq Strict_Mapped.sam • -n sets the number of mismatches in the seed • The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. • --un saves the unmapped reads to the file specified • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned Handout Reference: Page 8 Using the defaults bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastq test.sam • • • • • -t prints timing information -p sets the number of simultaneous threads --sam makes bowtie output in SAM format --best ensures the best map is returned --un saves the unmapped reads to the file specified Handout Reference: Page 8 Allowing some mismatches Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9 Strict_Unmapped.fastq Loose_Mapped.sam • Set the number of seed mismatches to the maximum. • Set -e to a value more appropriate for our read length. • How many reads do you map? Handout Reference: Pages 8-9 A closer look at our data set: fastQC Handout Reference: Pages 9-10 Handout Reference: Pages 9-10 Trimming reads • Bowtie allows you to trim reads before it attempts to map them. • You can trim from the left (5’) end of the read with the --trim5 option. • You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10 Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS transcript CDS CDS CDS CDS A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12 Options to map more reads… • Trim some bases from the end of the reads using --trim5 and/or --trim3. • Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12 Trim [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 -trim3 25 --sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastq Trimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: 531414 # reads with at least one reported alignment: 164256 (30.91%) # reads that failed to align: 367158 (69.09%) Reported 164256 alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10 Then map to junctions [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastq Junctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: 367158 # reads with at least one reported alignment: 128756 (35.07%) # reads that failed to align: 238402 (64.93%) Reported 128756 alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11 Number of mapped reads Mapping strategy Command line options No. Mapped Reads No. Unmapped Reads Reference Strict -n 0 -e 1 1,049,050 (47.32%) 1,167,992 (52.68%) Genome Loose -n 3 -e 200 1,783,048 (80.42%) 433,994 (19.58%) Genome Trimming -n 3 --trim5 15 --trim3 25 1,912,003 (86.24%) 305,039 (13.76%) Genome Junctions -n 3 -e 200 2,007,627 (90.55%) 209,415 (9.45%) Junction Library Handout Reference: Page 12 Further options Type Name Link General aligner GMAP/GSNAP http://research-pub.gene.com/gmap/ BFAST http://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIE http://bowtie-bio.sourceforge.net/index.shtml CloudBurst http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmap http://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWA http://maq.sourceforge.net/ Perm http://code.google.com/p/perm/ RazerS http://www.seqan.de/projects/razers.html Mrfast/mrsfast http://mrfast.sourceforge.net/manual.html SOAP/SOAP2 http://soap.genomics.org.cn/ SHRiMP http://compbio.cs.toronto.edu/shrimp/ QPALMA/GenomeMapper/PALMapper http://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper SpliceMap http://www.stanford.edu/group/wonglab/SpliceMap/ SOAPals http://soap.genomics.org.cn/ G-Mo. R-Se http://www.genoscope.cns.fr/externe/gmorse/ TopHat http://tophat.cbcb.umd.edu/ SplitSeek http://solidsoftwaretools.com/gf/project/splitseek Oases http://www.ebi.ac.uk/~zerbino/oases/ MIRA http://sourceforge.net/apps/mediawiki/mira-assembler/index.php De Novo annotator De Novo transcript assembler Handout Reference: Pages 12-13