Mapping Sequence Tags Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers Raw Data Formats • Most common format is FASTQ – Four rows per read @HWI-EAS255_4_FC2010Y_1_43_110_790 TTAATCTACAGAATAGATAGCTAGCATATATTT + hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdh • First row identifies read by experiment and location on lane or plate • Second row gives sequence as read • Fourth row gives quality information Quality Scores • Modelled after PHRED scores: 10*log10( P(error)) • Calibrated by manufacturer based on previous experience • May be systematically conservative or optimistic Representing Quality Scores • SOLiD QV and 454 represent quality scores as decimal numbers • Illumina compresses scores as ASCII characters: quality scores range from 0 to 62 using ASCII 64 to 126 (normally scores are between 0 and 40) hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdh R e h d 82 (82-64)=18 101 (101-64)=37 104 (104-64)=40 100 (100-64)=36 Variation: SOLiD CSFASTA & QV • Color Space FASTA >1_51_64_F3 T10301031230333233203333000021122223 >1_51_127_F3 T20103232332031323101101002003103102 – Each second line contains color space with last base of primer • QV file contains integers with PHRED scores Alignment Pre-processing – Mapping Reads • Huge read numbers (10M – 100M) • Classical Dynamic Programming alignment is far too slow • Reads have more errors than typical Sanger sequence • RNA-Seq reads may be spliced and map between two locations on genome Mapping Software • BLAST and even BLAT (2002 high-speed method) are too slow • Many mapping algorithms Eland MAQ GSNAP BowTie mrFAST SOAP MoM BWA Fast Alignment Approaches • Key to rapid alignment is efficient errortolerant indexing of reference genome • MAQ uses spaced indexing • SOAP uses seed and hash look-up table for binary representation of query and reference • Bowtie & BWA both use Burrows-Wheeler transform to compress query and reference and then map Spaced Seed Method (MAQ) • Each position in the reference is cut into equal-sized pieces, called 'seeds' and these seeds are paired and stored in a lookup table. Each read is also cut up according to this scheme, and pairs of seeds are used as keys to look up matching positions in the reference. • The index can be very large – human genome takes 50GB Trapnell & Salzburg, Nature Biotech, 2009 Burrows-Wheeler Transform • First generate all rotations of the text, then sort them in lexicographic order. The last column is the transform • Repeated motifs get transformed into repeated characters – easily compressed • Easily invertible • Allows hierarchical searching • Aligners maps compressed reads to compressed genome Burrows-Wheeler Algorithm • Trapnell & Salzburg, Nature Biotech, 2009 • • Reads are aligned character by character from right to left against the transformed string. With each new character, the algorithm updates an interval (indicated by blue 'beams') in the transformed string. Each successively aligned new character winnows the list of positions to which the read might map. When all characters in the read have been processed, alignments are represented by any positions within the interval. Much faster than spaced seed method Very efficient storage (2GB for human genome BowTie • Run from command line • Arguments: genome, FASTQ file, output >bowtie e_coli ./e_coli_1000.fq e_coli.map # reads processed: 1000 # reads with at least one reported alignment: 699 (69.90%) # reads that failed to align: 301 (30.10%) Reported 699 alignments to 1 output stream(s) • Pre-built indices at SourceForge • BowTie 2 - gapped alignments (v. Tophat) • Myrna - BowTie alignment ‘in the cloud’ BWA • Two stages: indexing and mapping • Stage 1 arguments: number of errors, number of errors in ‘seed’, reference, query and index filename bwa aln -n 6 -k 1 ref.db.fasta input.query.fq index.sai This allows 1 error in ‘seed’ but up to 6 errors in alignment • Time taken grows exponentially with k • Multi-thread option -t BWA – Second Stage • Transform index file into SAM file; still needs the reference file bwa samse -n 2 ref.db.fasta index.sai input.query.fq out.sam – This writes out up to 2 matches for each read • bwa sampe - paired end • Usually convert SAM to BAM file for easy storage Alignments in Color Space • Make color-space version of reference genome • BW transform reference CS sequence • Transform csfasta and qv to fastq • Map reads Multiple Mapping • Many short reads will map to multiple loci • Defaults are uniquely mapped reads only • Low complexity short reads will often map to very many possible loci • Most aligners don’t do a very good job of handling more than two loci • Hard to work with SAM/BAM information for multiply mapped reads • Rather important for low-complexity regions Using Quality Scores in Mapping • Alignment is already so demanding that most aligners do not use quality information • Some SNP callers use quality information SNP Calling • Variants in 3’ end of reads are more likely to be errors than variants in middle of reads • In Illumina data a true G is read as a T with higher probability than other substitutions • Therefore G->T variants are more likely than other variants to be errors (Bayes Theorem)