Read mapping

Mapping Sequence Tags
Biological Sequence Analysis
BNFO 691/602 Spring 2014
Mark Reimers
Raw Data Formats
• Most common format is FASTQ
– Four rows per read
• First row identifies read by experiment and
location on lane or plate
• Second row gives sequence as read
• Fourth row gives quality information
Quality Scores
• Modelled after PHRED scores:
10*log10( P(error))
• Calibrated by manufacturer based on
previous experience
• May be systematically conservative or
Representing Quality Scores
• SOLiD QV and 454 represent quality
scores as decimal numbers
• Illumina compresses scores as ASCII
characters: quality scores range from 0 to
62 using ASCII 64 to 126 (normally scores
are between 0 and 40)
82 (82-64)=18
101 (101-64)=37
104 (104-64)=40
100 (100-64)=36
Variation: SOLiD CSFASTA & QV
• Color Space FASTA
– Each second line contains color space with
last base of primer
• QV file contains integers with PHRED
Pre-processing – Mapping Reads
• Huge read numbers (10M – 100M)
• Classical Dynamic Programming
alignment is far too slow
• Reads have more errors than typical
Sanger sequence
• RNA-Seq reads may be spliced and map
between two locations on genome
Mapping Software
• BLAST and even BLAT (2002 high-speed
method) are too slow
• Many mapping algorithms
Fast Alignment Approaches
• Key to rapid alignment is efficient errortolerant indexing of reference genome
• MAQ uses spaced indexing
• SOAP uses seed and hash look-up table
for binary representation of query and
• Bowtie & BWA both use Burrows-Wheeler
transform to compress query and
reference and then map
Spaced Seed Method (MAQ)
• Each position in the reference
is cut into equal-sized pieces,
called 'seeds' and these seeds
are paired and stored in a
lookup table. Each read is also
cut up according to this
scheme, and pairs of seeds are
used as keys to look up
matching positions in the
• The index can be very large –
human genome takes 50GB
Trapnell & Salzburg,
Nature Biotech, 2009
Burrows-Wheeler Transform
• First generate all rotations of the text, then
sort them in lexicographic order. The last
column is the transform
• Repeated motifs get transformed into
repeated characters – easily compressed
• Easily invertible
• Allows hierarchical searching
• Aligners maps compressed reads to
compressed genome
Burrows-Wheeler Algorithm
Trapnell & Salzburg,
Nature Biotech, 2009
Reads are aligned character by
character from right to left against the
transformed string. With each new
character, the algorithm updates an
interval (indicated by blue 'beams') in
the transformed string. Each
successively aligned new character
winnows the list of positions to which
the read might map. When all
characters in the read have been
processed, alignments are represented
by any positions within the interval.
Much faster than spaced seed method
Very efficient storage (2GB for human
• Run from command line
• Arguments: genome, FASTQ file, output
>bowtie e_coli ./e_coli_1000.fq
# reads processed: 1000
# reads with at least one reported alignment: 699 (69.90%)
# reads that failed to align: 301 (30.10%)
Reported 699 alignments to 1 output stream(s)
• Pre-built indices at SourceForge
• BowTie 2 - gapped alignments (v. Tophat)
• Myrna - BowTie alignment ‘in the cloud’
• Two stages: indexing and mapping
• Stage 1 arguments: number of errors,
number of errors in ‘seed’, reference,
query and index filename
bwa aln -n 6 -k 1 ref.db.fasta
input.query.fq index.sai
This allows 1 error in ‘seed’ but up to 6 errors in
• Time taken grows exponentially with k
• Multi-thread option -t
BWA – Second Stage
• Transform index file into SAM file; still
needs the reference file
bwa samse -n 2 ref.db.fasta index.sai
input.query.fq out.sam
– This writes out up to 2 matches for each read
• bwa sampe - paired end
• Usually convert SAM to BAM file for easy
Alignments in Color Space
• Make color-space version of reference
• BW transform reference CS sequence
• Transform csfasta and qv to fastq
• Map reads
Multiple Mapping
• Many short reads will map to multiple loci
• Defaults are uniquely mapped reads only
• Low complexity short reads will often map to
very many possible loci
• Most aligners don’t do a very good job of
handling more than two loci
• Hard to work with SAM/BAM information for
multiply mapped reads
• Rather important for low-complexity regions
Using Quality Scores in Mapping
• Alignment is already so demanding that
most aligners do not use quality
• Some SNP callers use quality information
SNP Calling
• Variants in 3’ end of reads are more likely
to be errors than variants in middle of
• In Illumina data a true G is read as a T
with higher probability than other
• Therefore G->T variants are more likely
than other variants to be errors (Bayes