Introduction to Short Read Sequencing Analysis

advertisement
Introduction to Short Read Sequencing Analysis
Jim Noonan
GENE 760
Sequence read lengths remain limiting
Chr1: 249 Mb
249 Mb sequencing read
Current platforms:
• A moderate number (~500,000) of long reads (~10 kb)
• A very large number (>200 M) of short reads (100 bp)
• For most applications reads are aligned to a reference genome
• Short reads contain inherently limited information
• De novo assembly of short reads is difficult
Determining the identity and location of short sequence reads
in the genome/exome/transcriptome
@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA
TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG
Aligning short reads to much larger reference
TAGATTACACAGATTAC
|||||||||||||||||
TAGT TAGATTACACAGATTAC TAGA
Need a computationally efficient method to perform
accurate alignments of millions of reads
Read length requirements vary depending on the
feature being studied
Exome:
80-120 bp
Splice junctions
(connectivity)
Transcriptome:
10,000 bp
Determining the identity and location of short sequence reads
in the genome/exome/transcriptome
@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA
TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG
Aligning short reads to much larger reference
Exome or Genome
TAGATTACACAGATTAC
|||||||||||||||||
TAGT TAGATTACACAGATTAC TAGA
Considerations
•Alignment scoring
•Source of the reads
•Sequencing format (PE or SE)
•Read length
•Error rates
Transcriptome
Topics
•Scoring alignments
•Error rates and quality scores for short read sequencing
•Mapability
•Common algorithms for short read sequence alignment
•Scoring short read sequence aligments
•Uniform data output formats
Scoring alignments
Correct:
TAGATTACACAGATTAC
|||||||||||||||||
TAGATTACACAGATTAC
C
|
C
Match (+1)
C
Mismatch (-1, -2, etc.)
Wrong:
TAGATTACTCAGA-TAC
|||||||| |||| |||
TAGATTACACAGATTAC
T
A-TAC
|||||
ATTAC
A--AC
|||||
ATTAC
Gap penalty:
P = a +bN
a = cost of opening a gap
b = cost of extending gap by 1
N = length of gap
Many short read alignment algorithms
allow a fixed number of mismatches
Adapted from Mark Gerstein
Scoring alignments
Correct (polymorphism):
TAGATTACTCAGATTAC
|||||||| ||||||||
TAGATTACACAGATTAC
C
|
C
Match (+1)
C
Mismatch (-1, -2, etc.)
Wrong:
TAGATTACTCAGA-TAC
|||||||| |||| |||
TAGATTACACAGATTAC
T
A-TAC
|||||
ATTAC
A--AC
|||||
ATTAC
Gap penalty:
P = a +bN
a = cost of opening a gap
b = cost of extending gap by 1
N = length of gap
Many short read alignment algorithms
allow a fixed number of mismatches
Adapted from Mark Gerstein
Quality scores
A quality score (or Q-score) expresses the probability that a basecall is incorrect.
Given a basecall, A:
• The estimated probability that A is not correct is P(~A);
• The quality score for A is Q (A) = -10 log10 (P(~A))
A quality score of 10 means a probability of 0.1 that A is the wrong basecall.
P(~A) is platform-specific; Q-scores can be compared across platforms.
Quality scores are logarithmic:
Q-score
Error probability
10
0.1
20
0.01
40
0.0001
Error rates in lllumina sequencing reads
3
Reverse
termination
Add next base,
etc.
SEQUEN CIN G
1–3 d ays single-read run
3
SEQUEN CIN G
Add base
3–9 d ays p aired -end run
1–3 d ays single-read run
30 minut es hand s-on t ime
3–9 d ays p aired -end run
Sequencing
8 lanes, up to 96 samples
by
persynthesis
flow cell (run)
with reversible
dye terminators
30 minutes hands-on t ime
Scan flow cell
8 lanes, up to 96 samples
per flow cell (run)
1 cycle
www.illumina.com/ sequencing
w.illumina.com/ sequencing
Individual synthesis reactions
go out of phase
Error rates in lllumina sequencing reads
• Error rates are mismatch rates relative to reference genome
• Error rates increase with increasing cycle number
• Contingent on reference genome quality
• Reads may be trimmed to improve alignment quality
Illumina quality score encoding in FASTQ format
(CASAVA v1.8)
>90% Q30 bases in high quality run
>80% mappable reads
Sources of error in single-molecule sequencing
Illumina:
Consensus signal
TAGATTACACAGATTAC
|||||||||||||||||
TAGATTACACAGATTAC
PacBio:
One molecule, one read
TAGATTA-ACAG-TT-C
||||||| |||| || |
TAGATTACACAGATTAC
Sequence templates multiple times
Mapability
•The genome contains non-unique sequences (repeats, segmental duplications)
•Short reads derived from repetitive regions are difficult to map
Chr3
Longer reads:
Paired reads:
repeat
Chr7
repeat
Mapability scores at UCSC
•The genome contains non-unique sequences (repeats, segmental duplications)
•Short reads derived from repetitive regions are difficult to map
36mers, 2 mismatches
75mers, 2 mismatches
100mers, 2 mismatches
Poorly mappable regions of the genome
36mers, 2 mismatches
75mers, 2 mismatches
100mers, 2 mismatches
Common algorithms for mapping short reads to a
reference genome
Program
Website
ELAND (v2)
N/A – integrated into Illumina pipeline
Bowtie
http://bowtie-bio.sourceforge.net/
BWA
http://bio-bwa.sourceforge.net/
Maq
http://maq.sourceforge.net/
Considerations
•Alignment scoring method
•Speed
•Quality aware
•Seeding
•Gapped alignment
Seed-based alignment strategy
Single seed alignments
Reference
Seed
Critical values are seed length and number of mismatches allowed
In ELAND:
Seed length = 32
Number of mismatches = 2
Multiseed alignments
(ELAND v2, others)
Seed interval
contingent on read length
Implementation in ELAND v2
A read must have at least one seed with no more than 2 mismatches and no gaps
Gapped alignment: extend each alignment to full length of read, allowing gaps
up to 10 bp
Resolving ambiguous read alignments with multiple seeds
Reference
Seed
Resolving ambiguous read alignments with multiple seeds
Utility of gapped alignments
RNA-seq
Insertions and deletion variants in
exome and whole genome sequencing
Mapping paired end reads
Read 1
Read 2
Insert size
Insert size
within specified range
ELAND alignment scoring
Base quality values and mismatch positions in a candidate alignment are
used to assign a p value
P values reflect probability that candidate position in genome would
give rise to the observed read if its bases were sequenced at error rates
corresponding to the read’s quality values
Alignment score for a read is computed from p values of all candidate
alignments
If there are two candidates for a read with p values 0.9 and 0.3:
• 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct
• 1- 0.75, chance highest scoring alignment is wrong
• Alignment score = -10 log(0.25) = 6.
BaseSpace
https://basespace.illumina.com/
Spaced-seed indexing of the
reference genome
• Need to break up the genome into
manageable segments
• Create index of short sequences
• Match seeds against genome index
alignment
Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)
Reference genome indexing using
Burrows-Wheeler transform
•
•
•
•
alignment
Reversible encoding scheme
Simplifies genome sequence
Results in “indexed” genome
Very rapid alignments
Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)
Bowtie 2
Pre-built Indexed genomes
Bowtie 1 and Bowtie 2
indexes are not compatible
Alignments in Bowtie 2
@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA
TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG
Multiseed alignment (ungapped)
Seed length: 16 nt, every 10 nt
# mismatches: 0
Seeds are extended (gaps allowed) to generate alignment
Ref
Read
Match = 2
TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG
TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG
Mismatch = -6
Gap = -11
-5 to open
-3 to extend by 1 bp
Mapping in highly repetitive regions
ELAND is conservative
• Non-unique alignments are flagged; only one is reported in export.txt
• Post-alignment CASAVA analyses ignore these
Bowtie will report non-unique alignments
• User-specified options determine how these are reported
http://bowtie-bio.sourceforge.net/manual.shtml
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Sequence Alignment/Map (SAM) format
Standard format for reporting short read alignment data
• BAM is compressed version
Header
Alignment
info
http://samtools.sourceforge.net/
Summary
•Read the material posted for this lecture on the class wiki
•Next week: first Regulomics lecture
Download