From Raw to Analysis

advertisement
Read Processing and Mapping:
From Raw to Analysis-ready
Reads
B E N PA S S A R E L L I
QUAKE LAB
NGS WORKSHOP
M AY 3 0 , 2 0 1 4
From Raw to Analysis-ready Reads
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
2
Session Topics
•
•
•
•
•
•
•
•
•
•
•
Brief overview of high-throughput sequencing platforms
Understand read data formats and quality scores
Identify and fix some common read data problems
Find a genomic reference for mapping
Map reads to a reference genome
Understand alignment output
Sort, merge, index alignment for further analysis
Mark/eliminate duplicate reads
Locally realign at indels
Recalibrate base quality scores
How to get started
Sequencing Platforms at a Glance
Illumina Sequencing Platforms
MiSeq
NextSeq 500
HiSeq 2500
Features
MiSeq
NextSeq 500
HiSeq 2500
# Flowcells
1
1
2
# Sample Mixes
1
1
16
# Clusters / Run
25M
400M
3200M
Max Read Length
2x300
2x150
2x100
Gb / Run
15
120
640
Hours / Run
55 hours
30 hours
12 days
Reagent Cost / Gb
$79
$32
$36
Single Cell Analysis Toolset
• Built on R Statistics Package
• Differential gene expression analysis and visualization
• PCA
• Unsupervised clustering
• ANOVA (statistical hypothesis testing)
Sample to Raw Reads
Sample
Preparation
C1 Single Cell Capture
Imaging / Lysis
Amp of DNA / cDNA
6
QC and
Quantification
AATI Fragment Analyzer
Evaluate and Quantitate
Harvested C1 DNA products
Library
Construction
Sequencing
NextSeq 500
300M or 800M Reads
In ~24 hours
Raw
Reads
Solid Phase Amplification
Sequencing Steps
•Clusters are linearized
•Sequencing primer annealed
•All labeled dNTPs added at each cycle
•Intensity of different tags  base call
•Error Profile: substitutions
Library DNA binds to Oligos Immobilized on Glass Flowcell Surface
7
Instrument Output
Illumina
MiSeq NextSeq
HiSeq
Base call file (.bcl)
LifeTech
PGM
Pacific Biosciences
Oxford
Nanopore
RS
MinION
Proton
Standard flowgram file (.sff)
Trace (.trc.h5)
Pulse (.pls.h5)
Base (.bas.h5)
Sequence Data
(FASTQ Format)
8
Squiggle (.h5)
FASTQ Format (Illumina Example)
Read Record
Header
Separator
(with optional
repeated
header)
Lane
Flow Cell ID
Tile
Tile
Coordinates
Barcode
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
Read Bases
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read Quality
+
Scores
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
NOTE:
for paired-end runs, there is a second file
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
with
+ one-to-one corresponding headers and reads
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Base Call Quality: Phred Quality Scores
Phred* quality score Q with base-calling error probability P
Q = -10 log10P
* Name of first program to assign accurate base quality scores. From the Human Genome Project.
Q score
Probability of
base error
Base
confidence
Sangerencoded
(Q Score +
33) ASCII
character
10
0.1
90%
“+”
20
0.01
99%
“5”
30
0.001
99.9%
“?”
40
0.0001
99.99%
“I”
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|
|
|
|
|
|
33
59
64
73
104
126
S - Sanger
Phred+33
I - Illumina 1.3+ Phred+64
L - Illumina 1.8+ Phred+33
range: 0 to 40
range: 0 to 40
range: 0 to 41
Initial Read Assessment and Processing
Raw reads
Read assessment
and prep
Common problems that can affect analysis:
Low confidence base calls
• typically toward ends of reads
• criteria vary by application
Mapping
Duplicate
Marking
Presence of adapter sequence in reads
• poor fragment size selection
• protocol execution or artifacts
Local
realignment
Over-abundant sequence duplicates
Base quality
recalibration
Library contamination
Analysis-ready
reads
Quick Read Assessment: FastQC
Free Download
Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
Samples reads (200K default): fast, low resource use
Read Assessment Example (Cont’d)
Trim for base quality or adapters
(run or library issue)
Trim leading bases
(library artifact)
Read Assessment Example (Cont’d)
TruSeq Adapter, Index 9
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
Comprehensive Read Assessment: Prinseq
http://prinseq.sourceforge.net/
15
Selected Tools to Process Reads
Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/
(partial list)
FASTQ Information: Chart Quality Statistics and Nucleotide Distribution
FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).
FASTQ Clipper: Removing sequencing adapters
FASTQ Quality Filter: Filters sequences based on quality
FASTQ Quality Trimmer: Trims (cuts) sequences based on quality
FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality
*defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.
SepPrep https://github.com/jstjohn/SeqPrep
Adapter trimming
Merge overlapping paired-end read
Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html
(for python programmers)
Especially useful for implementing custom/complex sequence analysis/manipulation
Galaxy http://galaxy.psu.edu
Great for beginners: upload data, point and click
Just about everything you’ll see in today’s presentations
SolexaQA2 http://solexaqa.sourceforge.net
Dynamic trimming
Length sorting (resembles read grouping of Prinseq)
Many Analysis Pipelines Start with Read Mapping
Genotyping/Haplotyping
Gene Expression
https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq
Tumor/Normal Comparison
https://www.broadinstitute.org/gatk/guide/best-practices
17
http://www.appistry.com/sites/all/themes/appistry/files/pdfs/CGAS_download.pdf
Read Mapping
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
http://www.broadinstitute.org/igv/
Sequence References and Annotations



http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome
Comprehensive reference information


http://hgdownload.cse.ucsc.edu/downloads.html
Comprehensive reference, annotation, and translation information



ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
References and SNP information data by GATK
Human only



http://cufflinks.cbcb.umd.edu/igenomes.html
Pre-indexed references and gene annotations for Tuxedo suite
Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans,
Yeast

http://www.repeatmasker.org
Fasta Sequence Format
•
•
•
•
•
One or more sequences per file
“>” denotes beginning of sequence or contig
Subsequent lines up to the next “>” define sequence
Lowercase base denotes repeat masked base
Contig ID may have comments delimited by “|”
>chr1
…
TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt
agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga
attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga
aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa
taaccatatttgggatatactggattttaaaaatatatcactaatttcat
…
>chr2
…
>chr3
…
Read Mapping
Novoalign
(3.0)
SOAP3
(0.01 beta)
BWA
(0.7.8)
Bowtie2
(2.2.2)
Tophat2
(2.0.11)
STAR
(2.3.0e)
License
Commercial
GPL v3
GPL v3
Artistic
Artistic
GPL v3
Mismatch
allowed
up to 8
up to 3
user specified.
max is function of
read length and
error rate
user specified
uses Bowtie2
user specified
Alignments
reported per
read
random/all/none
random/all/none
user selected
user selected
uses Bowtie2
user selected
Gapped
alignment
up to 7bp
1-3bp gap
yes
yes
yes
splice junctions
introns
yes
splice junctions
introns
Pair-end reads
yes
yes
yes
yes
yes
yes
Best alignment
highest
alignment score
minimal number
of mismatches
minimal number
of mismatches
highest
alignment score
uses Bowtie2
highest
alignment score
Trim bases
3’ end
3’ end
3’ and 5’ end
3’ and 5’ end
uses Bowtie2
3’ and 5’ end
Comments
At one time, best
performance and
alignment quality
Can use nVIDIA
CUDA (GPU)
Element of Broad’s
“best practices”
genotyping
workflow
Smith-Waterman
quality alignments,
currently fastest
Currently most
popular RNA-seq
aligner
Very fast; uses
memory to achieve
performance
Read Mapping: BWA
BWA Features
• Uses Burrows Wheeler Transform
— fast
— modest memory footprint (<4GB)
• Accurate
• Tolerates base mismatches
— increased sensitivity
— reduces allele bias
• Gapped alignment for both single- and paired-ended reads
• Automatically adjusts parameters based on read lengths and
error rates
• Native BAM/SAM output (the de facto standard)
• Large installed base, well-supported
• Open-source (no charge)
Read Mapping: Bowtie2
Bowtie2
• Uses dynamic programming (edit distance scoring)
oEliminates need for realignment around indels
oCan be tuned for different sequencing technologies
• Multi-seed search - adjustable sensitivity
• Input read length limited only by available memory
• Fasta or Fastq input
• Caveats
oLonger input reads require much more memory
oTrade-off parallelism with memory requirement
Dynamic Programming Illustration
http://bowtie-bio.sourceforge.net/bowtie2
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2,
Nature Methods. 2012, 9:357-359
23
SAM (BAM) Format
Sequence Alignment/Map format
Universal standard
Human-readable (SAM) and compact (BAM) forms
Superset of FASTQ
Structure
Header
version, sort order, reference sequences, read groups,
program/processing history
Alignment records
SAM/BAM Format: Header
[benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam
samtools to view bam
@HD VN:1.0
GO:none
SO:coordinate
header
sort order
@SQ SN:chrM
LN:16571
@SQ SN:chr1
LN:249250621
@SQ SN:chr2
LN:243199373
reference sequence names
@SQ SN:chr3
LN:198022430
with lengths
…
@SQ SN:chr19
LN:59128983
@SQ SN:chr20
LN:63025520
@SQ SN:chr21
LN:48129895
@SQ SN:chr22
LN:51304566
read groups with platform,
@SQ SN:chrX
LN:155270560
library and sample information
@SQ SN:chrY
LN:59373566
…
@RG ID:86-191 PL:ILLUMINA
LB:IL500
SM:86-191-1
@RG ID:BsK010 PL:ILLUMINA
LB:IL501
SM:BsK010-1
@RG ID:Bsk136 PL:ILLUMINA
LB:IL502
SM:Bsk136-1
@RG ID:MAK001 PL:ILLUMINA
LB:IL503
SM:MAK001-1
@RG ID:NG87
PL:ILLUMINA
LB:IL504
SM:NG87-1
…
program (analysis) history
@RG ID:SDH023 PL:ILLUMINA
LB:IL508
SM:SDH023
@PG ID:GATK IndelRealigner
VN:2.0-39-gd091f72
CL:knownAlleles=[] targetIntervals=tmp.intervals.li
@PG ID:bwa
PN:bwa
VN:0.6.2-r126
SAM/BAM Format: Alignment Records
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
2
3
4
5
6
8
9
HW-ST605:127:B0568ABXX:2:1201:10933:3739
147
chr1 27675 60
101M =
27588 -188
10 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC
11 =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/
RG:Z:86-191
1
HW-ST605:127:B0568ABXX:3:1104:21059:173553
83
chr1 27682 60
101M =
27664 -119
ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA
8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0
RG:Z:SDH023
* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability
http://samtools.sourceforge.net/SAM1.pdf
Compression is Big Win for HTS Data
33.8M 100bp Illumina reads
Compression Ratio
6x
5x
4x
Improvement
3x
Preparing for Next Steps
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
28
Subsequent steps require sorted and indexed bams
Sort orders: karyotypic, lexicographical
Indexing improves analysis performance
Picard tools: fast, portable, free
http://picard.sourceforge.net/command-line-overview.shtml
Sort:
SortSam.jar
Merge:
MergeSamFiles.jar
Index:
BuildBamIndex.jar
Order: sort, merge (optional), index
Duplicate Marking
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
$java -Xmx4g -jar <path to picard>/MarkDuplicates.jar \
INPUT=aligned.sorted.bam \
OUTPUT=aligned.sorted.dedup.bam \
VALIDATION_STRINGENCY=LENIENT \
METRICS_FILE=aligned.dedup.metrics.txt \
REMOVE_DUPLICATES=false \
ASSUME_SORTED=true
http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
SAM/BAM Format: Alignment Records
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
HW-ST605:127:B0568ABXX:2:1201:10933:3739
147
chr1 27675 60
101M =
27588 -188
TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC
=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/
RG:Z:86-191
http://picard.sourceforge.net/explain-flags.html
http://samtools.sourceforge.net/SAM1.pdf
Local Realignment
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
BWT-BASED ALIGNMENT
INDIVIDUAL
IS FAST FOR MATCHING READS TO REFERENCE
BASE ALIGNMENTS OFTEN SUB-OPTIMAL AT INDELS
APPROACH
 Fast read mapping with BWT-based aligner
 Realign reads at indel sites using gold standard (but much slower)
Smith-Waterman algorithm
BENEFITS
 Refines location of indels
 Reduces erroneous SNP calls
 Very high alignment accuracy in significantly less time, with fewer
resources
1Smith,
Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal
of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238
Local Realignment
Raw BWA alignment
DePristo MA, et al. A framework for variation discovery and
genotyping using next-generation DNA sequencing data. Nat Genet.
2011 May;43(5):491-8. PMID: 21478889
Post re-alignment at indels
Base Quality Recalibration
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
STEP 1: Find covariates at non-dbSNP sites using:
Reported quality score
The position within the read
The preceding and current nucleotide (sequencer properties)
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I alignment.bam \
-R hg19/ucsc.hg19.fasta \
-knownSites hg19/dbsnp_135.hg19.vcf \
-o alignment.recal_data.grp
STEP 2: Generate BAM with recalibrated base scores:
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R hg19/ucsc.hg19.fasta \
-I alignment.bam \
-BQSR alignment.recal_data.grp \
-o alignment.recalibrated.bam
Base Quality Recalibration (Cont’d)
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
35
Is there an easier way to get started?!
http://galaxyproject.org/
Click on “Use Galaxy”
Getting Started
Raw reads
Read assessment
and prep
Mapping
Duplicate
Marking
Local
realignment
Base quality
recalibration
Analysis-ready
reads
38
Download