Bioinformatics Tips (NGS data processing)

advertisement
Bioinformatics Tips
NGS data processing and
pipeline writing
Na Cai
3rd year DPhil in Clinical Medicine
Supervisor: Jonathan Flint
Example projects
CONVERGE
- 1.7x whole genome sequencing
in 12,000 Han Chinese Women
- 6000 Cases of MD, 6000 controls
- Detailed questionnaire
- 45T of sequencing data
Commercial Outbred Mice
- 0.1x whole genome sequencing
in 2,000 mice
- Known breeding history
- Extensive phenotyping
- 2T of sequencing data
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
One step at a time processing
• Make new directories as you go along
• Make flag files to indicate successful completion of
previous command
• Parallelize using make
• This is good for step by step troubleshooting
Pipeline writing – Ruffus
http://www.ruffus.org.uk
Setting up Ruffus
Once Ruffus is set up - Help
Once Ruffus is set up – just print
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
Processing a raw BAM file
• Things to consider
– How many samples one is processing
– Coverage per sample
– Ploidy of subjects
– Size of genome
– Source of DNA and possible contamination
– Server/cluster usage: How the jobs can be
parallelized
Processing a raw BAM file
• Some manipulations of bam files
– Converting between bams and fastqs
– Indexing
– Coordinate sorting
– Splitting or merging
– Filter out reads
– Mask entire regions
Tools of the Trade
• Picardtools
Tools of the Trade - Picardtools
• Commonly used Picard tools:
–
–
–
–
ValidateSamFile
SamToFastq
MergeSamFiles
ReplaceSamHeader
• Cool Picard options:
–
–
–
–
SORT_ORDER <default=null>
CREATE_INDEX <default=false>
CREATE_MD5_FILE <default=F>
VALIDATION_STRINGENCY <default=STRICT>
NGS data processing
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
Indel Realignment
Image from: http://www.broadinstitute.org/gatk/guide/best-practices
Why Realign Around Indels?
An*example*of*a*strand&discordant*locus*
Several)
consecuDve)
“SNPs”)only)found)
on)reads)ending)
on)the)right)of)the)
homopolymer)
Several)
consecuDve)
“SNPs”)only)found)
on)reads)ending)
on)the)leH)of)the)
homopolymer)
7bp)“T”)
homopolymer)run)
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf
Why Realign Around Indels?
Local*realignment*uncovers*the*hidden*indel*in*these*
reads*and*eliminates*all*the*poten7al*FP*SNPs*
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf
How does it work?
Identified intervals:
• Known
Indels
Local*
realignment*
iden7fies*most*parsimonious*
• Indels discovered in original alignments (in CIGAR strings of
alignment*
along*
reads in BAM
files)all*reads*at*a*problema7c*locus*
• the*
Reads
where there
is evidence
of that,*
possible
misalignment
1.*Find*
best*alternate*
consensus*
sequence*
together*
with*the*
reference,*best*fits*the*reads*in*a*pile*(maximum*of*1*indel)***
Ref:*
Three*
adjacent*
SNPs*
AAGCGTCG !
AAGCGTCG !
Realigning*
determines*
which*is*
beVer*
AAG---CG !
Read*pile*consistent*
with*the*reference*
sequence*
Read*pile*consistent*
with*a*3bp*inser7on*
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-22.*The*score*for*an*alternate*consensus*is*the*total*sum*of*the*quality*
Realignment.pdf
scores*of*mismatching*bases*
TheIndel*
Indel
Realignerworkflow*
Workflow
Realignment*
Original*BAM*file*
RealignerTargetCreator)
Intervals*(.intervals)*
IndelRealigner)
Realigned*BAM*file*
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf
Implementing the Indel Realignment
Site1
sample1
reads
sample2
reads
sample3
reads
sample4
…
Site2
Site3
Site4
Site5
Site6
Site7
Site8
sample5
sample6
sample7
The RealignerTargetCreater needs as many reads from all the samples at a
particular site to determine if reads tend to get misaligned there  need to
parse in data for all samples at the same time
Implementing the Indel Realignment
Site1
sample1
reads
sample2
reads
sample3
reads
sample4
…
Site2
Site3
Site4
Site5
Site6
Site7
Site8
sample5
sample6
sample7
Once the Intervals are identified, reads from any single sample can be
realigned individually based on the sample’s own insertion/deletion lengths
 only need to parse in one sample’s data at a time
Base Quality Score Recalibration
(BQSR)
Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
• Quality%scores%are%cri2cal%for%all%downstream%analysis%%
• Systema2c%biases%are%a%major%contributor%to%bad%calls%
Why BQSR?
Example:%Bias%in%the%quali2es%reported%depending%on%nucleo2de%context%%%
RMSE = 0.281
AA
AG
CA
CG
GA
Dinuc
GG
5
0
−5
Empirical − Reported Quality
5
0
−5
recalibrated%
−10
original%
−10
Empirical − Reported Quality
10
10
RMSE = 4.188
TA
TG
AA
AG
CA
CG
GA
GG
TA
TG
Dinuc
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3Base_recalibration.pdf
The BQSR workflow
Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3Base_recalibration.pdf
Implementing the BQSR
Site1
sample1
reads
sample2
reads
sample3
reads
sample4
…
Site2
Site3
Site4
Site5
Site6
Site7
Site8
sample5
sample6
sample7
The BaseRecalibrator needs all reads from each samples at all unmasked
sites to come up with the recalibration table for the dataset  need to
parse in all of the data of each sample
Variant Calling
Image from: http://www.broadinstitute.org/gatk/guide/best-practices
Variant Calling
Multi-sample calling integrates per sample likelihoods
to jointly estimate allele frequency of variation!
Sample-associated reads!
Genotype likelihoods!
Allele frequency!
Individual 1!
Individual 2!
Joint estimate
across samples!
SNPs
and
Indels!
Individual N!
Genotype frequencies!
• Simultaneous'es6ma6on'of:'
From: http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5– Allele'frequency'(AF)'spectrum'Pr{AF'='i'|'D}'
Variant_calling.pdf
– The'probability'that'a'variant'exists'Pr{AF'>'0'|'D}''
– Assignment'of'genotypes'to'each'sample'
Implementing Variant Calling
Site1
sample1
reads
sample2
reads
sample3
reads
sample4
…
Site2
Site3
Site4
Site5
Site6
Site7
Site8
sample5
sample6
sample7
The UnifiedGenotyper (and many other callers) needs as many reads from
all the samples at a particular site to determine if there is a variant at the
site tend  need to parse in data for all samples at a particular site at the
same time
Variant Calling Softwares
•
•
•
•
•
Samtools
GATK UnifiedGenotyper
GATK HaplotypeCaller
Platypus
Cortex (denovo assembly + variant caller)
Comparison between callers
Acknowledgements
Jonathan Flint
Richard Mott
Winni Kretzschmar
Robbie Davies
Leo Goodstadt (Ruffus)
Kiran Garimella (GATK)
Gerton Lunter (Stampy)
Andy Rimmer (Platypus)
Zam Iqbal (Cortex)
John Broxholme
Download