Bioinformatics Tips NGS data processing and pipeline writing Na Cai 3rd year DPhil in Clinical Medicine Supervisor: Jonathan Flint Example projects CONVERGE - 1.7x whole genome sequencing in 12,000 Han Chinese Women - 6000 Cases of MD, 6000 controls - Detailed questionnaire - 45T of sequencing data Commercial Outbred Mice - 0.1x whole genome sequencing in 2,000 mice - Known breeding history - Extensive phenotyping - 2T of sequencing data NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices One step at a time processing • Make new directories as you go along • Make flag files to indicate successful completion of previous command • Parallelize using make • This is good for step by step troubleshooting Pipeline writing – Ruffus http://www.ruffus.org.uk Setting up Ruffus Once Ruffus is set up - Help Once Ruffus is set up – just print NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices Processing a raw BAM file • Things to consider – How many samples one is processing – Coverage per sample – Ploidy of subjects – Size of genome – Source of DNA and possible contamination – Server/cluster usage: How the jobs can be parallelized Processing a raw BAM file • Some manipulations of bam files – Converting between bams and fastqs – Indexing – Coordinate sorting – Splitting or merging – Filter out reads – Mask entire regions Tools of the Trade • Picardtools Tools of the Trade - Picardtools • Commonly used Picard tools: – – – – ValidateSamFile SamToFastq MergeSamFiles ReplaceSamHeader • Cool Picard options: – – – – SORT_ORDER <default=null> CREATE_INDEX <default=false> CREATE_MD5_FILE <default=F> VALIDATION_STRINGENCY <default=STRICT> NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices Indel Realignment Image from: http://www.broadinstitute.org/gatk/guide/best-practices Why Realign Around Indels? An*example*of*a*strand&discordant*locus* Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)right)of)the) homopolymer) Several) consecuDve) “SNPs”)only)found) on)reads)ending) on)the)leH)of)the) homopolymer) 7bp)“T”) homopolymer)run) Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf Why Realign Around Indels? Local*realignment*uncovers*the*hidden*indel*in*these* reads*and*eliminates*all*the*poten7al*FP*SNPs* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf How does it work? Identified intervals: • Known Indels Local* realignment* iden7fies*most*parsimonious* • Indels discovered in original alignments (in CIGAR strings of alignment* along* reads in BAM files)all*reads*at*a*problema7c*locus* • the* Reads where there is evidence of that,* possible misalignment 1.*Find* best*alternate* consensus* sequence* together* with*the* reference,*best*fits*the*reads*in*a*pile*(maximum*of*1*indel)*** Ref:* Three* adjacent* SNPs* AAGCGTCG ! AAGCGTCG ! Realigning* determines* which*is* beVer* AAG---CG ! Read*pile*consistent* with*the*reference* sequence* Read*pile*consistent* with*a*3bp*inser7on* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-22.*The*score*for*an*alternate*consensus*is*the*total*sum*of*the*quality* Realignment.pdf scores*of*mismatching*bases* TheIndel* Indel Realignerworkflow* Workflow Realignment* Original*BAM*file* RealignerTargetCreator) Intervals*(.intervals)* IndelRealigner) Realigned*BAM*file* Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2Realignment.pdf Implementing the Indel Realignment Site1 sample1 reads sample2 reads sample3 reads sample4 … Site2 Site3 Site4 Site5 Site6 Site7 Site8 sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time Implementing the Indel Realignment Site1 sample1 reads sample2 reads sample3 reads sample4 … Site2 Site3 Site4 Site5 Site6 Site7 Site8 sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the sample’s own insertion/deletion lengths only need to parse in one sample’s data at a time Base Quality Score Recalibration (BQSR) Taken from: http://www.broadinstitute.org/gatk/guide/best-practices • Quality%scores%are%cri2cal%for%all%downstream%analysis%% • Systema2c%biases%are%a%major%contributor%to%bad%calls% Why BQSR? Example:%Bias%in%the%quali2es%reported%depending%on%nucleo2de%context%%% RMSE = 0.281 AA AG CA CG GA Dinuc GG 5 0 −5 Empirical − Reported Quality 5 0 −5 recalibrated% −10 original% −10 Empirical − Reported Quality 10 10 RMSE = 4.188 TA TG AA AG CA CG GA GG TA TG Dinuc Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3Base_recalibration.pdf The BQSR workflow Image from: http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3Base_recalibration.pdf Implementing the BQSR Site1 sample1 reads sample2 reads sample3 reads sample4 … Site2 Site3 Site4 Site5 Site6 Site7 Site8 sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample Variant Calling Image from: http://www.broadinstitute.org/gatk/guide/best-practices Variant Calling Multi-sample calling integrates per sample likelihoods to jointly estimate allele frequency of variation! Sample-associated reads! Genotype likelihoods! Allele frequency! Individual 1! Individual 2! Joint estimate across samples! SNPs and Indels! Individual N! Genotype frequencies! • Simultaneous'es6ma6on'of:' From: http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5– Allele'frequency'(AF)'spectrum'Pr{AF'='i'|'D}' Variant_calling.pdf – The'probability'that'a'variant'exists'Pr{AF'>'0'|'D}'' – Assignment'of'genotypes'to'each'sample' Implementing Variant Calling Site1 sample1 reads sample2 reads sample3 reads sample4 … Site2 Site3 Site4 Site5 Site6 Site7 Site8 sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time Variant Calling Softwares • • • • • Samtools GATK UnifiedGenotyper GATK HaplotypeCaller Platypus Cortex (denovo assembly + variant caller) Comparison between callers Acknowledgements Jonathan Flint Richard Mott Winni Kretzschmar Robbie Davies Leo Goodstadt (Ruffus) Kiran Garimella (GATK) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme