Detection and analysis of SNP polymorphisms Alexis Dereeper CIBA courses – Brasil 2011 Objectives Short reads Solexa • To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data) Mapping SAM • To think about difficulties encountered when analysing new generation sequencing data (differentiate sequencing errors, paralogs and allelic variation) Allelic variations 867 1998 2341 • Detect SNP and assign genotypes to every polymorphic positions • Simply exploit polymorphisms data via a Webbased application (genetic diversity, LD) Ind1 Ind2 Ind3 A/G T/C List of SNPs T/G ATTGTGTCGTAACGTATGTCATGTCGT ATTGTGTCGGAACGTATGTCATGTCGT ATTGTGTCGKAACGTATGTCATGTCGT Assignation of genotypes • Obtain an exploitable dataset to send for the design of a high-throughput SNP chip (Illumina VeraCode technology) Design of a Illumina SNP chip Exploitation of polymorphism data Tablet • Graphical viewer for assembly of NGS data • Accepts different formats: ACE, SAM, BAM Alexis Dereeper CIBA courses – Brasil 2011 Automatic detection of SNP from SAM assembly Example of pipeline faisable with the Galaxy system: 3 alternatives Fastq FastQ Groomer PicardTools Mapping BWA SamTools GATK SAM assembly VarScan SAM-to-BAM Generate Pileup AddReadGroupIntoSam SAM-to-BAM SNiPlay Utilities SamToFastaAlignments IndelRealigner Pileup file CountCovariates Pileup2snp TableRecalibration FASTA alignments with IUPAC UnifiedGenotyper SNP tabular file VCF file Alexis Dereeper VCFToFastaAlignments CIBA courses – Brasil 2011 Varscan Program for SNP detection from Pileup file : Pileup2snp Another module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen Pileup format Text file describing for each position: base for reference, depth of coverage, variations, quality seq1 seq1 seq1 seq1 seq1 seq1 seq1 seq1 272 273 274 275 276 277 278 279 T T T A G T G C 24 23 23 23 22 22 23 23 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Alexis Dereeper CIBA courses – Brasil 2011 SamToFastaAlignments and AceToFastaAlignments: SNiPlay utilities for management of NGS data Mapping: SAM format Threshold values per genotype Assemblage: Ace format Depth Frequency genotype1 1 0 1 genotype2 4 0.3 2 genotype3 4 0.3 2 CL1Contig1 Depth threshold Heterozygosity Depth threshold For heterozygosity estimation For each contig List of + heterozygous positions FASTA alignments including IUPAC CL1Contig1.align.fa A A T For position + Stats: estimation of average heterozygosity for each genotype Y W + CL1Contig2.align.fa , CL2Contig1.align.fa … Alexis Dereeper Depth CIBA courses – Brasil 2011 GATK (Genome Analysis ToolKit) • Package for analysis of NGS data. • Developed for the analysis of Human medical resequencing projects (1000 Genomes, The Cancer Genome Atlas) • Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery • Complementary of 2 other packages: SamTools, PicardTools Alexis Dereeper PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer) CIBA courses – Brasil 2011 Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) FastQ Groomer FastQ Groomer FastQ Groomer FastQ Groomer Mapping BWA Mapping BWA Mapping BWA Mapping BWA AddReadGroupIntoSam AddReadGroupIntoSam AddReadGroupIntoSam SAM with read group …. AddReadGroupIntoSam SAM with read group SAM with read group SAM with read group mergeSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq global FastQ Groomer Mapping BWA AddReadGroupIntoSam Global SAM with read group SAM-to-BAM IndelRealigner CountCovariates TableRecalibration UnifiedGenotyper VCF file Fastq (RC4) VCF format (Variant Call Format) Advantages: describes the variations for each position + genotype assignation ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 Alexis Dereeper FORMAT NA00001 NA00002 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 CIBA courses – Brasil 2011 Other functionalities of GATK • DepthOfCoverage module: Enables to inform sequencing depth of coverage for each gene, each position and each individual • ReadBackedPhasing module: Enables to define if possible allele association (phase or haplotype) in case of heterozygosity… And not AGG GGA Alexis Dereeper CIBA courses – Brasil 2011 SNiPlay: Webbased application for polymorphism analysis http://sniplay.cirad.fr Alexis Dereeper CIBA courses – Brasil 2011 Automatic detection of SNP from SAM assembly Example of pipeline faisable with the Galaxy system: 3 alternatives Fastq FastQ Groomer PicardTools Mapping BWA SamTools GATK SAM assembly VarScan SAM-to-BAM Generate Pileup AddReadGroupIntoSam SAM-to-BAM SNiPlay Utilities SamToFastaAlignments IndelRealigner Pileup file CountCovariates Pileup2snp TableRecalibration FASTA alignments with IUPAC UnifiedGenotyper SNP tabular file VCF file Alexis Dereeper VCFToFastaAlignments CIBA courses – Brasil 2011 Options of SNiPlay Select the VCF format Load the VCF file Load reference file Select the Rice genome as reference Alexis Dereeper CIBA courses – Brasil 2011 Design of Illumina chip Submission file for Illumina Genotyping file Analysis with the BeadStudio software Cartesian coordinates Alexis Dereeper CIBA courses – Brasil 2011 Allelic files cARB 1 cSYR 2 cARA 3 • PED format 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 3 3 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 4 4 4 4 4 4 • DARwin format @DARwin 5.0 33 20 N° 50 50 1 1 1 2 1 1 3 1 1 4 1 1 ALLELIC - 2 122 1 1 1 1 122 1 1 1 1 218 3 3 3 3 • .inp format for Phase 33 10 P 49 121 217 244 260 289 SSSSSSSSSS #cARB A A G G T C C A T T A A G G T C C A T T #cSYR A A G A T C C A T C A A G G T C C A T T 218 3 3 3 3 245 3 1 3 3 245 3 3 3 3 261 4 4 4 4 261 4 4 4 4 290 2 2 2 2 290 2 2 2 2 356 2 2 2 2 • Format for TASSEL (association studies) 33 50 cARB cSYR cARA cORL cLAR Alexis Dereeper 10:2 122 A:A A:A A:A A:A A:G 218 A:A A:A A:A A:A A:G 245 G:G G:G G:G G:G A:G 261 G:G A:G G:G G:G A:G 290 T:T T:T T:T T:T C:T 356 C:C C:C C:C C:C C:C 461 C:C C:C C:C C:C C:C CIBA courses – Brasil 2011 467 A:A A:A A:A A:A A:A 560 T:T T:T T:T T:T T:T T:T C:T T:T T:T C:T 4 4 2 4 4 4 Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011 Annotation of SNPs Alexis Dereeper CIBA courses – Brasil 2011 Diversity analysis SeqLib library Low frequency haplotype Haplotype networks High frequency haplotypes Distance between 2 haplotypes (nb of mutations) Group distribution whithin this haplotype Alexis Dereeper CIBA courses – Brasil 2011 Allele sharing between groups External file (optional) Individu, group Ind1, Table Ind2, Table Ind3, Table Ind4, East Ind5, East Ind6, East Ind7, East Ind8, West Alexis Dereeper CIBA courses – Brasil 2011