snp_polymorphisms

advertisement
Detection and analysis
of SNP polymorphisms
Alexis Dereeper
CIBA courses – Brasil 2011
Objectives
Short reads Solexa
• To know and manipulate available packages/tools
for SNP and INDEL detection from NGS data
(assembly of NGS data)
Mapping SAM
• To think about difficulties encountered when
analysing new generation sequencing data
(differentiate sequencing errors, paralogs and allelic
variation)
Allelic
variations
867
1998
2341
• Detect SNP and assign genotypes to every
polymorphic positions
• Simply exploit polymorphisms data via a Webbased application (genetic diversity, LD)
Ind1
Ind2
Ind3
A/G
T/C List of SNPs
T/G
ATTGTGTCGTAACGTATGTCATGTCGT
ATTGTGTCGGAACGTATGTCATGTCGT
ATTGTGTCGKAACGTATGTCATGTCGT
Assignation of genotypes
• Obtain an exploitable dataset to send for the
design of a high-throughput SNP chip
(Illumina VeraCode technology)
Design of a
Illumina SNP chip
Exploitation of
polymorphism data
Tablet
• Graphical
viewer for
assembly of NGS
data
• Accepts
different
formats:
ACE, SAM, BAM
Alexis Dereeper
CIBA courses – Brasil 2011
Automatic detection of SNP from SAM assembly
Example of pipeline faisable with
the Galaxy system:
3 alternatives
Fastq
FastQ Groomer
PicardTools
Mapping BWA
SamTools
GATK
SAM assembly
VarScan
SAM-to-BAM
Generate Pileup
AddReadGroupIntoSam
SAM-to-BAM
SNiPlay Utilities
SamToFastaAlignments
IndelRealigner
Pileup file
CountCovariates
Pileup2snp
TableRecalibration
FASTA alignments
with IUPAC
UnifiedGenotyper
SNP tabular file
VCF file
Alexis Dereeper
VCFToFastaAlignments
CIBA courses – Brasil 2011
Varscan
Program for SNP detection from Pileup file : Pileup2snp
Another module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen
Pileup format
Text file describing for each position: base for reference, depth of coverage,
variations, quality
seq1
seq1
seq1
seq1
seq1
seq1
seq1
seq1
272
273
274
275
276
277
278
279
T
T
T
A
G
T
G
C
24
23
23
23
22
22
23
23
,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
,.$....,,.,.,...,,,.,...
7<7;<;<<<<<<<<<=<;<;<<6
,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
....,,.,.,...,,,.,....^k.
%38*<<;<7<<7<=<<<;<<<<<
A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Alexis Dereeper
CIBA courses – Brasil 2011
SamToFastaAlignments and AceToFastaAlignments:
SNiPlay utilities for management of NGS data
Mapping: SAM format
Threshold values per genotype
Assemblage: Ace format
Depth
Frequency
genotype1
1
0
1
genotype2
4
0.3
2
genotype3
4
0.3
2
CL1Contig1
Depth
threshold
Heterozygosity
Depth
threshold
For heterozygosity
estimation
For each contig
List of
+ heterozygous
positions
FASTA alignments including IUPAC
CL1Contig1.align.fa
A
A
T
For position
+ Stats: estimation of average
heterozygosity for each genotype
Y
W
+ CL1Contig2.align.fa , CL2Contig1.align.fa …
Alexis Dereeper
Depth
CIBA courses – Brasil 2011
GATK (Genome Analysis ToolKit)
• Package for analysis of NGS data.
• Developed for the analysis of Human
medical resequencing projects
(1000 Genomes, The Cancer Genome Atlas)
• Includes tools for depth analysis, quality
score recalibration, SNP/InDel discovery
• Complementary of 2 other packages:
SamTools, PicardTools
Alexis Dereeper
PREPROCESS:
* Index human genome (Picard), we used HG18 from UCSC.
* Convert Illumina reads to Fastq format
* Convert Illumina 1.6 read quality scores to standard Sanger scores
FOR EACH SAMPLE:
1. Align samples to genome (BWA), generates SAI files.
2. Convert SAI to SAM (BWA)
3. Convert SAM to BAM binary format (SAM Tools)
4. Sort BAM (SAM Tools)
5. Index BAM (SAM Tools)
6. Identify target regions for realignment (Genome Analysis Toolkit)
7. Realign BAM to get better Indel calling (Genome Analysis Toolkit)
8. Reindex the realigned BAM (SAM Tools)
9. Call Indels (Genome Analysis Toolkit)
10. Call SNPs (Genome Analysis Toolkit)
11. View aligned reads in BAM/BAI (Integrated Genome Viewer)
CIBA courses – Brasil 2011
Fastq (RC1)
Fastq (RC2)
Fastq (RC3)
Fastq (RC4)
FastQ Groomer
FastQ Groomer
FastQ Groomer
FastQ Groomer
Mapping BWA
Mapping BWA
Mapping BWA
Mapping BWA
AddReadGroupIntoSam
AddReadGroupIntoSam
AddReadGroupIntoSam
SAM with read group
….
AddReadGroupIntoSam
SAM with read group SAM with read group SAM with read group
mergeSam
Global SAM with read group
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
Fastq (RC1)
Fastq (RC2)
Fastq (RC3)
Fastq global
FastQ Groomer
Mapping BWA
AddReadGroupIntoSam
Global SAM with read group
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
Fastq (RC4)
VCF format (Variant Call Format)
Advantages: describes the variations for each position + genotype assignation
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS
ID
REF ALT
QUAL FILTER INFO
20
14370
rs6054257 G
A
29
PASS
NS=3;DP=14;AF=0.5;DB;H2
20
17330
.
T
A
3
q10
NS=3;DP=11;AF=0.017
Alexis Dereeper
FORMAT
NA00001
NA00002
GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3
CIBA courses – Brasil 2011
Other functionalities of GATK
• DepthOfCoverage module:
Enables to inform sequencing depth of coverage for each gene, each position and each
individual
• ReadBackedPhasing module:
Enables to define if possible allele association (phase or haplotype) in case of heterozygosity…
And not
AGG
GGA
Alexis Dereeper
CIBA courses – Brasil 2011
SNiPlay: Webbased application
for polymorphism
analysis
http://sniplay.cirad.fr
Alexis Dereeper
CIBA courses – Brasil 2011
Automatic detection of SNP from SAM assembly
Example of pipeline faisable with
the Galaxy system:
3 alternatives
Fastq
FastQ Groomer
PicardTools
Mapping BWA
SamTools
GATK
SAM assembly
VarScan
SAM-to-BAM
Generate Pileup
AddReadGroupIntoSam
SAM-to-BAM
SNiPlay Utilities
SamToFastaAlignments
IndelRealigner
Pileup file
CountCovariates
Pileup2snp
TableRecalibration
FASTA alignments
with IUPAC
UnifiedGenotyper
SNP tabular file
VCF file
Alexis Dereeper
VCFToFastaAlignments
CIBA courses – Brasil 2011
Options of SNiPlay
Select the VCF format
Load the VCF file
Load reference file
Select the Rice genome
as reference
Alexis Dereeper
CIBA courses – Brasil 2011
Design of Illumina chip
Submission file for Illumina
Genotyping file
Analysis with the BeadStudio software
Cartesian
coordinates
Alexis Dereeper
CIBA courses – Brasil 2011
Allelic files
cARB 1
cSYR 2
cARA 3
• PED format
0
0
0
0
0
0
1
1
1
0
0
0
1 1
1 1
1 1
1 1
1 1
1 1
3 3
3 3
3 3
3 3
1 3
3 3
4 4
4 4
4 4
2 2
2 2
2 2
2 2
2 2
2 2
1 1
1 1
1 1
4 4
4 4
4 4
• DARwin format
@DARwin 5.0 33
20
N° 50
50
1
1
1
2
1
1
3
1
1
4
1
1
ALLELIC - 2
122
1
1
1
1
122
1
1
1
1
218
3
3
3
3
• .inp format for Phase
33
10
P 49 121 217 244 260 289
SSSSSSSSSS
#cARB
A A G G T C C A T T
A A G G T C C A T T
#cSYR
A A G A T C C A T C
A A G G T C C A T T
218
3
3
3
3
245
3
1
3
3
245
3
3
3
3
261
4
4
4
4
261
4
4
4
4
290
2
2
2
2
290
2
2
2
2
356
2
2
2
2
• Format for TASSEL (association studies)
33
50
cARB
cSYR
cARA
cORL
cLAR
Alexis Dereeper
10:2
122
A:A
A:A
A:A
A:A
A:G
218
A:A
A:A
A:A
A:A
A:G
245
G:G
G:G
G:G
G:G
A:G
261
G:G
A:G
G:G
G:G
A:G
290
T:T
T:T
T:T
T:T
C:T
356
C:C
C:C
C:C
C:C
C:C
461
C:C
C:C
C:C
C:C
C:C
CIBA courses – Brasil 2011
467
A:A
A:A
A:A
A:A
A:A
560
T:T
T:T
T:T
T:T
T:T
T:T
C:T
T:T
T:T
C:T
4 4
2 4
4 4
Annotation of
SNPs
Alexis Dereeper
CIBA courses – Brasil 2011
Annotation
of SNPs
Alexis Dereeper
CIBA courses – Brasil 2011
Diversity
analysis
SeqLib library
Low frequency
haplotype
Haplotype
networks
High frequency haplotypes
Distance between 2
haplotypes
(nb of mutations)
Group distribution
whithin this
haplotype
Alexis Dereeper
CIBA courses – Brasil 2011
Allele sharing
between groups
External file (optional)
Individu, group
Ind1, Table
Ind2, Table
Ind3, Table
Ind4, East
Ind5, East
Ind6, East
Ind7, East
Ind8, West
Alexis Dereeper
CIBA courses – Brasil 2011
Download