1 - Springer Static Content Server

advertisement
A Beginners Guide To SNP Calling From High-Throughput DNA-Sequencing Data
André Altmann1, Peter Weber2, Daniel Bader2, Michael Preuß3, Elisabeth B.
Binder2, Bertram Müller-Myhsok1
1 Statistical Genetics, Max Planck Institute of Psychiatry, Germany; 2 Molecular
Genetics of Affective Disorder, Max Planck Institute of Psychiatry, Germany; 3
Genetic Epidemiology, Institut für Medizinische Biometrie und Statistik,
University of Lübeck, Germany
Online Supplementary Material
Supplementary Text
Statistical Framework For SNP calling
The probabilistic approaches rely on Bayes’ Theorem for computing genotype
posterior probabilities:
,
where G refers to a specific diploid genotype and E to the evidence that is
available (i.e., reads). The probabilities p(G) and p(E) are prior probabilities for
the genotype and the evidence, respectively. The term p(G|E) refers to the
posterior probability of the genotype G. Hence, we can compute the most
plausible genotype by:
.
The prior for the evidence p(E) remains constant in the maximization and can
therefore be omitted. Thus, it is sufficient to find the genotype that maximizes
the genotype prior times posterior probability. In this case the evidence E simply
consists of rescaled quality values of the bases. More precisely, p(E|G) can be
expressed as
, where ei is the evidence in read i for genotype G at a
specific site in the genome. The p(ei|G) can be simply regarded as rescaled
quality scores. The rescaling may involve for instance the alignment quality or
the position in the read (i.e., sequencing cycle). Regarding the prior: in theory
one can chose a uniform prior for the genotypes. However, better results are
achieved, when selecting a prior that is based on external information, e.g.,
dbSNP or the reference sequence. SOAPsnp and MAQ are two examples for
algorithms that make use of such priors. Instead of basing the genotype prior on
external information, one can use the data from more than one individual.
Briefly, in most projects more than one individual is sequenced. Thus, instead of
carrying out the SNP calling for each dataset separately, some algorithms
support SNP calling from multiple datasets in parallel. Results are in general
more stable due to the improved prior probability (Nielsen et al. 2011).
Supplementary Tables
Table S1: Settings for the programs used for processing the example data. The
mandatory input and output files were omitted from the listing. Moreover, due to
the work with GATK two additional processing steps had to be carried out: a) the
addition of read group information to the bam files and b) reordering the bam
files with respect to a karyotypically ordered reference genome (available in the
GATK bundle). Both steps were carried out using Picard. Highlighted columns
are only meant for increasing the readability of the table.
[TableS1_Program_Settings.pdf]
Table S2: ANNOVAR based functional annotation of the 22,383 variants
discovered in all eight combinations of SNP caller and aligner. The functional
information provided by the Ensembl database (Hubbard et al. 2002) was used
for this annotation. Among the 10,234 exonic variants are 5,657 synonymous
mutations, 4,688 non-synonymous mutations, 29 gains of stop codons, and eight
losses of stop codons.
Functional annotation
Count
Intronic
10,634
Exonic
10,234
UTR3
637
UTR5
384
Exonic;splicing
148
Upstream
92
Intergenic
77
Downstream
48
Splicing
37
ncRNA exonic
37
Upstream;downstream
8
ncRNA_intronic
6
UTR5;UTR3
4
ncRNA splicing
1
Supplementary Figures
Figure S1: Plots provided by SolexaQA for the one of the mate pairs in our
sample data. Inlet (a) depicts a histogram of the maximal read length without a
single base quality below a specified error probability threshold (here 5%).
Evidently, a large fraction of the reads (> 70%) does not contain a single base
with an error probability of 5% or lower. The figure in inlet (b) provides an
overview over the average quality per tile (y-axis) on the slide. The x-axis
corresponds to the position within the read, the average error probability is
color-coded. Just as in Figure 1, it is evident that the read quality decreases
towards the end of the read. Of note, each square corresponds to one black dot in
Figure 1, while the mean of each column corresponds to the red circles in Figure
1.
(a) [FigureS1a_fastq_segments_hist.pdf]
(b) [FigureS1b_fastq_tiles.png]
References
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen
V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L,
Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R,
Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner
W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M
(2002) The Ensembl genome database project. Nucleic Acids Res 30: 3841
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from
next-generation sequencing data. Nat Rev Genet 12: 443-51
Download