A Beginners Guide To SNP Calling From High-Throughput DNA-Sequencing Data André Altmann1, Peter Weber2, Daniel Bader2, Michael Preuß3, Elisabeth B. Binder2, Bertram Müller-Myhsok1 1 Statistical Genetics, Max Planck Institute of Psychiatry, Germany; 2 Molecular Genetics of Affective Disorder, Max Planck Institute of Psychiatry, Germany; 3 Genetic Epidemiology, Institut für Medizinische Biometrie und Statistik, University of Lübeck, Germany Online Supplementary Material Supplementary Text Statistical Framework For SNP calling The probabilistic approaches rely on Bayes’ Theorem for computing genotype posterior probabilities: , where G refers to a specific diploid genotype and E to the evidence that is available (i.e., reads). The probabilities p(G) and p(E) are prior probabilities for the genotype and the evidence, respectively. The term p(G|E) refers to the posterior probability of the genotype G. Hence, we can compute the most plausible genotype by: . The prior for the evidence p(E) remains constant in the maximization and can therefore be omitted. Thus, it is sufficient to find the genotype that maximizes the genotype prior times posterior probability. In this case the evidence E simply consists of rescaled quality values of the bases. More precisely, p(E|G) can be expressed as , where ei is the evidence in read i for genotype G at a specific site in the genome. The p(ei|G) can be simply regarded as rescaled quality scores. The rescaling may involve for instance the alignment quality or the position in the read (i.e., sequencing cycle). Regarding the prior: in theory one can chose a uniform prior for the genotypes. However, better results are achieved, when selecting a prior that is based on external information, e.g., dbSNP or the reference sequence. SOAPsnp and MAQ are two examples for algorithms that make use of such priors. Instead of basing the genotype prior on external information, one can use the data from more than one individual. Briefly, in most projects more than one individual is sequenced. Thus, instead of carrying out the SNP calling for each dataset separately, some algorithms support SNP calling from multiple datasets in parallel. Results are in general more stable due to the improved prior probability (Nielsen et al. 2011). Supplementary Tables Table S1: Settings for the programs used for processing the example data. The mandatory input and output files were omitted from the listing. Moreover, due to the work with GATK two additional processing steps had to be carried out: a) the addition of read group information to the bam files and b) reordering the bam files with respect to a karyotypically ordered reference genome (available in the GATK bundle). Both steps were carried out using Picard. Highlighted columns are only meant for increasing the readability of the table. [TableS1_Program_Settings.pdf] Table S2: ANNOVAR based functional annotation of the 22,383 variants discovered in all eight combinations of SNP caller and aligner. The functional information provided by the Ensembl database (Hubbard et al. 2002) was used for this annotation. Among the 10,234 exonic variants are 5,657 synonymous mutations, 4,688 non-synonymous mutations, 29 gains of stop codons, and eight losses of stop codons. Functional annotation Count Intronic 10,634 Exonic 10,234 UTR3 637 UTR5 384 Exonic;splicing 148 Upstream 92 Intergenic 77 Downstream 48 Splicing 37 ncRNA exonic 37 Upstream;downstream 8 ncRNA_intronic 6 UTR5;UTR3 4 ncRNA splicing 1 Supplementary Figures Figure S1: Plots provided by SolexaQA for the one of the mate pairs in our sample data. Inlet (a) depicts a histogram of the maximal read length without a single base quality below a specified error probability threshold (here 5%). Evidently, a large fraction of the reads (> 70%) does not contain a single base with an error probability of 5% or lower. The figure in inlet (b) provides an overview over the average quality per tile (y-axis) on the slide. The x-axis corresponds to the position within the read, the average error probability is color-coded. Just as in Figure 1, it is evident that the read quality decreases towards the end of the read. Of note, each square corresponds to one black dot in Figure 1, while the mean of each column corresponds to the red circles in Figure 1. (a) [FigureS1a_fastq_segments_hist.pdf] (b) [FigureS1b_fastq_tiles.png] References Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M (2002) The Ensembl genome database project. Nucleic Acids Res 30: 3841 Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: 443-51