FaSD-somatic: A fast and accurate somatic SNV detection algorithm for cancer genome sequencing data Weixin Wang1, 2, †, Panwen Wang1, 2, †, Feng Xu1, 2, Ruibang Luo3, 4, Tak-Wah Lam3, 4, and Junwen Wang1, 2, 5* 1 Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China. 2 Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China. 3 HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory 4 Department of Computer Science, University of Hong Kong, Hong Kong, China 5 Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China. †Both authors contributed equally to this work 1. Introduction for current methods Several efforts have been made by both biologists and bioinformaticists to surmount those challenges. ABSOLUTE, a recently proposed algorithm, demonstrates the capability to quantify the copy number variations (CNVs) and point mutations at an absolute level, rather than at a relative level, and to infer the subclonal architecture of tumors from single nucleotide polymorphism (SNP) arrays data [1]. However, the copy number profile are not guaranteed available in big sequencing project. PurityEst can infer the tumor purity level from the allelic differential representation of heterozygous loci with somatic mutations[2]. Though the process of PurityEst is free of copy number profile, the program’s estimated purity will affect the detection accuracy of somatic mutation, which is used as the input to the program. Recently emerging single-cell sequencing technology [3, 4] can characterize the genomic features of individual cells rather than a mixed population of tumor cells. There is no doubt that sequencing the genome at the single-cell level, especially when coupled with matched transcriptome and proteome profiling, will provide a deeper view of the genetic diversity within tumors. However this kind of resource is very limited compared to the high-throughput data from comprehensive cancer genomic projects, such as Cancer Genome Project (CGP), the International Cancer Genome Consortium(ICGC)[5] and the Cancer Genome Atlas(TCGA)[6]. Currently there are three types of methods to call somatic SNVs from high-throughput sequenced tumor-normal sample pairs. The first type applies a simple subtraction [7, 8] in which the genotypes for paired tumor and normal samples are initially identified independently, and then these loci with a high quality call of variant in tumor and with a high quality call of non-variant in normal will be treated as somatic SNVs. However, without simultaneously comparison of both samples, once the frequency of true germline variants is very low in normal sample, it is easily to call those germline variants as somatic (false positives). Likewise, when the frequency of true somatic variants in tumor is too low to be distinguished from errors, false-negative calls will also be made. VarScan2[9] represents the second type of somatic SNVs caller, which uses Fisher's exact test to calculate the significance of allele frequency difference between the tumor and normal samples. It is argued that VarScan2 reports P-value without any correction of multiple testing, which confounds the statistical interpretation[10]. The third type utilizes Bayesian models to simultaneously compare tumor and matched normal samples. JointSNVmix[11] is one of the representative tools of that category, which firstly allow users to train the parameters, then do the classification with the posterior probability. Similarly, SomaticSniper[12] incorporate a prior somatic mutation rate to describe the dependence between tumor and normal samples from the same individual, and employs a derived Bayesian likelihoods model to calculate the phred score of somatic SNVs. Though Bayesian methods perform well in general, the underlying diploid assumption is not applicable for regions with CNVs. Beside the above three types of strategies,’ mpileup module coupled with bcftools module in SAMtools [13] provides an option to compute the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical. 2. Model of FaSD-somatic Firstly, we define the prior genotype probability P (Gi) of genotype Gi for the normal clone as follows: Ti Ti Tv Tv Ti Tv Ti 2 Ti Tv Tv 2 Ti Tv P (G i ) 2 ( Ti ) 2 Ti Tv Tv 2 2 ( ) Ti Tv 2 2TiTv 2 (Ti Tv) 2 1 2 (1) (2) (3) (4) (5) (6) (7) (8) Where θ is the expected whole genome SNP rate, Ti is the number for genome-wide transitions and Tv is the number for transversions. It was reported that the estimated SNP rate θ between two distinct human haploid chromosomes is close to 0.001 [14]. Meanwhile, recent human genome studies particularly from the 1000 genomes project have been showing that a Ti/Tv ratio of around 2-2.1 is generally correct for whole human genome[15]. So we use values of θ=0.001 and Ti/Tv ratio=2. Case 1 and 2 occur when the genotype is heterozygous with a transitional or transversional mutation, and shares one allele with the reference. Case 3 and 4 occur when the genotype is homozygous variant with transitional or transversional mutations. Case 5, 6 and 7 occur when the genotype is heterozygous variant, and shares no allele with the reference. Case 5 denotes transitional mutations exist in both haploid chromosomes, case 6 denotes both transversional mutations, and case 7 is the case for one transitional and one translational mutation. The other situation which normal clone’s genotype is homozygous reference is listed in case 8. Inspired by the dependency between paired tumor and normal samples from the same individual that is proposed by SomaticSniper, the prior somatic mutation rate [12], which stands for the probability of one allele’s mutation from normal to tumor clone, is set with the default value of 0.01. So given the prior normal clone’s genotype probability and the prior somatic mutation rate , the joint tumor and normal clone’s genotype probability could be written as follows: P( gtT Gk | gt N G j )= 2 2 1- - Gk shares one allele with G j Gk shares no allele with G j Gk equals G j P( gtT Gk , gt N G j ) P( gtT Gk | gt N G j ) P( gt N G j ) Here gtT and gtN denote the genotypes in tumor and normal clones. We intented to specify the prior somatic mutation rate P ( gtT Gk | gt N G j ) in the transition and transversion categories. However, due to the large effect of sequencing error and somatic hyper-mutation machinery, to set the Ti/Tv ratio is not feasible for whole human genome somatic SNVs [16, 17]. Finally, the expression somatic _ score 10log10 of Gi ({ gt N _ obs}{ gtT _ obs}) G j { gt N _ obs} Gk { gtT _ obs} somatic score could be written as follows: | FaSD _ scoregtN Gi FaSD _ scoregtT Gi | P(T / Gi ) P( N / Gi ) P( gtT Gi | gt N Gi ) P ( gt N Gi ) | FaSD _ scoregtN G j FaSD _ scoregtT Gk | P (T / Gk ) P ( N / G j ) P ( gtT Gk | gt N G j ) P ( gt N G j ) The FaSD _ score and genotype likelihood P (T / G ) and P ( N / G ) (the probability to observe such sequencing information of tumor and normal on this locus given certain genotype) are calculated by our fast and accurate SNPs caller[18]. Here the gtT_obs stand for all possible genotypes of clones in tumor samples. For example, if A allele and B allele are observed, genotype of AA, AB and BB could be inferred; Likewise, if A, B and C allele are observed, genotype of AB, AC, BC, AA, BB and CC could be inferred(The phases of genotype are ignored). We assume that only one normal clone exists in normal samples, so the maximum number of gtN_obs should be 3. In above equation, we multiply the absolute difference between FaSD _ scoregtN Gi and FaSD _ scoregtT Gi to the tumor and normal clonal joint probability to emphasize the lower-frequency mutated alleles’ contribution to the somatic SNVs detection. In FaSD _ score computation, the subscript gtN=Gi or gtT=Gi will directly change the alternative _ score as we previously defined: 0 pseudo _ score 1 pseudo _ score alternative _ scoreGi 2 pseudo _ score 3 pseudo _ score (1) (2) (3) (4) depth FaSD _ scoreGi alternative _ scoreGi log i 1 2 P(readi / ref) depth It is noted that we also added a pseudo score to avoid the alternative score to be 0. Here case 1 occurs when the Gi is the homozygous reference; Case 2 occurs when the Gi is heterozygous with one variant and one reference allele; Case 3 occurs when the Gi is homozygous variant; Case 4 occurs when the Gi is heterozygous variant without reference allele. 3. Filtering Parameters for Somatic SNVs We have provided several filtering parameters for users to adjust. The recommended value for each parameter is inspired by SomaticSniper, VarScan2 and SAMtools. Sites and their covered bases that meet following requirements could be further processed to compute the somatic score: (1) Loci coverage for both tumor and normal sample ≥ 8. (2) Loci non-reference coverage for tumor sample ≥ 3. (3) Covered bases with base quality ≥ 25. (4) Covered bases mapping quality ≥ 10. (5) Every non-reference allele with strand bias ≤ 0.8. In the 2×2 matrix, the major allele is allele with the highest observed frequency, and the minor allele is this non-reference allele. If this non-reference allele has the highest frequency, we treat it as major allele. The same strand bias calculation has been utilized in several studies[19], and is defined as follows: | b d bd | /( ) ab cd abcd Where a, c represent the forward and reverse strands allele counts of the major allele, and b, d represent the forward and reverse strands allele counts of the minor allele 4. Evaluation Metrics According to JointSNVmix, the best inference of somatic SNVs should has the highest concordance with the somatic database; on the other hand, it should has the lowest concordance with the non-somatic database [11]. Catalogue of somatic mutations in cancer (COSMIC) v64[20] was used to generate the somatic mutation gold standard. We merged the VCF files for the coding mutations and non-coding variants annotated in COSMIC, and filtered the somatic indels, then got totally 628,643 somatic SNVs which are validated and recorded in the published scientific literature. And the non-somatic mutation gold standard was built by excluding the above 628,643 COSMIC somatic SNVs from 2011/05/21 released phase 1 1000 genomes project curated germline mutations. After the indels filleting, totally 38,218,563 non-somatic SNPs constituted the non-somatic mutation gold standard datasets. To test the callers’ performance specifically on the shallow depth data, we applied the receiver operating characteristic (ROC) analysis. First, the VarScan2’s high confidence (HC) somatic SNVs calling set on the higher depth LUAD data (~40X) is treated as the benchmark, and it turned out there are 110,980 HC non-somatic SNPs and 16,091 HC somatic SNVs. It is well known that an imbalanced dataset will reduce the classification performance and make the classifications deviate to the prevalent class [21, 22], so we sampled loci from 110,980 non-somatic SNPs each time with the same number as the available somatic SNVs, to avoid classification bias. Furthermore, bootstrap is applied to measure the stability of the performance of distinct programs. Then we run SomaticSniper, VarScan2, SAMtools, JointSNVmix and FaSD-somatic on the independently sequenced lower depth LUAD data (~4X) and the 50% sub-sampled higher depth LUAD data(~20X) from the same sample, to output the putative loci regarded as somatic SNVs for each caller. The sub-sampling process was implemented by SAMtools on the original 40X LUAD bam file. For the different callers, we had the different assignation for the predictors. SomaticSniper and FaSD-somatic both outputs the somatic score. The higher somatic score indicates the higher quality of that somatic call, so it could be treated as predictor. VarScan2 outputs the Somatic p-value, and the lower p-value indicates the more reliable calling, so the negative logarithm of the p-value was treated as the predictor. SAMtools provides the CLR in the INFO field of the output file, which gives the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical, so the value of CLR can be directly used as predictor. JointSNVmix outputs 9 probabilities of the genotype combinations. Following the recommendation its author, we added p_AA_AB + p_AA_BB together to get the somatic genotype probability to be used as predictor. Because VarScan2 and FaSD-somatic have the specific minimum depth requirement for somatic SNVs calling (8X), which may cause the above comparisons lack of reliability due to the insufficient sample size, we adjust the minimum depth requirement of VarScan2 and FaSD-somatic to 3X (only in the 4X LUAD dataset). In order to make the comparisons as fair as possible we run all the programs in the default parameters without any post-filtering. We used the train and classify sub commands in JointSNVmix2 model for JointSNVmix test. 5. Data used for evaluation Several independent tumor-normal paired whole genomes sequencing datasets were picked from TCGA, including Lung Adenocarcinoma (LUAD) with ~4X aligned coverage both in tumor and normal (TCGA-49-4486-01A-01D-1203-02 and TCGA-49-4486-11A-01D-1203-02 sequenced by Raju Kucherlapati Lab in Harvard Medical School on Illumina HiSeq platform, and the bam alignment files were downloaded from CGHub), Glioblastoma Multiforme (GBM) with ~6X coverage (TCGA-06-0188-01A-01D-0373-08 and TCGA-06-0188-10B-01D-0373-08 sequenced by Broad Institute of MIT and Harvard on Illumina GAⅡ platform, and the raw fastq files were downloaded from sequence read archive(accession code: SRX006310 and SRX006325), mapped to reference genome hg19 by BWA, then converted into the standard bam files), and Lung Squamous Cell Carcinoma (LUSC) with ~50X coverage (TCGA-34-2596-01A-01D-0963-08 and TCGA-34-2596-11A-01D-0963-08 sequenced by Broad Institute of MIT and Harvard, and the bam alignment files were downloaded from CGHub). Furthermore, we took a relative higher depth data (~40X) of the same LUAD sample described above (TCGA-49-4486-01A-01D-1931-08 and TCGA-49-4486-11A-01D-1931-08 sequenced by Broad Institute of MIT and Harvard, and the bam alignment files were downloaded from CGHub) as the benchmark, to show the extraordinary capability to call SNVs on shallow depth data of FaSD-somatic. 6. Evaluation Benchmarked by Databases As shown in Figure 1, in the paired tumor and normal LUAD samples with the lowest sequencing depth of 4X each, FaSD-somatic has the highest concordance with the COSMIC validated somatic gold standard among all five somatic SNVs callers (maximum concordance value 0.0079 compared with Varscan2’s 0.0060, SomaticSniper’s 0.0039, JointSNVmix’s 0.0023 and SAMtools’ 0.0011). Even if the quality threshold is decreased and the number of prediction candidate increased simultaneously, FaSD-somatic still has the highest concordance with somatic mutation benchmark. In the aspect of the concordance with the 1000 Genomes validated germline benchmark, FaSD-somatic has slightly higher concordance in the top 1,000 candidate of putative somatic SNVs (Supplementary Figure 1). Nevertheless, the germline concordance of FaSD-somatic is still less than 20% at the beginning and it gradually decreases to reach the third high position with value of 10%. SomaticSniper and JointSNVmix, though have surprising lower concordance with the 1000 Genomes validated germline gold standard in top ranking candidates, its value raise very sharply to reach over 30%. For the paired tumor and normal GBM samples with the sequencing depth of 6X each, FaSD-somatic has the highest concordance with the COSMIC validated somatic benchmark (no smaller than 0.002) among all five somatic SNVs callers when the number of predicted somatic SNVs is larger than 10,000 (Supplementary Figure 2). In the interval from 1 to 10,000 predicted somatic SNVs, FaSD-somatic’s concordance with the somatic gold standard could be ranked as the second high. In the aspect of the concordance with the germline database, FaSD-somatic starts with 20% and gradually decreases to nearly 10% (Supplementary Figure 3). And in most range of number of predicted somatic SNVs, FaSD-somatic could be listed in the top two callers with lowest concordance with the non-somatic SNPs set. Supplementary Figure 1 Germline concordance analyses of paired tumor and normal LUAD samples. The horizontal axis shows the number of somatic predictions made and the vertical axis represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic SNPs set. Supplementary Figure 2 Somatic concordance analyses of paired tumor and normal GBM samples. The horizontal axis shows the number of somatic predictions and the vertical axis represents the fraction of those predictions found to be in the merged COSMIC somatic SNVs set. Supplementary Figure 3 Germline concordance analyses of paired tumor and normal GBM samples. The horizontal axis shows the number of somatic predictions and the vertical axis represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic SNPs set. Performance for FaSD-somatic in the paired tumor and normal LUSC samples with the highest sequencing depth of 50X each are pretty similar to those in LUAD and GBM. FaSD-somatic has the highest somatic concordance among all five somatic SNVs callers when the number of predicted somatic SNVs is smaller than 30,000 (Supplementary Figure 4) and has the lowest germline concordance among all five somatic SNVs callers when the number of predicted somatic SNVs is larger than 60,000 (Supplementary Figure 5). When the number of predicted somatic SNVs is small, which means the quality threshold is very strict, the evaluation benchmarked by somatic and non-somatic database is easily disturbed by drawing small amount of somatic or germline loci. So here we did smoothing for all the concordance analyses when the calling number is less than 5,000. Supplementary Figure 4 Somatic concordance analyses of paired tumor and normal LUSC samples. The horizontal axis shows the number of somatic predictions and the vertical axis represents the fraction of those predictions found to be in the merged COSMIC somatic SNVs set. Supplementary Figure 5 Germline concordance analyses of paired tumor and normal LUSC samples. The horizontal axis shows the number of somatic predictions and the vertical axis represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic SNPs set. 7. Evaluation Benchmarked by Higher-Depth Data Since the area under the curve (AUC) of a receiver operating characteristic (ROC) curve does not need to take the specific cutoffs into consideration, it is widely applied as an important index of the overall classification performance of a program. Thus we also applied AUC to evaluate the performance of distinct somatic SNV caller. The five software’s calling result on 4X LUAD dataset was compared with the Varscan2’s HC calling result on the 40X LUAD data sequenced by different institutions but acquired from the identical sample. In the process of calculating the AUC, the corresponding score of each program is applied as the predictor while the result of Varscan2 on the high depth dataset is used as the gold standard. Due to the limited sequencing depth, we did not further divide the loci into categories with different sequencing depth. For each caller, we did 1,000 times bootstrap to test the stability of the performance of each software. As shown in Supplementary Figure 6 and Supplementary Table 1, the AUC of FaSD-somatic has a mean value of 0.801, and the 95 % non-parametric confidence interval is [0.765, 0.835], which is significantly higher than JointSNVmix (P-value < 2.2e-16 by Wilcoxon signed rank test, similarly hereinafter), SAMtools (P-value=6.027e-09), SomaticSniper(P-value < 2.2e-16), and Varscan2(P-value < 2.2e-16). For the evaluation on the 50% sub-sampled data from benchmark itself, we divided the loci into two categories: loci with sequencing depth smaller than 10X and loci with sequencing depth greater than or equal to 10X in both tumor and normal samples. In the first category, the AUC of FaSD-somatic has a mean value of 0.955, and the 95 % non-parametric confidence interval is [0.893, 0.989], which does not differ significantly from SAMtools’s AUC (mean value 0.967 and the 95 % non-parametric confidence interval is [0.96, 0.973]) (Supplementary Figure 7 and Supplementary Table 2). It is worth mentioning that the FaSD-somatic’s upper bound of 95 % non-parametric confidence interval is higher than SAMtools (0.989 in FaSD-somatic versus 0.973 in SAMtools). Nevertheless, FaSD-somatic’s performance is still significantly better than JointSNVmix (P-value < 2.2e-16), SomaticSniper (P-value < 2.2e-16), and Varscan2 (P-value < 2.2e-16). The performance of the five callers in the second category is similar to those in the first category. FaSD-somatic and SAMtools are two best callers among all five callers, with AUC significantly higher than others (AUC mean value of 0.981 and 95% non-parametric confidence interval of [0.979, 0.983] in FaSD-somatic versus 0.995 and [0.993, 0.996] in SAMtools, P-value < 2.2e-16 compared with other three callers) (Supplementary Figure 8 and Supplementary Table 3). Supplementary Figure 6 AUC analyses on 4X LUAD dataset. Supplementary Table 1 Mean value and 95% non-parametric confidence interval of AUC for each caller on 4X LUAD dataset software\measurement mean lower bound upper bound FaSD-somatic JointSNVmix SAMtools SomaticSniper VarScan2 0.801 0.614 0.796 0.668 0.76 0.765 0.607 0.755 0.658 0.699 0.835 0.621 0.834 0.678 0.811 Supplementary Figure 7 AUC analyses on loci with sequencing depth smaller than 10X in sub-sampled LUAD dataset. Supplementary Table 2 Mean value and 95% non-parametric confidence interval of AUC for each caller on loci with sequencing depth smaller than 10X in sub-sampled LUAD dataset software\measurement mean FaSD-somatic JointSNVmix SAMtools SomaticSniper VarScan2 0.955 0.787 0.967 0.755 0.681 lower bound 0.893 0.772 0.96 0.731 0.636 upper bound 0.989 0.8 0.973 0.779 0.726 Supplementary Figure 8 AUC analyses on loci with sequencing depth greater than or equal to 10X in sub-sampled LUAD dataset. Supplementary Table 3 Mean value and 95% non-parametric confidence interval of AUC for each caller on loci with sequencing depth greater than or equal to 10X in sub-sampled LUAD dataset software\measurement mean FaSD-somatic JointSNVmix SAMtools SomaticSniper VarScan2 0.981 0.936 0.995 0.885 0.825 lower bound 0.979 0.934 0.993 0.876 0.819 upper bound 0.983 0.939 0.996 0.891 0.831 8. Processing Speed The time taken for these tools to process the data is a major bottleneck for sequencing data analysis. We compared the running time of FaSD-somatic, JointSNVmix, SAMtools, SomaticSniper and VarScan2 on the same data tested in 3.1. All five programs were tested on a server, with 2.00 GHz Intel(R) Xeon(R) CPU E5-2620, 64 GB memory. Based on one thread of single core of CPU, FaSD-somatic can finish the whole genome somatic SNVs calling within 4,815 seconds, 8,601 seconds and 49,807 seconds on 4X LUAD, 6X GBM and 50X LUSC dataset, which is 38% faster than SomaticSniper, 62% than SAMtools, 113% than VarScan2 and 501% than JointSNVmix (Supplementary Figure 9 and Supplementary Table 4). Supplementary Figure 9 Run time on server using only a single thread in seconds (Time for VarScan2 does not include the BAM to pileup conversion) Supplementary Table 4 Run time on server using only a single thread sample\software FaSD-somatic SomaticSniper SAMtools VarScan2 JointSNVmix LUAD GBM LUSC 1:20:15 2:23:21 13:50:07 1:51:11 2:45:48 15:16:32 2:10:05 3:35:11 17:32:38 2:50:34 4:03:01 41:14:19 8:03:14 15:35:57 116:11:28 Time is in in h: m: s format. Running time of VarScan2 does not include the BAM to pileup conversion. 9. Reference 1. Carter, S.L., et al., Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol, 2012. 30(5): p. 413-21. 2. Su, X., et al., PurityEst: estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics, 2012. 28(17): p. 2265-6. 3. Hou, Y., et al., Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell, 2012. 148(5): p. 873-85. 4. Navin, N., et al., Tumour evolution inferred by single-cell sequencing. Nature, 2011. 472(7341): p. 90-4. 5. Hudson, T.J., et al., International network of cancer genome projects. Nature, 2010. 464(7291): p. 993-8. 6. Collins, F.S. and A.D. Barker, Mapping the cancer genome - Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Scientific American, 2007. 296(3): p. 50-57. 7. Pleasance, E.D., et al., A comprehensive catalogue of somatic mutations from a human cancer genome. Nature, 2010. 463(7278): p. 191-6. 8. Stark, M.S., et al., Frequent somatic mutations in MAP3K5 and MAP3K9 in metastatic melanoma identified by exome sequencing. Nature genetics, 2012. 44(2): p. 165-169. 9. Koboldt, D.C., et al., VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res, 2012. 22(3): p. 568-76. 10. Hansen, N.F., et al., Shimmer: Detection of genetic alterations in tumors using next generation sequence data. Bioinformatics, 2013. 11. Roth, A., et al., JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 2012. 28(7): p. 907-13. 12. Larson, D.E., et al., SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 2012. 28(3): p. 311-7. 13. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-9. 14. Li, R.Q., et al., SNP detection for massively parallel whole-genome resequencing. Genome Res, 2009. 19(6): p. 1124-1132. 15. Altshuler, D.M., et al., An integrated map of genetic variation from 1,092 human genomes. Nature, 2012. 491(7422): p. 56-65. 16. Campbell, P.J., et al., Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proceedings of the National Academy of Sciences of the United States of America, 2008. 105(35): p. 13081-13086. 17. Yang, Z.H., S. Ro, and B. Rannala, Likelihood models of somatic mutation and codon substitution in cancer genes. Genetics, 2003. 165(2): p. 695-705. 18. Xu, F., et al., A fast and accurate SNP detection algorithm for next-generation sequencing data. Nature communications, 2012. 3: p. 1258. 19. Guo, Y., et al., The effect of strand bias in Illumina short-read sequencing data. BMC genomics, 2012. 13. 20. Forbes, S.A., et al., COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research, 2011. 39: p. D945-D950. 21. Visa, S.R., A., The effect of imbalanced data class distribution on fuzzy classifiers-Experimental study, in IEEE Conference on Fuzzy Systems2005, IEEE. p. 749-754. 22. Weiss, G.M. and F. Provost, Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 2003. 19: p. 315-354.