Here - jjwanglab

advertisement
FaSD-somatic: A fast and accurate somatic SNV detection
algorithm for cancer genome sequencing data
Weixin Wang1, 2, †, Panwen Wang1, 2, †, Feng Xu1, 2, Ruibang Luo3, 4, Tak-Wah Lam3, 4, and Junwen
Wang1, 2, 5*
1
Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong
SAR, China.
2
Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China.
3
HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory
4
Department of Computer Science, University of Hong Kong, Hong Kong, China
5
Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong
SAR, China.
†Both authors contributed equally to this work
1. Introduction for current methods
Several efforts have been made by both biologists and bioinformaticists to surmount those
challenges. ABSOLUTE, a recently proposed algorithm, demonstrates the capability to quantify
the copy number variations (CNVs) and point mutations at an absolute level, rather than at a
relative level, and to infer the subclonal architecture of tumors from single nucleotide
polymorphism (SNP) arrays data [1]. However, the copy number profile are not guaranteed
available in big sequencing project. PurityEst can infer the tumor purity level from the allelic
differential representation of heterozygous loci with somatic mutations[2]. Though the process of
PurityEst is free of copy number profile, the program’s estimated purity will affect the detection
accuracy of somatic mutation, which is used as the input to the program. Recently emerging
single-cell sequencing technology [3, 4] can characterize the genomic features of individual cells
rather than a mixed population of tumor cells. There is no doubt that sequencing the genome at the
single-cell level, especially when coupled with matched transcriptome and proteome profiling,
will provide a deeper view of the genetic diversity within tumors. However this kind of resource is
very limited compared to the high-throughput data from comprehensive cancer genomic projects,
such as Cancer Genome Project (CGP), the International Cancer Genome Consortium(ICGC)[5]
and the Cancer Genome Atlas(TCGA)[6].
Currently there are three types of methods to call somatic SNVs from high-throughput sequenced
tumor-normal sample pairs. The first type applies a simple subtraction [7, 8] in which the
genotypes for paired tumor and normal samples are initially identified independently, and then
these loci with a high quality call of variant in tumor and with a high quality call of non-variant in
normal will be treated as somatic SNVs. However, without simultaneously comparison of both
samples, once the frequency of true germline variants is very low in normal sample, it is easily to
call those germline variants as somatic (false positives). Likewise, when the frequency of true
somatic variants in tumor is too low to be distinguished from errors, false-negative calls will also
be made. VarScan2[9] represents the second type of somatic SNVs caller, which uses Fisher's
exact test to calculate the significance of allele frequency difference between the tumor and
normal samples. It is argued that VarScan2 reports P-value without any correction of multiple
testing, which confounds the statistical interpretation[10]. The third type utilizes Bayesian models
to simultaneously compare tumor and matched normal samples. JointSNVmix[11] is one of the
representative tools of that category, which firstly allow users to train the parameters, then do the
classification with the posterior probability. Similarly, SomaticSniper[12] incorporate a prior
somatic mutation rate to describe the dependence between tumor and normal samples from the
same individual, and employs a derived Bayesian likelihoods model to calculate the phred score of
somatic SNVs. Though Bayesian methods perform well in general, the underlying diploid
assumption is not applicable for regions with CNVs. Beside the above three types of strategies,’
mpileup module coupled with bcftools module in SAMtools [13] provides an option to compute
the Phred-log ratio between the likelihood by treating the two samples independently, and the
likelihood by requiring the genotype to be identical.
2. Model of FaSD-somatic
Firstly, we define the prior genotype probability P (Gi) of genotype Gi for the normal clone as
follows:
Ti

  Ti  Tv

  Tv
 Ti  Tv

  Ti
 2 Ti  Tv
  Tv

 2 Ti  Tv
P (G i )  
 2 ( Ti ) 2
 Ti  Tv

Tv 2
 2 (
)
 Ti  Tv
 2 2TiTv

2
 (Ti  Tv)

 2
1     

2
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Where θ is the expected whole genome SNP rate, Ti is the number for genome-wide transitions
and Tv is the number for transversions. It was reported that the estimated SNP rate θ between two
distinct human haploid chromosomes is close to 0.001 [14]. Meanwhile, recent human genome
studies particularly from the 1000 genomes project have been showing that a Ti/Tv ratio of around
2-2.1 is generally correct for whole human genome[15]. So we use values of θ=0.001 and Ti/Tv
ratio=2. Case 1 and 2 occur when the genotype is heterozygous with a transitional or
transversional mutation, and shares one allele with the reference. Case 3 and 4 occur when the
genotype is homozygous variant with transitional or transversional mutations. Case 5, 6 and 7
occur when the genotype is heterozygous variant, and shares no allele with the reference. Case 5
denotes transitional mutations exist in both haploid chromosomes, case 6 denotes both
transversional mutations, and case 7 is the case for one transitional and one translational mutation.
The other situation which normal clone’s genotype is homozygous reference is listed in case 8.
Inspired by the dependency between paired tumor and normal samples from the same individual
that is proposed by SomaticSniper, the prior somatic mutation rate  [12], which stands for the
probability of one allele’s mutation from normal to tumor clone, is set with the default value of
0.01. So given the prior normal clone’s genotype probability and the prior somatic mutation rate  ,
the joint tumor and normal clone’s genotype probability could be written as follows:


P( gtT  Gk | gt N  G j )=   2

2
1- -
Gk shares one allele with G j
Gk shares no allele with G j
Gk equals G j
P( gtT  Gk , gt N  G j )  P( gtT  Gk | gt N  G j ) P( gt N  G j )
Here gtT and gtN denote the genotypes in tumor and normal clones. We intented to specify the
prior somatic mutation rate P ( gtT  Gk | gt N  G j ) in the transition and transversion categories.
However, due to the large effect of sequencing error and somatic hyper-mutation machinery, to set
the Ti/Tv ratio is not feasible for whole human genome somatic SNVs [16, 17].
Finally,
the
expression
somatic _ score  10log10

of
Gi ({ gt N _ obs}{ gtT _ obs})


G j { gt N _ obs} Gk { gtT _ obs}
somatic
score
could
be
written
as
follows:
| FaSD _ scoregtN Gi  FaSD _ scoregtT Gi | P(T / Gi ) P( N / Gi ) P( gtT  Gi | gt N  Gi ) P ( gt N  Gi )
| FaSD _ scoregtN G j  FaSD _ scoregtT Gk | P (T / Gk ) P ( N / G j ) P ( gtT  Gk | gt N  G j ) P ( gt N  G j )
The FaSD _ score and genotype likelihood P (T / G ) and P ( N / G ) (the probability to
observe such sequencing information of tumor and normal on this locus given certain genotype)
are calculated by our fast and accurate SNPs caller[18]. Here the gtT_obs stand for all possible
genotypes of clones in tumor samples. For example, if A allele and B allele are observed, genotype
of AA, AB and BB could be inferred; Likewise, if A, B and C allele are observed, genotype of AB,
AC, BC, AA, BB and CC could be inferred(The phases of genotype are ignored). We assume that
only one normal clone exists in normal samples, so the maximum number of gtN_obs should be 3.
In above equation, we multiply the absolute difference between FaSD _ scoregtN Gi and
FaSD _ scoregtT Gi to the tumor and normal clonal joint probability to emphasize the
lower-frequency mutated alleles’ contribution to the somatic SNVs detection. In FaSD _ score
computation, the subscript gtN=Gi or gtT=Gi will directly change the alternative _ score as we
previously defined:
 0  pseudo _ score
1  pseudo _ score

alternative _ scoreGi  
2  pseudo _ score
3  pseudo _ score
(1)
(2)
(3)
(4)
depth
FaSD _ scoreGi  alternative _ scoreGi 
 log
i 1
2
P(readi / ref)
depth
It is noted that we also added a pseudo score to avoid the alternative score to be 0. Here case 1
occurs when the Gi is the homozygous reference; Case 2 occurs when the Gi is heterozygous with
one variant and one reference allele; Case 3 occurs when the Gi is homozygous variant; Case 4
occurs when the Gi is heterozygous variant without reference allele.
3. Filtering Parameters for Somatic SNVs
We have provided several filtering parameters for users to adjust. The recommended value for
each parameter is inspired by SomaticSniper, VarScan2 and SAMtools. Sites and their covered
bases that meet following requirements could be further processed to compute the somatic score:
(1) Loci coverage for both tumor and normal sample ≥ 8.
(2) Loci non-reference coverage for tumor sample ≥ 3.
(3) Covered bases with base quality ≥ 25.
(4) Covered bases mapping quality ≥ 10.
(5) Every non-reference allele with strand bias ≤ 0.8. In the 2×2 matrix, the major allele is allele
with the highest observed frequency, and the minor allele is this non-reference allele. If this
non-reference allele has the highest frequency, we treat it as major allele. The same strand bias
calculation has been utilized in several studies[19], and is defined as follows:
|
b
d
bd

| /(
)
ab cd abcd
Where a, c represent the forward and reverse strands allele counts of the major allele, and b, d
represent the forward and reverse strands allele counts of the minor allele
4. Evaluation Metrics
According to JointSNVmix, the best inference of somatic SNVs should has the highest
concordance with the somatic database; on the other hand, it should has the lowest concordance
with the non-somatic database [11]. Catalogue of somatic mutations in cancer (COSMIC) v64[20]
was used to generate the somatic mutation gold standard. We merged the VCF files for the coding
mutations and non-coding variants annotated in COSMIC, and filtered the somatic indels, then got
totally 628,643 somatic SNVs which are validated and recorded in the published scientific
literature. And the non-somatic mutation gold standard was built by excluding the above 628,643
COSMIC somatic SNVs from 2011/05/21 released phase 1 1000 genomes project curated
germline mutations. After the indels filleting, totally 38,218,563 non-somatic SNPs constituted the
non-somatic mutation gold standard datasets.
To test the callers’ performance specifically on the shallow depth data, we applied the receiver
operating characteristic (ROC) analysis. First, the VarScan2’s high confidence (HC) somatic
SNVs calling set on the higher depth LUAD data (~40X) is treated as the benchmark, and it turned
out there are 110,980 HC non-somatic SNPs and 16,091 HC somatic SNVs. It is well known that
an imbalanced dataset will reduce the classification performance and make the classifications
deviate to the prevalent class [21, 22], so we sampled loci from 110,980 non-somatic SNPs each
time with the same number as the available somatic SNVs, to avoid classification bias.
Furthermore, bootstrap is applied to measure the stability of the performance of distinct programs.
Then we run SomaticSniper, VarScan2, SAMtools, JointSNVmix and FaSD-somatic on the
independently sequenced lower depth LUAD data (~4X) and the 50% sub-sampled higher depth
LUAD data(~20X) from the same sample, to output the putative loci regarded as somatic SNVs
for each caller. The sub-sampling process was implemented by SAMtools on the original 40X
LUAD bam file. For the different callers, we had the different assignation for the predictors.
SomaticSniper and FaSD-somatic both outputs the somatic score. The higher somatic score
indicates the higher quality of that somatic call, so it could be treated as predictor. VarScan2
outputs the Somatic p-value, and the lower p-value indicates the more reliable calling, so the
negative logarithm of the p-value was treated as the predictor. SAMtools provides the CLR in the
INFO field of the output file, which gives the Phred-log ratio between the likelihood by treating
the two samples independently, and the likelihood by requiring the genotype to be identical, so the
value of CLR can be directly used as predictor. JointSNVmix outputs 9 probabilities of the
genotype combinations. Following the recommendation its author, we added p_AA_AB +
p_AA_BB together to get the somatic genotype probability to be used as predictor.
Because VarScan2 and FaSD-somatic have the specific minimum depth requirement for somatic
SNVs calling (8X), which may cause the above comparisons lack of reliability due to the
insufficient sample size, we adjust the minimum depth requirement of VarScan2 and
FaSD-somatic to 3X (only in the 4X LUAD dataset). In order to make the comparisons as fair as
possible we run all the programs in the default parameters without any post-filtering. We used the
train and classify sub commands in JointSNVmix2 model for JointSNVmix test.
5. Data used for evaluation
Several independent tumor-normal paired whole genomes sequencing datasets were picked from
TCGA, including Lung Adenocarcinoma (LUAD) with ~4X aligned coverage both in tumor and
normal (TCGA-49-4486-01A-01D-1203-02 and TCGA-49-4486-11A-01D-1203-02 sequenced by
Raju Kucherlapati Lab in Harvard Medical School on Illumina HiSeq platform, and the bam
alignment files were downloaded from CGHub), Glioblastoma Multiforme (GBM) with ~6X
coverage (TCGA-06-0188-01A-01D-0373-08 and TCGA-06-0188-10B-01D-0373-08 sequenced
by Broad Institute of MIT and Harvard on Illumina GAⅡ platform, and the raw fastq files were
downloaded from sequence read archive(accession code: SRX006310 and SRX006325), mapped
to reference genome hg19 by BWA, then converted into the standard bam files), and Lung
Squamous Cell Carcinoma (LUSC) with ~50X coverage (TCGA-34-2596-01A-01D-0963-08 and
TCGA-34-2596-11A-01D-0963-08 sequenced by Broad Institute of MIT and Harvard, and the
bam alignment files were downloaded from CGHub). Furthermore, we took a relative higher
depth
data
(~40X)
of
the
same
LUAD
sample
described
above
(TCGA-49-4486-01A-01D-1931-08 and TCGA-49-4486-11A-01D-1931-08 sequenced by Broad
Institute of MIT and Harvard, and the bam alignment files were downloaded from CGHub) as the
benchmark, to show the extraordinary capability to call SNVs on shallow depth data of
FaSD-somatic.
6. Evaluation Benchmarked by Databases
As shown in Figure 1, in the paired tumor and normal LUAD samples with the lowest sequencing
depth of 4X each, FaSD-somatic has the highest concordance with the COSMIC validated somatic
gold standard among all five somatic SNVs callers (maximum concordance value 0.0079
compared with Varscan2’s 0.0060, SomaticSniper’s 0.0039, JointSNVmix’s 0.0023 and SAMtools’
0.0011). Even if the quality threshold is decreased and the number of prediction candidate
increased simultaneously, FaSD-somatic still has the highest concordance with somatic mutation
benchmark. In the aspect of the concordance with the 1000 Genomes validated germline
benchmark, FaSD-somatic has slightly higher concordance in the top 1,000 candidate of putative
somatic SNVs (Supplementary Figure 1). Nevertheless, the germline concordance of
FaSD-somatic is still less than 20% at the beginning and it gradually decreases to reach the third
high position with value of 10%. SomaticSniper and JointSNVmix, though have surprising lower
concordance with the 1000 Genomes validated germline gold standard in top ranking candidates,
its value raise very sharply to reach over 30%.
For the paired tumor and normal GBM samples with the sequencing depth of 6X each,
FaSD-somatic has the highest concordance with the COSMIC validated somatic benchmark (no
smaller than 0.002) among all five somatic SNVs callers when the number of predicted somatic
SNVs is larger than 10,000 (Supplementary Figure 2). In the interval from 1 to 10,000 predicted
somatic SNVs, FaSD-somatic’s concordance with the somatic gold standard could be ranked as
the second high. In the aspect of the concordance with the germline database, FaSD-somatic starts
with 20% and gradually decreases to nearly 10% (Supplementary Figure 3). And in most range
of number of predicted somatic SNVs, FaSD-somatic could be listed in the top two callers with
lowest concordance with the non-somatic SNPs set.
Supplementary Figure 1 Germline concordance analyses of paired tumor and normal LUAD
samples. The horizontal axis shows the number of somatic predictions made and the vertical axis
represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic
SNPs set.
Supplementary Figure 2 Somatic concordance analyses of paired tumor and normal GBM
samples. The horizontal axis shows the number of somatic predictions and the vertical axis
represents the fraction of those predictions found to be in the merged COSMIC somatic SNVs set.
Supplementary Figure 3 Germline concordance analyses of paired tumor and normal GBM
samples. The horizontal axis shows the number of somatic predictions and the vertical axis
represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic
SNPs set.
Performance for FaSD-somatic in the paired tumor and normal LUSC samples with the highest
sequencing depth of 50X each are pretty similar to those in LUAD and GBM. FaSD-somatic has
the highest somatic concordance among all five somatic SNVs callers when the number of
predicted somatic SNVs is smaller than 30,000 (Supplementary Figure 4) and has the lowest
germline concordance among all five somatic SNVs callers when the number of predicted somatic
SNVs is larger than 60,000 (Supplementary Figure 5).
When the number of predicted somatic SNVs is small, which means the quality threshold is very
strict, the evaluation benchmarked by somatic and non-somatic database is easily disturbed by
drawing small amount of somatic or germline loci. So here we did smoothing for all the
concordance analyses when the calling number is less than 5,000.
Supplementary Figure 4 Somatic concordance analyses of paired tumor and normal LUSC
samples. The horizontal axis shows the number of somatic predictions and the vertical axis
represents the fraction of those predictions found to be in the merged COSMIC somatic SNVs set.
Supplementary Figure 5 Germline concordance analyses of paired tumor and normal LUSC
samples. The horizontal axis shows the number of somatic predictions and the vertical axis
represents the fraction of those predictions found to be in the 1000 Genomes based non-somatic
SNPs set.
7. Evaluation Benchmarked by Higher-Depth Data
Since the area under the curve (AUC) of a receiver operating characteristic (ROC) curve does not
need to take the specific cutoffs into consideration, it is widely applied as an important index of
the overall classification performance of a program. Thus we also applied AUC to evaluate the
performance of distinct somatic SNV caller. The five software’s calling result on 4X LUAD
dataset was compared with the Varscan2’s HC calling result on the 40X LUAD data sequenced by
different institutions but acquired from the identical sample. In the process of calculating the AUC,
the corresponding score of each program is applied as the predictor while the result of Varscan2 on
the high depth dataset is used as the gold standard. Due to the limited sequencing depth, we did
not further divide the loci into categories with different sequencing depth. For each caller, we did
1,000 times bootstrap to test the stability of the performance of each software. As shown in
Supplementary Figure 6 and Supplementary Table 1, the AUC of FaSD-somatic has a mean
value of 0.801, and the 95 % non-parametric confidence interval is [0.765, 0.835], which is
significantly higher than JointSNVmix (P-value < 2.2e-16 by Wilcoxon signed rank test, similarly
hereinafter), SAMtools (P-value=6.027e-09), SomaticSniper(P-value < 2.2e-16), and
Varscan2(P-value < 2.2e-16).
For the evaluation on the 50% sub-sampled data from benchmark itself, we divided the loci into
two categories: loci with sequencing depth smaller than 10X and loci with sequencing depth
greater than or equal to 10X in both tumor and normal samples. In the first category, the AUC of
FaSD-somatic has a mean value of 0.955, and the 95 % non-parametric confidence interval is
[0.893, 0.989], which does not differ significantly from SAMtools’s AUC (mean value 0.967 and
the 95 % non-parametric confidence interval is [0.96, 0.973]) (Supplementary Figure 7 and
Supplementary Table 2). It is worth mentioning that the FaSD-somatic’s upper bound of 95 %
non-parametric confidence interval is higher than SAMtools (0.989 in FaSD-somatic versus 0.973
in SAMtools). Nevertheless, FaSD-somatic’s performance is still significantly better than
JointSNVmix (P-value < 2.2e-16), SomaticSniper (P-value < 2.2e-16), and Varscan2 (P-value <
2.2e-16). The performance of the five callers in the second category is similar to those in the first
category. FaSD-somatic and SAMtools are two best callers among all five callers, with AUC
significantly higher than others (AUC mean value of 0.981 and 95% non-parametric confidence
interval of [0.979, 0.983] in FaSD-somatic versus 0.995 and [0.993, 0.996] in SAMtools, P-value
< 2.2e-16 compared with other three callers) (Supplementary Figure 8 and Supplementary
Table 3).
Supplementary Figure 6 AUC analyses on 4X LUAD dataset.
Supplementary Table 1 Mean value and 95% non-parametric confidence interval of AUC for
each caller on 4X LUAD dataset
software\measurement
mean
lower bound
upper bound
FaSD-somatic
JointSNVmix
SAMtools
SomaticSniper
VarScan2
0.801
0.614
0.796
0.668
0.76
0.765
0.607
0.755
0.658
0.699
0.835
0.621
0.834
0.678
0.811
Supplementary Figure 7 AUC analyses on loci with sequencing depth smaller than 10X in
sub-sampled LUAD dataset.
Supplementary Table 2 Mean value and 95% non-parametric confidence interval of AUC for
each caller on loci with sequencing depth smaller than 10X in sub-sampled LUAD dataset
software\measurement
mean
FaSD-somatic
JointSNVmix
SAMtools
SomaticSniper
VarScan2
0.955
0.787
0.967
0.755
0.681
lower bound
0.893
0.772
0.96
0.731
0.636
upper bound
0.989
0.8
0.973
0.779
0.726
Supplementary Figure 8 AUC analyses on loci with sequencing depth greater than or equal to
10X in sub-sampled LUAD dataset.
Supplementary Table 3 Mean value and 95% non-parametric confidence interval of AUC for
each caller on loci with sequencing depth greater than or equal to 10X in sub-sampled LUAD
dataset
software\measurement
mean
FaSD-somatic
JointSNVmix
SAMtools
SomaticSniper
VarScan2
0.981
0.936
0.995
0.885
0.825
lower bound
0.979
0.934
0.993
0.876
0.819
upper bound
0.983
0.939
0.996
0.891
0.831
8. Processing Speed
The time taken for these tools to process the data is a major bottleneck for sequencing data
analysis. We compared the running time of FaSD-somatic, JointSNVmix, SAMtools,
SomaticSniper and VarScan2 on the same data tested in 3.1. All five programs were tested on a
server, with 2.00 GHz Intel(R) Xeon(R) CPU E5-2620, 64 GB memory. Based on one thread of
single core of CPU, FaSD-somatic can finish the whole genome somatic SNVs calling within
4,815 seconds, 8,601 seconds and 49,807 seconds on 4X LUAD, 6X GBM and 50X LUSC dataset,
which is 38% faster than SomaticSniper, 62% than SAMtools, 113% than VarScan2 and 501%
than JointSNVmix (Supplementary Figure 9 and Supplementary Table 4).
Supplementary Figure 9 Run time on server using only a single thread in seconds (Time for
VarScan2 does not include the BAM to pileup conversion)
Supplementary Table 4 Run time on server using only a single thread
sample\software
FaSD-somatic
SomaticSniper
SAMtools
VarScan2
JointSNVmix
LUAD
GBM
LUSC
1:20:15
2:23:21
13:50:07
1:51:11
2:45:48
15:16:32
2:10:05
3:35:11
17:32:38
2:50:34
4:03:01
41:14:19
8:03:14
15:35:57
116:11:28
Time is in in h: m: s format. Running time of VarScan2 does not include the BAM to pileup
conversion.
9. Reference
1.
Carter, S.L., et al., Absolute quantification of somatic DNA alterations in human
cancer. Nat Biotechnol, 2012. 30(5): p. 413-21.
2.
Su, X., et al., PurityEst: estimating purity of human tumor samples using
next-generation sequencing data. Bioinformatics, 2012. 28(17): p. 2265-6.
3.
Hou, Y., et al., Single-cell exome sequencing and monoclonal evolution of a
JAK2-negative myeloproliferative neoplasm. Cell, 2012. 148(5): p. 873-85.
4.
Navin, N., et al., Tumour evolution inferred by single-cell sequencing. Nature,
2011. 472(7341): p. 90-4.
5.
Hudson, T.J., et al., International network of cancer genome projects. Nature,
2010. 464(7291): p. 993-8.
6.
Collins, F.S. and A.D. Barker, Mapping the cancer genome - Pinpointing the genes
involved in cancer will help chart a new course across the complex landscape of human
malignancies. Scientific American, 2007. 296(3): p. 50-57.
7.
Pleasance, E.D., et al., A comprehensive catalogue of somatic mutations from a
human cancer genome. Nature, 2010. 463(7278): p. 191-6.
8.
Stark, M.S., et al., Frequent somatic mutations in MAP3K5 and MAP3K9 in
metastatic melanoma identified by exome sequencing. Nature genetics, 2012. 44(2): p.
165-169.
9.
Koboldt, D.C., et al., VarScan 2: somatic mutation and copy number alteration
discovery in cancer by exome sequencing. Genome Res, 2012. 22(3): p. 568-76.
10.
Hansen, N.F., et al., Shimmer: Detection of genetic alterations in tumors using
next generation sequence data. Bioinformatics, 2013.
11.
Roth, A., et al., JointSNVMix: a probabilistic model for accurate detection of
somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics,
2012. 28(7): p. 907-13.
12.
Larson, D.E., et al., SomaticSniper: identification of somatic point mutations in
whole genome sequencing data. Bioinformatics, 2012. 28(3): p. 311-7.
13.
Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics,
2009. 25(16): p. 2078-9.
14.
Li, R.Q., et al., SNP detection for massively parallel whole-genome resequencing.
Genome Res, 2009. 19(6): p. 1124-1132.
15.
Altshuler, D.M., et al., An integrated map of genetic variation from 1,092 human
genomes. Nature, 2012. 491(7422): p. 56-65.
16.
Campbell, P.J., et al., Subclonal phylogenetic structures in cancer revealed by
ultra-deep sequencing. Proceedings of the National Academy of Sciences of the United States
of America, 2008. 105(35): p. 13081-13086.
17.
Yang, Z.H., S. Ro, and B. Rannala, Likelihood models of somatic mutation and
codon substitution in cancer genes. Genetics, 2003. 165(2): p. 695-705.
18.
Xu, F., et al., A fast and accurate SNP detection algorithm for next-generation
sequencing data. Nature communications, 2012. 3: p. 1258.
19.
Guo, Y., et al., The effect of strand bias in Illumina short-read sequencing data.
BMC genomics, 2012. 13.
20.
Forbes, S.A., et al., COSMIC: mining complete cancer genomes in the Catalogue of
Somatic Mutations in Cancer. Nucleic acids research, 2011. 39: p. D945-D950.
21.
Visa, S.R., A., The effect of imbalanced data class distribution on fuzzy
classifiers-Experimental study, in IEEE Conference on Fuzzy Systems2005, IEEE. p. 749-754.
22.
Weiss, G.M. and F. Provost, Learning when training data are costly: The effect of
class distribution on tree induction. Journal of Artificial Intelligence Research, 2003. 19: p.
315-354.
Download