Response to Reviewers` Comments Referee 1 Major Compulsory

advertisement
Response to Reviewers’ Comments
Referee 1
Major Compulsory Revisions
1. My major concerns are lack of data set description and no sign that any ethical board
has approved the data collection and use for the purpose of the paper. Neither the source
of the dataset is described, nor how the data has been collected. Therefore the data is at
the same level of interest, as a simulated dataset.
2. The authors describe the data as “not addicted to opium” versus “opium addicted”.
This might be a poor dataset for genotypic analysis, as the genotypic influence might be
low in comparison to the behavioral factors. It is well known, that for some diseases and
addictions the genetic influence is high, e.g. schizophrenia and smoking. But in this cases
the smoking addiction is a type of self medication to cure some of the symptoms of
schizophrenia. I do not see this in the case of opium addiction, and if available it needs to
be described by the authors, with appropriate references.
Authors’ Response: Identifying underlying mechanisms behind genetically complex
traits is one of the important areas in psychiatric neurogenetics. Opioid dependence is one
of the complex disorders with moderate to high heritability (approximate range 0.4–0.6)
(Gelertner 2005). The high incidence and complex inheritance patterns suggest that the
elucidation of the roles of common genetic variations in vulnerability might be critical for
a better understanding of its pathophysiologies. In this study, we use a sample that was
aggregated from multiple genetic studies of opioid dependence (citation 1). This sample
of subjects was genotyped using NIAAA's addictions-focused array. A total of 1212
single nucleotide polymorphisms were genotyped from 130 genes that have been reported
as candidate genes to addictions (citation 2). Whilst several functional loci have been
identified in alcoholism from this set of genetic markers, these markers have not been
used to study the genetic variance in vulnerability to opioid dependence.
3. The other concern is the comparison of the methods performance based on used
compute time. Here the bias of the computer implementation might be too large, if the
MDR method (based on Java) is called with improper memory settings and compared to
a well optimized implementation. To overcome this, it is recommended to make the
comparison based on function evaluations, like it is standard in the comparison of
optimization algorithms. But even in this case, the comparison might be unfair as it seems
that the size of the subsets seems to be different for MDR and the proposed algorithms
(up to ternary models for MDR and 32 SNPs for the proposed algorithms).
Authors’ Response: We use the same set of SNPs for both our algorithms and MDR.
Also, we have tried MDR will all possible settings and hence we feel that the comparison
is very fair.
4. Another concern is the size of the dataset, it is with 1212 SNP already preselected or
too small to be compared to real world tasks, as actual datasets for GWAs have at least
500k SNPs or more, so it could be used as a proof of principle and the preprocessing
stages to reduce the dataset up to 1212 SNPs need to be described properly.
Authors’ Response: Although candidate gene studies have their own inherent limitations
(reviewed in Tabor HK, Risch NJ, Myers RM. Candidate-gene approaches for studying
complex genetic traits: practical considerations. Nat Rev Genet. 2002;3:1–7), the use of
smaller focused arrays possibly represents a more practical approach for many studies
than the use of large scale arrays such as GWAS. These focused arrays are able to
overcome the issues of inadequate gene coverage by providing full coverage for a limited
number of candidate genes. Such focused arrays offer the advantages of lower cost and
lower false discovery rate, especially in situations where a dataset may have inadequate
power due to size or other reasons.
5. Furthermore the paper does not discuss potential statistical pitfalls that arise due to
multiple testings of the data. Due to the enormous number of function evaluations in the
MDR case, it might be that the results are only “chance results”.
Authors’ Response: We have now redone our analysis to take care of this. In particular,
we have chosen a large number of subsets of the SNPs and computed the quantities of
interest for each such subset. The results are very similar.
6. Neither the dataset nor the implementation of the proposed method seems to be
available. The authors should consider to make this available like the used MDR
implementation, to allow for proper comparisons with other methods.
Authors’ Response: We have no problems in releasing our software. We will give it to
any one upon request. We are even willing to put it in a publicly available link once the
paper is accepted.
7. The given references in the publication are not appropriate for the quality level of a
Journal publication. For example, in the introduction section several more references are
needed and more modern ones, e.g. the usefulness of SNPs in non-coding regions, which
have been considered useless several years ago, but are now known to be of importance
for regulationary processes. The notion of Linkage Disequilibrium needs to be discussed
at this level.
Authors’ Reponse: Please note that although candidate gene studies have their own
inherent limitations (reviewed in Tabor HK, Risch NJ, Myers RM. Candidate-gene
approaches for studying complex genetic traits: practical considerations. Nat Rev Genet.
2002;3:1–7), the use of smaller focused arrays possibly represents a more practical
approach for many studies than the use of large scale arrays such as GWAS. These
focused arrays are able to overcome the issues of inadequate gene coverage by providing
full coverage for a limited number of candidate genes. Such focused arrays offer the
advantages of lower cost and lower false discovery rate, especially in situations where a
dataset may have inadequate power due to size or other reasons. Our genetic markers
were obtained in a study conducted by NIAAA. For details about our data please refer to
"Addictions Biology: Haplotype-Based Analysis for 130 Candidate Genes on a Single
Array" by Hodgkinson, et al., Alcohol Alcohol. 2008 Sep-Oct; 43(5): 505–515.
According to this paper, the panel SNPs that we use in our study are able to extract full
haplotype information for candidate genes in alcoholism, other addictions and disorders
of mood and anxiety.
8. It seems to be that in some parts of the publication, the difference between DNA level
and mRNA level are not clear to the authors and it is even not clear if the dataset is the
result of an expression analysis or genomic assessment? An example is the paragraph
“Gene selection”. This needs to be clarified strictly.
Authors’ Response: Perhaps the naming of one of the algorithms as “Gene Selection” is
confusing. As a result, now we call that algorithm “Feature Selection”.
9. The description of the MDR technique is not appropriate, as the original authors of
MDR have done a lot of research on the MDR technique. This need to be referenced and
discussed properly in the paper.
Authors’ Response: We now have a longer discussion on the MDR algorithm.
10. The authors use only a linear-SVM and do not discuss, why they do not test any other
kernel?
Authors’ Response: We now have employed other kernels with SVM.
11. The techniques in section “Gene Selection” are not appropriately described, a better
algorithmic description is needed.
Authors’ Response: We have given a more detailed description of this algorithm.
12. The authors propose a Random Projection method, which might be interesting, but
unfortunately they do not discuss and test with other well know projections based on the
principal components analysis. There are hundreds of publications, that use these
techniques. A more thorough literature survey is needed.
Authors’ Response: We now have a summary of these techniques in the literature
survey. We also have employed principal component analysis.
13. The section Our Algorithms should be the main part of the contribution, unfortunately
it is poorly written and the exact working of the algorithms can not be assessed by the
reader.
Authors’ Reponse: We now have reorganized the paper and brought out our
contributions clearly.
14. The authors state in “Results and Discussion” that they have done rigorous
simulations. Unfortunately they do not describe these rigorous simulations in detail.
Authors’ Response: We now have a lot of details. Also, please note that we now
compare our algorithms with several others.
15. Subset selections of genes are of great interest for biomedical researchers, as they
might get further insights into the disease mechanisms. Unfortunately the authors do not
discuss the biological plausibility of the found subsets. Therefore it would be helpful to
give the rs-Numbers of the found SNPs or the corresponding gene.
Authors’ Response: We can supply these numbers upon request.
16. It is unclear, if the same selection of the crossvalidation partitions is used for the
comparison of MDR and the new proposed algorithm.
Authors’ Response: We use the same selection of cross-validation partitions.
17. In the comparison with MDR the number of SNPs are given for the MDR analysis, but
not for the proposed algorithms. Do you use 32 SNPs as described under Algorithm 1? If
yes, the comparison is unfair as potentially any other simple subset selection algorithm
might have given this result.
Authors’ Response: Please note that for Algorithm 1, we have used all the SNPs and for
MDR we have only used the best 32. Therefore, the comparison is only unfair on our
algorithms rather than on MDR. It is true that one could use any subset selection
algorithm. In this paper we have chosen to use Algorithm 1.
18. I have some methodological doubts: For the proposed algorithms only the best
accuracy is reported, is this on the test set? Or is it at the crossvalidation level? For the
MDR it seems to be clear, as the results are given on the crossvalidation side and the
testing accuracy. Why is this distinction not made for the 3 algorithms? Typically the
results on the CV-data and the testing accuracy differ, like for the MDR. If the testing
accuracy is used for the selection of subsets, the comparison is flawed. I do not get this
from the paper, therefore the description of the comparison method needs to be given
much more precisely.
Authors’ Response: The accuracies reported are at the cross-validation level.
19. Other publications the authors might consider:
Listgarten, J.; Damaraju, S.; Poulin, B.; Cook, L.; Dufour, J.; Driga, A.; Mackey, J.;
Wishart, D.; Greiner, R. & Zanke, B. Predictive models for breast cancer susceptibility
from multiple single nucleotide polymorphisms. Clin Cancer Res, 2004, 10, 2725-2737.
G. UÅNstünkar, S. OÅNzoÅNgür-Akyüz, G. W. Weber, C. M. Friedrich and Y. A. Son,
„Selection of Representative SNP Sets for Genome-Wide Association Studies: A
Metaheuristic Approach“, DOI:10.1007/s11590-011-0419-7, Optimization Letters,
Volume 6(6), Page 1207-1218, 2012.
The newer publications by the Moore group, e.g.
Martin, E. R.; Ritchie, M. D.; Hahn, L.; Kang, S. & Moore, J. H. A novel method to
identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol, 2006, 30,
111-123
You should also refer to the publications under the topic „tag-SNP“ selection
Authors’ Response: We now cite all of these papers.
- Minor Essential Revisions
The author can be trusted to make these. For example, missing labels on figures, the
wrong use of a term, spelling mistakes.
Authors’ Response: We have done this.
1. The part Normalization is too long in comparison to the other descriptions, it
needs to be shortened.
Authors’ Response: We have done this.
2. Reference 7. “Vapnik” instead of “Vpnik”
Authors’ Response: We have corrected this.
Overall Assessment:
Reject because most of the analysis is not done well, the description of data is incomplete
and the contribution is too low. The paper when properly revised and analysis improved
should be considered as a good conference contribution.
Level of interest: An article of limited interest
Quality of written English: Acceptable
Statistical review: Yes, and I have assessed the statistics in my report.
Declaration of competing interests:
I declare that I have no competing interests
Referee 2
Reviewer's report:
This paper presents algorithmic techniques for genotype-phenotypic correlation studies.
More specifically, the paper delves into the problem of identifying a minimal subset of
genes that can be held accountable for a phenotypic change, through analysis of
genotypic data (SNP) and phenotypic measurements (microarray profiles). Two different
approaches are proposed and comparatively analyzed – first is gene selection using an
SVM-based training method, and the second is dimensionality reduction using random
projection.
Major Compulsory Revisions
The paper addresses an important and actively researched problem. However there are a
number of major issues with this paper that need to be addressed:
i) This paper ignores the wealth of related literature available on this topic. A number of
papers show up if we do a brief Google search on keywords such as “gene selection svm
genotype phenotype”. Many of them deal with gene/feature selection and some of them
also seem to be using SVM (like in this paper). See for example the literature cited in the
review article by Saeys et al., Bioinformatics, 2007. This paper cites almost none of this
and more importantly doesn’t contrast itself from these previous efforts, neither by
method nor by results.
Authors’ Response: We now have supplied a large number of references. Also, our
algorithms have been compared with more algorithms.
ii) Methods in this paper are not clearly presented. Critical details are missing:
a. The SVM method (GSA) [4] used in this paper is described in another paper by the
authors (published in a workshop in 2007). Because of this I was neither able to
understand nor appreciate the author’s method. The authors should consider including
the details of this method as part of this manuscript.
Authors’ Response: We now provide more details on the GSA algorithm.
b. The random projection method uses parameter k (the number of target dimensions).
How is this determined? Also, from the earlier part of the manuscript it seemed like the
random projection was an alternative method to GSA. But the later parts in the paper
combine the two (e.g., Algorithm 2).
Authors’ Response: Please note that we determine the value of k empirically. In other
words, we try different values of k and pick the one that gives the best results.
c. The algorithms described in the section “Our Algorithms” seemed more like a
description of the methodology used in experimental section than the actual methods
themselves. For example, “Algorithm 3: In this algorithm, we compare the accuracy and
runtime of our GSA and MDR algorithms…”
Authors’ Response: We now supply more details on our algorithms.
iii) The experimental results section also was very difficult to parse. How do algorithms 1
and 2 compare (GSA vs. GSA+RP)? It seems like this part of the paper can benefit from
useful tables and/or Venn Diagrams. The tables presented were not of much help here.
There is no reference to Tables 1 and 2 from the main text. Tables 3-7 are basically the
same table describing MDR for different settings. What are these results conveying and
can they plotted concisely in charts to show the trends? Result representation is generally
important in these kinds of studies and the presentation and organization of results in this
paper left much to be desired. The opium data set seems like a good choice but not much
domain comment is made available to speak to the general quality of the obtained results.
Authors’ Response: We now provide a comparison of all the algorithms in one Table.
Also, references to tables have been taken care of.
Minor Essential Revisions
i) Sections are not numbered but the paper organization subsection in page 3
refers to them in numbers
Authors’ Response: Please note that BMC MIDM does not permit section numbers. We
have deleted number references.
ii) Page 4: a claim is made to the effect that other algorithms are time consuming
because they use recursion, whereas the authors use some combination of techniques
which are faster. Details of the latter and justification are required.
Authors’ Response: Done.
iii) Algo 1: step 3: which “population”?
Authors’ Response: We have taken care of this.
iv) Algo 3: “Comparison with GSA and MDR” – needs rephrasing
Authors’ Response: Done.
Discretionary Revisions
i) Put sets in {…} in the main body of the paper – e.g., input n genes {g1,g2,…gn}
Authors’ Response: Done.
Level of interest: An article of importance in its field
Quality of written English: Acceptable
Statistical review: Yes, but I do not feel adequately qualified to assess the statistics.
Declaration of competing interests:
I declare that I have no competing interests.
Referee 3
Reviewer's report:
This article improves upon existing methodology for genotype phenotype correlations.
Establishing these correlations have become especially important in areas such as
biomedical informatics. The overarching goal might be to incorporate these findings in
clinical workflows that may accelerate personalized medicine. The authors propose
simulation results in support of their claims and provide a clear comparison on the
performance of their proposed approach to more established methods.
Minor Comments:
I would encourage the authors to move the equations and Theorems to the Appendix
section is possible as these seem to be an unecessary distraction. Inclusion of
experimental data can only fortify the claims of the authors. It might not be necessary to
list the user-defined SNP ID in Table 1 as it does not have a biological significance.
Authors’ Response: We feel that the theorems and equations can help the reader get a
summary of the ideas quickly. We used a real patient sample that was genotyped by
NIAAA and belonged to a medical doctor. The data was under the protection of an IRB
protocol, and hence the SNP rsxxxxx were replaced by SNP IDs. We agree that these
SNP IDs do not represent biological sequences.
Level of interest: An article of importance in its field
Quality of written English: Acceptable
Statistical review: No, the manuscript does not need to be seen by a
statistician.
Declaration of competing interests:
No
Download