Response to Reviewers’ Comments Referee 1 Major Compulsory Revisions 1. My major concerns are lack of data set description and no sign that any ethical board has approved the data collection and use for the purpose of the paper. Neither the source of the dataset is described, nor how the data has been collected. Therefore the data is at the same level of interest, as a simulated dataset. 2. The authors describe the data as “not addicted to opium” versus “opium addicted”. This might be a poor dataset for genotypic analysis, as the genotypic influence might be low in comparison to the behavioral factors. It is well known, that for some diseases and addictions the genetic influence is high, e.g. schizophrenia and smoking. But in this cases the smoking addiction is a type of self medication to cure some of the symptoms of schizophrenia. I do not see this in the case of opium addiction, and if available it needs to be described by the authors, with appropriate references. Authors’ Response: Identifying underlying mechanisms behind genetically complex traits is one of the important areas in psychiatric neurogenetics. Opioid dependence is one of the complex disorders with moderate to high heritability (approximate range 0.4–0.6) (Gelertner 2005). The high incidence and complex inheritance patterns suggest that the elucidation of the roles of common genetic variations in vulnerability might be critical for a better understanding of its pathophysiologies. In this study, we use a sample that was aggregated from multiple genetic studies of opioid dependence (citation 1). This sample of subjects was genotyped using NIAAA's addictions-focused array. A total of 1212 single nucleotide polymorphisms were genotyped from 130 genes that have been reported as candidate genes to addictions (citation 2). Whilst several functional loci have been identified in alcoholism from this set of genetic markers, these markers have not been used to study the genetic variance in vulnerability to opioid dependence. 3. The other concern is the comparison of the methods performance based on used compute time. Here the bias of the computer implementation might be too large, if the MDR method (based on Java) is called with improper memory settings and compared to a well optimized implementation. To overcome this, it is recommended to make the comparison based on function evaluations, like it is standard in the comparison of optimization algorithms. But even in this case, the comparison might be unfair as it seems that the size of the subsets seems to be different for MDR and the proposed algorithms (up to ternary models for MDR and 32 SNPs for the proposed algorithms). Authors’ Response: We use the same set of SNPs for both our algorithms and MDR. Also, we have tried MDR will all possible settings and hence we feel that the comparison is very fair. 4. Another concern is the size of the dataset, it is with 1212 SNP already preselected or too small to be compared to real world tasks, as actual datasets for GWAs have at least 500k SNPs or more, so it could be used as a proof of principle and the preprocessing stages to reduce the dataset up to 1212 SNPs need to be described properly. Authors’ Response: Although candidate gene studies have their own inherent limitations (reviewed in Tabor HK, Risch NJ, Myers RM. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet. 2002;3:1–7), the use of smaller focused arrays possibly represents a more practical approach for many studies than the use of large scale arrays such as GWAS. These focused arrays are able to overcome the issues of inadequate gene coverage by providing full coverage for a limited number of candidate genes. Such focused arrays offer the advantages of lower cost and lower false discovery rate, especially in situations where a dataset may have inadequate power due to size or other reasons. 5. Furthermore the paper does not discuss potential statistical pitfalls that arise due to multiple testings of the data. Due to the enormous number of function evaluations in the MDR case, it might be that the results are only “chance results”. Authors’ Response: We have now redone our analysis to take care of this. In particular, we have chosen a large number of subsets of the SNPs and computed the quantities of interest for each such subset. The results are very similar. 6. Neither the dataset nor the implementation of the proposed method seems to be available. The authors should consider to make this available like the used MDR implementation, to allow for proper comparisons with other methods. Authors’ Response: We have no problems in releasing our software. We will give it to any one upon request. We are even willing to put it in a publicly available link once the paper is accepted. 7. The given references in the publication are not appropriate for the quality level of a Journal publication. For example, in the introduction section several more references are needed and more modern ones, e.g. the usefulness of SNPs in non-coding regions, which have been considered useless several years ago, but are now known to be of importance for regulationary processes. The notion of Linkage Disequilibrium needs to be discussed at this level. Authors’ Reponse: Please note that although candidate gene studies have their own inherent limitations (reviewed in Tabor HK, Risch NJ, Myers RM. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet. 2002;3:1–7), the use of smaller focused arrays possibly represents a more practical approach for many studies than the use of large scale arrays such as GWAS. These focused arrays are able to overcome the issues of inadequate gene coverage by providing full coverage for a limited number of candidate genes. Such focused arrays offer the advantages of lower cost and lower false discovery rate, especially in situations where a dataset may have inadequate power due to size or other reasons. Our genetic markers were obtained in a study conducted by NIAAA. For details about our data please refer to "Addictions Biology: Haplotype-Based Analysis for 130 Candidate Genes on a Single Array" by Hodgkinson, et al., Alcohol Alcohol. 2008 Sep-Oct; 43(5): 505–515. According to this paper, the panel SNPs that we use in our study are able to extract full haplotype information for candidate genes in alcoholism, other addictions and disorders of mood and anxiety. 8. It seems to be that in some parts of the publication, the difference between DNA level and mRNA level are not clear to the authors and it is even not clear if the dataset is the result of an expression analysis or genomic assessment? An example is the paragraph “Gene selection”. This needs to be clarified strictly. Authors’ Response: Perhaps the naming of one of the algorithms as “Gene Selection” is confusing. As a result, now we call that algorithm “Feature Selection”. 9. The description of the MDR technique is not appropriate, as the original authors of MDR have done a lot of research on the MDR technique. This need to be referenced and discussed properly in the paper. Authors’ Response: We now have a longer discussion on the MDR algorithm. 10. The authors use only a linear-SVM and do not discuss, why they do not test any other kernel? Authors’ Response: We now have employed other kernels with SVM. 11. The techniques in section “Gene Selection” are not appropriately described, a better algorithmic description is needed. Authors’ Response: We have given a more detailed description of this algorithm. 12. The authors propose a Random Projection method, which might be interesting, but unfortunately they do not discuss and test with other well know projections based on the principal components analysis. There are hundreds of publications, that use these techniques. A more thorough literature survey is needed. Authors’ Response: We now have a summary of these techniques in the literature survey. We also have employed principal component analysis. 13. The section Our Algorithms should be the main part of the contribution, unfortunately it is poorly written and the exact working of the algorithms can not be assessed by the reader. Authors’ Reponse: We now have reorganized the paper and brought out our contributions clearly. 14. The authors state in “Results and Discussion” that they have done rigorous simulations. Unfortunately they do not describe these rigorous simulations in detail. Authors’ Response: We now have a lot of details. Also, please note that we now compare our algorithms with several others. 15. Subset selections of genes are of great interest for biomedical researchers, as they might get further insights into the disease mechanisms. Unfortunately the authors do not discuss the biological plausibility of the found subsets. Therefore it would be helpful to give the rs-Numbers of the found SNPs or the corresponding gene. Authors’ Response: We can supply these numbers upon request. 16. It is unclear, if the same selection of the crossvalidation partitions is used for the comparison of MDR and the new proposed algorithm. Authors’ Response: We use the same selection of cross-validation partitions. 17. In the comparison with MDR the number of SNPs are given for the MDR analysis, but not for the proposed algorithms. Do you use 32 SNPs as described under Algorithm 1? If yes, the comparison is unfair as potentially any other simple subset selection algorithm might have given this result. Authors’ Response: Please note that for Algorithm 1, we have used all the SNPs and for MDR we have only used the best 32. Therefore, the comparison is only unfair on our algorithms rather than on MDR. It is true that one could use any subset selection algorithm. In this paper we have chosen to use Algorithm 1. 18. I have some methodological doubts: For the proposed algorithms only the best accuracy is reported, is this on the test set? Or is it at the crossvalidation level? For the MDR it seems to be clear, as the results are given on the crossvalidation side and the testing accuracy. Why is this distinction not made for the 3 algorithms? Typically the results on the CV-data and the testing accuracy differ, like for the MDR. If the testing accuracy is used for the selection of subsets, the comparison is flawed. I do not get this from the paper, therefore the description of the comparison method needs to be given much more precisely. Authors’ Response: The accuracies reported are at the cross-validation level. 19. Other publications the authors might consider: Listgarten, J.; Damaraju, S.; Poulin, B.; Cook, L.; Dufour, J.; Driga, A.; Mackey, J.; Wishart, D.; Greiner, R. & Zanke, B. Predictive models for breast cancer susceptibility from multiple single nucleotide polymorphisms. Clin Cancer Res, 2004, 10, 2725-2737. G. UÅNstünkar, S. OÅNzoÅNgür-Akyüz, G. W. Weber, C. M. Friedrich and Y. A. Son, „Selection of Representative SNP Sets for Genome-Wide Association Studies: A Metaheuristic Approach“, DOI:10.1007/s11590-011-0419-7, Optimization Letters, Volume 6(6), Page 1207-1218, 2012. The newer publications by the Moore group, e.g. Martin, E. R.; Ritchie, M. D.; Hahn, L.; Kang, S. & Moore, J. H. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol, 2006, 30, 111-123 You should also refer to the publications under the topic „tag-SNP“ selection Authors’ Response: We now cite all of these papers. - Minor Essential Revisions The author can be trusted to make these. For example, missing labels on figures, the wrong use of a term, spelling mistakes. Authors’ Response: We have done this. 1. The part Normalization is too long in comparison to the other descriptions, it needs to be shortened. Authors’ Response: We have done this. 2. Reference 7. “Vapnik” instead of “Vpnik” Authors’ Response: We have corrected this. Overall Assessment: Reject because most of the analysis is not done well, the description of data is incomplete and the contribution is too low. The paper when properly revised and analysis improved should be considered as a good conference contribution. Level of interest: An article of limited interest Quality of written English: Acceptable Statistical review: Yes, and I have assessed the statistics in my report. Declaration of competing interests: I declare that I have no competing interests Referee 2 Reviewer's report: This paper presents algorithmic techniques for genotype-phenotypic correlation studies. More specifically, the paper delves into the problem of identifying a minimal subset of genes that can be held accountable for a phenotypic change, through analysis of genotypic data (SNP) and phenotypic measurements (microarray profiles). Two different approaches are proposed and comparatively analyzed – first is gene selection using an SVM-based training method, and the second is dimensionality reduction using random projection. Major Compulsory Revisions The paper addresses an important and actively researched problem. However there are a number of major issues with this paper that need to be addressed: i) This paper ignores the wealth of related literature available on this topic. A number of papers show up if we do a brief Google search on keywords such as “gene selection svm genotype phenotype”. Many of them deal with gene/feature selection and some of them also seem to be using SVM (like in this paper). See for example the literature cited in the review article by Saeys et al., Bioinformatics, 2007. This paper cites almost none of this and more importantly doesn’t contrast itself from these previous efforts, neither by method nor by results. Authors’ Response: We now have supplied a large number of references. Also, our algorithms have been compared with more algorithms. ii) Methods in this paper are not clearly presented. Critical details are missing: a. The SVM method (GSA) [4] used in this paper is described in another paper by the authors (published in a workshop in 2007). Because of this I was neither able to understand nor appreciate the author’s method. The authors should consider including the details of this method as part of this manuscript. Authors’ Response: We now provide more details on the GSA algorithm. b. The random projection method uses parameter k (the number of target dimensions). How is this determined? Also, from the earlier part of the manuscript it seemed like the random projection was an alternative method to GSA. But the later parts in the paper combine the two (e.g., Algorithm 2). Authors’ Response: Please note that we determine the value of k empirically. In other words, we try different values of k and pick the one that gives the best results. c. The algorithms described in the section “Our Algorithms” seemed more like a description of the methodology used in experimental section than the actual methods themselves. For example, “Algorithm 3: In this algorithm, we compare the accuracy and runtime of our GSA and MDR algorithms…” Authors’ Response: We now supply more details on our algorithms. iii) The experimental results section also was very difficult to parse. How do algorithms 1 and 2 compare (GSA vs. GSA+RP)? It seems like this part of the paper can benefit from useful tables and/or Venn Diagrams. The tables presented were not of much help here. There is no reference to Tables 1 and 2 from the main text. Tables 3-7 are basically the same table describing MDR for different settings. What are these results conveying and can they plotted concisely in charts to show the trends? Result representation is generally important in these kinds of studies and the presentation and organization of results in this paper left much to be desired. The opium data set seems like a good choice but not much domain comment is made available to speak to the general quality of the obtained results. Authors’ Response: We now provide a comparison of all the algorithms in one Table. Also, references to tables have been taken care of. Minor Essential Revisions i) Sections are not numbered but the paper organization subsection in page 3 refers to them in numbers Authors’ Response: Please note that BMC MIDM does not permit section numbers. We have deleted number references. ii) Page 4: a claim is made to the effect that other algorithms are time consuming because they use recursion, whereas the authors use some combination of techniques which are faster. Details of the latter and justification are required. Authors’ Response: Done. iii) Algo 1: step 3: which “population”? Authors’ Response: We have taken care of this. iv) Algo 3: “Comparison with GSA and MDR” – needs rephrasing Authors’ Response: Done. Discretionary Revisions i) Put sets in {…} in the main body of the paper – e.g., input n genes {g1,g2,…gn} Authors’ Response: Done. Level of interest: An article of importance in its field Quality of written English: Acceptable Statistical review: Yes, but I do not feel adequately qualified to assess the statistics. Declaration of competing interests: I declare that I have no competing interests. Referee 3 Reviewer's report: This article improves upon existing methodology for genotype phenotype correlations. Establishing these correlations have become especially important in areas such as biomedical informatics. The overarching goal might be to incorporate these findings in clinical workflows that may accelerate personalized medicine. The authors propose simulation results in support of their claims and provide a clear comparison on the performance of their proposed approach to more established methods. Minor Comments: I would encourage the authors to move the equations and Theorems to the Appendix section is possible as these seem to be an unecessary distraction. Inclusion of experimental data can only fortify the claims of the authors. It might not be necessary to list the user-defined SNP ID in Table 1 as it does not have a biological significance. Authors’ Response: We feel that the theorems and equations can help the reader get a summary of the ideas quickly. We used a real patient sample that was genotyped by NIAAA and belonged to a medical doctor. The data was under the protection of an IRB protocol, and hence the SNP rsxxxxx were replaced by SNP IDs. We agree that these SNP IDs do not represent biological sequences. Level of interest: An article of importance in its field Quality of written English: Acceptable Statistical review: No, the manuscript does not need to be seen by a statistician. Declaration of competing interests: No