1 Remarks to the comments of the reviewers Reviewer #1 We thank the reviewer #1 Dr. Peterson for the critical and very helpful suggestions and the valuation that the article of importance in its field. The manuscript has been especially in the ANN description chapter modified. Our comments to the corresponding points are the following: To point 1: The authors left out valuable descriptions on why they didn't use certain predictor variables when developing their artificial neural network model. What was the informativeness of histopathological scores such as biopsy Gleason, pathology Gleason, TNM, pathology stage, extracapsular extension, seminal vesicle invasion, lymph node involvement(yes/no), surgical margins, etc? Our response: The 49 PCa patients had biopsy Gleason sum of 4 (n=2), 5 (n=5), 6 (n=19), 7 (n=18), 8 (n=3) and 9 (n=2). Most patients were treated with radiation therapy and / or hormonal castration or with active surveillance. Fifteen patients underwent laparoscopic radical prostatectomy. Of those, 11 were pT2, 3 were pT3a and one had seminal vesicle invasion (pT3b). Regarding the surgical margins, 3 patients were R1 and the remaining 12 had a R0 status. Of those 14 patients with laparoscopic pelvine lymphadenectomy three had a tumor invasion (pN1) while the other 11 were pN0. We would mention that the ANN was developed only for the detection of prostate cancer but not for staging – therefore we give the histopathological information in the comment but not in the revised manuscript. To point 2: The reviewer wrote: Please provide a table with statistical test results for the above variables using a k-group test such as Ftest or Kruskal-Wallis for the 3 groups (i.e. k=3). What are the p-values for significant difference of the above across the 3 groups? How many missing data are there? Provide more discussion about variable selection prior to ANN implementation. Our response: In regard to the missing data, we described in the first chapter of Patients and methods the selection of patients for this study. From a cohort of 2959 patients with available tPSA) and fPSA measurements a total of 199 patients were included. The selection criteria for this study were at least three PSA and fPSA measurements with a minimum of three months interval between two measurements before treatment. Therefore we had to exclude most of the initially seen patients with typically only one dataset of tPSA and fPSA but the remaining 199 patients of this study had therefore no missing data. The variable selection prior to ANN implementation was restricted. Only the clinical data tPSA, fPSA, age, prostate volume and DRE status were available from all patients. No other parameters such as transitional prostate volume or other serum values nor TRUS findings were available from all patients and could therefore not be implemented. Furthermore, the basis of this study was the original “ProstataClass” ANN, which includes only the five values tPSA, fPSA, age, prostate volume and DRE. The aim was to see if the addition of the PSA velocity and the ANN velocity may add further information to distinguish between the two patient groups prostate cancer (PCa) and benign prostate hyperplasia (BPH). Thus, in this study we only compared PCa and BPH but not a third group. However, we added those p-values, which were not given in the text in the result chapter into table 1 such as for %fPSA and prostate volume (both were different between PCa and BPH with P < 0.0001). The reviewer furthermore wanted a table with statistical test results for the above variables using a k-group test such as F-test or Kruskal-Wallis for the 3 groups (i.e. k=3). We assume that the 3 groups were those with increasing, stable or decreasing PSA velocity (PSAV) and ANN velocity (ANNV) values. We calculated all requested comparisons and included two new tables No 6 and 7 for the 3 PSAV groups and the 3 ANNV groups (pages 19 and 20). We added the information into the 2 statistic section (page 5, last chapter of Patients and Methods section) that beside the Mann-Whitney U Test we also used the Kruskal-Wallis test. Statistical calculations were performed with SPSS 14.0 for Windows (SPSS, Chicago, USA). We used the non-parametric Mann-Whitney U test and the Kruskal-Wallis test. We included a sentence in the result chapter referring to the two new tables (page 7, lines 9-10): Further descriptive data are given for the patients with increasing, stable and decreasing PSAV (Table 6) and ANNV (Table 7). Table 6 PSAV values Median values and p-values between the 3 groups of increasing, stable or decreasing Increasing PSAV (>0.75 ng/mL/year) Stable PSAV (-0.75 to 0.75 ng/mL/year) Decreasing PSAV (< -0.75 ng/mL/year) Parameter PCa (n=35) BPH (n=32) p-value PCa (n=9) BPH (n=98) p-value PCa (n=5) BPH (n=20) p-value Age (years) 65 68 0.15 68 68 0.86 72 68.5 0.92 tPSA (ng/mL) 11.7* 11.5*$ 0.72 4.8§ 4.96 0.96 6.2* 4.39 0.067 %fPSA (%) 9.6*§ 14.1 0.001 15 17.4 0.49 21 16 0.13 Volume (mL) 35 46.5 0.066 33 45 0.1 55 57.5 0.89 The p-value is given for the comparison between the respective PCa and BPH patients (Mann-Whitney U Test) *significantly different to the respective patients in the stable group (P < 0.05; Mann-Whitney U Test) § significantly different to the respective patients in the decreasing group (P < 0.05; Mann-Whitney U Test) The Kruskal-Wallis Test for the PCa patients between all 3 groups showed for tPSA (p=0.0003) and %fPSA (p=0.0008) significant differences but not for age (p=0.26) or volume (p=0.17). The Kruskal-Wallis Test for the BPH patients between all 3 groups showed for tPSA (p<0.0001) significant differences but not for %fPSA (p=0.4), age (p=0.96) or volume (p=0.44). 3 Table 7 Median values and p-values between the 3 groups of increasing, stable or decreasing ANNV values Increasing ANNV (> 4) Stable ANNV (-4 to 4) Decreasing ANNV (< -4) Parameter PCa (n=22) BPH (n=16) p-value PCa (n=17) BPH (n=124) p-value PCa (n=10) BPH (n=10) p-value Age (years) 65* 68.5 0.61 70 68 0.7 62 64 0.29 tPSA (ng/mL) 8.3 6.8 0.17 10.2 5.3 0.002 5.5 4.35 0.5 %fPSA (%) 8.0*$ 13.15 0.017 15.15 17.4 0.85 11.2 13.8 0.1 Volume (mL) 33.5* 36.5* 0.34 50 53.5$ 0.87 30.5* 34* 0.26 The p-value is given for the comparison between the respective PCa and BPH patients (Mann-Whitney U Test) *significantly different to the respective patients in the stable group (P < 0.05; Mann-Whitney U Test) § significantly different to the respective patients in the decreasing group (P < 0.05; Mann-Whitney U Test) The Kruskal-Wallis Test for the PCa patients between all 3 groups showed only for %fPSA (p=0.014) significant differences but not for age (p=0.32), tPSA (p=0.16) or volume (p=0.98). The Kruskal-Wallis Test for the BPH patients between all 3 groups showed for volume (p=0.022) significant differences but not for age (p=0.31), tPSA (p=0.61), or %fPSA (p=0.17). 4 Points 3 - 13 are directed to the ANN, therefore they should be considered together. The main topic of this paper is not the construction of a new ANN but the use of a previously trained ANN. Therefore the characteristics of the ANN are not the topic of this paper. Beyond the information in the paper [22] further parameters of the ANN used are given in the revised version of the paper now. Determination/measuring of the variables and the characteristics of the population are the same as described in the paper [22] also. There we described, that we used three ANN – for the PSA-ranges 24 ng/mL, 4-10 ng/mL and 10-20 ng/mL respectively. The ANN training parameters were the same for each ANN. To point 3: The reviewer wrote: For the ANN models used, how many input nodes were there (don't use variable names). Rather, state e.g. "there were 6 input nodes based on the input variables x,y,z... Our response: Each ANN has one input layer with five neurons (according to the five variables age, PSA, %fPSA, TRUS and DRE). We changed the sentence accordingly (page 5, lines 2-3): The back-propagation network consists of one input layer with the five neurons tPSA, %fPSA, patient age, prostate volume, and DRE status. To point 4: The reviewer wrote: How many hidden nodes were there in each ANN model? Our response: Each ANN contains one hidden layer with three neurons (according to the rule of thumb that the number of hidden neurons should be about the half of the number of input neurons). However, we also tried to use 2 or 4 hidden neurons but 3 worked best. We added this information into the manuscript (page 5, lines 3-4): Each ANN contains one hidden layer with three neurons. To point 5: The reviewer wrote: How many output nodes in each ANN model (3?). This can be assumed, but it's better to say "there were 3 output nodes, one node for each diagnostic class." Our response: Each ANN contains only one output neuron representing the output value as the probability of PCa. The groups described in the paper are built from the changes in the PSA values and from changes in the output values (the probabilities for PCa) of the respective ANN. The groups are not built from ANNs with three output neurons according to three groups. To clarify this to the readers we added the sentence (page 5, lines 4-5): Each ANN finally contains one output neuron representing the output value as the probability of PCa. To point 6: The reviewer wrote: What activation function was applied at the hidden nodes (logistic, tanh, etc.)? Our response: 5 The activation function for the hidden neurons was the tanh. We added this into the Patients and methods chapter (see point 7). To point 7: The reviewer wrote: What function was applied prior to output nodes (softmax, linear, etc.)? Our response: The activation function for the output neuron was linear in the range 0 to 1 to get a value for the probability of PCa. We added the information from point 6 and 7 (page 5, lines 5-7): The activation function for the hidden neurons was the tanh while the activation function for the output neuron was linear in the range 0 to 1 to get a value for the probability of PCa. To point 8: The reviewer wrote: How many epochs or sweeps were used per model, and what was the MSE stopping criteria? It is a good idea to provide a plot of MSE as a function of epochs (sweeps) for an example model. Our response: We added these detailed technical ANN data into the manuscript (page 5. lines 7-8): Training of the ANN took place in 4 steps with 100 sweeps each of them. Stopping criteria were a RMS error less than 0.001 or a rate of 95% correct classified samples. To point 9: The reviewer wrote: How were the initial ANN weights set and what was their range? Our response: The initial weights were set randomly to values between -1 and +1. To point 10: The reviewer wrote: Were the variables standardized prior to ANN usage? How was this done? Our response: Before training all variables were normalized to mean value 0 and standard deviation 1. Facts to points 9 and 10 were given in the manuscript (page 5, lines 9-10): The initial weights were set randomly to values between -1 and +1. Before training all variables were normalized to mean value 0 and standard deviation 1. To point 11: The reviewer wrote: What were the values of the ANN learning rate and momentum? Our response: We used the default values of NConnect - that means for the 4 training steps (see point 8) learning rate 0.9, 0.7, 0.5 as well as 0.4 and momentum 0.1, 0.4, 0.5 as well as 0.6 respectively. This very detailed information is only given in this comment but not in the manuscript. To point 12: 6 The reviewer wrote: Was the sample order randomly permuted before each training epoch so that ANN training results are not biased toward the order in which the samples were arranged during training? Our response: Before training the samples were ordered randomly. We added this (page 5, lines 9-10): Before training all variables were normalized to mean value 0 and standard deviation 1 and ordered randomly. To point 13: The reviewer wrote: How were the samples partitioned during ANN training and testing. Leave-one-out cross validation? 10-fold cross validation? Hold out method, where e.g. 2/3 samples are used for training and 1/3 for testing? Last, the data are purely phenotypic (clinical) and not mechanistic, since PSA is not a biological marker of cytokine signaling, apoptosis, cell survival, neovascularization, or invasion. Thus, the reported AUC values are understandable. Our response: The ANN model was evaluated internally by a 10-fold cross-validation. In comparison to other ANN models we built with new biomarkers we here used the same method as in the original ANN “ProstataClass” model with the 10-fold cross-validation. We agree with the reviewer that all data clinical data and that interpretation of the AUC values has to be seen in this way. Reviewer #2 We thank the reviewer #2 Dr. Almeida for the critical and very helpful suggestions and the valuation that the analysis of this data is very relevant and should be reported. The manuscript has been modified. Our comments to the corresponding points are the following: To point 1: The reviewer wrote: How was the ANN configured – how many hidden nodes? If an adaptative method was used, was it by pruning, by forward addition? Or was this not optimized and 3 hidden nodes were preset as in [22]? Our response: We didn’t train a new ANN but we used the ANN, which was trained as described in reference [22]. The main topic of this paper is not the construction of a new ANN but the use of a previously trained ANN. Determination/measuring of the variables and the characteristics of the population in this paper are the same as described in the paper [22] also. There we described, that we used three ANN – for the PSAranges 2-4 ng/mL, 4-10 ng/mL and 10-20 ng/mL respectively. The ANN training parameters were the same for each ANN. Each ANN has one input layer with five neurons (according to the five variables age, PSA, %fPSA, TRUS and DRE). Furthermore, each ANN contains one hidden layer with three neurons (according to the rule of thumb that the number of hidden neurons should be about the half of the number of input neurons). Finally, each ANN contains one output neuron representing the output value as the probability of PCa As also requested by reviewer 1 all information were added (page 5, lines 1-5): The back-propagation network consists of one input layer with the five neurons tPSA, %fPSA, patient age, prostate volume, and DRE status. Each ANN contains one hidden layer with three neurons. Each ANN finally contains one output neuron representing the output value as the probability of PCa. 7 To point 2: The reviewer wrote: The ROC curves are derived using the same type of data that was used to train the ANN. Is there any kind of cross-validation? External validation (with data not at all used for training)? Is there any optimization halting procedure to avoid over-fitting? Early stopping criteria are a standard feature of ANN training. Was any of that in place? Whatever the specific answers, the critical issue in this objection is that of validation. As per the description in the report itself this reviewer would expect the ANN to be over-trained. Our response: During training the ANN reported in [22] 10-fold cross-validation (internal validation) carried out. Therefore information about the performance of the ANN was possible. Furthermore during training 10% of the data are used for validation to avoid overfitting. Stopping criteria were the number of 400 sweeps (four steps with different learning coefficients and momentum coefficients) or a RMS error of 0.001 or 95% correct classified samples. See page 5, lines 7-8 according to the comment to reviewer 1. All these facts concern the ANN developed in [22]. Although we didn’t perform an external validation there, we believe the validation during the training and the 10fold cross-validation are minimizing the risk of over-fitting. Again, we used the ANN already described in reference [22] to perform this study about the usefulness of PSA velocity. From this point of view we took the ANN outputs directly and the values for the “velocity” of ANN outputs (ANNV) to derive the ROC curves. We added information on 10fold cross validation and over-fitting accordingly (page 5, lines 10-12): To avoid over-fitting we used 10-fold cross-validation. During training always 10% of the data were used for internal validation. To point 3: The reviewer wrote: If a number of clinical parameters such as age are already known to be co-variates, why weren’t they considered in the analysis? That is where machine learning approaches become the most useful – by being able to include a large variety of parameters without concern for their scales or distributions. If a variable selection procedure were in place (such as boosting or more advanced methods using evolutionary approaches) then the proposed interpretations could be backed by sensitivity analysis (the report says “results not shown” on the topic of assigning predictive sensitivity). Most of the questions posed here are addressed in that report. However, that was a different study and it is not clear what exactly can be assumed to apply to this much smaller study. Some probably can – such as a 10-fold cross-validation – but for others I find no clue. The suggestion is that the relevant material be pulled in and the missing answers be added. This could be listed in a compact, small, “ANN analysis section”, in a tabular format would be fine, so the computational statisticians in the audience can put their concerns to rest. Our response: We didn’t use any variable selection procedure in the process of training the ANN described in [22]. However, as suggested by the reviewer, we added a small ANN analysis section at page 5, lines 2-12. This is also in accordance to the requested ANN method details by reviewer 1. The back-propagation network consists of one input layer with the five neurons tPSA, %fPSA, patient age, prostate volume, and DRE status. Each ANN contains one hidden layer with three neurons. Each ANN finally contains one output neuron representing the output value as the probability of PCa. The activation function for the hidden neurons was the tanh while the activation function for the output neuron was linear in the range 0 to 1 to get a value for the probability of PCa. Training of the ANN took place in 4 steps with 100 sweeps each of them. Stopping criteria were a RMS error less than 0.001 or a rate of 95% correct classified samples. The initial weights were set randomly to values between -1 and +1. Before training all variables were normalized to mean value 0 and standard deviation 1 and ordered randomly. To avoid over-fitting we used 10-fold cross-validation. During training always 10% of the data were used for internal validation.