Dear editor, We thank the reviewers for their constructive comments and for reading the revised version very carefully. Below we give a response to the additional points raised by Reviewer 3. We also indicate in our responses where the paper has changed since the previous revision. The authors R1 After the modifications and the answers to the questions raised during the first review I think the paper can be accepted in its current form. We thank reviewer 1 for his appreciation of our paper. R2 I am satisfied with the authors' responses to the changes I proposed in my initial review, in particular the inclusion of SSVM in the experiments to complete the picture of probabilistic vs utility maximisation approaches. We thank reviewer 2 for his appreciation of the additional experiments we have included. Minor comments: - Perhaps I missed it, but it would be good to briefly summarise notation at the beginning, e.g. the use of capital roman font for matrices. We do not summarize notation at the beginning, but we have added appropriate footnote to clarify the notation for matrices. - pg 10, "Practically, this will be mainly the case for datasets with relatively *low* noise" Thanks. R3 Overall, the paper is in a much better shape than the previous version. However, I still hope to see additional changes to be incorporated. ** MAJOR POINTS 1. On page 4, the authors cited (Sawade et al. 2010) as an example of active learning. However, this paper is about active estimation, not active learning. Please see section 4 of that paper. This citation is indeed a bit vague. We have deleted it. 2. On page 6, it is not clear to me that regret (5) is always lower than the least favourable case involving the set of risk minimizers $H_L$. We have updated the formulas on page 6 to make this point more clear. 3. In section 4.3, Theorem 4.2 actually does not justify using marginal probabilities, because it is a lower bound. In contrast, Theorem 4.2 is rather an argument not to use marginal probabilities. This is discussed quite thoroughly in the manuscript. See paragraph just before Theorem 4.2. 4. In section 6.1, for the Jaccard index, the exact algorithm of Quevedo et al. (2012) can actually be included as a method to compute for the Jaccard index performance measure. Although section 6.2 is for joint distribution, this same algorithm can also be included for comparison. We could do that, but this article is about F-measure maximization. We have only additionally shown that there is a link between optimization of the Fmeasure and the Jaccard index. The goal of this paper is not to analyze exhaustively algorithms that maximize the Jaccard index. So, we believe that it would be a bit strange to run additional experiments for optimizing the Jaccard measure. 5. Section 7.1.2, the partition of the decision trees based on F-measure seems strange to me, and it conflicts with the entire philosophy of the paper. From my perspective, the probabilistic models (what is called Learning Algorithms in the paper), whether they are kNN, decision trees, PCC or logistic models, are to estimate the probabilities instead of optimise the F-measure. Hence, decision trees as they are, without the modification suggested in the paper, already fulfils this objective. Please see the DTA approach in Ye et al. 2012. We can understand this remark, and we partially agree with it. Decision trees are indeed a bit of a specific case, as we use the F-measure already during the training phase. In decision trees there are two main components that influence the overall performance of the tree. The first one is the right inference in a given partition of the feature space. Based on training examples from this partition one can easily compute the F-maximizer using the GFM algorithm (similarly as in kNN, thus, during the inference phase). The second component is the right criterion for splitting the feature space in the recursive procedure of tree building. There are essentially two options. The first one relies on splitting a tree in order to get accurate estimates of probabilities. In case of the F-measure one should be interested in estimating probabilities that constitute matrix P. The second option, described in the manuscript, is to find a split that maximizes the Fmeasure. This second option is similar to the approach taken in predictive clustering trees, which belong to the most successful tree algorithms for many complex tasks. We agree that in such a case we are already optimizing the F-measure in the training phase. 6. In section 7.2.1, para 4, I don't understand how "high-dimensional problems" explains the substantial decrease of F-measure with the number of neighbours. Doesn't the subset 0/1 loss require more parameters: it requires the entire joint distribution, but GFM only requires a quadratic number of parameters. It is not that we need to estimate very accurately the entire joint distribution for the subset 0/1 loss, as it is enough to estimate the joint mode. Moreover, one cannot easily compare irrelative values of different losses. The subset 0/1 loss is already very high for small k. For the F-measure we need accurate estimates of probabilities. If we increase the neighborhood we may end up with worse estimates. This can also be observed for MM and Hamming loss. The performance decreases with the size of neighborhood for Scene and Enron datasets. 7. The last sentence of section 7.2.4 also applies to all probabilistic models (or learning algorithms as named in the paper). This advantage indeed applies to all probabilistic models, however, there is always a question how well the label dependencies can be modeled. Another issue is whether inference can be performed in an efficient manner. This is also not possible with all probabilistic models. APPENDIX 8. Please explain why the proof of Theorem 3.2 can proceed with $\epsilon$ approaches zero, but the proof of Theorem 3.1 can't. The answer to this question is quite technical, but let us make an attempt to explain it as intuitive as possible. The proof of Theorem 3.2 can’t proceed with epsilon terms because we are not able to derive an oracle solution for the mixed integer linear program when epsilon terms are considered. However, for Theorem 3.2 this is not needed because we are able to find a probability distribution that has unique risk minimizers, while being arbitrarily close to the oracle solution that we present. The claim of the theorem then immediately follows from taking the limit when epsilon approaches zero. Let us consider the example of m=4. The vectors 0000 1100 1010 1001 0110 0101 0011 1110 1011 1101 0111 1111 all have a probability mass of 1/12. The subset zero-one loss minimizer is not unique in this case, but let us choose 0000 as subset zero-one loss minimizer. The F-measure maximizer is 1111. The regret should then be 13 / 24 - see fraction in Theorem 3.2. We can easily extend this to an epsilon-case with unique subset zero-one loss minimizer: 0000 1100 1010 1001 0110 1/12 + 11 eps 1/12 - eps 1/12 - eps ... 0101 0011 1110 1011 1101 0111 1111 So, this suffices as a mathematically correct proof. The same trick cannot be used for Theorem 3.1. If we would omit the epsilon-terms in the Lagrangian there, we would end up with a solution that cannot be epsilon-approached by a probability distribution with unique risk minimizer. As an example, let us again consider the case m=4 and the following probability distribution: 1000 .5 0111 .5 One of the risk minimizers for Hamming measure. Another vector, (1110), which gets an F-measure of 0.5833. This is a supremum mentioned in Theorem 3.1, but epsilon-approached when restricting to loss is vector (0000) with zero Fis also a Hamming loss minimizer, regret that is higher than the it is a value that cannot be unique Hamming loss minimizers. 9. In equation (32), it seems to me that the last summand on the left hand side should not be present. This sum needs to be present there. P(0_m) is one of the 2^m variables in the optimization problem. This term originates from the third sum-term in the primal Lagrangian, when taking the partial derivative w.r.t. P(0_m). 10. In equation (41), the case conditions does not seem to hold for $m=4$. Indeed, this was a small typo from our side. We have corrected it. This does not have any consequences for the correctness of the proof. ** MINOR POINTS 1. At the end of section 4, instead of constructing more complicated examples, one can simply use your Theorem 4.5 and give 1/6 as a regret for large enough $m$. The example is actually not meant to show how big the regret can be, but to provide more insight about what kind of distributions might lead to a high regret. We believe that this is useful information for readers that want to skip the proof. 2. The first para on page 17 seems to be important. You may want to include it in the abstract and/or the introduction. We like this comment. This is indeed an important paragraph, but we have a feeling that readers first have to understand the general framework we work with before being able to get the importance of this sentence. 3. In Figures 2 and 3, the right graphs can be terminated at around 4000, since most all the methods have already reached their asymptotic levels by then. Indeed, we could terminate the plot around 4000. However, in order to not incorporate any additional mistakes/typos into the paper we will not change the graphs. 4. The bottom left plot in Figure 3 is important to show to performance of GFM over FM. More discussion on this will be helpful. This an interesting remark. A quick answer is that FM assumes label independence and if this assumption is violated then it may not perform well not only for the F-measure, but also for the Jaccard index. We have added a short text to this section. 5. In Table 2, Hamming loss, Scene data set, I believe the best method for $l=100$ is JM instead of MM. Thanks. ** CLARITY I suggest the authors do a thorough proof-reading. At several points, the wordings are rather awkward. We have read the paper once again and we have changed a few sentences. As none of us is a native English speaker we can imagine that some wordings could be further improved. 1. In Theorem 3.1, should the set $\mathcal{P}^u_{L_H}$ actually be $\mathcal{P}^u_H$ to be consistent with Definition 2.2? This actually occurs at multiple points throughout the paper. We actually prefer to use the symbol L_H instead of simply H to denote the Hamming loss. From that perspective the notation is consistent with Def. 2.2. 2. In Theorem 3.1, it seems to be that the requirement for unique $F$measure maximiser is unnecessary, as you have remarked near the end of section 2. We have deleted this restriction in Theorem 3.1. 3. In the last para of section 3.3, despite the authors' attempt, I still find "signal strength" confusing. a. Do you mean the "predictive performance" is a good indication of "signal strength", or do you mean the "decrease of the upper bound on the regret with predictive performance" is a good indication? b. Please explain the second sentence: what is a "practical situation"? and what is a hypothetical machine learning method? c. Again, what do you mean by "noise" in the last sentence? We have changed the discussion to make those things more clear. 4. The entire section 4.3 is hard to understand, including footnote 2. I think this is because the authors tried to re-cast empirical utility maximisation (EUM) (using the term in Ye et al. 2012) into the decision theory approach (again, using the term in Ye et al. 2012). The EUM estimates the threshold $\theta$ using training data, but applies the same threshold on test data. The citations after Theorem 4.5 are based on such EUM approach that uses training data and test data at separate phases. However, it seems to me that the authors are talking about estimating the threshold and then testing on the same data. The authors have the make clear such differences, because most readers are used to the EUM approach. The goal of this section is to show that thresholding does not lead to the optimal result, even if the thresholding method has a full knowledge about the joint distribution. The method of Zhang et al. (2010) performs thresholding on a validation set with the observations that are not independent. 5. In section 5, several of the $\Delta_{ik}$s are missing their superscripts. Moreover, the $\delta$ in the display equation above Theorem 5.1 should actually be $\Delta$. Thanks. We now do not use the superscript in the entire manuscript. 6. On page 17, second para, last sentence: I don't see how the concentration of the joint distribution helps in estimating the required parameters. In a problem of $m$ labels we have $m$ marginal probabilities and $2^m$ parameters of the joint distribution to estimate. However, if the joint distribution is concentrated only on a few label combinations (the other label combinations have zero probability), say $n$, and if $n < m$, then estimation of the joint distribution is not harder than estimation of the marginal probabilities. Having the joint distribution we can easily get any parameters we want for our algorithms. 7. In section 6.1, are the training sets used to estimate the parameters $w_i$s? We do not estimate $w_i$ directly. We use appropriate frequencies to estimate required parameters. Then we run the inference methods on these estimates. 8. In section 6.2, does "theoretical analysis" refers to Theorem 3.3? We have rephrased this paragraph. "Theoretical analysis" refers to Theorem 4.2 and Corollary 4.3. 9. In the first sentence of section 7, by "parameters" do you mean the parameters of the probabilistic models? The parameters that are needed to compute the F-measure: marginal probabilities or matrices P or \Delta. 10. In the intro to section 7, please give citations for macro- and microaverages F-measure. We have added this. 11. The second sentence in section 7.1.4 can be omitted to prevent confusion. We have adequately modified this paragraph. 12. In section 7.1.4, is the [[y_i=1]] the Iverson bracket? Yes. We have clarified the notation in the text. 13. In section 7.1.4, I suggest to use $t$ as the random variable in the multinomial regressions to prevent confusion with the actual labels in the multi-label problem. Thanks. We have used $t$ instead of $y$ in Section 7.1.4. 14. In section 7.1.4, para 5, please explain the parenthesised clause. We have extended this clause to be more clear. 15. In section 7.1.4, and in the subsequent mentions, I suggest to use PGFM and P-FM instead of EFP and LFP to remind the user what these methods stand for. It is a nice suggestion, but we prefer to keep the notation consistent with the previous paper. 16. In section 7.2.1 para 5, what is "searching time"? searching time = nearest neighbor search time (we have changed the text to make it clear). 17. In section 7.2.3, I suggest to use P-MM instead of BR, since it is basically mode maximisation. Similarly as above. BR is a popular name for this approach, therefore we have decided to keep it. 18. Table 5, bold 10.51 for BR/Hamming-Loss. Thanks. 19. In section 7.3 para 4, what does "blending" mean? It is another word used for ensemble learning. APPENDIX 20. In proof of Theorem 3.1, please use either $P_A$ or $P$ throughout. We need to distinguish between P(y) and P_A(y). The former represent the variables that are optimized in the mixed integer program. The latter represent the oracle solution of that program. 21. In proof of Theorem 3.1, should the $y^{ABC}$s be $y^{BCD}$ ? Corrected. 22. Please explain the phrasing about the individual variable $\lambda_y^-$ at the top of page 40. We have extended the sentence. 23. In proof of Theorem 3.1, it is not immediately clear that (27) and (28) are equivalent due to (26). (27) and (28) are not equivalent. (26), (27), (28) and (29) represent the four cases that can be distinguished when writing out the system just above (26) more explicitly. We agree that this was confusing, and we have updated the text. 24. Please explain where the the set of equations at the bottom of page 40 comes from. Are these valid for any $y$ and $y'$? This follows immediately from the formula above (26). The inequality occurs due to omitting the mu_j terms in that formula. 25. Please explain the first sentence on page 41. This is explained now more in detail. 26. For the proof of Theorem 3.2, please place the proof of the oracle solution into a separate lemma. We don’t believe that it is a good idea to add a lemma in the middle of the proof. In order to write the oracle solution of the optimization problem as a lemma, one needs to introduce a lot of notation first (the part written above the formulation as an optimization problem). As a result, we believe that putting part of the proof in a lemma will only make the proof less readable. 27. For the sets $\Omega^0$ and $\Omega^1$ on page 45, should these sets be parameterized by $q$? Similarly, it seems that the mapping $\Psi$ should also be parameterized by $q$. Yes, this is a correct remark. We have adjusted this. 28. Instead of using "shifts" the bit, use "toggle" or "switch". Ok. 29. For proof of Theorem 4.5, say "expected F-measure of this prediction", instead of omitting "expected". Ok.