rebuttal2

advertisement
Dear editor,
We thank the reviewers for their constructive comments and for
reading the revised version very carefully. Below we give a response
to the additional points raised by Reviewer 3. We also indicate in
our responses where the paper has changed since the previous
revision.
The authors
R1
After the modifications and the answers to the questions raised during the
first review I think the paper can be accepted in its current form.
We thank reviewer 1 for his appreciation of our paper.
R2
I am satisfied with the authors' responses to the changes I proposed in my
initial review, in particular the inclusion of SSVM in the experiments to
complete the picture of probabilistic vs utility maximisation approaches.
We thank reviewer 2 for his appreciation of the additional experiments we
have included.
Minor comments:
- Perhaps I missed it, but it would be good to briefly summarise notation
at the beginning, e.g. the use of capital roman font for matrices.
We do not summarize notation at the beginning, but we have added
appropriate footnote to clarify the notation for matrices.
- pg 10, "Practically, this will be mainly the case for datasets with
relatively *low* noise"
Thanks.
R3
Overall, the paper is in a much better shape than the previous version.
However, I still hope to see additional changes to be incorporated.
** MAJOR POINTS
1. On page 4, the authors cited (Sawade et al. 2010) as an example of
active learning. However, this paper is about active estimation, not active
learning. Please see section 4 of that paper.
This citation is indeed a bit vague. We have deleted it.
2. On page 6, it is not clear to me that regret (5) is always lower than
the least favourable case involving the set of risk minimizers $H_L$.
We have updated the formulas on page 6 to make this point more clear.
3. In section 4.3, Theorem 4.2 actually does not justify using marginal
probabilities, because it is a lower bound.
In contrast, Theorem 4.2 is rather an argument not to use marginal
probabilities. This is discussed quite thoroughly in the manuscript. See
paragraph just before Theorem 4.2.
4. In section 6.1, for the Jaccard index, the exact algorithm of Quevedo et
al. (2012) can actually be included as a method to compute for the Jaccard
index performance measure. Although section 6.2 is for joint distribution,
this same algorithm can also be included for comparison.
We could do that, but this article is about F-measure maximization. We have
only additionally shown that there is a link between optimization of the Fmeasure and the Jaccard index. The goal of this paper is not to analyze
exhaustively algorithms that maximize the Jaccard index. So, we believe
that it would be a bit strange to run additional experiments for optimizing
the Jaccard measure.
5. Section 7.1.2, the partition of the decision trees based on F-measure
seems strange to me, and it conflicts with the entire philosophy of the
paper. From my perspective, the probabilistic models (what is called
Learning Algorithms in the paper), whether they are kNN, decision trees,
PCC or logistic models, are to estimate the probabilities instead of
optimise the F-measure. Hence, decision trees as they are, without the
modification suggested in the paper, already fulfils this objective. Please
see the DTA approach in Ye et al. 2012.
We can understand this remark, and we partially agree with it. Decision
trees are indeed a bit of a specific case, as we use the F-measure already
during the training phase.
In decision trees there are two main components that influence the overall
performance of the tree. The first one is the right inference in a given
partition of the feature space. Based on training examples from this
partition one can easily compute the F-maximizer using the GFM algorithm
(similarly as in kNN, thus, during the inference phase). The second
component is the right criterion for splitting the feature space in the
recursive procedure of tree building. There are essentially two options.
The first one relies on splitting a tree in order to get accurate estimates
of probabilities. In case of the F-measure one should be interested in
estimating probabilities that constitute matrix P. The second option,
described in the manuscript, is to find a split that maximizes the Fmeasure. This second option is similar to the approach taken in predictive
clustering trees, which belong to the most successful tree algorithms for
many complex tasks. We agree that in such a case we are already optimizing
the F-measure in the training phase.
6. In section 7.2.1, para 4, I don't understand how "high-dimensional
problems" explains the substantial decrease of F-measure with the number of
neighbours. Doesn't the subset 0/1 loss require more parameters: it
requires the entire joint distribution, but GFM only requires a quadratic
number of parameters.
It is not that we need to estimate very accurately the entire joint
distribution for the subset 0/1 loss, as it is enough to estimate the joint
mode. Moreover, one cannot easily compare irrelative values of different
losses. The subset 0/1 loss is already very high for small k.
For the F-measure we need accurate estimates of probabilities. If we
increase the neighborhood we may end up with worse estimates. This can also
be observed for MM and Hamming loss. The performance decreases with the
size of neighborhood for Scene and Enron datasets.
7. The last sentence of section 7.2.4 also applies to all probabilistic
models (or learning algorithms as named in the paper).
This advantage indeed applies to all probabilistic models, however, there
is always a question how well the label dependencies can be modeled.
Another issue is whether inference can be performed in an efficient manner.
This is also not possible with all probabilistic models.
APPENDIX
8. Please explain why the proof of Theorem 3.2 can proceed with $\epsilon$
approaches zero, but the proof of Theorem 3.1 can't.
The answer to this question is quite technical, but let us make an attempt
to explain it as intuitive as possible. The proof of Theorem 3.2 can’t
proceed with epsilon terms because we are not able to derive an oracle
solution for the mixed integer linear program when epsilon terms are
considered. However, for Theorem 3.2 this is not needed because we are able
to find a probability distribution that has unique risk minimizers, while
being arbitrarily close to the oracle solution that we present. The claim
of the theorem then immediately follows from taking the limit when epsilon
approaches zero. Let us consider the example of m=4. The vectors
0000
1100
1010
1001
0110
0101
0011
1110
1011
1101
0111
1111
all have a probability mass of 1/12. The subset zero-one loss minimizer is
not unique in this case, but let us choose 0000 as subset zero-one loss
minimizer. The F-measure maximizer is 1111. The regret should then be 13 /
24 - see fraction in Theorem 3.2.
We can easily extend this to an epsilon-case with unique subset zero-one
loss minimizer:
0000
1100
1010
1001
0110
1/12 + 11 eps
1/12 - eps
1/12 - eps
...
0101
0011
1110
1011
1101
0111
1111
So, this suffices as a mathematically correct proof.
The same trick cannot be used for Theorem 3.1. If we would omit the
epsilon-terms in the Lagrangian there, we would end up with a solution that
cannot be epsilon-approached by a probability distribution with unique risk
minimizer. As an example, let us again consider the case m=4 and the
following probability distribution:
1000 .5
0111 .5
One of the risk minimizers for Hamming
measure. Another vector, (1110), which
gets an F-measure of 0.5833. This is a
supremum mentioned in Theorem 3.1, but
epsilon-approached when restricting to
loss is vector (0000) with zero Fis also a Hamming loss minimizer,
regret that is higher than the
it is a value that cannot be
unique Hamming loss minimizers.
9. In equation (32), it seems to me that the last summand on the left hand
side should not be present.
This sum needs to be present there. P(0_m) is one of the 2^m variables in
the optimization problem. This term originates from the third sum-term in
the primal Lagrangian, when taking the partial derivative w.r.t. P(0_m).
10. In equation (41), the case conditions does not seem to hold for $m=4$.
Indeed, this was a small typo from our side. We have corrected it. This
does not have any consequences for the correctness of the proof.
** MINOR POINTS
1. At the end of section 4, instead of constructing more complicated
examples, one can simply use your Theorem 4.5 and give 1/6 as a regret for
large enough $m$.
The example is actually not meant to show how big the regret can be, but to
provide more insight about what kind of distributions might lead to a high
regret. We believe that this is useful information for readers that want to
skip the proof.
2. The first para on page 17 seems to be important. You may want to include
it in the abstract and/or the introduction.
We like this comment. This is indeed an important paragraph, but we have a
feeling that readers first have to understand the general framework we work
with before being able to get the importance of this sentence.
3. In Figures 2 and 3, the right graphs can be terminated at around 4000,
since most all the methods have already reached their asymptotic levels by
then.
Indeed, we could terminate the plot around 4000. However, in order to not
incorporate any additional mistakes/typos into the paper we will not change
the graphs.
4. The bottom left plot in Figure 3 is important to show to performance of
GFM over FM. More discussion on this will be helpful.
This an interesting remark. A quick answer is that FM assumes label
independence and if this assumption is violated then it may not perform
well not only for the F-measure, but also for the Jaccard index. We have
added a short text to this section.
5. In Table 2, Hamming loss, Scene data set, I believe the best method for
$l=100$ is JM instead of MM.
Thanks.
** CLARITY
I suggest the authors do a thorough proof-reading. At several points, the
wordings are rather awkward.
We have read the paper once again and we have changed a few sentences. As
none of us is a native English speaker we can imagine that some wordings
could be further improved.
1. In Theorem 3.1, should the set $\mathcal{P}^u_{L_H}$ actually be
$\mathcal{P}^u_H$ to be consistent with Definition 2.2? This actually
occurs at multiple points throughout the paper.
We actually prefer to use the symbol L_H instead of simply H to denote the
Hamming loss. From that perspective the notation is consistent with Def.
2.2.
2. In Theorem 3.1, it seems to be that the requirement for unique $F$measure maximiser is unnecessary, as you have remarked near the end of
section 2.
We have deleted this restriction in Theorem 3.1.
3. In the last para of section 3.3, despite the authors' attempt, I still
find "signal strength" confusing.
a. Do you mean the "predictive performance" is a good indication of "signal
strength", or do you mean the "decrease of the upper bound on the regret
with predictive performance" is a good indication?
b. Please explain the second sentence: what is a "practical situation"? and
what is a hypothetical machine learning method?
c. Again, what do you mean by "noise" in the last sentence?
We have changed the discussion to make those things more clear.
4. The entire section 4.3 is hard to understand, including footnote 2. I
think this is because the authors tried to re-cast empirical utility
maximisation (EUM) (using the term in Ye et al. 2012) into the decision
theory approach (again, using the term in Ye et al. 2012). The EUM
estimates the threshold $\theta$ using training data, but applies the same
threshold on test data. The citations after Theorem 4.5 are based on such
EUM approach that uses training data and test data at separate phases.
However, it seems to me that the authors are talking about estimating the
threshold and then testing on the same data. The authors have the make
clear such differences, because most readers are used to the EUM approach.
The goal of this section is to show that thresholding does not lead to the
optimal result, even if the thresholding method has a full knowledge about
the joint distribution. The method of Zhang et al. (2010) performs
thresholding on a validation set with the observations that are not
independent.
5. In section 5, several of the $\Delta_{ik}$s are missing their
superscripts. Moreover, the $\delta$ in the display equation above Theorem
5.1 should actually be $\Delta$.
Thanks. We now do not use the superscript in the entire manuscript.
6. On page 17, second para, last sentence: I don't see how the
concentration of the joint distribution helps in estimating the required
parameters.
In a problem of $m$ labels we have $m$ marginal probabilities and $2^m$
parameters of the joint distribution to estimate. However, if the joint
distribution is concentrated only on a few label combinations (the other
label combinations have zero probability), say $n$, and if $n < m$, then
estimation of the joint distribution is not harder than estimation of the
marginal probabilities. Having the joint distribution we can easily get any
parameters we want for our algorithms.
7. In section 6.1, are the training sets used to estimate the parameters
$w_i$s?
We do not estimate $w_i$ directly. We use appropriate frequencies to
estimate required parameters. Then we run the inference methods on these
estimates.
8. In section 6.2, does "theoretical analysis" refers to Theorem 3.3?
We have rephrased this paragraph. "Theoretical analysis" refers to Theorem
4.2 and Corollary 4.3.
9. In the first sentence of section 7, by "parameters" do you mean the
parameters of the probabilistic models?
The parameters that are needed to compute the F-measure: marginal
probabilities or matrices P or \Delta.
10. In the intro to section 7, please give citations for macro- and microaverages F-measure.
We have added this.
11. The second sentence in section 7.1.4 can be omitted to prevent
confusion.
We have adequately modified this paragraph.
12. In section 7.1.4, is the [[y_i=1]] the Iverson bracket?
Yes. We have clarified the notation in the text.
13. In section 7.1.4, I suggest to use $t$ as the random variable in the
multinomial regressions to prevent confusion with the actual labels in the
multi-label problem.
Thanks. We have used $t$ instead of $y$ in Section 7.1.4.
14. In section 7.1.4, para 5, please explain the parenthesised clause.
We have extended this clause to be more clear.
15. In section 7.1.4, and in the subsequent mentions, I suggest to use PGFM and P-FM instead of EFP and LFP to remind the user what these methods
stand for.
It is a nice suggestion, but we prefer to keep the notation consistent with
the previous paper.
16. In section 7.2.1 para 5, what is "searching time"?
searching time = nearest neighbor search time (we have changed the text to
make it clear).
17. In section 7.2.3, I suggest to use P-MM instead of BR, since it is
basically mode maximisation.
Similarly as above. BR is a popular name for this approach, therefore we
have decided to keep it.
18. Table 5, bold 10.51 for BR/Hamming-Loss.
Thanks.
19. In section 7.3 para 4, what does "blending" mean?
It is another word used for ensemble learning.
APPENDIX
20. In proof of Theorem 3.1, please use either $P_A$ or $P$ throughout.
We need to distinguish between P(y) and P_A(y). The former represent the
variables that are optimized in the mixed integer program. The latter
represent the oracle solution of that program.
21. In proof of Theorem 3.1, should the $y^{ABC}$s be $y^{BCD}$ ?
Corrected.
22. Please explain the phrasing about the individual variable $\lambda_y^-$
at the top of page 40.
We have extended the sentence.
23. In proof of Theorem 3.1, it is not immediately clear that (27) and (28)
are equivalent due to (26).
(27) and (28) are not equivalent. (26), (27), (28) and (29) represent the
four cases that can be distinguished when writing out the system just above
(26) more explicitly. We agree that this was confusing, and we have updated
the text.
24. Please explain where the the set of equations at the bottom of page 40
comes from. Are these valid for any $y$ and $y'$?
This follows immediately from the formula above (26). The inequality occurs
due to omitting the mu_j terms in that formula.
25. Please explain the first sentence on page 41.
This is explained now more in detail.
26. For the proof of Theorem 3.2, please place the proof of the oracle
solution into a separate lemma.
We don’t believe that it is a good idea to add a lemma in the middle of the
proof. In order to write the oracle solution of the optimization problem as
a lemma, one needs to introduce a lot of notation first (the part written
above the formulation as an optimization problem). As a result, we believe
that putting part of the proof in a lemma will only make the proof less
readable.
27. For the sets $\Omega^0$ and $\Omega^1$ on page 45, should these sets be
parameterized by $q$? Similarly, it seems that the mapping $\Psi$ should
also be parameterized by $q$.
Yes, this is a correct remark. We have adjusted this.
28. Instead of using "shifts" the bit, use "toggle" or "switch".
Ok.
29. For proof of Theorem 4.5, say "expected F-measure of this prediction",
instead of omitting "expected".
Ok.
Download