Robustness of Bayesian Pool-Based Active Learning Against Prior Misspecification Nguyen Viet Cuong,

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)
Robustness of Bayesian Pool-Based Active
Learning Against Prior Misspecification
Nguyen Viet Cuong,1 Nan Ye,2 Wee Sun Lee3
1
2
Department of Mechanical Engineering, National University of Singapore, Singapore, nvcuong@nus.edu.sg
Mathematical Sciences School & ACEMS, Queensland University of Technology, Australia, n.ye@qut.edu.au
3
Department of Computer Science, National University of Singapore, Singapore, leews@comp.nus.edu.sg
queries, and the minimum cost problem which aims to minimize the expected number of queries needed to identify the
true labeling of all examples. We focus on approximate algorithms because previous works have shown that, in general,
it is computationally intractable to find the optimal strategy for choosing the examples, while some commonly used
AL algorithms can achieve good approximation ratios compared to the optimal strategies (Golovin and Krause 2011;
Chen and Krause 2013; Cuong et al. 2013; Cuong, Lee,
and Ye 2014). For example, with the version space reduction utility, the maximum Gibbs error algorithm achieves
a (1 − 1/e)-approximation of the optimal expected utility
(Cuong et al. 2013), while the least confidence algorithm
achieves the same approximation of the optimal worst-case
utility (Cuong, Lee, and Ye 2014).
Our work shows that many commonly used AL algorithms are robust. In the maximum coverage setting, our
main result is that if the utility function is Lipschitz continuous in the prior, all α-approximate algorithms are robust,
i.e., they are near α-approximate when using a perturbed
prior. More precisely, their performance guarantee on the
expected or worst-case utility degrades by at most a constant factor of the 1 distance between the perturbed prior
and the true prior. It follows from this result that the maximum Gibbs error and the least confidence algorithms are
near (1 − 1/e)-approximate. Our result also implies the robustness of the batch and generalized versions of the maximum Gibbs error algorithm. If the utility is non-Lipschitz,
we show that even an optimal algorithm for the perturbed
prior may not be robust. This suggests we should use a
Lipschitz continuous utility for AL in order to achieve robustness. Similarly, we prove a robustness result for the
minimum cost setting that implies the robustness of the
generalized binary search AL algorithm (Dasgupta 2004;
Nowak 2008).
We also address the difficulty of choosing a good prior
in practice. Practically, it is often easier to come up with a
set of distributions and combine them using a mixture. Our
theoretical results imply robustness when the mixture prior
is close to the true prior. In the mixture setting, another interesting question is robustness when the true prior is one
of the components of the mixture. In this case, the mixture
prior may not necessarily be close to the true prior in terms
of 1 distance. We prove that for the uniform mixture prior,
Abstract
We study the robustness of active learning (AL) algorithms
against prior misspecification: whether an algorithm achieves
similar performance using a perturbed prior as compared to
using the true prior. In both the average and worst cases of the
maximum coverage setting, we prove that all α-approximate
algorithms are robust (i.e., near α-approximate) if the utility
is Lipschitz continuous in the prior. We further show that robustness may not be achieved if the utility is non-Lipschitz.
This suggests we should use a Lipschitz utility for AL if
robustness is required. For the minimum cost setting, we
can also obtain a robustness result for approximate AL algorithms. Our results imply that many commonly used AL
algorithms are robust against perturbed priors. We then propose the use of a mixture prior to alleviate the problem of
prior misspecification. We analyze the robustness of the uniform mixture prior and show experimentally that it performs
reasonably well in practice.
1
Introduction
In pool-based active learning (AL), training examples are
sequentially selected and labeled from a pool of unlabeled
data, with the aim of obtaining a good classifier using as
few labeled examples as possible (McCallum and Nigam
1998). To achieve computational efficiency, most commonly
used methods greedily select one example at a time based on
some criterion.
In this paper, we consider Bayesian pool-based AL that
assumes data labels are generated from a prior distribution. In theory, the true prior is generally assumed to be
known (Golovin and Krause 2011; Cuong et al. 2013;
Cuong, Lee, and Ye 2014). In practice, it is often unknown
and misspecified; that is, the prior used is different from the
true one. This work is the first one investigating the robustness of AL algorithms against prior misspecification – that
is, whether an algorithm achieves similar performance using
a perturbed prior as compared to using the true prior.
We focus on the analysis of approximate algorithms for
two commonly studied problems: the maximum coverage
problem which aims to maximize the expected or worstcase utility of the chosen examples given a fixed budget of
c 2016, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
1512
approximate AL algorithms are still robust in the sense that
they are competitive with the optimum performance of the
mixture, which is the performance we expect when modeling. Our experiments show that the uniform mixture performs well in practice.
Related Works: Greedy algorithms for pool-based AL
usually optimize some measure of uncertainty of the selected examples (Settles 2009; Cuong et al. 2013). In the
Bayesian setting, they can be viewed as greedily optimizing
some corresponding average-case or worst-case objective.
For instance, the maximum entropy algorithm (Settles and
Craven 2008), which maximizes the Shannon entropy of the
selected examples, attempts to greedily maximize the policy
entropy in the average case (Cuong et al. 2013). The maximum Gibbs error algorithm, which maximizes the Gibbs error of the selected examples, attempts to greedily maximize
the version space reduction in the average case (Cuong et al.
2013); and the least confidence algorithm, which minimizes
the probability of the most likely label of the selected examples, attempts to maximize the version space reduction in
the worst case (Cuong, Lee, and Ye 2014).
Analyses of these algorithms typically investigate their
near-optimality guarantees in the average or worst case. The
maximum entropy algorithm was shown to have no constant
factor approximation guarantee in the average case (Cuong,
Lee, and Ye 2014). In contrast, the maximum Gibbs error
algorithm has a (1 − 1/e)-factor approximation guarantee
for the average version space reduction objective (Cuong
et al. 2013). This algorithm is a probabilistic version of
the generalized binary search algorithm (Dasgupta 2004;
Golovin and Krause 2011). It can also be applied to the batch
mode setting (Hoi et al. 2006), and was shown to provide a
(1 − e−(e−1)/e )-factor approximation to the optimal batch
AL algorithm (Cuong et al. 2013). In the noiseless case, this
batch maximum Gibbs error algorithm is equivalent to the
BatchGreedy algorithm (Chen and Krause 2013).
Cuong, Lee, and Ye (2014) showed the least confidence
algorithm (Culotta and McCallum 2005) has a (1 − 1/e)factor approximation guarantee with respect to the worstcase version space reduction objective. A similar result in
the worst case was also shown for the generalized maximum
Gibbs error algorithm with an arbitrary loss (Cuong, Lee,
and Ye 2014). These results are due to the pointwise submodularity of version space reduction. AL that exploits submodularity was also studied in (Guillory and Bilmes 2010;
Wei, Iyer, and Bilmes 2015).
2
to generate labels for the examples and a prior is imposed
on these models instead of the labelings, we can convert this
prior into an equivalent prior on the labelings and work with
this induced prior. The construction of the induced prior involves computing the probability of labelings with respect
to the original prior (Cuong et al. 2013). In practice, we
do not need to compute or maintain the induced prior explicitly as this process is very expensive. Instead, we compute or approximate the AL criteria directly from the original prior on the probabilistic models (Cuong et al. 2013;
Cuong, Lee, and Ye 2014).
For any distribution p[h], any example sequence S ⊆ X ,
and any label sequence y with the same length, p[y; S] denotes the probability
that y is the label sequence of S. Formally, p[y; S] def
= h p[h] P[h(S) = y | h], where h(S) =
(h(x1 ), . . . , h(xi )) if S = (x1 , . . . , xi ). In the above,
P[h(S) = y | h] is the probability that S has label sequence
y given the hypothesis h. If h is deterministic as in our setting, P[h(S) = y | h] = 1(h(S) = y), where 1(·) is the
indicator function. Note that p[ · ; S] is a probability distribution on the label sequences of S. We also write p[y; x] to
denote p[{y}; {x}] for x ∈ X and y ∈ Y.
Given a prior, a pool-based AL algorithm is equivalent to
a policy for choosing training examples from X . A policy
is a mapping from a partial labeling (labeling of a subset of
X ) to the next unlabeled example to query. It can be represented by a policy tree whose nodes correspond to unlabeled
examples to query, and edges from a node correspond to its
labels. When an unlabeled example is queried, we receive
its true label according to htrue . We will focus on adaptive
policies, which use the observed labels of previously chosen
examples to query the next unlabeled example.
3
Robustness: Maximum Coverage Problem
We now consider the robustness of AL algorithms for the
maximum coverage problem: find an adaptive policy maximizing the expected or worst-case utility given a budget of k
queries (Cuong et al. 2013; Cuong, Lee, and Ye 2014). The
utility is a non-negative function f (S, h) : 2X × H → R≥0 .
Intuitively, a utility measures the value of querying examples S when the true labeling is h. Utilities for AL usually
depend on the prior, so we shall use the notation fp (S, h) to
denote that the utility fp depends on a distribution p over H.
fp is said to be Lipschitz continuous (in the prior) with a
Lipschitz constant L if for any S, h, and two priors p, p ,
|fp (S, h) − fp (S, h)| ≤ Lp − p ,
(1)
where p−p def
= h |p[h]−p [h]| is the 1 distance between
p and p . Lipschitz continuity implies boundedness.1
An AL algorithm is a mapping from a utility and a prior
to a policy. Let xπh denote the set of examples selected by
a policy π when the true labeling is h. We now analyze the
robustness of AL algorithms for both the average case and
the worst case.
Preliminaries
Let X be a finite set (a pool) of examples and Y be a finite set of labels. Consider the hypothesis space H def
= YX
consisting of all functions from X to Y. Each hypothesis
h ∈ H is a labeling of X . In the Bayesian setting, we assume
an unknown true labeling htrue drawn from a prior p0 [h] on
H. After observing a labeled set D, we obtain the posterior
pD [h] def
= p0 [h | D] using Bayes’ rule.
The true labeling htrue may be generated by a complex
process rather than directly drawn from a prior p0 . For instance, if probabilistic models (e.g., naive Bayes) are used
Choose an arbitrary p , then for any p, S, h, we have
fp (S, h) ≤ fp (S, h) + Lp − p ≤ fp (S, h) + 2L ≤
maxS,h fp (S, h) + 2L. Similarly, a lower bound exists.
1
1513
3.1
fpavg
0 (A(p1 )) is the expected utility of the policy returned
by A using p1 as prior. The expected utility is always computed with respect to the true prior p0 . Theorem 1 shows
that when we use a perturbed prior p1 , the expected utility
achieved by an average-case α-approximate algorithm degrades by at most a constant factor of the 1 distance between the perturbed prior and the true prior.
Application to Maximum Gibbs Error: Theorem 1 implies the robustness of the maximum Gibbs
error algorithm (Cuong et al. 2013). This algorithm greedily selects the next example x∗ satisfying
x∗ = arg maxx Ey∼pD [·;x] [1 − pD [y; x]], where pD is
the current posterior and pD [y; x] is the probability
(w.r.t. pD ) that x has label y. In the binary-class and
noiseless setting, it is equivalent to the generalized binary search algorithm (Dasgupta 2004; Nowak 2008;
Golovin and Krause 2011). Consider the version space reduction utility fp (S, h) def
= 1 − p[h(S); S], where p[h(S); S]
is the probability (w.r.t. p) that S has the labels h(S).
We have the following corollary about the algorithm. The
proofs of this corollary and the remaining results are in the
supplementary material.
The Average Case
In this case, our objective is to find a policy with maximum
expected utility. If p0 is the true prior, the expected utility of
π
def
a policy π is fpavg
0 (π) = Eh∼p0 [fp0 (xh , h)].
We consider the case where we have already chosen a utility, but still need to choose the prior. In practice, the choice
is often subjective and may not be the true prior. A natural
question is: if we choose a perturbed prior p1 (i.e., a prior
not very different from the true prior p0 in terms of 1 distance), can an AL algorithm achieve performance competitive to that obtained using the true prior?
Our first robustness result is for α-approximate algorithms that return an α-approximate policy of the optimal
one. Formally, an average-case α-approximate (0 < α ≤ 1)
algorithm A outputs, for any prior p, a policy A(p) satisfying
fpavg (A(p)) ≥ α max fpavg (π).
π
When α = 1, we call A an exact algorithm. For notational
convenience, we drop the dependency of A on the utility as
we assumed a fixed utility here. We consider approximate algorithms because practical algorithms are generally approximate due to computational intractability of the problem. We
have the following robustness result.
Theorem 1. Assume the utility fp is Lipschitz continuous
with a Lipschitz constant L. Let M be an upper bound of fp .
If A is an average-case α-approximate algorithm, then for
any true prior p0 and any perturbed prior p1 ,
Corollary 1. If A is the maximum Gibbs error algorithm,
then for any true prior p0 and any perturbed prior p1 ,
1
2
avg
max
p1 − p0 .
(A(p
))
≥
1
−
f
(π)
−
4
−
fpavg
1
p
0
0
π
e
e
fpavg
(A(p1 )) ≥ α max fpavg
(π) − (α + 1)(L + M )p1 − p0 .
0
0
Application to Batch Maximum Gibbs Error: We can
also obtain the robustness result for the batch version of the
maximum Gibbs error algorithm. In the batch setting, the
AL algorithm queries a batch of examples in each iteration
instead of only one example (Hoi et al. 2006). The batch
maximum Gibbs error algorithm is described in Algorithm 1
of (Cuong et al. 2013), and by Theorem 5 of the same work,
it is a (1 − e−(e−1)/e )-approximate algorithm for the version
space reduction utility above. If we restrict the policies to
only those in the batch setting, then from Theorem 1, we
have the following corollary. Note that the range of the max
operation in the corollary is restricted to only batch policies.
π
Thus, A is robust in the sense that it returns a near αapproximate policy when using a perturbed prior.
Proof. Let C = L + M . For any policy π, note that:
(π) − fpavg
(π)|
|fpavg
0
1
= |(
p0 [h]fp0 (xπh , h) −
p0 [h]fp1 (xπh , h))
h
+(
h
p0 [h]fp1 (xπh , h)
h
≤
−
p1 [h]fp1 (xπh , h))|
h
Cp1 − p0 ,
Corollary 2. If A is the batch maximum Gibbs error algorithm, for any true prior p0 and any perturbed prior p1 ,
−(e−1)/e
max fpavg
(A(p
))
≥
1
−
e
(π)
fpavg
1
0
0
π
− 4 − 2e−(e−1)/e p1 − p0 .
where the last inequality holds due to the Lipschitz continuity and boundedness of the utility function fp . Thus, if
avg
π1 = arg maxπ fpavg
1 (π) and π0 = arg maxπ fp0 (π), it follows that:
fpavg
(π1 ) ≥ fpavg
(π0 ) ≥ fpavg
(π0 ) − Cp1 − p0 , and
1
1
0
fpavg
(π) ≥ fpavg
(π) − Cp1 − p0 for all π.
0
1
3.2
Hence,
In this case, our objective is to find a policy with maximum
worst-case utility. If p0 is the true prior, the worst-case utility
of a policy π is fpworst
(π) def
= minh [fp0 (xπh , h)].
0
An algorithm A is a worst-case α-approximate algorithm (0 < α ≤ 1) if for any prior p, we have
fpworst (A(p)) ≥ α maxπ fpworst (π). When α = 1, A is an exact algorithm.
For worst-case α-approximate algorithms, we can obtain
a robustness result similar to Theorem 1.
fpavg
(A(p1 )) ≥ fpavg
(A(p1 )) − Cp1 − p0 0
1
≥
≥
=
The Worst Case
αfpavg
(π1 ) − Cp1 − p0 1
avg
α(fp0 (π0 ) − Cp1 − p0 ) − Cp1 − p0 α max fpavg
(π) − C(α + 1)p1 − p0 ,
0
π
where the first and third inequalities are from the above
discussions and the second inequality holds as A is αapproximate.
1514
Theorem 2. Assume fp is Lipschitz continuous with a Lipschitz constant L. If A is a worst-case α-approximate algorithm, then for any true prior p0 and perturbed prior p1 ,
fpworst
(A(p1 )) ≥ α max fpworst
(π) − (α + 1)Lp1 − p0 .
0
0
not Lipschitz continuous, then even an exact AL algorithm
for perturbed priors may not be robust, both in the average
and worst cases. We prove this in Theorem 3 below.
Theorem 3. For both the average and worst cases, there
is an AL problem with a non-Lipschitz utility such that: for
any C, α, > 0, there exist a perturbed prior p1 satisfying
0 < p1 − p0 < and an exact algorithm A∗ satisfying
π
The worst-case utility is also computed with respect to the
instead of fpworst
). Theorem 2
true prior p0 (i.e., using fpworst
0
1
shows that when we use a perturbed prior, the worst-case
utility achieved by a worst-case α-approximate algorithm
degrades by at most a constant factor of the 1 distance between the perturbed prior and the true prior.
Application to Least Confidence: Theorem 2 implies the
robustness of the well-known least confidence AL algorithm
(Lewis and Gale 1994; Culotta and McCallum 2005) with
perturbed priors. This algorithm greedily selects the next example x∗ satisfying x∗ = arg minx {maxy∈Y pD [y; x]}. If
fp is the version space reduction utility (considered previously for the maximum Gibbs error algorithm), we have the
following corollary.
fpc0 (A∗ (p1 )) < α max fpc0 (π) − Cp1 − p0 ,
π
, fpworst
} respectively.
where fpc0 ∈ {fpavg
0
0
This theorem and our results above suggest we should use a
Lipschitz utility for AL to maintain the robustness.
By taking p1 = p0 , Corollaries 1 and 2 can recover the approximation ratios for the maximum Gibbs error and batch
maximum Gibbs error algorithms in Theorems 4 and 5 of
(Cuong et al. 2013) respectively. Similarly, Corollaries 3 and
4 can recover the ratios for the least confidence and generalized Gibbs error algorithms in Theorems 5 and 8 of (Cuong,
Lee, and Ye 2014) respectively. Thus, our corollaries are
generalizations of these previous theorems.
If A is α-approximate (in the average or worst case) with
an optimal constant α under some computational complexity assumption (Golovin and Krause 2011), then it is also
optimal in our theorems under the same assumption. This
can be proven easily by contradiction and setting p1 = p0 .
If we are only interested in some particular prior p0 and
the perturbed priors within a neighborhood of p0 , we can relax the Lipschitz assumption (1) to the locally Lipschitz assumption at p0 : there exist L and δ such that for all S, h, and
p, if p0 − p < δ, then |fp0 (S, h) − fp (S, h)| ≤ Lp0 − p.
Under this relaxed assumption, the theorems and corollaries
above still hold for any p1 satisfying p0 − p1 < δ.
Corollary 3. If A is the least confidence algorithm, then for
any true prior p0 and any perturbed prior p1 ,
1
1
worst
fpworst
max
(A(p
))
≥
1−
f
(π)−
2− p1 −p0 .
1
p0
0
π
e
e
Application to Generalized Maximum Gibbs Error:
Theorem 2 also implies the robustness of the worst-case generalized Gibbs error algorithm (Cuong, Lee, and Ye 2014)
with a bounded loss. Intuitively, the algorithm greedily maximizes in the worst case the total generalized version space
reduction, which is defined as
p[h ] L(h , h ) p[h ],
tp (S, h) def
=
h ,h :h (S)=h(S) or
h (S)=h(S)
4
where L is a non-negative loss function between labelings
that satisfies L(h, h ) = L(h , h) and L(h, h) = 0 for all
h, h . The worst-case generalized Gibbs error algorithm atπ
def
tempts to greedily maximize tworst
p0 (π) = minh tp0 (xh , h),
and it is a worst-case (1 − 1/e)-approximate algorithm for
this objective (Cuong, Lee, and Ye 2014).
If we assume L is upper bounded by a constant m, we
have the following corollary about this algorithm. Note that
the bounded loss assumption is reasonable since it holds for
various practical loss functions such as Hamming loss or F1
loss, which is 1 − F1 (h, h ) where F1 (h, h ) is the F1 score
between h and h .
In this section, we investigate the robustness of AL algorithms for the minimum cost problem in the average case:
find an adaptive policy minimizing the expected number
of queries to identify the true labeling htrue (Golovin and
Krause 2011). This problem assumes htrue is drawn from a
prior p0 on a small hypothesis space H (i.e., H does not
need to contain all functions from X to Y). After we make
a query and observe a label, all the hypotheses inconsistent
with the observed label are removed from the current hypothesis space (also called the version space). We stop when
there is only one hypothesis htrue left.
We do not consider this problem in the worst case because
even the optimal worst-case algorithm may not be robust.2
For instance, if the true prior gives probability 1 to one correct hypothesis but the perturbed prior gives positive probabilities to all the hypotheses, then the cost of using the true
prior is 0 while the cost of using the perturbed prior is |X |.
For any policy π and hypothesis h, let c(π, h) be the cost
of identifying h when running π. This is the length of the
path corresponding to h in the policy tree of π. For any prior
p0 and policy π, the expected cost of π with respect to the
def
prior p0 is defined as cavg
p0 (π) = Eh∼p0 [c(π, h)].
Corollary 4. If A is the worst-case generalized Gibbs error
algorithm and the loss function of interest is upper bounded
by a constant m ≥ 0, then for any true prior p0 and any
perturbed prior p1 ,
1
2
tworst
max tworst (π)−m 4− p1 −p0 .
p0 (A(p1 )) ≥ 1−
e π p0
e
3.3
Robustness: Minimum Cost Problem
Discussions
We emphasize that our results are important as they enhance
our understanding and confidence about existing AL algorithms. Furthermore, if the utility we want to maximize is
2
1515
The Lipschitz assumption is not satisfied in this setting.
We will consider α(p)-approximate algorithms that return
a policy whose expected cost is within an α(p)-factor of the
optimal one. Formally, for any prior p, an α(p)-approximate
(α(p) ≥ 1) algorithm A outputs a policy A(p) satisfying
First, we note that optimizing the expected cost of the
mixture directly has a lower expected cost than trying to
separately identify the appropriate component (corresponding to using a validation set in passive learning) and the best
hypothesis given the component (corresponding to using the
training set). Hence, we would expect the method to perform
favorably in comparison to passive learning when the mixture prior is the true prior.
Results in earlier sections assure us that the method is
near optimal when the mixture prior is incorrect but generates hypotheses with probabilities similar to the true prior.
What if the true prior is far from the mixture distribution in
the 1 distance? In particular, we are interested in the case
where the true distribution is one of the mixture components, rather than the mixture itself. The following theorem
provides bounds on the performance in such cases. We note
that the theorem holds for general priors that may vary in
form (e.g., with different probability mass functions) and is
not restricted to priors corresponding to regularization constants.
The bounds show that the performance of the mixture is
competitive with that of the optimal algorithm, although the
constant can be large if some hypotheses have small probabilities under the true distribution. We also provide an absolute bound (instead of competitive bound) which may be
more informative in cases where there are hypotheses with
small probabilities. The bound (the first bound in Theorem
5) shows that the expected cost is within a constant factor
of the optimal expected cost of the mixture, which is the expected cost we would have to pay if our model was correct.
The optimal expected cost of the mixture is in turn better
than the expected cost of any two-stage identification procedure that identifies the component and the hypothesis given
the component separately, assuming the expectation is taken
with respect to the mixture.
Theorem 5. If A is an α(p)-approximate algorithm for the
minimum cost problem, then for any true prior p0 and any
k
k component uniform mixture prior p1 = i=1 k1 p1,i such
that p0 ∈ {p1,i }ki=1 , we have
avg
cavg
p (A(p)) ≤ α(p) min cp (π).
π
Note that α(p) may depend on the prior p. When α(p) = 1,
A is an exact algorithm. We have the following robustness
result for the minimum cost problem in the average case.
Theorem 4. Assume c(π, h) is upper bounded by a constant
K for all π, h. If A is an α(p)-approximate algorithm, then
for any true prior p0 and any perturbed prior p1 ,
avg
cavg
p0 (A(p1 )) ≤ α(p1 ) min cp0 (π)+(α(p1 )+1)Kp1 −p0 .
π
The assumption c(π, h) ≤ K for all π, h is reasonable
since c(π, h) ≤ |X | for all π, h. When H is small, K can be
much smaller than |X |.
Application to Generalized Binary Search: Theorem 4
implies the robustness of the generalized binary search algorithm, which is known to be (ln minh1 p[h] + 1)-approximate
(Golovin and Krause 2011). The result is stated in the corollary below. By taking p1 = p0 , this corollary can recover the
previous result by Golovin and Krause (2011) for the generalized binary search algorithm.
Corollary 5. Assume c(π, h) is upper bounded by K for all
π, h. If A is the generalized binary search algorithm, then
for any true prior p0 and any perturbed prior p1 ,
1
cavg
+
1
min cavg
(A(p
))
≤
ln
1
p0
p0 (π)
π
minh p1 [h]
1
+ ln
+ 2 Kp1 − p0 .
minh p1 [h]
Theorem 4 can also provide the robustness of algorithms
for problems other than AL. For instance, it can provide the
robustness of the RAId algorithm for the adaptive informative path planning problem (Lim, Hsu, and Lee 2015).
5
avg
cavg
p0 (A(p1 )) ≤ kα(p1 ) min cp1 (π), and
π
k−1
cavg
+ 1 min cavg
p0 (A(p1 )) ≤ α(p1 )
p0 (π).
π
minh p0 [h]
As a result, if A is generalized binary search, then
avg
k
cavg
p0 (A(p1 )) ≤ k ln minh p0 [h] + 1 min cp1 (π), and
π
avg
k
k−1
cavg
p0 (A(p1 )) ≤ ln minh p0 [h]+1 minh p0 [h]+1 min cp0 (π).
Mixture Prior
Let us consider methods that minimize a regularized loss.
These methods are commonly used and known to be equivalent to finding the maximum a posteriori hypothesis with
an appropriate prior. In practice, the best regularization constant is usually unknown, and a common technique (in passive learning) is to split the available data set into a training
and a validation set, which is used to select the best regularization constant based on performance of the algorithm
trained on the training set. As this method is effective in
practice, we construct a Bayesian version and study its performance, particularly the robustness, when used with AL.
We assume that we have a candidate set of prior distributions
corresponding to different regularization constants, and the
true hypothesis is randomly generated by first randomly selecting a distribution and then selecting a hypothesis using
that distribution. This corresponds to assuming that the prior
distribution is the mixture distribution. For simplicity, we
consider the uniform mixture in this work.
π
The algorithm for greedy AL with the mixture model
is shown in Algorithm 1. In the algorithm, the unlabeled
example x∗ can be chosen using any AL criterion. The
criterion can be computed from the weights and posteriors obtained from the previous iteration. For instance,
if the maximum Gibbs error algorithm is used, then at
iteration t, we have x∗ = arg maxx Ey∼p[·;x] [1 − p[y; x]],
n
i
pit−1 [y; x]. After x∗ is chosen,
where p[y; x] = i=1 wt−1
we query its label y ∗ and update the new weights
and poste
riors, which are always normalized so that i wti = 1 for all
1516
Table 1: AUCs of maximum Gibbs error algorithm with 1/σ 2 = 0.01, 0.1, 1, 10 and the mixture prior model on 20 Newsgroups
data set (upper half) and UCI data set (lower half). Double asterisks (**) indicate the best score, while an asterisk (*) indicates
the second best score on a row (without the last column). The last column shows the AUCs of passive learning with the mixture
prior model for comparison.
Data set
0.01
0.1
1
10
Mixture
Mixture (Passive)
alt.atheism/comp.graphics
talk.politics.guns/talk.politics.mideast
comp.sys.mac.hardware/comp.windows.x
rec.motorcycles/rec.sport.baseball
sci.crypt/sci.electronics
sci.space/soc.religion.christian
soc.religion.christian/talk.politics.guns
Average (20 Newsgroups)
87.60**
80.71**
78.75**
86.20**
78.08**
86.09**
86.16**
83.37**
87.25
79.28
78.21*
85.39
77.35
85.12
85.01
82.52
84.94
74.57
75.07
82.23
73.92
81.48
80.91
79.02
81.46
66.76
70.54
77.35
68.72
75.51
74.03
73.48
87.33*
79.49*
78.21*
85.59*
77.42*
85.50*
85.46*
82.71*
83.92
76.34
75.02
81.56
73.08
80.31
81.81
78.86
Adult
Breast cancer
Diabetes
Ionosphere
Liver disorders
Mushroom
Sonar
Average (UCI)
79.38
88.28*
65.09*
82.80*
66.31**
90.73**
66.75**
77.05**
80.15
88.37**
64.53
82.76
64.16
89.56
65.45*
76.43
80.39**
86.95
64.39
81.48
61.42
84.14
63.74
74.64
79.68
83.82
65.48**
77.88
58.42
82.94
60.81
72.72
80.18*
88.14
64.82
82.95**
64.73*
90.33*
65.00
76.59*
77.41
89.07
64.24
81.91
65.89
73.38
66.53
74.06
Algorithm 1 Active learning for the mixture prior model
1
2
83.37
78.88
n
Input: A set of n priors {p , p , . . . , p }, the initial normalized
weights for the priors {w1 , w2 , . . . , wn }, and the budget of k
queries.
81.78
75.58
Passive
Active
79.02
73.48
69.43
(a)
58.62
pi0 ← pi ; w0i ← wi ; for all i = 1, 2, . . . , n;
for t = 1 to k do
Choose an unlabeled example x∗ based on an AL criterion;
y ∗ ← Query-label(x∗ );
Update and normalize weights:
i
wti ∝ wt−1
pit−1 [y ∗ ; x∗ ] for all i = 1, 2, . . . , n;
Update each posterior individually using Bayes’ rule:
pit [h] ∝ pit−1 [h] P[h(x∗ ) = y ∗ | h]
for each i = 1, 2, . . . , n and h ∈ H;
end for
1
2
n
return {p1k , p2k , . . . , pn
k } and {wk , wk , . . . , wk };
0.01
77.05
74.07
0.1
76.43
72.48
0.2
75.86
1
10
Passive
74.64
71.22
(b)
Active
72.72
66.80
61.50
0.01
0.1
0.2
1
10
Figure 1: Average AUCs for passive learning and maximum
Gibbs error AL algorithms with 1/σ 2 = 0.01, 0.1, 0.2, 1,
and 10 on the 20 Newsgroups (a) and UCI (b) data sets.
t and h pit [h] = 1 for all i and t. The algorithm returns the
final weights and posteriors that can be used to make predictions on new examples. More specifically,
the predicted
n
label of a new example x is arg maxy i=1 wki pik [y; x].
We note that Algorithm 1 does not require the hypotheses
to be deterministic. In fact, the algorithm can be used with
probabilistic hypotheses where P[h(x) = y | h] ∈ [0, 1]. We
also note that computing pit [y; x] for a posterior pit can be
expensive. In this work, we approximate it using the MAP
hypothesis. In particular, we assume pit [y; x] ≈ piMAP [y; x],
the probability that x has label y according to the MAP hypothesis of the posterior pit . This is used to approximate both
the AL criterion and the predicted label of a new example.
6
82.52
76.99
Gaussian prior with mean zero and variance σ 2 on the parameter space. Thus, we can consider different priors by
varying the variance σ 2 of the regularizer. We consider
two experiments with the maximum Gibbs error algorithm.
Since our data sets are all binary classification, this algorithm is equivalent to the least confidence and the maximum entropy algorithms. In our first experiment, we compare models that use different priors (equivalently, regularizers). In the second experiment, we run the uniform mixture prior model and compare it with models that use only
one prior. For AL, we randomly choose the first 10 examples
as a seed set. The scores are averaged over 100 runs of the
experiments with different seed sets.
Experiments
We report experimental results with different priors and the
mixture prior. We use the logistic regression model with different L2 regularizers, which are well-known to impose a
1517
6.1
References
Experiment with Different Priors
2
We run maximum Gibbs error with 1/σ
=
0.01, 0.1, 0.2, 1, 10 on tasks from the 20 Newsgroups and UCI data sets (Joachims 1996;
Bache and Lichman 2013) shown in the first column
of Table 1. Figure 1 shows the average areas under the
accuracy curves (AUC) on the first 150 selected examples
for the different regularizers. Figures 1a and 1b give the
average AUCs (computed on a separate test set) for the
20 Newsgroups and UCI data sets respectively. We also
compare the scores for AL with passive learning.
From Figure 1, AL is better than passive learning for all
the regularizers. When the regularizers are close to each
other (e.g., 1/σ 2 = 0.1 and 0.2), the corresponding scores
tend to be close. When they are farther apart (e.g., 1/σ 2 =
0.1 and 10), the scores also tend to be far from each other. In
some sense, this confirms our results in previous sections.
6.2
Bache, K., and Lichman, M. 2013. UCI machine learning repository. University of California – Irvine.
Chen, Y., and Krause, A. 2013. Near-optimal batch mode active
learning and adaptive submodular optimization. In ICML.
Culotta, A., and McCallum, A. 2005. Reducing labeling effort for
structured prediction tasks. In AAAI.
Cuong, N. V.; Lee, W. S.; Ye, N.; Chai, K. M. A.; and Chieu, H. L.
2013. Active learning for probabilistic hypotheses using the maximum Gibbs error criterion. In NIPS.
Cuong, N. V.; Lee, W. S.; and Ye, N. 2014. Near-optimal adaptive
pool-based active learning with general loss. In UAI.
Dasgupta, S. 2004. Analysis of a greedy active learning strategy.
In NIPS.
Forsyth, R. 1990. PC/Beagle user’s guide. BUPA Medical Research
Ltd.
Golovin, D., and Krause, A. 2011. Adaptive submodularity: Theory and applications in active learning and stochastic optimization.
Journal of Artificial Intelligence Research.
Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden
units in a layered network trained to classify sonar targets. Neural
Networks.
Guillory, A., and Bilmes, J. 2010. Interactive submodular set cover.
In ICML.
Hoi, S. C.; Jin, R.; Zhu, J.; and Lyu, M. R. 2006. Batch mode
active learning and its application to medical image classification.
In ICML.
Joachims, T. 1996. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. DTIC Document.
Kohavi, R. 1996. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In KDD.
Lewis, D. D., and Gale, W. A. 1994. A sequential algorithm for
training text classifiers. In SIGIR.
Lim, Z. W.; Hsu, D.; and Lee, W. S. 2015. Adaptive informative
path planning in metric spaces. International Journal of Robotics
Research.
McCallum, A., and Nigam, K. 1998. Employing EM and poolbased active learning for text classification. In ICML.
Nowak, R. 2008. Generalized binary search. In Annual Allerton
Conference on Communication, Control, and Computing.
Schlimmer, J. C. 1987. Concept acquisition through representational adjustment. University of California – Irvine.
Settles, B., and Craven, M. 2008. An analysis of active learning
strategies for sequence labeling tasks. In EMNLP.
Settles, B. 2009. Active learning literature survey. University of
Wisconsin – Madison.
Sigillito, V. G.; Wing, S. P.; Hutton, L. V.; and Baker, K. B. 1989.
Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest.
Smith, J. W.; Everhart, J. E.; Dickson, W. C.; Knowler, W. C.; and
Johannes, R. S. 1988. Using the ADAP learning algorithm to
forecast the onset of diabetes mellitus. In SCAMC.
Wei, K.; Iyer, R.; and Bilmes, J. 2015. Submodularity in data
subset selection and active learning. In ICML.
Wolberg, W. H., and Mangasarian, O. L. 1990. Multisurface
method of pattern separation for medical diagnosis applied to
breast cytology. PNAS.
Experiment with Mixture Prior
We investigate the performance of the mixture prior model
proposed in Algorithm 1. For AL, it is often infeasible to
use a validation set to choose the regularizers beforehand
because we do not initially have any labeled data. So, using
the mixture prior is a reasonable choice in this case.
We run the uniform mixture prior with regularizers
1/σ 2 = 0.01, 0.1, 1, 10 and compare it with models that
use only one of these regularizers. Table 1 shows the AUCs
of the first 150 selected examples for these models on the 20
Newsgroups and the UCI data sets.
From the results, the mixture prior model achieves the
second best AUCs for all tasks in the 20 Newsgroups data
set. For the UCI data set, the model achieves the best score
on Ionosphere and the second best scores on three other
tasks. For the remaining three tasks, it achieves the third
best scores. On average, the mixture prior model achieves
the second best scores for both data sets. Thus, the model
performs reasonably well given the fact that we do not know
which regularizer is the best to use for the data. We also note
that if a bad regularizer is used (e.g., 1/σ 2 = 10), AL may
be even worse than passive learning with mixture prior.
7
Conclusion
We proved new robustness bounds for AL with perturbed
priors that can be applied to various AL algorithms used in
practice. We showed that if the utility is not Lipschitz, an optimal algorithm on perturbed priors may not be robust. Our
results suggest that we should use a Lipschitz utility for AL
if robustness is required. We also proved novel robustness
bounds for a uniform mixture prior and showed experimentally that this prior is reasonable in practice.
Acknowledgments. We gratefully acknowledge the support of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through
the Australian Research Council Centre of Excellence for
Mathematical and Statistical Frontiers (ACEMS), and of
QUT through a Vice Chancellor’s Research Fellowship. We
also gratefully acknowledge the support of Singapore MOE
AcRF Tier Two grant R-265-000-443-112.
1518