Extension of TSVM to Multi-Class and Hierarchical Text S

advertisement
arXiv:1211.0210v1 [cs.LG] 1 Nov 2012
Extension of TSVM to Multi-Class and Hierarchical Text
Classification Problems With General Losses
S.Sathi y a Keer thi (1) S.Sundar ar a jan(2) Shir ish Shevade(3)
(1) Cloud and Information Services Lab, Microsoft, Mountain View, CA 94043
(2) Microsoft Research India, Bangalore, India
(3) Computer Science and Automation, Indian Institute of Science, Bangalore, India
keerthi@microsoft.com, ssrajan@microsoft.com, shirish@csa.iisc.ernet.in
Abstract
Transductive SVM (TSVM) is a well known semi-supervised large margin learning method
for binary text classification. In this paper we extend this method to multi-class and hierarchical classification problems. We point out that the determination of labels of unlabeled
examples with fixed classifier weights is a linear programming problem. We devise an
efficient technique for solving it. The method is applicable to general loss functions. We
demonstrate the value of the new method using large margin loss on a number of multiclass and hierarchical classification datasets. For maxent loss we show empirically that our
method is better than expectation regularization/constraint and posterior regularization
methods, and competitive with the version of entropy regularization method which uses
label constraints.
1
Introduction
Consider the following supervised learning problem corresponding to a general structured
output prediction problem:
mins F s (w) =
w,ξ
λ
2
kwk2 +
l
1X
l
ξsi
(1)
i=1
where ξsi = ξ(w, xsi , yis ) is the loss term and {(xsi , yis )}li=1 is the set of labeled examples. For
example, in large margin and maxent models we have
ξ(w, xi , yi ) = max L( y, yi ) − w T ∆f( y, yi ; xi )
y
(2)
ξ(w, xi , yi ) = −w T f( yi ; xi ) + log Z
(3)
P
where ∆f( y, yi ; xi ) = f( yi ; xi ) − f( y; xi ) and Z = y exp(w T f( y; xi )). Text classification problems involve a rich and large feature space (e.g., bag-of-words features) and so linear classifiers
work very well (Joachims, 1999). We particularly focus on multi-class and hierarchical classification problems (and hence our use of scalar notation for y). In multi-class problems y runs
over the classes and, w and f( y; xi ) have one component for each class, with the component
corresponding to y turned on. More generally, in hierarchical classification problems, y runs
over the set of leaf nodes of the hierarchy and, w and f( y; xi ) consist of one component for
each node of the hierarchy, with the node components in the path to leaf node y turned on.
λ > 0 is a regularization parameter. A good default value for λ can be chosen depending on the
loss function used.1 The superscript s denotes ‘supervised’; we will use superscript u to denote
elements corresponding to unlabeled examples.
In semi-supervised learning we use a set of unlabeled examples, {xui }ni=1 and include the
determination of the labels of these examples as part of the training process:
minu F s (w) +
w,y
s.t.
n
X
n
Cu X
ξui
(4)
δ( y, yiu ) = n( y) ∀ y
(5)
n
i=1
i=1
where yu = { yiu }, ξui = ξ(w, xui , yiu ) and δ is the Kronecker delta function. C u is a regularization
parameter for the unlabeled part. A good default value is C u = 1; we use this value in all our
experiments. (5) consists of constraints on the label counts that come from domain knowledge.
(In practice, one specifies φ( y), the fraction of examples in class
P y; then the values in {φ( y)n}
are rounded to integers {n( y)} in a suitable way so that y n( y) = n.2 ) Such constraints
are crucial for the effective solution of the semi-supervised learning problem; without them
the semi-supervised solution tends to move towards assigning the majority class label to most
unlabeled examples. In more general structured prediction problems (5) may include other
domain constraints (Chang et al., 2007). In this paper we will use just the label constraints
in (5).
1
In the experiments of this paper, for multi-class and hierarchical classification with large margin loss, we use λ = 10
and, for binary maxent loss we use λ = 10−3 .
2
We will assume that quite precise values are given for {n( y)}. The effect of noise in these values on the semisupervised solution needs a separate study.
Inspired by the effectiveness of the TSVM model of Joachims (1999), there have been a
number of works on the solution of (4)-(5) for binary classification with large margin losses.
These methods fall into one of two types: (a) combinatorial optimization; and (b) continuous
optimization. See (Chapelle et al., 2008, 2006) for a detailed coverage of various specific
methods falling into these two types. In combinatorial optimization the label set yu is determined
together with w. It is usual to use a sequence of alternating optimization steps (fix yu and
solve for w, and then fix w and solve for yu ) to obtain the solution. An important advantage
of doing this is that each of the sub-optimization problems can be solved using simple and/or
standard solvers. In continuous optimization yu is eliminated and the resulting (non-convex)
optimization problem is solved for w by minimizing
F s (w) +
n
Cu X
n
ρ(w, xui )
(6)
i=1
where ρ(w, xui ) = min y u ξ(w, xui , yiu ). The loss function ξ as well as ρ are usually smoothed so
that the objective function is differentiable and gradient-based optimization techniques can be
employed. Further, the constraints in (5) involving yu are replaced by smooth constraints on w
expressing balance of the mean outputs of each label over the labeled and unlabeled sets.
Zien et al. (2007) extended the continuous optimization approach to (6) for multi-class and
structured output problems. But their experiments only showed limited improvement over
supervised learning. The combinatorial optimization approach, on the other hand, has not
been carefully explored beyond binary classification. Methods based on semi-definite programming (Xu et al., 2006; De Bie and Cristianini, 2004) are impractical, even for medium size
problems. One-versus-rest and one-versus-one ideas have been tried, but it is unclear if they
work well: Zien et al. (2007) and Zubiaga et al. (2009) report failure while Bruzzone et al.
(2006) use a heuristic implementation and report success in one application domain. Unlike
these methods which have binary TSVM as the basis, we take up an implementation of the
approach for the direct multi-class and hierarchical classification formulation in (4)-(5). The
special structure in constraints allows the yu determination step to reduce to a degenerate
transportation linear programming problem. So the well-known transportation simplex method
can be used to obtain yu . We show that even this method is not efficient enough. As an alternative we suggest an effective and much more efficient heuristic label switching algorithm. For
binary classification problems this algorithm is an improved version of the multiple switching
algorithm developed by Sindhwani and Keerthi (2006) for TSVM. Experiments on a number of
multi-class and hierarchical classification datasets show that, like the TSVM method of binary
classification, our method yields a strong lift in performance over supervised learning, especially
when the number of labeled examples is not sufficiently large.
The applicability of our approach to general loss functions is a key advantage. Specialized
to maxent losses, the method offers an interesting alternative to the idea of entropy regularization (Grandvalet and Bengio, 2003) and related methods (Lee et al., 2006). For maxent
losses, there also exist other methods such as expectation regularization/constraint (Mann
and McCallum, 2010) and posterior regularization (Gärtner et al., 2005; Graca et al., 2007;
Ganchev et al., 2009) which use unlabeled examples only to enforce the constraints in (5). In
section 4 we compare our approach with these methods on binary classification and point out
that our method gives a stronger performance.
2
Semi-Supervised Learning Algorithm
The semi-supervised learning algorithm for multi-class and hierarchical classification problems
follows the spirit of the TSVM algorithm (Joachims, 1999). Algorithm 1 gives the steps. It
consists of an initialization part (steps 1-9) that sets starting values for w and yu , followed by
an iterative part (steps 10-15) where w and yu are refined by semi-supervised learning. Using
exactly the same arguments as those in (Joachims, 1999; Sindhwani and Keerthi, 2006) it can
be proved that Algorithm 1 is convergent.
Initialization of w is done by solving the supervised learning problem. This w can be used
to predict yu . However such a yu usually violates the constraints in (5). To choose a yu that
satisfies (5), we do a greedy modification of the predicted yu . Steps 3-9 of Algorithm 1 give the
details.
The iterative part of the algorithm consists of an outer loop and an inner loop. In the outer
loop (steps 10-15) the regularization parameter C u is varied from a small value to the final
value of 1 in annealing steps. This is done to avoid drastic switchings of the labels in yu , which
helps the algorithm reach a better minimum of (4)-(5) and hence achieve better performance.
For example, on ten runs of the multi-class dataset, 20NG (see Table 1) with 100 labeled
examples and 10, 000 unlabeled examples, the average macro F values on test data achieved
by supervised learning, Algorithm 1 without annealing and Algorithm 1 with annealing are,
respectively, 0.4577, 0.5377 and 0.6253. Similar performance differences are seen on other
datasets too.
The inner loop (steps 11-14) does alternating optimization of w and yu for a given C u . In
steps 12 and 13 we use the most recent w and yu as the starting points for the respective
sub-optimization problems. Because of this, the overall algorithm remains very efficient in spite
of the many annealing steps involving C u . Typically, the overall cost of the algorithm is only
about 3-5 times that of solving a supervised learning problem involving (n + l) examples. For
step 12 one can employ any standard algorithm suited to the chosen loss function. In the rest
of the section we will focus on step 13.
Algorithm 1 Semi-Supervised Learning Algorithm
1: Solve the supervised learning problem, (1) and get w.
2: Set initial labels for unlabeled examples, yu using steps 3-9 below.
3: Set Y = { y}, the set of all classes, A y = ; ∀ y, and I = {1, . . . , n}.
4: repeat
5:
Si = max y∈Y w T f( y; xui ) and yi = arg max y∈Y w T f( y; xui ) ∀i ∈ I.
6:
Sort I by decreasing order of Si .
7:
By order allocate i to A yi while not exceeding sizes specified by n( yi ).
8:
Remove all allocated i from I and remove all saturated y (i.e., |A y | = n( y)) from Y .
9: until Y = ;
10: for C u = {10−4 , 3 × 10−4 , 10−3 , 3 × 10−3 , . . . , 1} (in that order) do
11:
repeat
12:
Solve (4) for w with yu fixed.
13:
Solve (4)-(5) for yu with w fixed.
14:
until step 13 does not alter yu
15: end for
2.1
Linear programming formulation
Let us now consider optimizing yu with fixed w. Let us represent each yiu in a 1-of-m representation by defining boolean variables zi y and requiring
Pthat, for each i, exactly one zi y takes the
value 1. This can be done by using the constraint y zi y = 1 for all i. The label constraints
P
become i zi y = n( y) for all y. Let ci y = ξ(w, xui , y). With these definitions the optimization
problem of step 13 becomes (irrespective of the type of loss function used) the integer linear
programming problem,
min
X
ci y zi y s.t.
(7)
zi y = n( y) ∀ y,
(8)
zi y ∈ {0, 1} ∀i, y
(9)
i, y
X
y
zi y = 1 ∀i,
X
i
This is a special case of the well known Transportation problem (Hadley, 1963) in which
the constraint matrix satisfies unimodularity conditions; hence, the solution of the integer
linear programming problem (7)-(9) is same as the solution of the linear programming (LP)
problem, (7)-(8) (note: in LP the integer constraints are left out), i.e., at LP optimality (9)
holds automatically. Previous works (Joachims, 1999; SindhwaniP
and Keerthi, 2006) do not
make this neat connection to linear programming. The constraints y zi y = 1 ∀i allow exactly
n non-zero elements in {zi y }i y ; thus there is degeneracy of order m, i.e., there are (n + m)
constraints but only n non-zero solution elements.
2.2
Transportation simplex method
The transportation simplex method (a.k.a., stepping stone method) (Hadley, 1963) is a standard
and generally efficient way of solving LPs such as (7). However, it is not efficient enough for
typical large scale learning situations in which n, the number of unlabeled examples is large and
m, the number of classes, is small. Let us see why. Each iteration of this method starts with a
basis set of n + m − 1 basis elements. Then it computes reduced costs for all remaining elements.
This step requires O(nm) effort. If all reduced costs are non-negative then it implies that the
current solution is optimal. If this condition does not hold, elements which have negative
reduced costs are potential elements for entering the basis.3 One non-basis element with a
negative reduced cost (say, the element with the most negative reduced cost) is chosen. The
algorithm now moves the solution to a new basis in which an element of the previous basis is
replaced by the newly entering element. This operation corresponds to moving a chosen set of
examples between classes in a loop so that the label constraints are not violated. The number of
such iterations is observed to be O(nm) and so, the algorithm requires O(n2 m2 ) time. Since n
can be large in semi-supervised learning, the transportation simplex algorithm is not sufficiently
efficient. The main cause of inefficiency is that the step (one basis element changed) is too
small for the amount of work put in (computing all reduced costs)!
3
Presence of negative reduced costs may not mean that the current solution is non-optimal. This is due to degeneracy.
It is usually the case that, even when an optimal solution is reached, the transportation algorithm requires several end
steps to move the basis elements around to reach an end state where positive reduced costs are seen.
Algorithm 2 Switching Algorithm to solve (7)-(9)
1: repeat
2:
for each class pair ( y, ȳ) do
Compute δc(i, y, ȳ) for all i in class y and sort the elements in increasing order of δc
3:
values.
4:
Compute δc(ī, ȳ, y) for all ī in class ȳ and sort the elements in increasing order of δc
values.
5:
Align these two lists (so that the best pair is at the top) to form a switch list of 5-tuples,
{(i, y, ī, ȳ, ρ(i, y, ī, ȳ)}.
6:
Remove any 5-tuple with ρ(i, y, ī, ȳ) ≥ 0.
7:
end for
8:
Merge all the switch lists into one and sort the 5-tuples by increasing order of ρ values.
9:
while switch list is non-empty do
10:
Pick the top 5-tuple from the switch list; let’s say it is (i, y, ī, ȳ, ρ(i, y, ī, ȳ)). Move i to
class ȳ and move ī to class y.
11:
From the remaining switch list remove all 5-tuples involving either i or ī.
12:
end while
13: until the merged switch list from step 8 is empty
2.3
Switching algorithm
We now propose an efficient heuristic switching algorithm for solving (7)-(9) that is suited to
the case where n is large but m is small. The main idea is to use only pairwise switching of
labels between classes in order to improve the objective function. (Note that switching makes
sure that the label constraints are not violated.) This algorithm is sub-optimal for m ≥ 3, but
still quite powerful because of two reasons: (a) the solution obtained by the algorithm is usually
close to the true optimal solution; and (b) reaching optimality precisely is not crucial for the
alternating optimization approach (steps 12 and 13 of Algorithm 1) to be effective.
Let us now give the details of the switching algorithm. Suppose, in the current solution, example
i is in class y. Let us say we move this example to class ȳ. The change in objective function due
to the move is given by δc(i, y, ȳ) = ci ȳ − ci y . Suppose we have another example ī which is
currently in class ȳ and we switch i and ī, i.e., move i to class ȳ and move ī to class y. The
resulting change in objective function is given by
ρ(i, y, ī, ȳ) = δc(i, y, ȳ) + δc(ī, ȳ, y)
(10)
The more negative ρ(i, y, ī, ȳ) is, the better will be the objective function reduction due to the
switching of i and ī. The algorithm looks greedily for finding as many good switches as possible
at a time. Algorithm 2 gives the details. Steps 2-12 consist of one major greedy iteration and has
cost O(nm2 ). Steps 2-7 consist of the background work needed to do the greedy switching of
several pairs of examples in steps 9-12. Step 11 is included because, when i and ī are switched,
data related to any 5-tuple in the remaining switch list that involves either i or ī is messed up.
Removing such elements from the remaining switched list allows the algorithm to continue
finding more pairs to apply switching without a need for repeating steps 2-7. It is this multiple
switching idea that gives the needed efficiency lift over the transportation simplex algorithm.
The algorithm is convergent due to the following reasons: the algorithm only performs switchings which reduce the objective function; thus, once a pair of examples is switched, that pair
Ohscal
0
Objfn−InitObjfn
−10
−20
−30
Transportation Simplex
Switching Algorithm
−40
−50
−60 −2
10
0
10
CPU Time
2
10
Figure 1: Comparison of costs of Transportation simplex and Switching algorithms on Ohscal
dataset with 100/5581 labeled/unlabeled examples, on the first entry to step 13 of Algorithm
1. The vertical axis gives the change in objective function from the initial value.
will not be switched again; and, the number of possible switchings is finite. A typical run
of Algorithm 2 requires about 3 loops through steps 2-12. Since this algorithm only allows
pairwise switching of examples, it cannot assure that the class assignments resulting from it
will be optimal for (7)-(9) if m ≥ 3. However, in practice the objective function achieved by
the algorithm is very close to the true optimal value; also, as pointed out earlier, reaching true
optimality turns out to be not crucial for good performance of the semi-supervised algorithm.
2.4
Comparison of the algorithms
Figure 1 shows the performance of transportation simplex and switching algorithms on the
Ohscal dataset (Forman, 2003) with 100/5581 labeled/unlabeled examples. Note that the
cpu times (x-axis) are in log scale. While transportation simplex requires 100 secs, the switch
algorithm reaches close to optimal well within a second. On the binary classification dataset,
aut-avn (Sindhwani and Keerthi, 2006) with 100/35888 labeled/unlabeled examples, the
switch algorithm reaches exact optimality requiring only 0.1 seconds while transportation
simplex requires 30 minutes!
If m is large then steps 2-7 of Algorithm 2 can become expensive. We have applied the switching
algorithm to datasets that have m ≤ 105, but haven’t observed any inefficiency. If m happens to
be much larger then steps 2-7 can be modified to work with a suitably chosen subset of class
pairs instead of all possible pairs.
2.5
Relation with binary TSVM methods
Consider the case m = 2 (binary classification). There is only a single class pair and so step 11
is not needed. Joachims’ original TSVM method (Joachims, 1999) corresponds to the version
of Algorithm 2 in which only one switch (the top candidate in step 10) is made. Sindhwani
and Keerthi’s multiple switching algorithm (Sindhwani and Keerthi, 2006) is more efficient
than Joachims’ method and corresponds to doing one outer loop of Algorithm 2, i.e., steps 2-12.
Algorithm 2 is more improved and is also optimal for m = 2. This can be proved by noting
the following: the algorithm is convergent; at convergence there is no switching pair which
improves the objective function; and, for m = 2 a transportation simplex step corresponds to
switching labels for a set of example pairs. Thus, if the convergent solution is not optimal, a
transportation simplex iteration can be applied to find at least one switching pair that leads to
Table 1: Properties of datasets. N : number of examples, d : number of features, m : number of
classes, Type: M=Multi-Class; H=Hierarchical, with D=Depth and I=# Internal Nodes
N
d
m
Type
D/I
20NG
19928
62061
20
M
la1
3204
31472
6
M
webkb
8277
3000
7
M
ohscal
11162
11465
10
M
reut8
8201
10783
8
M
sector
9619
55197
105
M
mnist
70000
779
10
M
1
0.9
0.95
0.8
0.9
0.7
0.85
0.6
0.5
rcv-mcat
154706
11429
7
H
2/10
0.8
0.75
0.4
0.7
0.3
0.65
0.2
0
20NG
19928
62061
20
H
3/8
rcv−mcat
1
Macro F
Macro F
20NG
usps
9298
256
10
M
1000
2000
LabSize
3000
4000
0
1000
2000
LabSize
3000
4000
Figure 2: Hierarchical classification datasets: Variation of performance (Macro F) as a function
of the number of labeled examples (LabSize). Dashed black line corresponds to supervised
learning; Continuous black line corresponds to the semi-supervised method; Dashed horizontal
red line corresponds to the supervised classifier built using L and U with their labels known.
objective function reduction, which is a contradiction.
3
Experiments with large margin loss
In this section we give results of experiments on our method as applied to multi-class and
hierarchical classification problems using the large margin loss function, (2). We used the loss,
L( y, yi ) = δ( y, yi ). Eight multi-class datasets and two hierarchical classification datasets were
used. Properties of these datasets (Lang, 1995; Forman, 2003; McCallum and Nigam, 1998;
Lewis et al., 2006; LeCun, 2011; Tibshirani, 2011) are given in Table 1. Most of these datasets
are standard text classification benchmarks. We include two image datasets, mnist and usps to
point out that our methods are useful in other application domains too. rcv-mcat is a subset
of rcv1 (Lewis et al., 2006) corresponding to the sub-tree belonging to the high level category
MCAT with seven leaf nodes consisting of the categories, EQUITY, B OND, FOREX, COMMODITY,
SOFT, METAL and ENERGY. In one run of each dataset, 50% of the examples were randomly
chosen to form the unlabeled set, U; 20% of the examples were put aside in a set L to form
labeled data; the remaining data formed the test set. Ten such runs were done to compute the
mean and standard deviation of (test) performance. Performance was measured in terms of
Macro F (mean of the F values associated with various classes).
In the first experiment, we fixed the number of labeled examples (to 80) and varied the number
20NG (LabSize=80)
0.65
Macro F
0.6
0.55
0.5
0.45
0.4 0
10
2
3
10
10
UnLabSize
4
10
5
10
Figure 3: 20NG: Variation of performance (Macro F) as a function of the number of unlabeled
examples (UnLabSize), with the number of labeled examples fixed at 80.
of unlabeled examples from small to big values. The variation of performance as a function
of the number of unlabeled examples, for the multi-class dataset, 20NG, is given in Figure
2. Performance steadily improves as more unlabeled data is added. The same holds in other
datasets too.
Next we fixed the unlabeled data to U and varied the labeled data size from small values up to
|L|. This is an important study for semi-supervised learning methods since their main value is
when labeled data is sparse (lower side of the learning curve). The variation of performance as
a function of the number of unlabeled examples is shown for the two hierarchical classification
datasets in Figure 3 and, results for six multi-class datasets in Figure 4. We observed that
the performance on the 20N G dataset was almost same in the multi-class and hierarchical
classification scenarios. Also, the performance was similar on the M N IST and USPS datasets.
Clearly, semi-supervised learning is very useful and yields good improvement over supervised
learning especially when labeled data is sparse. The degree of improvement is sharp in some
datasets (e.g., reut8) and mild in some datasets (e.g., sector).
While the semi-supervised method is successful in linear classifier settings such as in text
classification and natural language processing, we want to caution, like (Chapelle et al., 2008),
that it may not work well on datasets originating from nonlinear manifold structure.
4
Maxent: Comparison with other semi-supervised methods
One of the nice features of our method is its applicability to general loss functions. Here we
take up the maxent loss, (3) and compare our method with other semi-supervised maxent
methods which make use of domain constraints such as the label constraints in (5). Let
exp(w T f( yiu ; xui ))
piu ( yiu ) = P
,
u
T
y exp(w f( y; x i ))
X
Eiu = −
piu ( yiu ) log piu ( yiu )
yiu
webkb
ohscal
0.9
0.8
0.75
0.8
0.7
0.7
Macro F
Macro F
0.65
0.6
0.6
0.55
0.5
0.5
0.45
0.4
0
0.35
0
0.4
200
400
600
LabSize
800
1000
500
reut8
2500
0.95
0.9
0.9
0.85
0.85
0.8
Macro F
0.8
Macro F
2000
sector
0.95
0.75
0.7
0.75
0.7
0.65
0.65
0.6
0.6
0.55
0
1000
1500
LabSize
0.55
500
1000
LabSize
1500
0.5
0
2000
500
la1
1000
LabSize
1500
2000
1500
2000
usps
0.9
1
0.95
0.8
0.7
Macro F
Macro F
0.9
0.85
0.6
0.8
0.75
0.7
0.5
0.65
0.4
0
200
400
LabSize
600
800
0
500
1000
LabSize
Figure 4: Multi-class datasets: Variation of performance (Macro F) as a function of the number of
labeled examples (LabSize). Dashed black line corresponds to supervised learning; Continuous
black line corresponds to the semi-supervised method; Dashed horizontal red line corresponds
to the supervised classifier built using L and U with their labels known.
gcat
aut−avn
1
1
0.95
0.95
0.85
F of Class 1
F of Class 1
0.9
0.8
0.75
0.9
0.85
0.7
0.8
0.65
0
128
256
LabSize
384
0.75
0
512
128
256
LabSize
384
512
Figure 5: Comparison of maxent methods on gcat and aut-avn datasets. Dashed Black: supervised learning, (1); Continuous Black: our method, (4)-(5); Green: entropy regularization,; Red:
expectation constraint,; Blue: posterior regularization. Dashed horizontal red line corresponds
to the supervised classifier built using L and U with their labels known.
be, respectively, the probability of label yiu , the partition function, and the entropy of the label
probability distribution associated with the i-th unlabeled example.
4.1
Entropy Regularization
The method minimizes the following objective function:
F s (w) + C u
n
X
i=1
Eiu s.t.
n
X
piu ( y) = n( y) ∀ y.
(11)
i=1
Although the original entropy regularization method (Grandvalet and Bengio, 2003) does not
use the domain constraints in (11), these constraints are crucial for getting good performance,
and so we include them. The unlabeled data term in the objective function (which is referred to
as the entropy regularization term), can be viewed as the expected negative log-likelihood of the
label probability distribution on unlabeled data given by the model. This term can be compared
with the unlabeled data term in the objective function associated with our formulation, (4).
While we work with choosing a single label for each example, entropy regularization works
with expectations. A key advantage of our method over entropy regularization is that the use of
alternate optimization of w and yu on (4)-(5) allows an easy handling of the domain constraints.
This advantage can be particularly crucial when dealing with general structured prediction
problems for which gradients of the domain constraint functions involving piu are expensive to
compute (Jiao et al., 2006).
4.2
Expectation Regularization/Constraint
Mann and McCallum (2010) use unlabeled data only to deal with the domain constraints; they
solve the optimization problem,
min F s (w) + C L
w
m X
n
X
(
piu ( y) − n( y))2 .
y=1 i=1
If the n( y) values are known precisely it is better to enforce the label constraints and solve,
instead, the following problem:
min F s (w) s.t.
w
n
X
piu ( y) = n( y) ∀ y
(12)
i=1
Like entropy regularization, a disadvantage of this method is the need to deal with gradients of
constraint functions involving piu .
4.3
Posterior Regularization
This method (Gärtner et al., 2005; Graca et al., 2007; Ganchev et al., 2009) was introduced
mainly to ease the handling of constraints in the expectation regularization/constraint method.
y
This is achieved by introducing intermediate label distributions qi = {qiu ( yiu )} yiu ∀i, forcing the
constraints 4 on {qiu } and including a KL divergence term between {piu } and {qiu }:
minu F s (w) + C K L
w,{qi }
n X
X
i=1 yiu
s.t.
n
X
qiu ( yiu ) log
qiu ( yiu )
piu ( yiu )
qiu ( y) = n( y) ∀ y
(13)
i=1
If alternating optimization is used on w and {qiu }, then, like in our method, we only need to
solve convex optimization problems in each step. We found C K L = 0.1 to be a good default
value.
We implemented entropy regularization and expectation constraint methods, only for binary
classification because of the complexity brought in by vector constraints. The augmented
lagrangian method (Bertsekas and Tsitsiklis, 1997) was used to handle the constraint. Posterior
regularization was implemented as described in (Gärtner et al., 2005). Figure 4 compares the
various methods on the two binary text classification datasets, gcat and aut-avn (Sindhwani
and Keerthi, 2006). gcat has 23149 examples and 47236 features; aut-avn has 71175 examples
and 20707 features. The experimental set up is similar to that in section 3 except: L consists of
512 examples, and, performance was measured in terms of the F measure of the first class.
The performances of expectation constraint and posterior regularization methods are close,
with the latter being slightly inferior due to the use of the intermediate distribution qiu and
alternate optimization. Both these methods are quite inferior to entropy regularization and our
method; clearly, the unlabeled likelihood terms in (11) and (4) play a crucial role in this. Our
4
There is a minor difference with what is originally presented by (Gärtner et al., 2005), who include the labeled
examples in the label constraints. But those equations can be rewritten in the form (13) by appropriately defining n( y).
method is slightly inferior to entropy regularization due to the use of alternate optimization.
All the four methods lift the performance of supervised learning quite well and so they are good
semi-supervised techniques.
5
Conclusion
In this paper we extended the TSVM approach of semi-supervised binary classification to multiclass and hierarchical classification problems with general loss functions, and demonstrated the
effectiveness of the extended approach. As a natural next step we are exploring the approach
for structured output prediction. The yu determination process is harder in this case since
reduction to linear programming is not automatic. But good solutions are still possible. In many
applications of structured output prediction, labeled data consists of examples with partial
labels. Our approach can easily handle this case; all that one has to do is include all unknown
labels as a part of yu .
References
Bertsekas, D. P. and Tsitsiklis, J. N. (1997). Parallel and Distributed Computation: Numerical
Methods. Athena Scientific.
Bruzzone, L., Chi, M., and Marconcini, M. (2006). A novel transductive SVM for semisupervised
classification of remote-sensing images. volume 44, pages 3363–3373.
Chang, M. W., Ratinov, L., and Roth, D. (2007). Guiding semi-supervision with constraintdriven learning. In ACL.
Chapelle, O., Chi, M., and Zien, A. (2006). A continuation method for semi-supervised SVMs.
In ICML.
Chapelle, O., Sindhwani, V., and Keerthi, S. S. (2008). Optimization techniques for semisupervised support vector machines. In JMLR, volume 9, pages 203–233.
De Bie, T. and Cristianini, N. (2004). Convex methods for transduction. In NIPS.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. In JMLR, volume 3, pages 1289–1305.
Ganchev, K., Graca, J., Gillenwater, J., and Taskar, B. (2009). Posterior regularization for
structured latent variable models. Technical report, Dept. of Computer & Information Science,
University of Pennsylvania.
Gärtner, T., Le, Q. V., Burton, S., Smola, A. J., and Vishwanathan, S. V. N. (2005). Large-scale
multiclass transduction. In NIPS.
Graca, J., Ganchev, K., and Taskar, B. (2007). Expectation maximization and posterior
constraints. In NIPS.
Grandvalet, Y. and Bengio, Y. (2003). Semi-supervised learning by entropy minimization. In
NIPS.
Hadley, G. (1963). Linear Programming. Addison-Wesley, 2nd edition.
Jiao, J., Wang, S., Lee, S., Greiner, R., and Schuurmans, D. (2006). Semi-supervised conditional
random fields for improved sequence segmentation and labeling. In ACL.
Joachims, T. (1999). Transductive inference for text classification using support vector
machines. In ICML.
Lang, K. (1995). Newsweeder: Learning to filter netnews. In ICML.
LeCun, Y. (2011). The MNIST database of handwritten digits.
Lee, C. H., Wang, S., Jiao, F., Schuurmans, D., and Greiner, R. (2006). Learning to model
spatial dependency: semi-supervised discriminative random fields. In NIPS.
Lewis, D., Yang, Y., Rose, T., and Li, F. (2006). Rcv1: A new benchmark collection for text
categorization research. In JMLR, volume 5, pages 361–397.
Mann, G. S. and McCallum, A. (2010). Generalized expectation criteria for semi-supervised
learning with weakly labeled data. In JMLR, volume 11, pages 955–984.
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text
classification. In AAAI Workshop on Learning for Text Categorization.
Sindhwani, V. and Keerthi, S. (2006). Large-scale semi-supervised linear SVMs. In SIGIR.
Tibshirani, R. (2011). USPS handwritten digits dataset.
Xu, L., Wilkinson, D., Southey, F., and Schuurmans, D. (2006). Discriminative unsupervised
learning of structured predictors. In ICML.
Zien, A., Brefeld, U., and Scheffer, T. (2007). Transductive support vector machines for
structured variables. In ICML.
Zubiaga, A., Fresno, V., and Martinez, R. (2009). Is unlabeled data suitable for multiclass
SVM-based web page classification? In NAACL HLT Workshop on Semi-supervised Learning for
Natural Language Processing.
Download