Diverse Ensembles for Active Learning Machine Learning Group Department of Computer Sciences

advertisement
Machine Learning Group
Diverse Ensembles for Active Learning
Prem Melville and Raymond J. Mooney
Machine Learning Group
Department of Computer Sciences
University of Texas at Austin
June 27, 2004
University of Texas at Austin
Motivation
• Actively selecting most useful training examples is an important
approach to reducing amount of supervision
• Pool-based sample selection is the most popular approach
– Learner chooses best instance for labeling from a set of unlabeled
examples
• Query by Committee (QBC) is a theoretically well motivated approach
to sample selection [Seung et al. 92]
– Committee of consistent hypotheses is learned
– Examples that cause maximum disagreement amongst this committee are
selected for labeling
• Bagging and AdaBoost have been used to learn effective committees
for QBC [Abe & Mamitsuka 98]
– Known as Query by Bagging (QBag) and Query by Boosting (QBoost)
2
Motivation
• A good ensemble for QBC should be diverse
– i.e., consistent hypotheses that are very different from each other
– Only a committee that effectively samples the version space is productive
for sample selection [Cohn 94]
• Decorate is a recently-developed ensemble method that explicitly
builds diverse ensembles [Melville & Mooney 03,04]
– It’s more accurate than Bagging & AdaBoost when training data is limited
– And does at least as well as AdaBoost when training sets are large
• How effective are Decorate ensembles for sample selection?
– Can the added diversity help select more informative examples than QBag
and QBoost?
3
Outline
• Background on DECORATE
• Active-DECORATE
• Experimental Evaluation
• Additional Experiments
• Future Work and Conclusions
4
Outline
• Background on DECORATE
• Active-DECORATE
• Experimental Evaluation
• Additional Experiments
• Future Work and Conclusions
5
Ensemble Diversity
• Combining classifiers is only useful if they disagree on some inputs
– Diversity refers to a measure of disagreement (ambiguity)
• Increasing diversity while maintaining error of ensemble members →
decreases ensemble error [Krogh & Vedelsby 95]
• We use disagreement with ensemble prediction as a measure of diversity
– If Ci(x) is the prediction of the i-th classifier for the label of x
– C*(x) is the prediction of the entire ensemble
– Diversity of the i-th classifier on example x is given by
0 : if Ci ( x)  C * ( x)
d i ( x)  
1 : otherwise
– Div. of ensemble of size m, on training set of size n:
1 n m
di ( x j )

nm j 1 i 1
• Our approach: build ensembles consistent with training data while
maximizing diversity
6
DECORATE: Basic Approach
• The ensemble is generated iteratively
• Artificially constructed examples are added to training set when
building new members
• Artificial examples are given labels that disagree with current
ensemble’s decisions
• The new classifier is trained on this augmented data
– Thereby forcing it to differ from the current ensemble
– Adding it to the ensemble will therefore increase diversity
• While forcing diversity we still maintain accuracy
– Reject new classifier if adding it to existing ensemble decreases its accuracy
• To produce predictions we take the majority vote of the ensemble
7
Overview of DECORATE
Current Ensemble
Training Examples
+
+
+
C1
Base Learner
+
+
+
Artificial Examples
8
Overview of DECORATE
Current Ensemble
Training Examples
+
+
+
C1
Base Learner
C2
+
+
+
Artificial Examples
9
Overview of DECORATE
Current Ensemble
Training Examples
+
+
+
C1
Base Learner
+
+
+
Artificial Examples
C2
C3
10
Artificial Data
• Examples are generated at each iteration
– Number of examples is proportional to training size (1:1)
• Randomly pick points from approx. training data distribution
– For numeric attributes
• compute mean and std dev & generate values from the Gaussian
– For nominal attributes
• Compute prob. of occurrence of each distinct value & generate values from
this distribution
• To label examples
– Find class membership probabilities predicted by current ensemble
– Select labels s.t. probability of selection is inversely proportional to
ensemble predictions
11
Outline
• Background on DECORATE
• Active-DECORATE
• Experimental Evaluation
• Additional Experiments
• Future Work and Conclusions
12
Active-DECORATE
Unlabeled Examples
Utility = 0.1
Current Ensemble
Training Examples
+
+
-
C1
+
C2
+
C3
+
C4
+
DECORATE
13
Active-DECORATE
Unlabeled Examples
Utility = 0.1
0.9
0.3
0.2
0.5
Current Ensemble
Training Examples
+
+
+
C1
+
C2
+
C3
-
C4
-
DECORATE
Acquire Label
QBag/QBoost similarly implemented using Bagging/AdaBoost in place of Decorate 14
Measure of Utility
• To evaluate the expected utility of unlabeled examples we
use the margins on the examples
– Similar to [Abe and Mamitsuka 98]
• Given the class membership probabilities predicted by the
committee
• The margin is defined as diff between highest and second
highest predicted class probability
– Smaller margins imply greater uncertainty in the class label
• Other measures of utility will be discussed later
15
Summary of Data Sets
Name
Cases
Classes
Attributes
Vowel
990
11
14
Statlog
270
2
14
Primary
339
21
18
Breast-w
699
2
9
Sonar
208
2
61
Glass
214
6
9
Heart-c
303
2
13
Hepatitis
155
2
19
Diabetes
768
2
9
Iris
150
3
4
Labor
57
2
16
Lymph
148
4
18
Credit-g
1000
2
21
Soybean
683
19
35
Heart-h
294
2
14
16
Experimental Methodology
• Compared Active-Decorate with QBag, QBoost and Decorate (using
random sampling)
– Used ensembles of size 15
– Used J48 as the base learner
• J48 is a Java implementation of C 4.5 decision tree induction
• 2x10-fold cross-validations were run on 15 UCI datasets
• In each fold, learning curves were generated
– The set of available examples treated as unlabeled pool
– At each iteration, the active learner selected sample of pts to be labeled
and added to training set
– For passive learner, Decorate, examples were selected randomly
• At the end of the learning curve, all algos see the same examples
– The curves evaluate the how well an active learner orders the set of
examples in terms of utility
17
Accuracy
Metrics – Data Utilization Ratio
Examples saved
Active
Random
Num of training examples
•
Primary aim of active learning – reduce amount of data needed to induce accurate
model
18
Accuracy
Metrics – Data Utilization Ratio
Examples saved
Active
Random
Num of training examples
•
Define target error rate as the error that Decorate can achieve on a given dataset
– Error averaged over pts of the learning curve corresponding to last 50 examples
•
Record smallest num of examples required by a learner to achieve same or lower
error
19
Accuracy
Metrics – Data Utilization Ratio
Examples saved
Active
Random
Num of training examples
•
Data utilization ratio:
– (num of examples required by active learner) / (num of examples required by Decorate)
•
Reflects how efficiently the active learner is using data
– Similar to measure used by Abe & Mamitsuka [98]
20
Metrics - Percentage Error Reduction
Accuracy
Error Reduction
Active
Random
Num of training examples
•
How much an active learner improves accuracy over random sampling given a
fixed amount of labeled data
•
Compute % reduction in error over Decorate
– Average over points on the learning curve
21
Metrics - Percentage Error Reduction
Accuracy
Error Reduction
Active
Random
Num of training examples
•
Towards end of learning curve all methods see almost the same examples
– Hence, main impact of active learning is lower on curve
•
Capture this by reporting % error reduction on 20% of point on the curve where
largest improvements are produced
– Similar to a measure used by Saar-Tsechansky & Provost [01]
22
Metrics - Percentage Error Reduction
Accuracy
Error Reduction
Active
Random
Num of training examples
•
Error reduction is considered significant if difference in error of the 2 systems
averaged across selected pts of the curve is statistically significant (p<0.05)
23
Results – Data Utilization
• On all but one dataset Active-Decorate produces improvements over
Decorate
– On average it requires 78% of the num of examples that Decorate needs
– With as few as 29% of examples on soybean
– On breast-w we notice a ceiling effect were none of the active methods
improve on Decorate
• Active-Decorate outperforms both QBag and QBoost on 10 datasets
– On some datasets (vowel & primary), QBag & QBoost failed to achieve
the target error
• Decorate itself achieves the target error with far fewer examples than is
available
– e.g. on breast-w it achieves the target error with only 30 of the available
630 examples
– Hence improving on the data utilization of Decorate is fairly challenging
24
Results – Error Reduction
• On all datasets Active-Decorate produces significant
reductions in error over Decorate
• On 8 datasets Active-Decorate produces higher reductions
than other active methods
• It produces a wide range of improvements
– From moderate (4.2% on credit-g) to high (70.68% on vowel)
– With an average reduction of 21.2%
Mean Err. Red.
No. of Wins
QBag
QBoost
ActiveDecorate
13.13%
15.64%
21.15%
4
3
8
25
Learning Curve for Soybean
26
Outline
• Background on DECORATE
• Active-DECORATE
• Experimental Evaluation
• Additional Experiments
• Future Work and Conclusions
27
Measures of Utility
• There are two main aspects of any QBC approach
– The method employed to construct the committee
– Measure used to rank utility of unlabeled examples
• We compared different methods for constructing committees
– Ranked examples based on margins
• Alternate approach – use Jensen-Shannon (JS) divergence
[Cover & Thomas 91]
– JS-div is a measure of similarity between probability distributions
28
Jensen-Shannon Divergence
• If Pi(x) is the class probability distribution given by i-th classifier for
example x, then JS-div of ensemble of size n as:
n
n
1
1
JS ( P1 , P2 ,..., Pn )  H (  Pi )   H ( Pi )
i 1 n
i 1 n
• H(P) is the Shannon entropy of distribution P = {pj, j=1,…,K} defined as:
K
H ( P)   p j log p j
j 1
• Higher values of JS-div indicate greater spread in predicted class
probability distribution
– Zero iff the distributions are identical
– A similar measure was used by [McCallum & Nigam 98]
• We ran experiments, as before, comparing JS-div with margins
29
Results – Utility Measures
Data Utilization
% Error Reduction
Margins
JS-Div
Margins
JS-Div
Mean
0.78
0.83
21.15
17.22
Num. of Wins
7
8
11
4
• In terms of data utilization, both methods equally matched
• On error reduction, using margins is more effective
• JS-div selects examples to reduce uncertainty in predicted class mem. probs
– Which indirectly helps improve accuracy
• Margins focus more directly on determining the decision boundary
• Cost-sensitive decisions require accurate class probability estimates
– Using JS-div could be more effective in such cases
30
Learning Curve for Vowel
– Often both measures achieve target error with comparable number of examples
– But error reduction produced by margins is higher
31
Committees for Sample Selection
vs. Prediction
• All active methods described use committees to select examples
• In addition to sample selection, they also use the committees for
prediction
• We are evaluating the combination of sample selection and
ensemble method
– Active-Decorate does better than QBag
– Could just be because Decorate is better than Bagging
• Claim: Decorate not only produces accurate committees, but
committees produce are more effective for sample selection
32
Committees for Sample Selection
vs. Prediction
• Implemented variant of Active-Decorate
– At each iteration a committee constructed by Bagging is used to select
examples given to Decorate
– Thus separating evaluation of selector from predictor
• Similarly, implemented a variant using AdaBoost as the selector
• Compared the 3 variants on 4 datasets
• On 3 of 4 datasets, using any selector with Decorate as predictor
performed better than random selection
– On the 4th dataset, the trends are same, but not statistically significant
• Compared to AdaBoost and Bagging, Decorate committees select more
informative examples for training Decorate
33
Learning Curve for Soybean
34
Related Work
• Dagan & Engelson [95] measure utility of examples using vote entropy
– i.e. the entropy of the class distribution based on majority votes of each
committee member
– [McCallum & Nigam 98] showed that it does not perform as well as JSdiv
• Another committee-based active learner – Co-Testing [Muslea et al.
00]
– Requires 2 redundant views of the data
– Hence limited applicability
• Expected-error reduction methods [Cohn et al. 96, Roy & McCallum
01, Zhu et al. 03]
– Select examples that are expected to minimize error on the actual test
distribution
– Is computationally intense, and must be tailored to specific learners
– Active meta-learners like Active-Decorate can be applied to any learner
35
Future Work & Conclusions
• Active-Decorate is a simple, yet effective approach to active learning
– Produces significant improvements over Decorate
– In general, it leads to more effective sample selection than QBag and
QBoost
• Using JS-divergence to evaluate effectiveness of examples is less
effective for improving classification accuracy than margins
– JS-div may be a better measure when the objective is improving class
probability estimates
• Active-Decorate is a meta-learning scheme – so it can be applied to
other base learners
– We can compare with other active learners, such as approaches for SVMs
[Tong et al. 01]
36
Questions?
DECORATE is now available as part of the Weka ML package.
Machine Learning Group, UT-Austin
www.cs.utexas.edu/users/ml
37
Ensemble Diversity
• Combining classifiers is only useful if they disagree on some
inputs
– Diversity refers to a measure of disagreement (ambiguity)
• For regression
– Using mean squared error to measure accuracy
– Using variance to measure diversity
– Ensemble generalization error E  E  D
[Krogh & Vedelsby ′95]
• E – average error of the ensemble members
• D – average diversity of the ensemble
• Increasing diversity while maintaining error of ensemble
members → decreases ensemble error
38
Diversity for Classification
• For classification the simple linear relation doesn’t hold
– We still have reason to believe that diversity is related to error
reduction [Cunnigham ′00]
• Many measures of diversity have been used in the literature
• [Kuncheva et al. ′03] compared different measures
– They show that most of these measures are highly correlated
• No conclusive study points to which measure of diversity
is the best to use
39
Learning Curve for Soybean (Full)
40
Learning Curve for Vowel (Full)
41
Learning Curve for Soybean (Full)
42
Related Work
• There have been other ensemble methods that focus on
diversity
– [Liu & Yao ′99], [Rosen ′96], [Opitz & Shavlik ′96], [Zenobi &
Cunnigham ′01], [Tumer and Ghosh ′96] , [Opitz ′99]
• How our work differs from others:
– Other methods attempt to optimize accuracy and diversity of
individual ensemble members
• We try to minimize error of entire ensemble by increasing diversity
– Some methods are dependent on the underlying learner (e.g. NN)
• DECORATE is a general meta-learner applicable to any base learner
– We compare with standard ensemble methods – others don’t
• Except for [Opitz ′99]
– We present learning curves - evaluates performance with varying
amounts of data
43
Modeling Artificial Data
• We use a very crude approximation of the data
distribution
– Assume independence of features
– Assume Gaussian distribution for nominal attributes
• We can do a better job of modeling the data
• But, we get good results with the current method
• It is unclear that a better model will improve
results
– It will however increase run time
44
Artificial vs. Unlabeled Data
• The way we use artificial examples may appear counterintuitive to the
way unlabeled data is used in semi-supervised learning
– Where the labels given to the unlabeled data by the supervised learner is
preserved (instead of being flipped)
• Why does semi-supervised learning work?
– Unlabeled data provides more information about the data distribution
– Artificial data does not
• Why does flipping labels not hurt Decorate?
– If the current ensemble is accurate, aren’t we are forcing subsequent
members to not be accurate?
– No – we make sure that the error of the ensemble never decreases
45
When Should You Use DECORATE?
1.
When you have few training examples
•
2.
For large amt. of training data you may still do better than Boosting
–
–
3.
DECORATE performs better on 6 of 15 datasets given 100% of the data
For your dataset there is a good chance that DECORATE will outperform
Boosting even with large amounts of data
When your base classifier cannot handle weighted examples
•
4.
Or acquiring labeled data is expensive
Boosting can be done with resampling – but might not be desirable
When you have noisy data
•
•
Boosting often increases error due to overfitting noisy data [Dietterich 00]
DECORATE is resilient to noise in data [Melville et al. 04]
46
Other Ensemble Methods
• There are other ensemble methods that we can compare to
– Error-Correcting Output Coding [Dietterich & Bakiri ′95]
– Injecting randomness into the learning algorithm
• We chose to compare to Bagging and Boosting
– They are the mostly widely used and studied
• We also compared to Random Forests
– Which is not a meta-learner
– But since we use decision trees we also compared with RFs
47
Labor
Iris
Heart-C
Breast-W
48
Bagging [Breiman ′96]
• Each classifier is trained on a set of m training examples
• Examples drawn randomly with replacement from the
original set of size m
– Such a set is called a bootstrap replicate
• Predictions are made by taking the majority vote of the
ensemble
• Ensemble members differ because they’re trained on
different subsets of the data
• Bagging reduces error due to variance of the base classifier
49
Boosting (AdaBoost.M1) [Freund & Shapire ′96]
• Maintains a set of weights over the training examples
• In each iteration classifier Ci is trained to minimize the weighted error
• The weighted error of Ci is used to update the distribution of weights
– Weights of misclassified examples are increased
– Weights of correctly classified examples are decreased
• Next classifier is trained on examples with updated distribution
– This process is repeated for specified number of iterations
• Ensemble predictions made using a weighted vote of individual classifiers
– Weight of each classifier is computed according to its training accuracy
50
Metrics – Data Utilization Ratio
• Primary aim of active learning – reduce amount of data needed to
induce accurate model
• First, define target error rate as the error that Decorate can achieve on
a given dataset
– Error averaged over pts of the learning curve corresponding to last 50
examples
• Then, record smallest num of examples required by a learner to
achieve same or lower error
• Data utilization ratio:
– (num of exs required by active learner) / (num of exs required by
Decorate)
• Reflects how efficiently the active learner is using data
– Similar to measure used by Abe & Mamitsuka [98]
51
Metrics - Percentage Error Reduction
• How much an active learner improves accuracy over random sampling
given a fixed amt of labeled data
• Compute % reduction in error over Decorate
– Average over points on the learning curve
• Towards end of learning curve all methods see almost the same examples
– Hence, main impact of active learning is lower on curve
• Capture this by reporting % error reduction on 20% of point on the curve
where largest improvements are produced
– Similar to a measure used by Saar-Tsechansky & Provost [01]
• Error reduction is considered significant if the diff in the error of the 2
systems averaged across selected pts of the curve is determined to be
statistically significant (p<0.05)
52
Download