Active Learning

advertisement
Active Learning and
Experimental Design
Anwar Sadat
Delgado Monica
Outline
1. Introduction
2. Scenarios (Approaches to Querying)
3. Query Strategy Frameworks
a.
b.
c.
d.
e.
f.
Uncertainty Sampling
Query-by-Committee
Expected Model Change
Expected Error Reduction
Variance Reduction
Density-Weighted Methods
4. Problem Setting Variants
5. Practical Consideration
6. Conclusion
2
Definition
Active learning is a special case of semi-supervised machine
learning in which a learning algorithm is able to interactively
query the user (or some other information source) to obtain the
desired outputs at new data points. In statistics literature it is
sometimes also called optimal Experimental design.
Settles, Burr (2010)
Olsson, Fredrik (2009)
3
Introduction:
Learning paradigms
Supervised Learning
Unsupervised Learning
4
Active Learning
Passive Learning
Active Learning
vs.
General Schema of a passive learner
• For all supervised and unsupervised learning tasks, it
needs to gather significant amount of data randomly
sampled from the underlying population distribution and
then induce a classifier or model
• Cons:
• Gathering of labeled data often too expensive: time, money
5
Passive Learning
Active Learning
vs.
General Schema of a active learner
• Difference: the ability to ask queries about the world
based upon the past queries and responses
• The notion of what exactly a query is and what response
it receives will depend upon the exact task at hand
• Pro:
• The entire data need not be labeled, it is the task of the learner
to request for relevant label
6
Active Learning Heuristic
• Start with a pool of unlabeled data (U)
• Pick a few points at random and get their labels using an
oracle (e.g. human annotator)
• Repeat the following:
1. Fit a classifier to the labels seen so far
2. Pick the BEST unlabeled point to get a label for
• (closest to the boundary?)
• (most uncertain?)
• (most likely to decrease overall uncertainty?)
7
Active Learning (II)
Start: Unlabeled data
8
Active Learning (III)
Label a random subset
9
Active Learning (IV)
10
Fit a classifier to labeled data
Active Learning (V)
11
Pick the BEST next point to label (e.g. closest to boundary)
Active Learning (VI)
12
Fit a classifier to labeled data
Active Learning (VII)
13
Pick the BEST next point to label (e.g. closest to boundary)
Active Learning (VIII)
14
Fit a classifier to labeled data
Active learning examples
• A.L. well motivated in many modern machine learning
problems where data may be abundant but labels are scarce
or expensive to obtain
• Applications:
• Speech recognition
• Information extraction
• Classification and filtering
Comparison of Uncertainty Sampling
(Active Learning) vs Random sampling
for text classification of Baseball vs
Hockey
15
Scenarios (Approaches to
Querying)
Three Main Active Learning Scenarios (Settles, 2010)
16
• The learner may request labels for any unlabeled instance in
the input space, including queries that the learner generates
de novo, rather than those sampled from some underlying
natural distribution.
• Pros:
• Computationally tractable (for finite domains)
• Can be extended to regression tasks
• Promising approach in automation of experiments that do not
require a human annotator
Scenarios
Membership Query Synthesis
• Cons:
• Human annotator may have difficulty interpreting and labeling
arbitrary instances
17
• Generally used when obtaining an unlabeled instance is free
or relatively cheap, the query can first be sampled from the
actual distribution, and then the learner can decide whether
or not to request its label.
• This approach is called stream / sequential active learning, as
each unlabeled instance is typically drawn one at a time from
the data source, and the learner must decide whether to
query or discard it.
• Pros:
Scenarios
Stream-Based Selective
Sampling
• Better to use when memory or processing power may be limited, as with
mobile and embedded devices.
• Cons:
• The assumption that the unlabeled data is available free might not
always hold.
18
Scenarios
Pool-Based Sampling
The pool-based active learning cycle (Settles, 2010)
19
• It assumes that there is a small set of labeled data L and a
large pool of unlabeled data U available
• The learner is supplied with a set of unlabeled examples from
which it can selects queries.
• Pros:
• Probably the most widely used sampling method (text
classification, information extraction, image classification, video
classification, speech recognition, cancer diagnosis)
• Applied to many real-world tasks
Scenarios
Pool-Based Sampling
• Cons:
• Computationally intensive
• Needs to evaluate entire set at each iteration
20
Query Strategy Frameworks
1.
2.
3.
4.
5.
6.
Uncertainty Sampling
Query-By-Committee
Expected Model Change
Expected Error Reduction
Variance Reduction
Density-Weighted Methods
21
Uncertainty Sampling (I)
closest to
boundary
• Basically there are 3 strategies: least confident, margin sampling
and entropy
Query Strategy Frameworks
• An active learner queries the instances about which it is least
certain how to label (e.g. closest to the decision boundary)
22
Uncertainty Sampling (II)
• Where
is the class label with
highest posterior probability under model θ
• Pros:
• Appropriate if the objective function is to reduce classification
error
Query Strategy Frameworks
• For problems with three or more class labels: it might query
the instances whose prediction is the least confident
• The uncertainty measure:
• Cons:
• This strategy only considers information about the most probable
label
23
Uncertainty Sampling (III)
where
and
are the first and second most probable
class labels under the model
• Pros:
• Corrects the least confident strategy by incorporating the
posterior of the second most likely label
• Appropriate if the objective function is to reduce classification
error
Query Strategy Frameworks
• Partial solution: margin sampling:
• Cons:
• For problems with very large label sets, it still ignores the output
distribution for the remaining classes
24
Uncertainty Sampling (IV)
• Where
ranges over all possible labeling
• Entropy is an information-theoretic measure that represents
the amount of information needed to “encode” a distribution
• Pro:
• Generalizes easily the strategy for probabilistic multi-label
classifiers and for more complex structured instances (e.g.
sequences, trees)
• Appropriate if the objective function is to minimize log-loss
Query Strategy Frameworks
• Solution: Entropy
25
Query Strategy Frameworks
Uncertainty Sampling (V)
Heat maps illustrating the
query behavior for three-label
classification problem
26
• This strategy may also be employed with non-probabilistic
classifiers
• Is also applicable in regression problems (i.e., learning tasks
where the output variable is a continuous value rather than a
set of discrete class labels)
• The learner queries the unlabeled instance for which the model
has the highest output variance in its prediction
• Active learning for regression problems are generally referred to
as optimal experimental design
Query Strategy Frameworks
Uncertainty Sampling (VI)
27
• The QBC approach involves maintaining a committee C of
models which are all trained on the current labeled set L, but
represent competing hypotheses
• Each committee member votes on the labeling of query
candidates
• Pick the instances generating the most disagreement among
hypotheses
• Goal of QBC: minimize the version space (the set of
hypotheses that are consistent with the current labeled
trained data L), to constrain the size of this space as much as
possible with a few labeled instances as possible
Query Strategy Frameworks
Query-By-Committee (I)
28
Version space examples for (a) linear and (b) axis-parallel box
classifiers. All hypotheses are consistent with the labeled training data
in L (as indicated by shaded polygons), but each represents a different
model in the version space
Query Strategy Frameworks
Query-By-Committee (II)
• To implement a QBC selection algorithm:
• Construct a committee of models that represent different regions
of the version space
• Have some measure of disagreement among committee
members
29
Query-By-Committee (III)
• Where
ranges over all possible labeling,
•
is the number of “votes”, C is the committee size
• 2) Kullback-Leibler (KL) divergence:
Query Strategy Frameworks
• Two approaches for measuring the level of disagreement
• 1) Vote entropy:
• Where:
•
represents a particular model, C is the committee as a
whole
30
Query-By-Committee (IV)
is the correct label is
• KL divergence is an information-theoretic measure of the
difference between two probability distributions
• Aside from QBC, other strategies attempt to minimize the
version space:
• A selective sampling algorithm that uses a committee of 2 NN,
the “most specific” and the “most general” models (Cohn et al.,
1994)
• Pool-based margin strategy for SVMs (Tong and Koller, 2000)
• The membership query algorithms (Angluin and King et al.)
Query Strategy Frameworks
• So the “consensus” probability that
given by:
31
Expected Model Change (I)
Before
After
Query Strategy Frameworks
• Is a decision-theoretic approach
• It selects the instance that would impart the greatest change
to the current model if we knew its label
32
• Expected Gradient Length (EGL)
• It uses gradient-based training
• The learner should query the instance x which, if labeled and
added to L, would result in the new training gradient of the
largest magnitude
• It calculates the length as an expectation over the possible
labeling:
Query Strategy Frameworks
Expected Model Change (II)
• With
the gradient of the objective function ℓ with
respect to the model parameters θ
•
the new gradient that
would be obtained by adding the training tuple <x,y> to L
33
Expected Model Change (III)
• It prefers instances that are likely to most influence the model,
regardless of the resulting query label
• Computationally expensive if both the feature space and set of
labeling are very large
• If features are not properly scaled, EGL approach can be led
astray (Solution: parameter regularization)
Query Strategy Frameworks
• Cons:
34
• A decision-theoretic approach
• Query the instances that would most reduce error. Measures
how much its generalization error is likely to be reduced
• The idea it to estimate the expected future error of a model
trained using
on the remaining unlabeled
instances in U, and query the instance with minimal expected
future error (risk)
• One approach: Minimize the expected 0/1-loss:
• Where
refers to the new model after it has been retrained with the training tuple <x,yi> added to L
Query Strategy Frameworks
Expected Error Reduction (I)
35
Expected Error Reduction (II)
• To reduce the expected total number of incorrect predictions
• To minimize the expected log-loss, which is equivalent to reduce
the expected entropy over U (maximizing the expected
information gain of the query x or the mutual information of the
output variables over x and U
• It’s been successfully used with naïve Bayes, Gaussian
random fields, logistic regression and SVM
• Cons:
• The most computationally expensive query framework (It
requires estimating the expected future error over U for each
query and a new model must be incrementally re-trained for each
possible query labeling)
• Because of the complexity, it is usually used for simple
binary classification tasks
Query Strategy Frameworks
• Objectives:
36
Variance Reduction (I)
Expectation over the conditional density P(y/x)
Noise
Bias
Expectation over both
Query Strategy Frameworks
• Objective: reduce generalization error indirectly by minimizing
output variance
• Considering a regression problem, the learner’s expected
future error can be decomposed:
Variance
Expectation over the labeled set L
37
Variance Reduction (II)
• The variance reduction query selection strategy then
becomes:
Here
is the estimated output mean variance once the
model has trained for the variable x
• The variance can be seen as a composition of the Fisher’s
Information matrix as given by this equation (MacKay, 1992):
• In other words, to minimize the variance over its parameter
estimates, an active learner should select data that maximizes
its Fisher information (or minimizes the inverse thereof)
38
Density-Weighted Methods (I)
• Last methods spend time querying possible outliers simple
because they are controversial
Figure: Poor strategy for classification when using uncertainty sampling can be
a poor strategy for classification. Since A is on the decision boundary, it would
be queried as the most uncertain. However, querying B is likely to result in
more information about the data distribution as a whole
• The basic idea is to query the most “informative” and
“representative” (inhabit dense regions of the input space)
Density-Weighted Methods (II)
• Main idea: informative instances should not only be
those which are uncertain, but also those which are
“representative” of the underlying distribution.
Query Strategy Frameworks
• 1) The information density framework (Settles and Craven,
2008)
In densest region but still
very close to decision
boundary
Very close to decision
boundary
40
Density-Weighted Methods (III)
• Where φA(x) represents the informativeness of x according to
some “base” query strategy A, such as an uncertainty
sampling or QBC approach. The second term weights the
informativeness of x by its average similarity to all other
instances in the input distribution (as approximated by U),
subject to a parameter β that controls the relative importance
of the density term
Query Strategy Frameworks
• The Settles and Craven approach:
41
• McCallum and Nigam (1998) also developed a densityweighted QBC approach for text classification with naïve Bayes
(special case of information density)
• Fujii et al. (1998) considered a query strategy for nearestneighbor methods that selects queries that are least similar to
the labeled instances in L, and most similar to the unlabeled
instances in U.
• Nguyen and Smeulders (2004) proposed a density-based
approach that first clusters instances and tries to avoid
querying outliers by propagating label information to
instances in the same cluster.
• Xu et al. (2007) use clustering to construct sets of queries for
batch-mode active learning with SVMs
Query Strategy Frameworks
Density-Weighted Methods (IV)
42
Density-Weighted Methods (V)
• If densities can be pre-computed efficiently and cached for later
use, the time required to select the next query is essentially no
different than the base informativeness measure (e.g.,
uncertainty sampling)
• This is advantageous for conducting active learning interactively
with oracles in real-time
Query Strategy Frameworks
• Pro
43
Usage Comparison of Different
Active Learning Approach
(300)
[Guyon, Cawley , Dror , Lemaire 2009]
[Active Learning Challenge ]
http://www.causality.inf.ethz.ch/activelearning.php#cont
44
Problem Setting Variants
Some common problem setting variants of the generalizations
and extensions of traditional active learning work.
1.
2.
3.
4.
Active learning for Structured Outputs
Active feature acquisition and classification
Active class selection
Active clustering
45
AL for Structured Outputs (I)
Problem Setting Variants
• Many important learning problems involve predicting
structured outputs on instances, such as sequences and trees
Example of an information extraction problem viewed as a sequence
labeling task
46
• Unlike simpler classification tasks, each instance x in this
setting is not represented by a single feature vector, but rather
a structured sequence of feature vectors
• The appropriate label for a token often depends on its context
in the sequence
• Labels are typically predicted by a sequence model based on a
probabilistic finite state machine, such as CRFs or HMMs
• Settles and Craven published- An Analysis of Active Learning
Strategies for Sequence Labeling Tasks (2008) discusses how
many active learning strategies can be adapted for structured
labeling tasks.
Problem Setting Variants
AL for Structured Outputs (II)
47
• In some learning domains, instances may have incomplete
feature descriptions, example, many data mining tasks in
modern business are characterized by naturally incomplete
customer data
• The task of the model is to classify a problem using
incomplete information as the feature set
• Active Feature Acquisition seeks to alleviate these problems by
allowing the learner to request more complete feature
information, the assumption that more information can be
requested at extra cost/effort
Problem Setting Variants
Active Feature Acquisition and
Classification (I)
48
• The goal in active feature acquisition is to select the most
informative features to obtain during training, rather than randomly
or exhaustively acquiring all new features for all training instances
• There have been multiple approaches to solving this problem of
feature acquisition, e.g:
• Zheng and Padmanabhan proposed two “single-pass” approaches
1. Attempt to impute the missing values, and then acquire the ones about
which the model has least confidence
2. An alternative, imputing these values, training a classifiers on the
imputed training instances, and only acquiring feature values for the
instances which are misclassified
Problem Setting Variants
Active Feature Acquisition and
Classification (II)
• Incremental active feature acquisition may acquire values for a few
salient features at a time
1. By selecting a small batch of misclassified examples (Melville et al.,
2004)
49
Active Feature Acquisition and
Classification (III)
• Works in active feature acquisition have considered cases in
which feature acquisition can be done during test-time rather
than during training
• The difference between these learning settings and typical
active learning is that the “oracle” provides salient feature
values rather than training labels
Problem Setting Variants
2. Taking a decision-theoretic approach and acquiring the feature
values which are expected to maximize some utility function
(SaarTsechansky et al., 2009).
50
• We have till now assumed that the data needed for training
the model is freely available and that classification involves
cost.
• An active class selection problem, the learner is capable of
querying a known class label and obtaining each instance
incurs a cost.
• Lomasky et al. (2007) propose several active class selection
query algorithms for an “artificial nose” task
• machine learns to discriminate between different vapor types
(the class labels) which must be chemically synthesized (to
generate the instances)
Problem Setting Variants
Active Class Selection
51
• Clustering algorithms are the most common examples of
unsupervised learning
• Since active learning aims at selecting data that will reduce
models uncertainty, unsupervised active learning is
counterintuitive
• Hofmann and Buhmann have proposed an active clustering
algorithm for proximity data, based on an expected value of
information criterion
• The idea is to generate (or subsample) the unlabeled instances in
such a way that they self-organize into groupings with less
overlap or noise than for clusters induced using random sampling
Problem Setting Variants
Active Clustering
52
Practical Considerations
• Until very recently the main question has been “Can machines
learn with fewer training instances if they ask questions?”
• Yes (subject to some assumptions e,g: single oracle, oracle is
always right, cost of labeling instances is uniform)
• These assumptions don’t hold in most real world situations
• The research trend has now taken a turn to try and answer the
question “Can machines learn more economically if they ask
questions?”
53
• The assumption that is oracle is correct doesn’t hold in real
world scenarios
• Crowdsourcing tools and annotation games allow to average
out the result by getting labels from multiple oracles
• Question: when the learner should query an potentially noisy
oracle or even re-request the label for an instance that seems
off
• Research has proved that the labeling can be improved by
selective relabeling
• Questions still remaining for research:
• How can learner incorporate varying level of noise from same
oracle?
• If repeated labeling not able to improve at all?
Practical Considerations
Noisy Oracles
54
• Aim to reduce total labeling cost, simply reducing the total
number of labels doesn’t solve the problem
• Annotation costs can be reduced by using trained models to
assist in labeling of query instances by pre-labeling them
(similar to data parsing)
• Kapoor et al. propose a method that accounts for labeling
costs and mislabeling costs. Queries are evaluated by
summing the labeling cost taking into account the probability
of mislabeling depending on the quality of oracle (the cost of
labeling and quality should be known before hand)
Practical Considerations
Variable Labeling Costs
55
• Most active learning research are oriented to work well when
the queries are selected in serial
• In the presence of multiple annotators, this method slows
down the learning process
• Batch-Mode active learning allows the learner to query the
instances in groups, this approach is more suited for a parallel
labeling
• Challenge is to form the query set. Querying the “Q-best”
queries according to the instance-level strategy usually causes
information overlap and is not helpful
• Research is oriented at approaches that specifically
incorporate diversity in the queries eg. Xu et al. (2007) query
the centroids of different clusters
Practical Considerations
Batch-Mode Active Learning
56
Alternative Query Types
• A bag is labeled positive if any of its instances are positive
• A bag is labeled negative if and only if all instances are negative
• It is easier to label bags compared to labeling instances
• Mixed granularity MI learning also proposed by Settles. The
learner is capable of querying instances as well as bags
• Querying on feature as opposed to label also an alternative.
Also interleave feature based learning with instance learning
to improve accuracy. (eg: presence of word football in a text
documents usually means it’s a sport document)
Practical Considerations
• Settles et al. (2008) suggest Multiple-Instance (MI) learning.
Here instances are grouped into bags and the bags are labeled
rather than the instances.
57
• The same data instance can be labeled in multiple ways for
different subtasks
• Also sometimes more economical to label all tasks of an
instance simultaneously
• Qi et al. (2008) study a different multi-task active learning
scenario, in which images may be labeled for several binary
classification tasks in parallel. For example, an image might be
labeled as containing a beach, sunset, mountain, field, etc.,
which are not all mutually exclusive; however, they are not
entirely independent, either
Practical Considerations
Multi-Task Active Learning
58
Stopping Criteria
• When is it good to stop?
2.
Cost of acquiring new training data is greater than the cost of
the errors made by the current model
Recognize when the accuracy plateau has been reached
• The real stopping criterion for practical applications is based
on economic or other external factors, which likely come well
before an intrinsic learner-decided threshold
Practical Considerations
1.
59
Analysis of Active Learning
• A number of research in the field and works done by large
corporations like Google, IBM, CiteSeer suggest that it works
• It is however very closely related to the quality of the
annotators
• It does reduce the number of instance labels required to
achieve a given level of accuracy in most of the studied
research
• However there are reported research in which the users opted
out of active learning as they lacked the confidence in the
approach
60
Conclusions
• Active learning is a growing area of research in machine
learning backed by the fact that the data available is ever
growing and is available for less
• A lot of the research is involved in improving the accuracy of
the learner with lesser number of queries
• Current day research is intended at implementing active
learning in practical scenarios and problem solving
61
Download