Active Learning and Experimental Design Anwar Sadat Delgado Monica Outline 1. Introduction 2. Scenarios (Approaches to Querying) 3. Query Strategy Frameworks a. b. c. d. e. f. Uncertainty Sampling Query-by-Committee Expected Model Change Expected Error Reduction Variance Reduction Density-Weighted Methods 4. Problem Setting Variants 5. Practical Consideration 6. Conclusion 2 Definition Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal Experimental design. Settles, Burr (2010) Olsson, Fredrik (2009) 3 Introduction: Learning paradigms Supervised Learning Unsupervised Learning 4 Active Learning Passive Learning Active Learning vs. General Schema of a passive learner • For all supervised and unsupervised learning tasks, it needs to gather significant amount of data randomly sampled from the underlying population distribution and then induce a classifier or model • Cons: • Gathering of labeled data often too expensive: time, money 5 Passive Learning Active Learning vs. General Schema of a active learner • Difference: the ability to ask queries about the world based upon the past queries and responses • The notion of what exactly a query is and what response it receives will depend upon the exact task at hand • Pro: • The entire data need not be labeled, it is the task of the learner to request for relevant label 6 Active Learning Heuristic • Start with a pool of unlabeled data (U) • Pick a few points at random and get their labels using an oracle (e.g. human annotator) • Repeat the following: 1. Fit a classifier to the labels seen so far 2. Pick the BEST unlabeled point to get a label for • (closest to the boundary?) • (most uncertain?) • (most likely to decrease overall uncertainty?) 7 Active Learning (II) Start: Unlabeled data 8 Active Learning (III) Label a random subset 9 Active Learning (IV) 10 Fit a classifier to labeled data Active Learning (V) 11 Pick the BEST next point to label (e.g. closest to boundary) Active Learning (VI) 12 Fit a classifier to labeled data Active Learning (VII) 13 Pick the BEST next point to label (e.g. closest to boundary) Active Learning (VIII) 14 Fit a classifier to labeled data Active learning examples • A.L. well motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain • Applications: • Speech recognition • Information extraction • Classification and filtering Comparison of Uncertainty Sampling (Active Learning) vs Random sampling for text classification of Baseball vs Hockey 15 Scenarios (Approaches to Querying) Three Main Active Learning Scenarios (Settles, 2010) 16 • The learner may request labels for any unlabeled instance in the input space, including queries that the learner generates de novo, rather than those sampled from some underlying natural distribution. • Pros: • Computationally tractable (for finite domains) • Can be extended to regression tasks • Promising approach in automation of experiments that do not require a human annotator Scenarios Membership Query Synthesis • Cons: • Human annotator may have difficulty interpreting and labeling arbitrary instances 17 • Generally used when obtaining an unlabeled instance is free or relatively cheap, the query can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label. • This approach is called stream / sequential active learning, as each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard it. • Pros: Scenarios Stream-Based Selective Sampling • Better to use when memory or processing power may be limited, as with mobile and embedded devices. • Cons: • The assumption that the unlabeled data is available free might not always hold. 18 Scenarios Pool-Based Sampling The pool-based active learning cycle (Settles, 2010) 19 • It assumes that there is a small set of labeled data L and a large pool of unlabeled data U available • The learner is supplied with a set of unlabeled examples from which it can selects queries. • Pros: • Probably the most widely used sampling method (text classification, information extraction, image classification, video classification, speech recognition, cancer diagnosis) • Applied to many real-world tasks Scenarios Pool-Based Sampling • Cons: • Computationally intensive • Needs to evaluate entire set at each iteration 20 Query Strategy Frameworks 1. 2. 3. 4. 5. 6. Uncertainty Sampling Query-By-Committee Expected Model Change Expected Error Reduction Variance Reduction Density-Weighted Methods 21 Uncertainty Sampling (I) closest to boundary • Basically there are 3 strategies: least confident, margin sampling and entropy Query Strategy Frameworks • An active learner queries the instances about which it is least certain how to label (e.g. closest to the decision boundary) 22 Uncertainty Sampling (II) • Where is the class label with highest posterior probability under model θ • Pros: • Appropriate if the objective function is to reduce classification error Query Strategy Frameworks • For problems with three or more class labels: it might query the instances whose prediction is the least confident • The uncertainty measure: • Cons: • This strategy only considers information about the most probable label 23 Uncertainty Sampling (III) where and are the first and second most probable class labels under the model • Pros: • Corrects the least confident strategy by incorporating the posterior of the second most likely label • Appropriate if the objective function is to reduce classification error Query Strategy Frameworks • Partial solution: margin sampling: • Cons: • For problems with very large label sets, it still ignores the output distribution for the remaining classes 24 Uncertainty Sampling (IV) • Where ranges over all possible labeling • Entropy is an information-theoretic measure that represents the amount of information needed to “encode” a distribution • Pro: • Generalizes easily the strategy for probabilistic multi-label classifiers and for more complex structured instances (e.g. sequences, trees) • Appropriate if the objective function is to minimize log-loss Query Strategy Frameworks • Solution: Entropy 25 Query Strategy Frameworks Uncertainty Sampling (V) Heat maps illustrating the query behavior for three-label classification problem 26 • This strategy may also be employed with non-probabilistic classifiers • Is also applicable in regression problems (i.e., learning tasks where the output variable is a continuous value rather than a set of discrete class labels) • The learner queries the unlabeled instance for which the model has the highest output variance in its prediction • Active learning for regression problems are generally referred to as optimal experimental design Query Strategy Frameworks Uncertainty Sampling (VI) 27 • The QBC approach involves maintaining a committee C of models which are all trained on the current labeled set L, but represent competing hypotheses • Each committee member votes on the labeling of query candidates • Pick the instances generating the most disagreement among hypotheses • Goal of QBC: minimize the version space (the set of hypotheses that are consistent with the current labeled trained data L), to constrain the size of this space as much as possible with a few labeled instances as possible Query Strategy Frameworks Query-By-Committee (I) 28 Version space examples for (a) linear and (b) axis-parallel box classifiers. All hypotheses are consistent with the labeled training data in L (as indicated by shaded polygons), but each represents a different model in the version space Query Strategy Frameworks Query-By-Committee (II) • To implement a QBC selection algorithm: • Construct a committee of models that represent different regions of the version space • Have some measure of disagreement among committee members 29 Query-By-Committee (III) • Where ranges over all possible labeling, • is the number of “votes”, C is the committee size • 2) Kullback-Leibler (KL) divergence: Query Strategy Frameworks • Two approaches for measuring the level of disagreement • 1) Vote entropy: • Where: • represents a particular model, C is the committee as a whole 30 Query-By-Committee (IV) is the correct label is • KL divergence is an information-theoretic measure of the difference between two probability distributions • Aside from QBC, other strategies attempt to minimize the version space: • A selective sampling algorithm that uses a committee of 2 NN, the “most specific” and the “most general” models (Cohn et al., 1994) • Pool-based margin strategy for SVMs (Tong and Koller, 2000) • The membership query algorithms (Angluin and King et al.) Query Strategy Frameworks • So the “consensus” probability that given by: 31 Expected Model Change (I) Before After Query Strategy Frameworks • Is a decision-theoretic approach • It selects the instance that would impart the greatest change to the current model if we knew its label 32 • Expected Gradient Length (EGL) • It uses gradient-based training • The learner should query the instance x which, if labeled and added to L, would result in the new training gradient of the largest magnitude • It calculates the length as an expectation over the possible labeling: Query Strategy Frameworks Expected Model Change (II) • With the gradient of the objective function ℓ with respect to the model parameters θ • the new gradient that would be obtained by adding the training tuple <x,y> to L 33 Expected Model Change (III) • It prefers instances that are likely to most influence the model, regardless of the resulting query label • Computationally expensive if both the feature space and set of labeling are very large • If features are not properly scaled, EGL approach can be led astray (Solution: parameter regularization) Query Strategy Frameworks • Cons: 34 • A decision-theoretic approach • Query the instances that would most reduce error. Measures how much its generalization error is likely to be reduced • The idea it to estimate the expected future error of a model trained using on the remaining unlabeled instances in U, and query the instance with minimal expected future error (risk) • One approach: Minimize the expected 0/1-loss: • Where refers to the new model after it has been retrained with the training tuple <x,yi> added to L Query Strategy Frameworks Expected Error Reduction (I) 35 Expected Error Reduction (II) • To reduce the expected total number of incorrect predictions • To minimize the expected log-loss, which is equivalent to reduce the expected entropy over U (maximizing the expected information gain of the query x or the mutual information of the output variables over x and U • It’s been successfully used with naïve Bayes, Gaussian random fields, logistic regression and SVM • Cons: • The most computationally expensive query framework (It requires estimating the expected future error over U for each query and a new model must be incrementally re-trained for each possible query labeling) • Because of the complexity, it is usually used for simple binary classification tasks Query Strategy Frameworks • Objectives: 36 Variance Reduction (I) Expectation over the conditional density P(y/x) Noise Bias Expectation over both Query Strategy Frameworks • Objective: reduce generalization error indirectly by minimizing output variance • Considering a regression problem, the learner’s expected future error can be decomposed: Variance Expectation over the labeled set L 37 Variance Reduction (II) • The variance reduction query selection strategy then becomes: Here is the estimated output mean variance once the model has trained for the variable x • The variance can be seen as a composition of the Fisher’s Information matrix as given by this equation (MacKay, 1992): • In other words, to minimize the variance over its parameter estimates, an active learner should select data that maximizes its Fisher information (or minimizes the inverse thereof) 38 Density-Weighted Methods (I) • Last methods spend time querying possible outliers simple because they are controversial Figure: Poor strategy for classification when using uncertainty sampling can be a poor strategy for classification. Since A is on the decision boundary, it would be queried as the most uncertain. However, querying B is likely to result in more information about the data distribution as a whole • The basic idea is to query the most “informative” and “representative” (inhabit dense regions of the input space) Density-Weighted Methods (II) • Main idea: informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution. Query Strategy Frameworks • 1) The information density framework (Settles and Craven, 2008) In densest region but still very close to decision boundary Very close to decision boundary 40 Density-Weighted Methods (III) • Where φA(x) represents the informativeness of x according to some “base” query strategy A, such as an uncertainty sampling or QBC approach. The second term weights the informativeness of x by its average similarity to all other instances in the input distribution (as approximated by U), subject to a parameter β that controls the relative importance of the density term Query Strategy Frameworks • The Settles and Craven approach: 41 • McCallum and Nigam (1998) also developed a densityweighted QBC approach for text classification with naïve Bayes (special case of information density) • Fujii et al. (1998) considered a query strategy for nearestneighbor methods that selects queries that are least similar to the labeled instances in L, and most similar to the unlabeled instances in U. • Nguyen and Smeulders (2004) proposed a density-based approach that first clusters instances and tries to avoid querying outliers by propagating label information to instances in the same cluster. • Xu et al. (2007) use clustering to construct sets of queries for batch-mode active learning with SVMs Query Strategy Frameworks Density-Weighted Methods (IV) 42 Density-Weighted Methods (V) • If densities can be pre-computed efficiently and cached for later use, the time required to select the next query is essentially no different than the base informativeness measure (e.g., uncertainty sampling) • This is advantageous for conducting active learning interactively with oracles in real-time Query Strategy Frameworks • Pro 43 Usage Comparison of Different Active Learning Approach (300) [Guyon, Cawley , Dror , Lemaire 2009] [Active Learning Challenge ] http://www.causality.inf.ethz.ch/activelearning.php#cont 44 Problem Setting Variants Some common problem setting variants of the generalizations and extensions of traditional active learning work. 1. 2. 3. 4. Active learning for Structured Outputs Active feature acquisition and classification Active class selection Active clustering 45 AL for Structured Outputs (I) Problem Setting Variants • Many important learning problems involve predicting structured outputs on instances, such as sequences and trees Example of an information extraction problem viewed as a sequence labeling task 46 • Unlike simpler classification tasks, each instance x in this setting is not represented by a single feature vector, but rather a structured sequence of feature vectors • The appropriate label for a token often depends on its context in the sequence • Labels are typically predicted by a sequence model based on a probabilistic finite state machine, such as CRFs or HMMs • Settles and Craven published- An Analysis of Active Learning Strategies for Sequence Labeling Tasks (2008) discusses how many active learning strategies can be adapted for structured labeling tasks. Problem Setting Variants AL for Structured Outputs (II) 47 • In some learning domains, instances may have incomplete feature descriptions, example, many data mining tasks in modern business are characterized by naturally incomplete customer data • The task of the model is to classify a problem using incomplete information as the feature set • Active Feature Acquisition seeks to alleviate these problems by allowing the learner to request more complete feature information, the assumption that more information can be requested at extra cost/effort Problem Setting Variants Active Feature Acquisition and Classification (I) 48 • The goal in active feature acquisition is to select the most informative features to obtain during training, rather than randomly or exhaustively acquiring all new features for all training instances • There have been multiple approaches to solving this problem of feature acquisition, e.g: • Zheng and Padmanabhan proposed two “single-pass” approaches 1. Attempt to impute the missing values, and then acquire the ones about which the model has least confidence 2. An alternative, imputing these values, training a classifiers on the imputed training instances, and only acquiring feature values for the instances which are misclassified Problem Setting Variants Active Feature Acquisition and Classification (II) • Incremental active feature acquisition may acquire values for a few salient features at a time 1. By selecting a small batch of misclassified examples (Melville et al., 2004) 49 Active Feature Acquisition and Classification (III) • Works in active feature acquisition have considered cases in which feature acquisition can be done during test-time rather than during training • The difference between these learning settings and typical active learning is that the “oracle” provides salient feature values rather than training labels Problem Setting Variants 2. Taking a decision-theoretic approach and acquiring the feature values which are expected to maximize some utility function (SaarTsechansky et al., 2009). 50 • We have till now assumed that the data needed for training the model is freely available and that classification involves cost. • An active class selection problem, the learner is capable of querying a known class label and obtaining each instance incurs a cost. • Lomasky et al. (2007) propose several active class selection query algorithms for an “artificial nose” task • machine learns to discriminate between different vapor types (the class labels) which must be chemically synthesized (to generate the instances) Problem Setting Variants Active Class Selection 51 • Clustering algorithms are the most common examples of unsupervised learning • Since active learning aims at selecting data that will reduce models uncertainty, unsupervised active learning is counterintuitive • Hofmann and Buhmann have proposed an active clustering algorithm for proximity data, based on an expected value of information criterion • The idea is to generate (or subsample) the unlabeled instances in such a way that they self-organize into groupings with less overlap or noise than for clusters induced using random sampling Problem Setting Variants Active Clustering 52 Practical Considerations • Until very recently the main question has been “Can machines learn with fewer training instances if they ask questions?” • Yes (subject to some assumptions e,g: single oracle, oracle is always right, cost of labeling instances is uniform) • These assumptions don’t hold in most real world situations • The research trend has now taken a turn to try and answer the question “Can machines learn more economically if they ask questions?” 53 • The assumption that is oracle is correct doesn’t hold in real world scenarios • Crowdsourcing tools and annotation games allow to average out the result by getting labels from multiple oracles • Question: when the learner should query an potentially noisy oracle or even re-request the label for an instance that seems off • Research has proved that the labeling can be improved by selective relabeling • Questions still remaining for research: • How can learner incorporate varying level of noise from same oracle? • If repeated labeling not able to improve at all? Practical Considerations Noisy Oracles 54 • Aim to reduce total labeling cost, simply reducing the total number of labels doesn’t solve the problem • Annotation costs can be reduced by using trained models to assist in labeling of query instances by pre-labeling them (similar to data parsing) • Kapoor et al. propose a method that accounts for labeling costs and mislabeling costs. Queries are evaluated by summing the labeling cost taking into account the probability of mislabeling depending on the quality of oracle (the cost of labeling and quality should be known before hand) Practical Considerations Variable Labeling Costs 55 • Most active learning research are oriented to work well when the queries are selected in serial • In the presence of multiple annotators, this method slows down the learning process • Batch-Mode active learning allows the learner to query the instances in groups, this approach is more suited for a parallel labeling • Challenge is to form the query set. Querying the “Q-best” queries according to the instance-level strategy usually causes information overlap and is not helpful • Research is oriented at approaches that specifically incorporate diversity in the queries eg. Xu et al. (2007) query the centroids of different clusters Practical Considerations Batch-Mode Active Learning 56 Alternative Query Types • A bag is labeled positive if any of its instances are positive • A bag is labeled negative if and only if all instances are negative • It is easier to label bags compared to labeling instances • Mixed granularity MI learning also proposed by Settles. The learner is capable of querying instances as well as bags • Querying on feature as opposed to label also an alternative. Also interleave feature based learning with instance learning to improve accuracy. (eg: presence of word football in a text documents usually means it’s a sport document) Practical Considerations • Settles et al. (2008) suggest Multiple-Instance (MI) learning. Here instances are grouped into bags and the bags are labeled rather than the instances. 57 • The same data instance can be labeled in multiple ways for different subtasks • Also sometimes more economical to label all tasks of an instance simultaneously • Qi et al. (2008) study a different multi-task active learning scenario, in which images may be labeled for several binary classification tasks in parallel. For example, an image might be labeled as containing a beach, sunset, mountain, field, etc., which are not all mutually exclusive; however, they are not entirely independent, either Practical Considerations Multi-Task Active Learning 58 Stopping Criteria • When is it good to stop? 2. Cost of acquiring new training data is greater than the cost of the errors made by the current model Recognize when the accuracy plateau has been reached • The real stopping criterion for practical applications is based on economic or other external factors, which likely come well before an intrinsic learner-decided threshold Practical Considerations 1. 59 Analysis of Active Learning • A number of research in the field and works done by large corporations like Google, IBM, CiteSeer suggest that it works • It is however very closely related to the quality of the annotators • It does reduce the number of instance labels required to achieve a given level of accuracy in most of the studied research • However there are reported research in which the users opted out of active learning as they lacked the confidence in the approach 60 Conclusions • Active learning is a growing area of research in machine learning backed by the fact that the data available is ever growing and is available for less • A lot of the research is involved in improving the accuracy of the learner with lesser number of queries • Current day research is intended at implementing active learning in practical scenarios and problem solving 61