Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Teaching Classification Boundaries to Humans Sumit Basu Janara Christensen Microsoft Research Redmond, WA sumitb@microsoft.com University of Washington Seattle, WA janara@cs.washington.edu Abstract portant is training (calibrating) human labelers for high inter-annotator agreement with past labelers. By teaching new raters using labels from past raters, we can train them to produce more consistent and thus higher quality data. We know from both theoretical and empirical findings that classifiers learn best from data that constitute a representative sample, i.e., a random sample that has the same distribution as the target data. There are variations such as active learning, in which a classifier can pick from a palette of unlabeled points to be labeled. Humans are quite different from classifiers, though: simply seeing representative examples does not mean they will perform optimally with respect to them. They will be more sensitive to some features than others, they will not perfectly consider all the information from all the samples, and they will be confused by items that are similar. They may have “blind spots” or inherent biases that lead them to make less than optimal decisions. Active learning approaches where the human could pick the next sample to be labeled may help in limited scenarios, but when there are multiple dimensions (of features), the perplexity of possible choices could prove overwhelming. Similarly, automatically choosing examples that would be “best” for a classifier can be disastrous for a human; we found this in our own pilot experiments. Given the possible benefits and inherent questions, we investigate in this paper what might be the best way to teach a classification task to a human from a palette of available methods. The obvious baseline is random presentation, i.e., picking examples from the same distribution as the test set, but we suspected that it would be possible to do better. We thus developed a set of mechanisms we hypothesized could be helpful to humans in the learning task, some based on the cognitive science and machine learning literature, others based on our models of task mastery. In particular, these mechanisms were (1) an indicator of individual example difficulty, (2) a means of selecting examples in order of increasing difficulty (curriculum learning), and (3) the estimation of a “coverage model” of the subject’s mastery over the data space and then sampling from its complement. To test the effectiveness of our methods, Given a classification task, what is the best way to teach the resulting boundary to a human? While machine learning techniques can provide excellent methods for finding the boundary, including the selection of examples in an online setting, they tell us little about how we would teach a human the same task. We propose to investigate the problem of example selection and presentation in the context of teaching humans, and explore a variety of mechanisms in the interests of finding what may work best. In particular, we begin with the baseline of random presentation and then examine combinations of several mechanisms: the indication of an example’s relative difficulty, the use of the shaping heuristic from the cognitive science literature (moving from easier examples to harder ones), and a novel kernel-based “coverage model” of the subject’s mastery of the task. From our experiments on 54 human subjects learning and performing a pair of synthetic classification tasks via our teaching system, we found that we can achieve the greatest gains with a combination of shaping and the coverage model. Introduction Machine learning has yielded a broad variety of impressive results on how best to train a classifier, but what is the best way to teach a classification task to a human? The literature has mostly neglected this question, despite the potential for education as well as other scenarios. This raises the question of why one might wish to teach humans this particular type of task. Beyond the underlying questions about how best to teach human subjects in this scenario, we believe there are several practical applications. The first and most direct is that of teaching a real-world discrimination task. As humans we are often required to make such classifications: is this a safe link to click on or not, is this a safe place to use a credit card, is this peach sufficiently ripe, is this a morel mushroom or a poisonous false morel – if we can do better than picking random examples for teaching such tasks, we may be able to improve human performance in the real world. A second scenario that is increasingly imCopyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 109 we developed a synthetic learning task: subjects learned to classify parametrically drawn mushrooms based on presented examples and then took a quiz to determine how much they learned. As a result of these experiments, we found that indeed we can do substantially better than random presentation with some combinations of these methods. the correspondence to a vector space is not explicit, as with a real-world classification task with only a similarity metric available, and as such, we didn’t want to focus on methods that took advantage of an explicit representation of the feature space. We also note that their methods were independent of the user’s performance during training. In their workflow, the user was shown labels for the examples they or the algorithm picked (users were tested at intervals on the location of the boundary); as such their performance was not recorded or used. In our work, with every training example the subject was required to test their knowledge before the label was revealed. Our work leverages this information during training via our “coverage model.” Another thread of related work comes from the “shaping” work in the classic cognitive science literature via its recent application in the machine learning community. The original work, by Skinner and others (Peterson 2004) was based on the theory that it is possible to teach a complex task/behavior by building up and reinforcing smaller subtasks. A generalization of this theory is that it may be more efficient to teach a difficult concept by starting off with easy versions of it (Kreuger and Dayan 2009). The latter interpretation has found success in recent work on online learning, in which easier examples are presented before more difficult examples for online training of classifiers; this approach is known as “curriculum learning,” as explored in (Bengio et al. 2009) and (Kumar, Packer, and Koller 2010). Given the success of this approach in teaching classifiers (recently) and animals/humans (classically), we added this strategy for teaching to our suite. The work on curriculum learning led Khan, Zhu, and Mutlu (2011) to a related investigation, though with the converse of our goals, in which the authors studied how humans might choose to teach a classifier (robot). In summary, other than (Castro et al. 1980), ours is the only work we are aware of to teach classification boundaries, and the first to teach higher dimensional boundaries. Related Work There is a wealth of literature on the general area of teaching with the help of computers, an area known as Interactive Tutoring Systems, but most of this work is focused on aiding with traditional classroom curricula; the recent compilation of Woolf (2008) provides an excellent overview. The closest to our topic is the work on teaching categories, specifically in the context of teaching vocabulary. The emphasis in that research has been on spacing effects, i.e., selecting the interval of study to maximize retention; see for instance the recent work of Mozer et al. (2009). If we reach further afield into the cognitive science literature, there is much work examining how humans form and learn categories, like the classic work of Markman (1989). The closest research to our own appears in the machine learning literature, in an investigation by Castro et al. (2008) which sought to examine how active learning heuristics would fare in the context of teaching humans. As active learning is focused on finding the best order of examples to present to an online learning algorithm (see (Settles 2010) for a current review of active learning), it was a natural choice to apply these algorithms to humans. The authors presented subjects with a 1-D (single parameter) classification problem, where their goal was to identify the location of the boundary. Subjects were shown examples with the parameter illustrated with synthetic “eggs” of varying shapes. They found the best results came via an approach they termed “yoked learning,” in which they bisected the posterior for the location of the boundary at each step given the presented examples, getting ever closer to the solution. We note that in 1-D, the boundary is just a point, whereas in higher dimensions (such as in our experiments), the boundary is an N-1 dimensional hyperplane, and there is no corresponding bisection approach. While such a method could be used to find a single point on the boundary, our goal is to teach the human the difference between the classes, and for that they need to know the entire classification boundary. Another approach Castro et al. found helpful was letting humans choose the next location to view. Again, that makes sense in a 1-D space – subjects can quickly converge on the boundary. In the 4-D space of our experiments there are far too many possibilities, and no simple way for subjects to visualize the space. In our study, we had 10,000 points sampled uniformly from the space, and showing all of these or even a representative sample in a random ordering would be quite overwhelming. Furthermore, we were also interested in the case where the feature space could be latent, i.e., Learning Task For this study, we wanted to choose a classification task of sufficient complexity that users would not master it with a few examples, but not so difficult that it would take egregiously long to learn. We also wanted to be able to generate a large number of unique learning problems of equivalent difficulty; as such we focused on using synthetic data with randomly selected boundaries. Through extensive piloting, we found 4-dimensional data to be a good compromise: subjects could quickly get a good understanding of the problem without saturating their performance too early. As we wished to make the task as natural as possible, we chose to represent the data as parametrically drawn mushrooms instead of directly showing the numerical values of the four-dimensional instances (see Figure 1). The four di- 110 mensions, all between 0 and 1, were mapped to ranges of the stem width, stem height, cap width, and the cap height. small colored square in the upper right corner of the view (see Figure 2). The color varied from bright green (easy) to bright red (hard). The difficulty value was computed as the negative of the projected absolute distance to the boundary, normalized to be between 0 (easiest) and 1 (hardest): | | In this formulation, z is a normalizing constant equal to the maximum possible value for given the choice of w. Curriculum Learning Our next proposal originated from the cognitive science work in shaping as well as the machine learning work in curriculum learning, i.e., the idea that one could train a learner more efficiently by presenting easier examples first and then showing progressively harder examples. For our study, we used the same measure shown above for the difficulty indicator. For any given training example number, we chose a linearly related to the example number, such that the maximum hardness was achievable by the 20th example (out of 30). Examples were still drawn randomly, but if the drawn instance had difficulty greater than , it was returned to the set and a new example was chosen. This process continued until an example was found with difficulty less than . We also added a balance constraint that prevented the difference between the number of seen positive and negative examples to be no more than one. Figure 1: Examples of parametrically drawn mushrooms. The classification boundary was an N-1 dimensional hyperplane; we did not add any label noise, and as such the problems were linearly separable. In order to ensure that the tasks would be of equal difficulty and that random guessing would result in a base rate of 50% on average, we required the boundary to pass through the origin. Formally, we randomly chose a 4-dimensional w, normalized it to unit length, and computed the label for each exemplar as 1 if 0 and 0 otherwise. As such, we were able to construct a new problem for every task and every subject: for each trial we would begin by sampling 10,000 points from a uniform distribution and then generating a boundary. Teaching Methods Coverage Model As our goal was to investigate what methods might work best for teaching, we developed several mechanisms to compete with our baseline of random presentation. Difficulty Indicator Our final scheme for improving learning involved what we term a “coverage model:” we model the areas of the example space in which the subject had gained expertise, and then sample from the complement of that distribution. This method has analogies to the boosting approach of classifier combination, in which examples where the classifier performs poorly are reweighted to focus the efforts of successive learners (Freund and Schapire 1999). As we are not doing batch learning, we cannot explicitly apply the reweighting used in boosting, but instead use a kernel-based approach to represent the expertise distribution and use it to guide example selection. Since points near each other would be similar in mushroom appearance and typically in label, we modeled the user’s knowledge as a mixture of Gaussian kernels of fixed variance. The kernel centers were at every point the subject had been given a training example, as well as 100 random samples chosen from the unlabeled points to prevent overfitting. The form of the function was as follows: 1 ; , 0 The first mechanism we considered was an indicator for the difficulty of a given training example. This was shown as a The set represents the indices of the examples that have been labeled by the subject during training, the set ! is the Baseline We initially considered two baseline methods, random presentation and the active learning heuristic of the points closest to the boundary. We initially included the boundary method as well because it seemed like a good instance of a scheme that was ideal for a learning algorithm but terrible for human learners attempting to perform a classification task. Our intuition was correct: in an experiment with a three-dimensional feature space (also represented as a mushroom), a set of 13 users achieved 53.6% performance, far worse than random presentation (on which they achieved 71.7%), a difference that was significant (twotailed t-test, unequal variances: p=0.001, df=22.7, t=3.76). Furthermore, subjects complained bitterly about how frustrating and confusing this condition was. As its inferiority was clear, we left out this method for the final experiment. ∈, 111 randomly selected set of 100 unseen examples, is a normalizing constant such that the distribution sums to one over all possible examples, and is a fixed kernel width that we determined via pilot experiments. We require the to be positive to ensure that is positive everywhere. We define the probability of coverage (user knowledge) to be 1 on those indices in that the user labeled correctly, 0 on those that she labeled incorrectly, and 0.5 on those in ! (which have not been seen yet). Since we only have target values at these points, we can solve this as an + ! dimensional discrete minimization problem: # ‖% &‖ &'(, )* ; + , ther training during the testing period. Finally, the hardness indicator is never shown on new examples during testing. Experiments We recruited a total of 90 subjects at our institution; 54 completed the entire study. As we found individuals’ capabilities varied widely in our pilot studies, we used a withinsubjects design in which each subject went through two train/test conditions, the baseline and one other condition. Each condition had a different class boundary for each user. In this way, we were able to measure the relative performance of a given individual between the baseline and one of our methods. To protect against ordering effects, we presented the two conditions in random order. There were a total of five teaching methods; each subject experienced the baseline as well as one other method. • Base: training examples were chosen randomly. • Ind: the difficulty indicator was shown; examples were chosen randomly. • Covg: the coverage model was used to select examples. • IndCurr: the difficulty indicator was shown, and examples were filtered by difficulty using the curriculum learning approach as described above. • IndCurrCovg: the difficulty indicator was shown; examples were chosen via the coverage model and also filtered by difficulty via curriculum learning. Subjects were divided into the four possible condition pairs based on the order they started the task; as not all completed the task, there is some variance in the number of subjects (see Table 1). Each subject completed two learning tasks, each with 30 training examples followed by 20 test; the test examples were always drawn from a uniform distribution. They then filled out a short survey about the tasks. In this expression, # is the loss function, and % is the vector of target probabilities. To constrain the to be positive, we parametrize them as , . We can then take partial derivatives of the loss with respect to , via the chain rule: # 2& & % 2, , Given these partials, we use L-BFGS Quasi-Newton optimization (Nocedal 1980) to minimize the loss. Results and Analysis Given the data from the subjects, we sought to understand the differences in performance as well as what may have led to those differences. Below, we consider subjects’ performance and perceptions under the various conditions. Figure 2: The teaching interface. Teaching Interface Relative Performance by Condition We designed a teaching interface to test the different methods with our human subjects (Figure 2), consisting of three vertical panes. During training, the leftmost pane shows the current training example, along with buttons with which the subject can indicate whether it is “safe” or “poisonous.” The interface then reveals the answer, and places the example into the appropriate (safe/poisonous) history pane. The subject can thus review past training examples when making future judgments during both training and testing. There are some important differences to the workflow during testing. First, the subject is not told whether her individual answers are correct, nor are examples added to the history panes; this is to prevent the subject from doing fur- The core results of our study are in terms of relative performance on the test set between each of the teaching conditions and the baseline. We used the dependent t-test for paired samples (two-sided) to compare the subjects’ performance in the two conditions they saw (Table 1). The first two (Ind and Covg) did not show significant differences from the baseline, though Ind had a nearly-significant negative effect. While not conclusive, it seems possible that the difficulty indicator on its own may have hurt subjects’ performance. We hypothesize this might have been due to subjects ignoring more difficult training examples as being too hard, or being frustrated by wildly fluctuating difficulty. 112 p d.f. (N-1) t Improve ment Base vs. Ind 0.12 12 -1.71 -0.09 Base vs. Covg 0.43 13 -.81 -0.05 Base vs. IndCurr 0.08 15 1.88 0.08 Base v. IndCurrCovg 0.007 10 3.36 0.12 Method While both the IndCurr and IndCurrCovg conditions have the benefit of curriculum learning, we believe the greater gain in the latter is coming from the coverage model’s ability to model and highlight areas the subject performed poorly in or had not covered. This may be allowing the subject to continue learning longer, whereas the other models may start presenting examples that are redundant. Table 1: Test performance improvements (mean differences in per-condition accuracy scores across subjects) with respect to baseline. Statistically significant results (p<0.1) are in bold. The last two conditions, IndCurr and IndCurrCovg, did produce significant improvements; the strongest effect was for IndCurrCovg, both in magnitude (12 points absolute improvement, from 61% to 73%) and in significance (p=0.007). This implies that the coverage model in conjunction with curriculum learning has a stronger effect on performance than either on their own. In the remainder of our analysis, we will examine the fine-grained differences in behavior and performance to better understand these results. Figure 4: Mean cumulative subject performance during training for the Base vs. IndCurrCovg condition (N=11). Distribution of Training Example Difficulty Given the difference in effect of the coverage model in the IndCurrCovg condition vs. the Covg condition, we suspected that without restriction, the model might quickly converge onto the most difficult examples and confuse the user. In Figure 5, we examine the distribution of examples for Base, Covg, and IndCurrCovg over quartiles of difficulty. For IndCurrCovg, it is biased towards easier examples. For Covg, though, it is indeed biased in the other direction. 0.6 Base Covg IndCurrCovg 0.55 0.5 Figure 3: Mean cumulative subject performance during training for the Base vs. IndCurr condition (N=16). 0.45 0.4 0.35 Performance During Learning 0.3 0.25 Our first area of investigation involves seeing how subjects performed during training by looking at their cumulative training performance (over all examples presented so far). In Figures 3 and 4, we look at the mean performance across subjects for the IndCurrCovg and IndCurr conditions. In both sets of conditions, we see that when subjects are in the baseline condition, their performance improves for the first few examples, but then saturates by example 15 or so. For IndCurr, they do better earlier on and improve in the same way but also saturate in performance at a similar point, albeit at a higher level. For IndCurrCovg, on the other hand, we see what appears to be a faster and longer rise in performance before learning saturates. 0.2 0.15 0.1 0.05 0 0.0 - 0.25 0.25 - 0.5 0.5 - 0.75 0.75 - 1.0 Difficulty of Training Example Figure 5: Distributions of training example difficulties. Performance vs. Example Difficulty We next wished to investigate how the different strategies affected performance at different levels of example difficulty during testing. We plotted the relative performance for 113 Survey Results condition pairs over quartiles of difficulty in Figures 6-8. Both IndCurrCovg and IndCurr show improvements over Base across all levels. We also show the relative performance for the Base vs. Covg condition in Figure 8 to investigate why the coverage model was not helpful on its own. Note that while performance for medium hardness increased slightly, performance on the easiest examples dropped substantially. This is potentially a consequence of the more difficult training examples shown in this case (see Figure 5); the fact that the users did worse on the easiest examples implies they did not learn the concept well at all. Test Accuracy Finally, we examine the qualitative feedback from users. In Table 2 we look at the answers to “which condition was more enjoyable?” and in Table 3 “which condition was easier?” Subjects overwhelmingly felt the IndCurrCovg condition was better in both respects: the mechanism that most improved performance, then, also improved most user experience, perhaps because of the greater feeling of mastery. We see this to a lesser extent for IndCurr as well. The results for the other two methods were also interesting: subjects were split on both questions for Base vs. Covg, consistent with their performance. In Base vs. Ind, on the other hand, subjects overwhelmingly preferred the baseline. This adds to our earlier suspicions that the difficulty indicator may have led to confusion and/or frustration. Base Other Same Base vs. Ind 9 3 0 Base vs. Covg 5 7 2 Base vs. IndCurr 2 12 3 Base vs. IndCurrCovg 3 8 0 Table 2: Survey results for the question “Which condition was more enjoyable? Figure 6: Test perf. vs. difficulty: Base vs. IndCurrCovg. Base Other Same Base vs. Ind 9 3 0 Base vs. Covg 5 6 3 Base vs. IndCurr 4 11 2 Base vs. IndCurrCovg 1 9 1 Table 3: Survey results for the question “Which condition was easier?” Discussion and Future Work While teaching classification tasks to humans is different from teaching algorithms, it is encouraging to see that some ideas from machine learning are indeed helpful. Given the strong, highly significant (p=0.007) effect on performance, we can say with some confidence that the combination of the difficulty indicator, curriculum learning, and the coverage model (IndCurrCovg strategy) does improve subjects’ performance at learning classification boundaries. With less confidence, we can say that the IndCurr strategy (difficulty indicator and curriculum learning) improves performance as well. As such, it seems the combination of curriculum learning and the coverage model are more powerful than either on its own, as the Covg condition did not produce a significant change in performance. As we had suspected upon seeing the results, the coverage model on its own seemed to emphasize examples that were too hard, whereas when coupled with curriculum learning it helped subjects Test Accuracy Figure 7: Test performance vs. difficulty: Base vs. IndCurr. Figure 8: Test performance vs. difficulty: Base vs. Covg. 114 Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum Learning. In Proceedings of ICML 2009 (2009). Castro, R., Kalish, C., Nowak, R., Qian, R., Rogers, T., and Zhu, X. Human Active Learning. In Proceedings of NIPS 2008 (2008). Freund, Y. and Schapire, R. E. A Short Introduction to Boosting. Journal of the Japanese Society for Artificial Intelligence 14, 5 (1999), 771-780. Khan, F., Zhu, X., and Mutlu, B. How Do Humans Teach: On Curriculum Learning and Teaching Dimension. In Proceedings of NIPS 2011 (2011). Kreuger, K.A., and Dayan, P. Flexible Shaping: How Learning in Small Steps Helps. Cognition 110, 3 (2009), 380-94. Kumar, M.P., Packer, B., and Koller, D. Self-Paced Learning for Latent Variable Models. In Proceedings of NIPS 2010 (2010). Markman, E. M. Categorization and Naming in Children: Problems of Induction. MIT Press, (1989). Mozer, M.C., Pashler, H., Cepeda, N., Lindsey, R., and Vul, E.. Predicting the optimal spacing of study: A multiscale context model of memory. In Proceedings of NIPS 2009 (2009). Nocedal, J. Updating Quasi-Newton Matrices with Limited Storage. Mathematics of Comp. 35 (1980), 773-782. Peterson, G.B. A Day of Great Illumination: B.F. Skinner’s Discovery of Shaping. Journal of Experimental Analysis of Behavior 82 (2004), 317-28. Settles, B. Active Learning Literature Survey. University of Wisconsin-Madison Computer Science Technical Report 1648, (2010). Woolf, B.P. Building Intelligent Interactive Tutors. Morgan Kaufmann. (2008). address their areas of difficulty while not overwhelming them with hard examples early on. The difficulty indicator, while intuitively a helpful additional source of information about each training example, seemed to both reduce subject’s performance (though not quite statistically significant in effect) and their enjoyment, as well as increasing their perception of task difficulty. We hypothesize that subjects became frustrated with and/or ignored examples marked as difficult, and suspect this condition was particularly vexing in the absence of curriculum learning, since hard and easy examples would be appearing in random sequence. Perhaps seeing the hardness increase in a controlled way (in the IndCurrCovg condition) reduced or removed the negative effects of the indicator. However, it’s also possible that removing the indicator and running a CurrCovg condition would show even greater gains; we hope to investigate this in future work. It may be that subjects are happier and learn better when they simply don’t know the relative difficulty of the training example; this could have interesting ramifications for other educational domains as well. As we outlined earlier, we believe there are a variety of applications for which teaching classification boundaries to humans can be important: we hope these methods and results can be helpful in such contexts, but also that they may extend to a broader set of teaching problems. References Andrews, R., Diederich, J., and Tickle, A.B. Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge Based Systems. 8(6) 373-389. (1995). 115