Proactive Learning: CostSensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008 Active learning Assumptions and Real World Active Learning ► unique oracle ► perfect oracle always right never tired ► works for free or charges uniformly Real World ► multiple sources of information ► imperfect oracles unreliable reluctant ► expensive or charges non-uniformly Solution: Proactive Learning ► Proactive learning is a generalization of active learning to relax these assumptions ► decision-theoretic framework to jointly optimize instance-oracle pair ► utility optimization problem under a fixed budget constraint Outline ► Methodology 3 Scenarios ► Reluctance ► Fallibility ► Variable and Fixed Cost ► Evaluation Problem Setup Datasets Results ► Conclusion Scenario 1: Reluctance ►2 oracles: reliable oracle: expensive but always answers with a correct label reluctant oracle: cheap but may not respond to some queries ► Define a utility score as expected value of information at unit cost P (ans | x , k ) *V (x ) U (x , k ) Ck How to simulate oracle unreliability? ► depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficulty ► Assumptions Perfect oracle ~ classifier having zero training error on the entire data Imperfect oracle ~ weak classifier trained on a subset of the entire data ► Train a logistic regression classifier on the subset to obtain Pˆ(y | x ) ► Identify instances with ► These are the unreliable instances ► Challenge: tradeoff between the information value of an instance and the reliability of the oracle Pˆ(y | x ) [0.45, 0.5] How to estimate ► ► Pˆ(ans | x , k ) ? Cluster unlabeled data using k-means Ask the label of each cluster centroid to the reluctant oracle. If label received: increase Pˆ(ans | x ,reluctant) of nearby points no label: decrease Pˆ(ans | x ,reluctant) of nearby points h (x c t , y c t ) maxd x c t x ˆ P (ans | x ,reluctant) exp ln Z 2 x ct x 0.5 h (x c , y c ) {1, 1} ► x C t equals 1 when label received, -1 otherwise # clusters depend on the clustering budget and oracle fee ► Algorithm works in rounds till no budget ► At each round, sampling continues until a label is obtained ► Be careful: You may spend the entire budget on a single attempt ► If no label, decrease the utility of remaining instances: Pˆ(ans | x ,reluctant) *V (x ) ˆ U (x ,reluctant) C round where C round is the amount spent thus far in the given round ► This is adaptive Penalization of the Reluctant Oracle Algorithm for Scenario 1 Scenario 2: Fallibility ► 2 oracles: One perfect but expensive oracle One fallible but cheap oracle, always answers ► Alg. Similar to Scenario 1 with slight modifications ► During exploration: Fallible oracle provides the label with its confidence Confidence = Pˆ(y | x ) of fallible oracle If Pˆ(y | x ) [0.45, 0.5] then we don’t use the label but we still update Pˆ(correct | x , k ) Outline of Scenario 2 Scenario 3: Non-uniform Cost ► Uniform etc. cost: Fraud detection, face recognition, ► Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. ►2 oracles: Fixed-cost Oracle Variable-cost Oracle C non unif (x ) 1 max y Y Pˆ(y | x ) 1 Y 1 1 Y Outline of Scenario 3 Evaluation ► Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI Adult Oracle Properties and Costs ► ► ► The cost is inversely proportional to reliability Higher costs for the fallible oracle since a noisy label should be penalized more than no label at all Cost ratio creates an incentive to choose between oracles Underlying Sampling Strategy ► Conditional entropy based sampling, weighted by a density measure ► Captures the information content of a close neighborhood U (x ) log min Pˆ(y | x ,wˆ) exp x k k x N x y { 1} 2 2 ˆ * min P (y | k ,wˆ) y { 1} close neighbors of x Results: Overall and Reluctance on Spambase Data Results: Reluctance Cost varies non-uniformly statistically significant results (p<0.01) More light on the clustering step ► Run each baseline without the clustering step ► Entire budget is spent in rounds for data elicitation ► No separate clustering budget ► Results on Spambase under Scenario 1, cost 1:3 Conclusion ► Address issues with the assumptions of active learning ► Introduction to a Proactive Learning framework ► Analysis of imperfect oracles with differing properties and costs ► Expected utility maximization across oracle-instance pairs ► Effective against exploitation of a single oracle