A Black-Box approach to machine learning Yoav Freund 1 Why do we need learning? • Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification • For accuracy, functions must be tuned to fit the data source. • For real-time processing, function computation has to be very fast. 2 The complexity/accuracy tradeoff Error Trivial performance Complexity 3 Flexibility The speed/flexibility tradeoff Matlab Code Java Code Machine code Digital Hardware Analog Hardware Speed 4 Theory Vs. Practice • Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations. - I prove theorems. • Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. • My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both. 5 Plan of talk • • • • • • • The black-box approach Boosting Alternating decision trees A commercial application Boosting the margin Confidence rated predictions Online learning 6 The black-box approach • Statistical models are not generators, they are predictors. • A predictor is a function from observation X to action Z. • After action is taken, outcome Y is observed which implies loss L (a real valued number). • Goal: find a predictor with small loss (in expectation, with high probability, cumulative…) 7 Main software components A predictor A learner x z Training examples x1,y1,x2 , y2 , ,xm ,ym We assume the predictor will be applied to examples similar to those on which it was trained 8 Learning in a system Learning System Training Examples predictor Target System Sensor Data Action feedback 9 Special case: Classification Observation X - arbitrary (measurable) space Outcome Y - finite set {1,..,K} yY Prediction Z - {1,…,K} yˆ Z Usually K=2 (binary classification) 1 if Loss( yˆ,y) 0 if y yˆ y yˆ 10 batch learning for binary classification Data distribution: x, y ~ D; Generalization error: y 1,1 h P Ý x,y ~D h(x) y Training set: T x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm Training error: 1 ˆ(h) Ý 1h(x) y P Ý x,y ~T h(x) y m x,y T 11 Boosting Combining weak learners 12 A weighted training set x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 13 A weak learner Weighted training set x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm A weak rule Weak Leaner instances h predictions x1,x2 , ,xm h yˆ1, yˆ2 , , yˆm ; yˆi {0,1} y yˆ w w m The weak requirement: i1 i i m i1 i i 0 14 The boosting process (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) weak learner h1 weak learner h2 h3 h4 h5 h6 h7 h8 h9 hT (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) FT x 1h1x 2h2 x Final rule: T hT x fT (x) sign FT x 15 Adaboost F0 x 0 for t 1..T w expyi Ft1(xi ) t i Get ht from weak learner t t t ln i:h x 1,y 1 wi i:h x 1,y 1 wi t i i t i i Ft1 Ft t ht 16 Main property of Adaboost If advantages of weak rules over random guessing are: 1,2,..,T then training error of final rule is at most T 2 ˆ fT exp t t1 17 Boosting block diagram Strong Learner Accurate Rule Weak Learner Weak rule Example weights Booster 18 What is a good weak learner? The set of weak rules (features) should be: • flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Simple enough to allow efficient search for a rule with non-trivial weighted training error. • Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast. 19 Alternating decision trees Freund, Mason 1997 20 Decision Trees Y X>3 +1 5 -1 Y>5 -1 -1 -1 +1 3 X 21 A decision tree as a sum of weak rules. Y -0.2 X>3 sign +1 +0.2 +0.1 -0.1 -0.1 -1 Y>5 -0.3 -0.2 +0.1 -1 -0.3 +0.2 X 22 An alternating decision tree Y -0.2 Y<1 sign 0.0 +0.2 +1 X>3 +0.7 -1 -0.1 +0.1 -0.1 +0.1 -1 -0.3 Y>5 -0.3 +0.2 +0.7 +1 X 23 Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances. 24 AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick 25 Commercial Deployment. 26 AT&T “buisosity” problem Freund, Mason, Rogers, Pregibon, Cortes 2000 • Distinguish business/residence customers from call detail information. (time of day, length of call …) • 230M telephone numbers, label unknown for ~30% • 260M calls / day • Required computer resources: • Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). • Significant: Calculating the classification for ~70M customers. • Negligible: Learning (2 Hours on 10K training examples on an offline computer). 27 AD-tree for “buisosity” 28 AD-tree (Detail) 29 Precision/recall: Accuracy Quantifiable results Score • For accuracy 94% increased coverage from 44% to 56%. • Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. 30 Adaboost’s resistance to over fitting Why statisticians find Adaboost interesting. 31 A very curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters 32 Large margins h x F x (x,y) y y Ý T margin FT t1 t t T t1 margin FT (x,y) 0 T t 1 fT (x) y Thesis: large margins => reliable predictions Very similar to SVM. 33 Experimental Evidence 34 Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d C i hi | hi H , i 0, i 1 T x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm c C, 0, with probability 1 w.r.t. T ~ D m Px,y~D sign c(x) y Px,y~T margin c x,y d / m 1 O˜ Olog No dependence on no. of combined functions!!! 35 Idea of Proof 36 Confidence rated predictions Agreement gives confidence 37 A motivating example ? - - - - - - - - - - - - -- - - + Unsure - - ++ + + - + + ++ + + ?+ + + ++ ++ + + + + + + + + + + + + - -- + + + + +++ + ++ + + Unsure + - ?+ - - + 38 The algorithm Freund, Mansour, Schapire 2001 Parameters 0, 0 w(h) e Ý Hypothesis weight: Empirical Log Ratio Prediction rule: : ˆ h w(h) 1 h:h ( x)1 ˆl (x) Ý ln w(h) h:h ( x)1 1 if pˆ, x -1,+1 if if 1 lˆx lˆx lˆx 39 Suggested tuning Suppose H is a finite set. 0 14 ln 8 H m1 2 2 Yields: m 8m1 2 * 2 ln 8 H ln m 2h O 1/2 m 1) P mistake Px,y ~D y pˆ (x) 2) for ln 1/ m = ln 1 ln H ln 1 ln H * ˆ P(abstain ) Px,y ~D p(x) 1,1 5h O 1/2 m 40 Confidence Rating block diagram Training examples x1,y1,x2 , y2 , ,xm ,ym Confidence-rated Rule Candidate Rules RaterCombiner 41 Face Detection Viola & Jones 1999 • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second). QuickTime™ and a YUV420 codec dec ompres sor are needed to see this pic ture. 42 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. All boxes Feature 1 Might be a face Feature 2 Definitely not a face 43 Using confidence to train car detectors 44 Original Image Vs. difference image QuickTime™ and a Photo - JPEG decompressor are needed to see this picture. 45 Co-training Blum and Mitchell 98 Partially trained B/W based Classifier Confident Predictions Hwy Images Partially trained Diff based Classifier Confident Predictions 46 Co-Training Results Levin, Freund, Viola 2002 Raw Image detector Before co-training Difference Image detector After co-training 47 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby 48 Online learning Adapting to changes 49 Online learning So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game An expert is an algorithm that maps the past (x1,y1 ),(x2 ,y2 ), ,(xt1,yt1 ),x to ta prediction zt Suppose we have a set of experts, we believe one is good, but we don’t know which one. 50 Online prediction game For t 1, ,T 1 t 2 t z ,z , ,z Experts generate predictions: Algorithm makes its own prediction: Total loss of algorithm: t yt Nature generates outcome: Total loss of expert i: N t L Lzti ,yt T i T L ,y t1 L A T T t1 Goal: for any sequence of events t t LTA min LiT oT i 51 A very simple example • • • • Binary classification N experts one expert is known to be perfect Algorithm: predict like the majority of experts that have made no mistake so far. • Bound: LA log 2 N 52 History of online learning • Littlestone & Warmuth • Vovk • Vovk and Shafer’s recent book: “Probability and Finance, its only a game!” • Innumerable contributions from many fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer … 53 Lossless compression X - arbitrary input space. Y - {0,1} Z - [0,1] Log Loss: LZ,Y y log 2 1 1 (1 y)log 2 z 1 z Entropy, Lossless compression, MDL. Statistical likelihood, standard probability theory. 54 Bayesian averaging Folk theorem in Information Theory N t w i t z i t ; w 2 i t i1 N w L i t1 i t i1 N N T 0; LTA log 2 w1i log 2 wTi min LiT ln N i1 i1 i 55 Game theoretical loss X - arbitrary space Y - a loss for each of N actions y 0,1 N Z - a distribution over N actions p 0,1 , p 1 1 Loss: L(p,y) p y Ep y N 56 Learning in games Freund and Schapire 94 An algorithm which knows T in advance guarantees: L min L 2T ln N ln N A T i i T 57 Multi-arm bandits Auer, Cesa-Bianchi, Freund, Schapire 95 Algorithm cannot observe full outcome yt Instead, a single it 1, , N is chosen at it p t and yt is observed random according to We describe an algorithm that guarantees: NT A i LT min LT O NT ln i With probability 1 58 Why isn’t online learning practical? • Prescriptions too similar to Bayesian approach. • Implementing low-level learning requires a large number of experts. • Computation increases linearly with the number of experts. • Potentially very powerful for combining a few high-level experts. 59 Online learning for detector deployment Face code Detector Library Merl frontal 1.0 B/W Frontal face detector Indoor, neutral background direct front-right lighting Detector can be Feedbackadaptive!! Download Images OL Adaptive real-time face detector Face Detections 60 Summary • By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line. • To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries. 61