A Black-Box approach to machine learning

advertisement
A Black-Box approach to
machine learning
Yoav Freund
1
Why do we need learning?
• Computers need functions that map highly variable data:




Speech recognition:
Audio signal -> words
Image analysis:
Video signal -> objects
Bio-Informatics: Micro-array Images -> gene function
Data Mining: Transaction logs -> customer classification
• For accuracy, functions must be tuned to fit the data
source.
• For real-time processing, function computation has to be
very fast.
2
The complexity/accuracy
tradeoff
Error
Trivial performance
Complexity
3
Flexibility
The speed/flexibility tradeoff
Matlab Code
Java Code
Machine code
Digital Hardware
Analog Hardware
Speed
4
Theory Vs. Practice
• Theoretician: I want a polynomial-time algorithm which is
guaranteed to perform arbitrarily well in “all” situations.
- I prove theorems.
• Practitioner: I want a real-time algorithm that performs
well on my problem.
- I experiment.
• My approach: I want combining algorithms whose
performance and speed is guaranteed relative to the
performance and speed of their components.
- I do both.
5
Plan of talk
•
•
•
•
•
•
•
The black-box approach
Boosting
Alternating decision trees
A commercial application
Boosting the margin
Confidence rated predictions
Online learning
6
The black-box approach
• Statistical models are not generators, they are
predictors.
• A predictor is a function from
observation X to action Z.
• After action is taken, outcome Y is observed
which implies loss L (a real valued number).
• Goal: find a predictor with small loss
(in expectation, with high probability, cumulative…)
7
Main software components
A predictor
A learner
x
z
Training examples
x1,y1,x2 , y2 , ,xm ,ym 

We assume the predictor will be applied to
examples similar to those on which it was trained
8
Learning in a system
Learning System
Training
Examples
predictor
Target System
Sensor Data
Action
feedback
9
Special case: Classification
Observation X - arbitrary (measurable) space
Outcome Y - finite set {1,..,K}
yY
Prediction Z - {1,…,K}
yˆ  Z
Usually K=2 (binary classification)
1 if
Loss( yˆ,y)  
0 if

y  yˆ
y  yˆ

10
batch learning for
binary classification
Data distribution:
x, y ~ D;
Generalization error:
y  1,1
h P
Ý x,y ~D h(x)  y
Training set: T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
Training error: 
1
ˆ(h) 
Ý 1h(x)  y P
Ý x,y ~T h(x)  y

m x,y T
11
Boosting
Combining weak learners
12
A weighted training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 
13
A weak learner
Weighted
training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 
A weak rule
Weak Leaner
instances
h
predictions
x1,x2 , ,xm
h
yˆ1, yˆ2 , , yˆm ; yˆi  {0,1}
 y yˆ w
 w
m
The weak requirement:


i1 i i
m
i1
i
i
 0
14
The boosting process
(x1,y1,1/n), … (xn,yn,1/n)
(x1,y1,w1), … (xn,yn,wn)
weak learner
h1
weak learner
h2
h3
h4
h5
h6
h7
h8
h9
hT
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
FT x  1h1x  2h2 x 
Final rule:
 T hT x
fT (x)  sign FT x
15
Adaboost
F0 x  0
for t 1..T
w  expyi Ft1(xi )
t
i
Get ht from weak  learner

t
t 
 t  ln i:h x 1,y 1 wi i:h x 1,y 1 wi 
 t  i  i

t i
i
Ft1  Ft  t ht
16
Main property of Adaboost
If advantages of weak rules over random
guessing are: 1,2,..,T then
training error of final rule is at most
 T 2 
ˆ fT   exp   t 

 t1 

17
Boosting block diagram
Strong Learner
Accurate
Rule
Weak
Learner
Weak
rule
Example
weights
Booster
18
What is a good weak learner?
The set of weak rules (features) should be:
• flexible enough to be (weakly) correlated with most
conceivable relations between feature vector and label.
• Simple enough to allow efficient search for a rule with
non-trivial weighted training error.
• Small enough to avoid over-fitting.
Calculation of prediction from observations should
be very fast.
19
Alternating decision trees
Freund, Mason 1997
20
Decision Trees
Y
X>3
+1
5
-1
Y>5
-1
-1
-1
+1
3
X
21
A decision tree as a sum of
weak rules.
Y
-0.2
X>3
sign
+1
+0.2
+0.1
-0.1
-0.1
-1
Y>5
-0.3
-0.2 +0.1
-1
-0.3
+0.2
X
22
An alternating decision tree
Y
-0.2
Y<1
sign
0.0
+0.2
+1
X>3
+0.7
-1
-0.1
+0.1
-0.1
+0.1
-1
-0.3
Y>5
-0.3
+0.2
+0.7
+1
X
23
Example: Medical Diagnostics
•
Cleve dataset from UC Irvine database.
•
Heart disease diagnostics (+1=healthy,-1=sick)
•
13 features from tests (real valued and discrete).
•
303 instances.
24
AD-tree for heart-disease diagnostics
>0 : Healthy
<0 : Sick
25
Commercial Deployment.
26
AT&T “buisosity” problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
• Distinguish business/residence customers from call detail
information. (time of day, length of call …)
• 230M telephone numbers, label unknown for ~30%
• 260M calls / day
• Required computer resources:
• Huge: counting log entries to produce statistics -- use specialized
I/O efficient sorting algorithms (Hancock).
• Significant: Calculating the classification for ~70M customers.
• Negligible: Learning (2 Hours on 10K training examples on an offline computer).
27
AD-tree for “buisosity”
28
AD-tree (Detail)
29
Precision/recall:
Accuracy
Quantifiable results
Score
• For accuracy 94%
increased coverage from 44% to 56%.
• Saved AT&T 15M$ in the year 2000 in operations costs
and missed opportunities.
30
Adaboost’s resistance to
over fitting
Why statisticians find Adaboost
interesting.
31
A very curious phenomenon
Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
32

Large margins
 h x

F x
(x,y) y
y
Ý

 
T
margin
FT
t1 t t
T
t1
margin
FT
(x,y)  0 
T
t
1
fT (x)  y
Thesis:
large margins => reliable predictions
Very similar to SVM.
33
Experimental Evidence
34
Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998
H: set of binary functions with VC-dimension d
C   i hi | hi  H , i  0, i  1
T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
c  C,   0,
with probability 1  w.r.t. T ~ D m
Px,y~D sign c(x)  y Px,y~T margin c x,y   
 d / m   1 
O˜ 
 Olog 
     
No dependence on no. of combined functions!!!
35
Idea of Proof
36
Confidence rated
predictions
Agreement gives confidence
37
A motivating example
?
-
-
-
-
-
-
-
-
-
-
-
-
--
- -
+ Unsure
-
-
++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- -- + + + + +++ +
++
+
+
Unsure
+ - ?+
-
-
+
38
The algorithm
Freund, Mansour, Schapire 2001
Parameters   0,   0
w(h) e
Ý
Hypothesis weight:

Empirical
Log Ratio

Prediction rule:

:
ˆ h 
  w(h) 
1 h:h ( x)1 
ˆl (x) 
Ý ln 


   w(h) 
h:h ( x)1 
 1
if

pˆ, x  -1,+1 if

if
 1
lˆx  
lˆx  
lˆx  
39
Suggested tuning
Suppose H is a finite set.
0    14
  ln 8 H m1 2
2
Yields:
m
8m1 2
*


 2   ln 8 H 
ln m 
 2h  O 1/2 
m

1) P
mistake   Px,y ~D y  pˆ (x)
2) for
ln

1/  

m =  ln 1  ln H  


 ln 1   ln H 


*


ˆ
P(abstain )  Px,y ~D p(x)  1,1  5h  O
1/2



m


40
Confidence Rating block
diagram
Training examples
x1,y1,x2 , y2 , ,xm ,ym 
Confidence-rated
Rule

Candidate
Rules
RaterCombiner
41
Face Detection
Viola & Jones 1999
• Paul Viola and Mike Jones developed a face detector that can work
in real time (15 frames per second).
QuickTime™ and a YUV420 codec dec ompres sor are needed to see this pic ture.
42
Using confidence to save time
The detector combines 6000 simple features using Adaboost.
In most boxes, only 8-9 features are calculated.
All
boxes
Feature 1
Might be a face
Feature 2
Definitely
not a face
43
Using confidence to train
car detectors
44
Original Image Vs.
difference image
QuickTime™ and a Photo - JPEG decompressor are needed to see this picture.
45
Co-training
Blum and Mitchell 98
Partially trained
B/W based
Classifier
Confident
Predictions
Hwy
Images
Partially trained
Diff based
Classifier
Confident
Predictions
46
Co-Training Results
Levin, Freund, Viola 2002
Raw Image detector
Before co-training
Difference Image detector
After co-training
47
Selective sampling
Unlabeled data
Partially trained
classifier
Sample of unconfident
examples
Labeled
examples
Query-by-committee, Seung, Opper & Sompolinsky
Freund, Seung, Shamir & Tishby
48
Online learning
Adapting to changes
49
Online learning
So far, the only statistical assumption was
that data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
An expert is an algorithm that maps the
past (x1,y1 ),(x2 ,y2 ), ,(xt1,yt1 ),x
to ta prediction
zt
Suppose we have a set of experts,
we believe one is good, but we don’t know which one.


50
Online prediction game
For t  1,
,T
1
t
2
t
z ,z , ,z
Experts generate predictions:
Algorithm makes its own prediction:
Total loss of algorithm:
t
yt
Nature generates outcome:
Total loss of expert i:
N
t
L   Lzti ,yt 

T
i
T
 L  ,y
t1
L 
A
T
T
t1
Goal: for any sequence
  of events

t
t

LTA  min LiT  oT 
i
51
A very simple example
•
•
•
•
Binary classification
N experts
one expert is known to be perfect
Algorithm: predict like the majority of
experts that have made no mistake so far.
• Bound:
LA  log 2 N
52
History of online learning
• Littlestone & Warmuth
• Vovk
• Vovk and Shafer’s recent book:
“Probability and Finance, its only a game!”
• Innumerable contributions from many fields:
Hannan, Blackwell, Davison, Gallager, Cover, Barron,
Foster & Vohra, Fudenberg & Levin, Feder & Merhav,
Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum,
Freund, Schapire, Valiant, Auer …
53
Lossless compression
X - arbitrary input space.
Y - {0,1}
Z - [0,1]
Log Loss: LZ,Y   y log 2
1
1
 (1 y)log 2
z
1 z
Entropy, Lossless compression, MDL.
Statistical likelihood, standard probability theory.

54
Bayesian averaging
Folk theorem in Information Theory
N
t 
w
i
t
z
i
t
; w 2
i
t
i1
N
w
L
i
t1
i
t
i1
N
N
T  0; LTA  log 2  w1i  log 2  wTi  min LiT  ln N

i1
i1
i
55
Game theoretical loss
X - arbitrary space
Y - a loss for each of N actions
y  0,1
N
Z - a distribution over N actions
p  0,1 , p 1 1

Loss: L(p,y)  p y  Ep y
N

56
Learning in games
Freund and Schapire 94
An algorithm which knows T in advance
guarantees:
L  min L  2T ln N  ln N
A
T
i
i
T

57
Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
Algorithm cannot observe full outcome
yt
Instead, a single it  1, , N  is chosen at
it
p t and yt is observed
random according to

We describe
 an algorithm that guarantees:

NT 
A
i
LT 
 min LT  O NT ln

i
 
 
With probability
1 
58
Why isn’t online learning
practical?
• Prescriptions too similar to Bayesian
approach.
• Implementing low-level learning requires a
large number of experts.
• Computation increases linearly with the
number of experts.
• Potentially very powerful for combining a
few high-level experts.
59
Online learning for detector
deployment
Face
code
Detector
Library
Merl frontal 1.0
B/W Frontal face detector
Indoor, neutral background
direct front-right lighting
Detector can be
Feedbackadaptive!!
Download
Images
OL
Adaptive
real-time
face detector
Face
Detections
60
Summary
• By Combining predictors we can:
 Improve accuracy.
 Estimate prediction confidence.
 Adapt on-line.
• To make machine learning practical:




Speed-up the predictors.
Concentrate human feedback on hard cases.
Fuse data from several sources.
Share predictor libraries.
61
Download