Learning From Data Lecture 4 Real Learning is Feasible

advertisement
Learning From Data
Lecture 4
Real Learning is Feasible
Real Learning vs. Verification
The Two Step Solution to Learning
Closer to Reality: Error and Noise
M. Magdon-Ismail
CSCI 4100/6100
recap:
Verification
Income
h
Age
Hoeffding: Eout(g) ≈ Ein(g)
Eout(h)
↓D
(with high probability)
2
Income
P[|Ein(h) − Eout(h)| > ǫ] ≤ 2e−2N ǫ .
Age
Ein(h) =
c AM
L Creator: Malik Magdon-Ismail
2
9
Real Learning is Feasible: 2 /16
Recap: selection bias and coins −→
recap:
Selection Bias Illustrated with Coins
Coin tossing example:
• If we toss one coin and get no heads, its very surprising.
P=
1
2N
We expect it is biased: P[heads] ≈ 0.
• Tossing 70 coins, and find one with no heads. Is it surprising?
70
1
P=1− 1− N
2
Do we expect P[heads] ≈ 0 for the selected coin?
Similar to the “birthday problem”: among 30 people, two will likely share the same birthday.
• This is called selection bias.
Selection bias is a very serious trap. For example medical screening.
Search Causes Selection Bias
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 3 /16
Learning: finite model −→
Real Learning – Finite Learning Models
hM
···
Income
Income
Income
h3
Income
h2
h1
Age
Age
Age
Age
Eout(h1)
Eout(h2)
Eout(h3)
Eout(hM )
Income
Income
Income
Income
↓D
···
Age
Ein(h1) =
Age
2
9
Age
Ein(h2) = 0
Ein(h3) =
Age
5
9
Ein(hM ) =
6
9
Pick the hypothesis with minimum Ein; will Eout be small?
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 4 /16
Hoeffding: finite model −→
Hoeffding says that Ein(g) ≈ Eout(g) for Finite H
P [|Ein(g) − Eout(g)| > ǫ] ≤ 2|H|e−2ǫ
2
N
for any ǫ > 0.
,
−2ǫ2 N
P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 2|H|e
for any ǫ > 0.
,
We don’t care how g was obtained, as long as it is from H
Some Basic Probability
Events A, B
Implication
If A =⇒ B (A ⊆ B) then P[A] ≤ P[B].
Union Bound
P[A or B] = P[A ∪ B] ≤ P[A] + P[B].
Proof: Let M = |H|.
The event “|Ein(g) − Eout (g)| > ǫ” implies
“|Ein(h1 ) − Eout (h1 )| > ǫ” OR . . . OR “|Ein (hM ) − Eout (hM )| > ǫ”
So, by the implication and union bounds:
P[|Ein (g) − Eout (g)| > ǫ] ≤ P OR |Ein (hM ) − Eout (hM )| > ǫ
M
m=1
Bayes’ Rule
P[B|A] · P[A]
P[A|B] =
P[B]
≤
M
X
P[|Ein (hm ) − Eout (hm )| > ǫ],
m=1
≤ 2Me−2ǫ
2N
.
(The last inequality is because we can apply the Hoeffding bound to each summand)
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 5 /16
Hoeffding as error bar −→
Interpreting the Hoeffding Bound for Finite |H|
P [|Ein(g) − Eout(g)| > ǫ] ≤ 2|H|e−2ǫ
2
N
for any ǫ > 0.
,
P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 2|H|e−2ǫ
Theorem. With probability at least 1 − δ,
r
1
2|H|
log
.
Eout(g) ≤ Ein(g) +
2N
δ
2
N
,
for any ǫ > 0.
2N
Proof: Let δ = 2|H|e−2ǫ
. Then
P [|Ein (g) − Eout (g)| ≤ ǫ] ≥ 1 − δ.
In words, with probability at least 1 − δ,
|Ein (g) − Eout (g)| ≤ ǫ.
This implies
We don’t care how g was obtained, as long as g ∈ H
Eout (g) ≤ Ein (g) + ǫ.
From the definition of δ, solve for ǫ:
r
1
2|H|
log
.
ǫ=
2N
δ
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 6 /16
Ein is close to Eout for small H −→
Ein Reaches Outside to Eout when |H| is Small
Eout(g) ≤ Ein(g) +
r
2|H|
1
.
log
2N
δ
If N ≫ ln |H|, then Eout(g) ≈ Ein(g).
• Does not depend on X , P (x), f or how g is found.
• Only requires P (x) to generate the data points independently and also the test point.
What about Eout ≈ 0?
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 7 /16
2 step approach −→
The 2 Step Approach to Getting Eout ≈ 0:
(1) Eout(g) ≈ Ein(g).
(2) Ein(g) ≈ 0.
Together, these ensure Eout ≈ 0.
How to verify (1) since we do not know Eout
– must ensure it theoretically - Hoeffding.
We can ensure (2) (for example PLA)
– modulo that we can guarantee (1)
out-of-sample error
There is a tradeoff:
Error
model
complexity
q
• Small |H| =⇒ Ein ≈ Eout
1
2N
log 2|H|
δ
• Large |H| =⇒ Ein ≈ 0 is more likely.
in-sample error
|H|∗
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 8 /16
|H|
Summary: feasibility of learning −→
Feasibility of Learning (Finite Models)
• No Free Lunch: can’t know anything outside D, for sure.
• Can “learn” with high probability if D is i.i.d. from P (x).
Eout ≈ Ein (Ein can reach outside the data set to Eout).
• We want Eout ≈ 0.
• The two step solution. We trade Eout ≈ 0 for 2 goals:
(i) Eout ≈ Ein;
(ii) Ein ≈ 0.
We know Ein, not Eout, but we can ensure (i) if |H| is small.
This is a big step!
• What about infinite H - the perceptron?
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 9 /16
Complex f are harder to learn −→
“Complex” Target Functions are Harder to Learn
What happened to the “difficulty” (complexity) of f ?
• Simple f =⇒ can use small H to get Ein ≈ 0 (need smaller N ).
• Complex f =⇒ need large H to get Ein ≈ 0 (need larger N ).
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 10 /16
Learning setup with probability −→
Revising the Learning Problem – Adding in Probability
UNKNOWN TARGET FUNCTION
f : X 7→ Y
UNKNOWN
INPUT DISTRIBUTION
yn = f (xn )
P (x)
TRAINING EXAMPLES
(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )
x1 , x2 , . . . , xN
x
g(x) ≈ f (x)
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
A
g
HYPOTHESIS SET
H
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 11 /16
Error and Noise −→
Error and Noise
Error Measure: How to quantify that h ≈ f .
Noise: yn 6= f (xn).
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 12 /16
Finger print example −→
Finger Print Recognition
Two types of error.
f


+1 you


f
+1
−1
no error
false accept
+1
h
−1 false reject
no error
−1 intruder
In any application you need to think about how to penalize each type of error.
f
f
+1 −1
+1 0 1
h
−1 10 0
+1 −1
+1 0 1000
h
−1 1
0
Supermarket
CIA
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 13 /16
Take Away
Error measure is specified by the user.
If not, choose one that is
– plausible (conceptually appealing)
– friendly (practically appealing)
Pointwise errors −→
Almost All Error Mearures are Pointwise
Compare h and f on individual points x using a pointwise error e(h(x), f (x)):
Binary error:
e(h(x), f (x)) = Jh(x) 6= f (x)K
Squared error: e(h(x), f (x)) = (h(x) − f (x))2
(classification)
(regression)
In-sample error:
N
1 X
e(h(xn), f (xn )).
Ein(h) =
N n=1
Out-of-sample error:
Eout(h) = Ex[e(h(x), f (x))].
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 14 /16
Noisy targets −→
Noisy Targets
age
gender
salary
debt
years in job
years at home
...
32 years
male
40,000
26,000
1 year
3 years
...
Consider two customers with the same credit data.
They can have different behaviors.
Approve for credit?
The target ‘function’ is not a deterministic function but a stochastic function.
‘f (x)’ = P (y|x)
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 15 /16
Learning setup with error and noise −→
Learning Setup with Error Measure and Noisy Targets
UNKNOWN TARGET DISTRIBUTION
(target function f plus noise)
P (y | x)
UNKNOWN
INPUT DISTRIBUTION
yn ∼ P (y | xn )
P (x)
TRAINING EXAMPLES
(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )
x1 , x2 , . . . , xN
x
ERROR
MEASURE
g(x) ≈ f (x)
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
A
g
HYPOTHESIS SET
H
c AM
L Creator: Malik Magdon-Ismail
Real Learning is Feasible: 16 /16
Download