week_2_2

advertisement
Machine Learning
Week 2
Lecture 2
Hand In
• It is online.
• Web board forum for Matlab questions
• Comments and corrections very welcome. I
will upload new versions as we go along.
Currently we are at version 3
• Your data is coming. We might change it over
time.
Quiz
• Go through all Questions
Recap
Impossibility of Learning!
What is f?
There are 256 potential
functions 8 of them has in
sample error 0
Assumptions are needed
x1
0
1
0
1
0
1
0
1
x2
0
0
1
1
0
0
1
1
x3
0
0
0
0
1
1
1
1
f(x)
1
0
1
1
0
?
?
?
No Free Lunch
"All models are wrong, but some models are useful.”
George Box
Machine Learning has many different models
and algorithms
There is no single best model that works best for all
problems (No Free Lunch Theorem)
Assumptions that works well in one domain
may fail in another
Probabilistic Approach
Repeat N times
independently
μ is unknown
Sample mean: ν #heads/N
Sample:
h,h,h,t,t,h,t,t,h
Hoeffdings Inequality
Sample mean is probably approximately correct PAC
Classification Connection
Testing a Hypothesis
Unknown Target
Fixed Hypothesis
Probability Distribution over x
is probability of picking x such that f(x) ≠ h(x)
is probability of picking x such that f(x) = h(x)
μ is the sum of the probability of all the points X
where hypothesis is wrong
Sample Mean - true error rate μ
Learning?
• Only Verification not Learning
• For finite hypothesis sets we used union bound
• Make sure
is close to
and minimize
Error Functions
Walmart. Discount for a given
person Error Function
h(x)/f(x)
Lying True
Est. Lying
0
1000
Est. True
1
0
CIA Access (Friday bar stock)
Error Function
h(x)/f(x)
Lying
Est. Lying
0
Est. True
True
1
1000
Point being. Depends on application
0
Final Diagram
Unknown Probability
Distribution P(x)
Unknown Target
P(y | x)
Learn
Importance
Data Set
Error Measure e
Learning
Algorithm
Hypothesis Set
Final Hypothesis
Today
• We are still only talking classification
• Test Sets
• Work towards learning with infinite size
hypothesis spaces for classification
– Reinvestigate Union Bound
– Dichotomies
– Break Points
The Test Set
Fixed hypothesis h, N independent data points, and any ε>0
•
•
•
•
Split your data into two parts D-train,D-test
Train on D-train and select hypothesis h
Test h on D-test, error
Apply Hoeffding bound to
Test Set
• Strong Bound: 1000 points then with 98%
probability, in sample error will be within 5% of
out of sample error
• Unbiased
– Just as likely to better than worse
• Problem lose data for training
• If Error is high it is not a help that it will also be
high in practice
• Can NOT be used to select h (contamination)
Learning
Pick a tolerance (risk) δ of failing you can accept
Set RHS equal to δ and solve for ε =
With Probability 1-δ
Generalization Bound
Why we minimize in sample error.
Union Bound
Union Bound Learning
Learning algorithm pick hypothesis hl
P(hl is bad) is less than the probability that some hypothesis is bad
We did not subtract overlapping events!!!
Hypotheses seem correlated
h2
Change
h1
if h1 is bad (poor generalization) then probably so is h2
Hope to improve union bound result
Goal
• Replace M with something like effective
number of hypotheses
• General bound. E.g. independent, target
function and input distribution
• Simple would be nice.
Look at finite point sets
Dichotomy
bit string of length N
Fixed set of N points X = (x1,..,xN) Hypothesis set
Each
gives a dichotomy
How Many Different Dichotomies do we get?
At Most
Capturing the “expressiveness” of the hypothesis set on X
Growth Function
Fixed set of N points X = (x1,..,xN) Hypothesis set
Example 1: Positive Rays
1-Dimensional input space (points on the real line)
a
Only Change When a moves to different interval
Example 2: Intervals
1-Dimensional input space (points on the real line)
a1
a1,a2 in separate parts
a2
+ Put in same
Example 3: Convex Sets
2-Dimensional input space (points in the plane)
Goal Continued
Generalization Bound
Imagine we can replace M with growth function
RHS is dropping exponentially fast in N
If Growth function is a polynomial in N then
RHS still drops exponentially in N
Bright Idea. Prove Growth function is polynomial in N
Prove we can replace M with growth function
Bounding Growth Function
• Might be hard to compute
• Instead of computing the exact value
• Prove that it is bounded by a polynomial
Shattering and Break Point
If
then we say that
shatters (x1,…,xN)
If no data set of size K can be shattered by
then K is a break point for
If K is a break point for then so is all
numbers larger than K? Why?
Revisit Examples
• Positive Rays
• Intervals
a1
• Convex sets
a
a2
2D Linear Classification (Hyperplanes)
2D Linear Classification
3 Points on a line
For 2D Linear Classification
Hypothesis set
4 is a break point
Break Points and Growth Function
If
has a break point then the
growth function
is polynomial (needs proof)
If not then it is not!
By definition of break point:
Break Point Game
Has Break Point 2
x x x
1
0 20 30
x x2
0 0 1
1
0 1 0
0 1 1
0
0
1
1
0
1
0
1
Impossible for
1
1
1
1
0
0
1
1
0
1
0
1
Row 1,2,3,4
Row 6,5,2,1
Row 7,5,3,2
Row 8,5,3,2
Proof Coming
If
has a break point then the
growth function is polynomial
Definition:
B(n,k) is the maximal number of dichotomies
possible on N points such that no subset of k points
can be shattered by the dichotomies.
More general than hypothesis sets
If no data set of size K can be shattered by
then K is a break point for
for any
with break point k
Computing B(n,k) – Boundary Cases
Cannot shatter set of size 1. There is
no way of picking dichotomies that
gives different classes for a point.
There is only one dichotomy since a
different dichotomy would give
different class for at least one point
There is only one point, this only 2
dichotomies are possible
Compute B(N,k)- Recursion
N,k >1
List L with all dichotomies in B(n,k)
Recursion
Consider the first n_1 points, there are α+β different (S2 sets are identical here)
They can still at most shatter k points, e.g. B(N-1,k) is an upper bound
Consider the first n_1 points in S2. If they can shatter k-1 points we can extend with
last point where we have both combinations for all dichotomies. This gives
k points we can shatter a contradiction.
Proof Coming
Base Cases:
Induction Step
Show for N0+1 for k>1 (k=1 was base case)
should be 0
change parameter
Continue
Make it into one sum
Recurrence for binomials
Add in zero index again
QED
Download