Lecture 2: Memorizer and Nearest Neighbor

advertisement
Lecture 2: Memorizer and
Nearest Neighbor
Introduction to Learning
and Analysis of Big Data
Kontorovich and Sabato (BGU)
Lecture 2
1 / 31
The new waiter
The coffee shop GuessYou prides itself in serving the customers their
favorite drinks before the customers order them.
New waiters are allowed to ask customers what is their favorite drink
for one week.
After this week they waiters need to decide what to serve each
customer, without asking any more customers.
The waiters’ tips depend on how well they do!
•
Kontorovich and Sabato (BGU)
Lecture 2
2 / 31
Supervised learning
In supervised learning, the learner gets to learn from examples with
their true labels.
The learner then needs to devise a prediction rule.
The prediction rule will be used to predict the labels of future
examples.
The success of the prediction rule is measured by how accurately it
predicts the labels of future examples.
In the coffee shop example:
I
I
I
I
Waiter = learner (learning algorithm)
Customers = examples
Favorite drinks = labels
Tips after first week = prediction success
•
Kontorovich and Sabato (BGU)
Lecture 2
3 / 31
Supervised learning
There are two distinct phases:
I
Training phase: The learner receives labeled examples,
and outputs a prediction rule.
F
I
The set of labeled examples that the learner receives
is the training sample
Test phase: The prediction rule is used on new examples.
•
Kontorovich and Sabato (BGU)
Lecture 2
4 / 31
A simple learning algorithm for the waiter
The waiter can memorize what each customer likes.
Over the first week the waiter writes down what every customer liked.
After the first week, every customer that is in the list gets
the same drink that they asked for before.
For customers not in the list: the waiter will just guess something.
How good is this algorithm?
•
Kontorovich and Sabato (BGU)
Lecture 2
5 / 31
A formal description
X - the set of all possible examples
Y - the set of all possible labels
A training sample: S = ((x1 , y1 ), . . . , (xm , ym ))
A learning algorithm is any algorithm that has:
I
I
Input: A training sample S
Output: A prediction rule ĥS : X → Y.
•
Kontorovich and Sabato (BGU)
Lecture 2
6 / 31
The Memorize algorithm
Memorize algorithm
input A training sample S
output A function ĥS : X → Y.
1: Set ĥS = fSmem where fSmem is defined as:
(
y
mem
∀x ∈ X , fS (x) =
a random label
(x, y ) ∈ S
otherwise
In the second case, the label is drawn uniformly at random from Y.
•
Kontorovich and Sabato (BGU)
Lecture 2
7 / 31
How good is the Memorize algorithm?
It depends...
I
I
I
Perhaps the coffee shop has only a single customer;
Perhaps each customer visits the coffee shop only once;
Perhaps after the first week the coffee shop moved and the regular
customers were replaced by new ones.
To analyze the algorithm, we need to make some assumptions.
We will use these assumptions throughout most of this course.
The main assumptions:
I
I
There is some distribution of examples and labels, and each labeled
example is an independent random draw from this distribution.
The algorithm does not know this distribution.
•
Kontorovich and Sabato (BGU)
Lecture 2
8 / 31
A distribution over labeled examples
The coffee shop example:
X = {”Monica”, ”Phoebe”, ”Ross”, ”Joey ”, ”Chandler ”}
Y = {”Juice”, ”Tea”, ”Coffee”}
D is a distribution over X × Y.
D defines a probability for every pair (x, y ) ∈ X × Y.
This is denoted P(X ,Y )∼D [(X , Y ) = (x, y )], or D((x, y )).
A possible D:
Monica
Phoebe
Ross
Joey
Chandler
Juice
Tea
Coffee
0
25%
0
0
25%
20%
10%
0
0
0
0
0
20%
0
0
•
Kontorovich and Sabato (BGU)
Lecture 2
9 / 31
A distribution over labeled examples
According to our assumption, each pair of a customer and its desired
drink is chosen randomly according to D.
Our training sample is S = {(x1 , y1 ), . . . , (xm , ym )}, where each
(xi , yi ) is drawn independently from D.
I
This is denoted S ∼ Dm .
After the training time ends, random pairs (x, y ) (customers with
desired drinks) continue to be drawn from D, but this time the waiter
only observes x, and uses ĥS to predict y .
How to measure the success of ĥS ?
•
Kontorovich and Sabato (BGU)
Lecture 2
10 / 31
Measuring the success of the algorithm
At test time, if the customers keep coming indefinitely,
the fraction of cases in which the waiter gets the wrong drink
is the probability of error of the waiter’s prediction rule:
err(ĥS , D) := P(X ,Y )∼D [ĥS (X ) 6= Y ].
Suppose that each customer has only one favorite drink.
So D has the following property:
∀x ∈ X , there is only one y ∈ Y s.t. D((x, y )) > 0.
In this case, any customer that came in the first week will always
receive the correct drink.
•
Kontorovich and Sabato (BGU)
Lecture 2
11 / 31
Probability of error for the prediction rule of Memorize
Suppose |Y| = k (i.e., k possible drinks)
Suppose each customer x ∈ X has probability px of showing up.
X
px := P(X ,Y )∼D [X = x] =
D((x, y )).
y ∈Y
Denote XS = {x | ∃y s.t. (x, y ) ∈ S}.
P
Missing mass: MS := x∈X \XS px
I
Probability of being surprised by an unfamiliar customer.
Memorizer makes a mistake on x when:
I
I
A new customer arrives: x ∈
/ XS , and
The random label is wrong.
Probability of error of ĥS = fSmem :
err(ĥS , D) = P[x ∈
/ XS and wrong random label]
= P[x ∈
/ XS ] · P[wrong random label | x ∈
/ XS ]
k −1
.
= MS ·
k
Kontorovich and Sabato (BGU)
Lecture 2
12 / 31
Probability of error for the prediction rule of Memorize
Recall: k = |Y| (# drinks), MS :=
P
x∈X \XS
px (missing mass).
We saw
k −1
MS .
k
MS depends on the random training sample S.
err(ĥS , D) =
What is the expected value of MS over random training samples?
Expected missing mass:
ES∼Dm [MS ] = ES∼Dm [
X
px ] = ES∼Dm [
X
px · E[I[x ∈
/ XS ]] =
x∈X
=
X
px · I[x ∈
/ XS ]]
x∈X
x∈X \XS
=
X
X
px · P[x ∈
/ XS ]
x∈X
px (1 − px )m .
x∈X
•
Kontorovich and Sabato (BGU)
Lecture 2
13 / 31
The case of a uniform distribution
err(ĥS , D) =
k−1
k MS ,
and E[MS ] =
P
x∈X
px (1 − px )m .
So the expected error of the Memorizer rule is:
ES∼Dm [err(ĥS , D)] =
k −1 X
px (1 − px )m .
k
x∈X
For instance, if |X | = N and all customers have the same probability,
px = 1/N:
ES∼Dm [err(ĥS , D)] =
k − 1 −m/N
k −1
(1 − 1/N)m ≤
e
.
k
k
Is that a good result?
•
Kontorovich and Sabato (BGU)
Lecture 2
14 / 31
Is that a good result?
We got
ES∼Dm [err(ĥS , D)] =
k −1 X
px (1 − px )m .
k
x∈X
For any fixed distribution D, the error → 0 when m grows.
There is no sample size m which is good for all distributions.
I
E.g. if k = 2 and the customers arrive uniformly
ES∼Dm [err(ĥS , D)] ≤
I
1 −m/N
e
.
2
but if N > 2m, the expected error is at least 30%.
We cannot tell in advance whether one week is enough for the waiter
to train its prediction rule.
We would like distribution-free guarantees: for some sample size m,
the learning algorithm will get low error for any distribution.
•
Kontorovich and Sabato (BGU)
Lecture 2
15 / 31
Does Memorize actually learn anything?
In some sense, Memorize doesn’t actually learn anything,
because it only “learns” what it already saw.
It cannot generalize to unseen examples.
What can we do to improve this?
•
Kontorovich and Sabato (BGU)
Lecture 2
16 / 31
Generalizing to unseen examples
If the waiter sees a customer that it didn’t see in the training time,
but this customer is similar to a customer that it did see, perhaps
they have the same favorite drink.
What do we mean by “similar”?
I
I
I
I
I
Same
Same
Same
Same
...
age?
gender?
height?
weight?
We need some distance function: ρ : X × X → R+ .
•
Kontorovich and Sabato (BGU)
Lecture 2
17 / 31
The nearest neighbor algorithm
S = ((x1 , y1 ), . . . , (xm , ym ))
For x ∈ X , denote by nn(x) the index
of the most similar example to x in S. So:
ρ(xnn(x) , x) = min ρ(xi , x).
i≤m
Nearest Neighbor algorithm
input A training sample S
output A function ĥS : X → Y.
1: Set ĥS = fSnn , where fSnn is defined as:
∀x ∈ X ,
fSnn (x) = ynn(x) .
•
Kontorovich and Sabato (BGU)
Lecture 2
18 / 31
How to choose the distance function?
Represent each example in X by a vector in Rd ,
where d is the number of features for each example.
feature: a single property of the example.
Map all possible feature values to real numbers.
For instance, possible features for customers:
I age (already a number)
I gender (represent one of the genders as 1, the other as 0)
I eye color: can be mapped to several binary features:
F
F
F
I
Is eye color blue? (0/1)
Is eye color brown? (0/1)
Is eye color green? (0/1)
E.g. a 25 years-old female customer with blue eyes: (25, 1, 1, 0, 0)
Define the distance to be the Euclidean distance in Rd :
v
u d
uX
0
0
ρ(x, x ) = kx − x k ≡ t (x(i) − x 0 (i))2
i=1
Kontorovich and Sabato (BGU)
Lecture 2
19 / 31
Examples of feature mappings
Task: Identify digits in scanned images.
I
I
I
I
The examples are grayscale photos of size 128 × 128 pixels.
The labels are the digits 0, . . . , 9.
Can represent examples in Rd , with d = 1282 .
A feature for every pixel. Value of feature is intensity of pixel.
Task: Identify whether a document is a scientific article.
I
I
I
I
I
The examples are text documents of varying length.
The labels are 0 or 1: 1 if scientific article.
Can represent examples in Rd , where d = number of words in the
english language.
Use a lookup table to map each word to one of 1, . . . , d.
x(i) = 1 iff word i appears in the document x.
Such feature mappings are used by many learning algorithms.
So we will usually assume in this course that X ⊆ Rd .
•
Kontorovich and Sabato (BGU)
Lecture 2
20 / 31
Nearest Neighbor illustration
Each example in the training sample creates a Voronoi cell.
All the new examples that fall in this cell will get the same label.
For X = R2 , we can draw the sample and the resulting Voronoi cells.
•
Kontorovich and Sabato (BGU)
Lecture 2
21 / 31
Nearest Neighbor and the distance function
Will Nearest Neighbor work well?
It depends on the distance function.
If the distance function is the Euclidean distance,
it depends on the representation of X : the feature mapping.
Good case: Many of the features are relevant to the classification task
Bad case: Many of the features are irrelevant.
I
I
Suppose only the customer’s age determines the favorite drink.
Two customers of same age might still be “far” if there are many other
features.
Same distance function can be good for one task, but bad for another!
Can we get close to the best possible prediction rule?
•
Kontorovich and Sabato (BGU)
Lecture 2
22 / 31
The Bayes-optimal rule
What is the best prediction rule?
Assume Y = {0, 1}. (E.g. Tea and Coffee.)
Define
η(x) = P(X ,Y )∼D [Y = 1 | X = x] ≡ D((x, 1))/D(x).
(E.g. the probability that Ross selects Coffee.)
The best prediction rule is the Bayes-optimal rule:
h∗ (x) := I[η(x) > 1/2].
For a general set of labels Y:
h∗ (x) := argmax P(X ,Y )∼D [Y = y | X = x].
y ∈Y
•
Kontorovich and Sabato (BGU)
Lecture 2
23 / 31
The Bayes-optimal rule
Proof for the Bayes-optimal rule:
err(h, D) = P(X ,Y )∼D [h(X ) 6= Y ]
X
=
P[X = x, Y = y ]
(x,y )∈X ×Y:h(x)6=y
X
=
P[X = x] · P[Y = y | X = x]
(x,y )∈X ×Y:h(x)6=y
=
X
X
P[X = x] · P[Y = y | X = x]
x∈X y ∈Y:h(x)6=y
=
X
x∈X
=
X
X
P[X = x]
P[Y = y | X = x]
y ∈Y:h(x)6=y
P[X = x] · P[Y 6= h(x) | X = x]
x∈X
=
X
P[X = x](1 − P[Y = h(x) | X = x])
x∈X
err(h, D) is smallest when for each x, P[Y = h(x) | X = x] is largest.
Hence argmaxy ∈Y P(X ,Y )∼D [Y = y | X = x] is the optimal choice for h(x).
•
Kontorovich and Sabato (BGU)
Lecture 2
24 / 31
When does Nearest Neighbor approach the optimal error?
We would like Nearest Neighbor to do well on our task.
This will be the case if:
examples that are close together usually have the same label.
A “nice” distribution is one in which there is some c > 0 such that
∀x, x 0 ∈ X ,
|η(x) − η(x 0 )| ≤ ckx − x 0 k
We say in this case that η is c-Lipschitz.
Theorem
If X ⊆ [0, 1]d , Y = {0, 1}, and η for the distribution D is c-Lipschitz, and if the
training sample is of size m, then
√
ES∼Dm [err(fSnn , D)] ≤ 2err(h∗ , D) + 4c dm−1/(d+1) .
•
Kontorovich and Sabato (BGU)
Lecture 2
25 / 31
Implications of this guarantee
Theorem
If X ⊆ [0, 1]d , Y = {0, 1}, and η for the distribution D is c-Lipschitz, then
√
ES∼Dm [err(fSnn , D)] ≤ 2err(h∗ , D) + 4c dm−1/(d+1) .
The nearest neighbor rule approaches twice the error of the
Bayes-optimal.
In some cases, the sample size m needs to be exponential in d.
If d is large, this could be a serious problem.
This is termed The curse of dimensionality:
I
More dimensions (features) ⇒ many more examples are needed.
•
Kontorovich and Sabato (BGU)
Lecture 2
26 / 31
k-Nearest Neighbor
We showed that we can approach 2 × The Bayes-optimal error.
Can we approach the Bayes-optimal error exactly?
S = ((x1 , y1 ), . . . , (xm , ym ))
For x ∈ X , denote by πi (x) the index of the
i’th closest example to x in S. So:
i ≤j
=⇒
ρ(xπi (x) , x) ≤ ρ(xπj (x) , x).
k-Nearest Neighbors algorithm
input A training sample S, integer k ≥ 1.
output A function ĥS : X → Y.
1: Set ĥS = fSk-nn , where fSk-nn is defined as:
∀x ∈ X ,
fSk-nn (x) = The majority label among yπ1 (x) , . . . , yπk (x) .
•
Kontorovich and Sabato (BGU)
Lecture 2
27 / 31
k-Nearest Neighbors
Theorem
Suppose that k1 , k2 , . . . is a sequence such that limm→∞ km = ∞, and
limm→∞ km /m = 0. Then
lim ES∼Dm [err(fSkm -nn , D)] = err(h∗ , D).
m→∞
By increasing k slowly with the sample size, the correct majority label
can be approached in every point in the space of X .
Curse of dimensionality still holds for k-nearest neighbors.
In practice, k-nearest neighbors is often used with some small fixed k.
•
Kontorovich and Sabato (BGU)
Lecture 2
28 / 31
Efficient implementation
Computing distances in high dimensions is expensive.
Also: storing the entire sample is memory-intensive.
How to speed up calculations and save memory?
Idea: project onto lower-dimensional subspace
I
I
Principal Components Analysis (PCA) preserves the general
“cloud” shape of the data.
Johnson-Lindenstrauss transform (JL) approximately preserves
pairwise distances.
Idea: sample compression/condensing
I
I
I
I
store only a few “representative” examples and discard the rest
memory efficient
also results in prediction speedup
turns out that also can improve generalization!
We will see some of these methods later in the course.
•
Kontorovich and Sabato (BGU)
Lecture 2
29 / 31
Exercise 1
Kontorovich and Sabato (BGU)
Lecture 2
30 / 31
Exercise 1
We are releasing it early for your convenience
You can already do Q1 and Q2
For Q3 you need next week’s lecture.
Deadline: 16.11.15
Main part of exercise: implement and run k-Nearest-Neighbors.
Getting started on Matlab - see link in exercise.
You may also use Octave instead of Matlab
(In class) more explanations on the exercise.
•
Kontorovich and Sabato (BGU)
Lecture 2
31 / 31
Download