Lecture 2: Memorizer and Nearest Neighbor

Lecture 2: Memorizer and Nearest Neighbor Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 2 1 / 31 The new waiter The coffee shop GuessYou prides itself in serving the customers their favorite drinks before the customers order them. New waiters are allowed to ask customers what is their favorite drink for one week. After this week they waiters need to decide what to serve each customer, without asking any more customers. The waiters’ tips depend on how well they do! • Kontorovich and Sabato (BGU) Lecture 2 2 / 31 Supervised learning In supervised learning, the learner gets to learn from examples with their true labels. The learner then needs to devise a prediction rule. The prediction rule will be used to predict the labels of future examples. The success of the prediction rule is measured by how accurately it predicts the labels of future examples. In the coffee shop example: I I I I Waiter = learner (learning algorithm) Customers = examples Favorite drinks = labels Tips after first week = prediction success • Kontorovich and Sabato (BGU) Lecture 2 3 / 31 Supervised learning There are two distinct phases: I Training phase: The learner receives labeled examples, and outputs a prediction rule. F I The set of labeled examples that the learner receives is the training sample Test phase: The prediction rule is used on new examples. • Kontorovich and Sabato (BGU) Lecture 2 4 / 31 A simple learning algorithm for the waiter The waiter can memorize what each customer likes. Over the first week the waiter writes down what every customer liked. After the first week, every customer that is in the list gets the same drink that they asked for before. For customers not in the list: the waiter will just guess something. How good is this algorithm? • Kontorovich and Sabato (BGU) Lecture 2 5 / 31 A formal description X - the set of all possible examples Y - the set of all possible labels A training sample: S = ((x1 , y1 ), . . . , (xm , ym )) A learning algorithm is any algorithm that has: I I Input: A training sample S Output: A prediction rule ĥS : X → Y. • Kontorovich and Sabato (BGU) Lecture 2 6 / 31 The Memorize algorithm Memorize algorithm input A training sample S output A function ĥS : X → Y. 1: Set ĥS = fSmem where fSmem is defined as: ( y mem ∀x ∈ X , fS (x) = a random label (x, y ) ∈ S otherwise In the second case, the label is drawn uniformly at random from Y. • Kontorovich and Sabato (BGU) Lecture 2 7 / 31 How good is the Memorize algorithm? It depends... I I I Perhaps the coffee shop has only a single customer; Perhaps each customer visits the coffee shop only once; Perhaps after the first week the coffee shop moved and the regular customers were replaced by new ones. To analyze the algorithm, we need to make some assumptions. We will use these assumptions throughout most of this course. The main assumptions: I I There is some distribution of examples and labels, and each labeled example is an independent random draw from this distribution. The algorithm does not know this distribution. • Kontorovich and Sabato (BGU) Lecture 2 8 / 31 A distribution over labeled examples The coffee shop example: X = {”Monica”, ”Phoebe”, ”Ross”, ”Joey ”, ”Chandler ”} Y = {”Juice”, ”Tea”, ”Coffee”} D is a distribution over X × Y. D defines a probability for every pair (x, y ) ∈ X × Y. This is denoted P(X ,Y )∼D [(X , Y ) = (x, y )], or D((x, y )). A possible D: Monica Phoebe Ross Joey Chandler Juice Tea Coffee 0 25% 0 0 25% 20% 10% 0 0 0 0 0 20% 0 0 • Kontorovich and Sabato (BGU) Lecture 2 9 / 31 A distribution over labeled examples According to our assumption, each pair of a customer and its desired drink is chosen randomly according to D. Our training sample is S = {(x1 , y1 ), . . . , (xm , ym )}, where each (xi , yi ) is drawn independently from D. I This is denoted S ∼ Dm . After the training time ends, random pairs (x, y ) (customers with desired drinks) continue to be drawn from D, but this time the waiter only observes x, and uses ĥS to predict y . How to measure the success of ĥS ? • Kontorovich and Sabato (BGU) Lecture 2 10 / 31 Measuring the success of the algorithm At test time, if the customers keep coming indefinitely, the fraction of cases in which the waiter gets the wrong drink is the probability of error of the waiter’s prediction rule: err(ĥS , D) := P(X ,Y )∼D [ĥS (X ) 6= Y ]. Suppose that each customer has only one favorite drink. So D has the following property: ∀x ∈ X , there is only one y ∈ Y s.t. D((x, y )) > 0. In this case, any customer that came in the first week will always receive the correct drink. • Kontorovich and Sabato (BGU) Lecture 2 11 / 31 Probability of error for the prediction rule of Memorize Suppose |Y| = k (i.e., k possible drinks) Suppose each customer x ∈ X has probability px of showing up. X px := P(X ,Y )∼D [X = x] = D((x, y )). y ∈Y Denote XS = {x | ∃y s.t. (x, y ) ∈ S}. P Missing mass: MS := x∈X \XS px I Probability of being surprised by an unfamiliar customer. Memorizer makes a mistake on x when: I I A new customer arrives: x ∈ / XS , and The random label is wrong. Probability of error of ĥS = fSmem : err(ĥS , D) = P[x ∈ / XS and wrong random label] = P[x ∈ / XS ] · P[wrong random label | x ∈ / XS ] k −1 . = MS · k Kontorovich and Sabato (BGU) Lecture 2 12 / 31 Probability of error for the prediction rule of Memorize Recall: k = |Y| (# drinks), MS := P x∈X \XS px (missing mass). We saw k −1 MS . k MS depends on the random training sample S. err(ĥS , D) = What is the expected value of MS over random training samples? Expected missing mass: ES∼Dm [MS ] = ES∼Dm [ X px ] = ES∼Dm [ X px · E[I[x ∈ / XS ]] = x∈X = X px · I[x ∈ / XS ]] x∈X x∈X \XS = X X px · P[x ∈ / XS ] x∈X px (1 − px )m . x∈X • Kontorovich and Sabato (BGU) Lecture 2 13 / 31 The case of a uniform distribution err(ĥS , D) = k−1 k MS , and E[MS ] = P x∈X px (1 − px )m . So the expected error of the Memorizer rule is: ES∼Dm [err(ĥS , D)] = k −1 X px (1 − px )m . k x∈X For instance, if |X | = N and all customers have the same probability, px = 1/N: ES∼Dm [err(ĥS , D)] = k − 1 −m/N k −1 (1 − 1/N)m ≤ e . k k Is that a good result? • Kontorovich and Sabato (BGU) Lecture 2 14 / 31 Is that a good result? We got ES∼Dm [err(ĥS , D)] = k −1 X px (1 − px )m . k x∈X For any fixed distribution D, the error → 0 when m grows. There is no sample size m which is good for all distributions. I E.g. if k = 2 and the customers arrive uniformly ES∼Dm [err(ĥS , D)] ≤ I 1 −m/N e . 2 but if N > 2m, the expected error is at least 30%. We cannot tell in advance whether one week is enough for the waiter to train its prediction rule. We would like distribution-free guarantees: for some sample size m, the learning algorithm will get low error for any distribution. • Kontorovich and Sabato (BGU) Lecture 2 15 / 31 Does Memorize actually learn anything? In some sense, Memorize doesn’t actually learn anything, because it only “learns” what it already saw. It cannot generalize to unseen examples. What can we do to improve this? • Kontorovich and Sabato (BGU) Lecture 2 16 / 31 Generalizing to unseen examples If the waiter sees a customer that it didn’t see in the training time, but this customer is similar to a customer that it did see, perhaps they have the same favorite drink. What do we mean by “similar”? I I I I I Same Same Same Same ... age? gender? height? weight? We need some distance function: ρ : X × X → R+ . • Kontorovich and Sabato (BGU) Lecture 2 17 / 31 The nearest neighbor algorithm S = ((x1 , y1 ), . . . , (xm , ym )) For x ∈ X , denote by nn(x) the index of the most similar example to x in S. So: ρ(xnn(x) , x) = min ρ(xi , x). i≤m Nearest Neighbor algorithm input A training sample S output A function ĥS : X → Y. 1: Set ĥS = fSnn , where fSnn is defined as: ∀x ∈ X , fSnn (x) = ynn(x) . • Kontorovich and Sabato (BGU) Lecture 2 18 / 31 How to choose the distance function? Represent each example in X by a vector in Rd , where d is the number of features for each example. feature: a single property of the example. Map all possible feature values to real numbers. For instance, possible features for customers: I age (already a number) I gender (represent one of the genders as 1, the other as 0) I eye color: can be mapped to several binary features: F F F I Is eye color blue? (0/1) Is eye color brown? (0/1) Is eye color green? (0/1) E.g. a 25 years-old female customer with blue eyes: (25, 1, 1, 0, 0) Define the distance to be the Euclidean distance in Rd : v u d uX 0 0 ρ(x, x ) = kx − x k ≡ t (x(i) − x 0 (i))2 i=1 Kontorovich and Sabato (BGU) Lecture 2 19 / 31 Examples of feature mappings Task: Identify digits in scanned images. I I I I The examples are grayscale photos of size 128 × 128 pixels. The labels are the digits 0, . . . , 9. Can represent examples in Rd , with d = 1282 . A feature for every pixel. Value of feature is intensity of pixel. Task: Identify whether a document is a scientific article. I I I I I The examples are text documents of varying length. The labels are 0 or 1: 1 if scientific article. Can represent examples in Rd , where d = number of words in the english language. Use a lookup table to map each word to one of 1, . . . , d. x(i) = 1 iff word i appears in the document x. Such feature mappings are used by many learning algorithms. So we will usually assume in this course that X ⊆ Rd . • Kontorovich and Sabato (BGU) Lecture 2 20 / 31 Nearest Neighbor illustration Each example in the training sample creates a Voronoi cell. All the new examples that fall in this cell will get the same label. For X = R2 , we can draw the sample and the resulting Voronoi cells. • Kontorovich and Sabato (BGU) Lecture 2 21 / 31 Nearest Neighbor and the distance function Will Nearest Neighbor work well? It depends on the distance function. If the distance function is the Euclidean distance, it depends on the representation of X : the feature mapping. Good case: Many of the features are relevant to the classification task Bad case: Many of the features are irrelevant. I I Suppose only the customer’s age determines the favorite drink. Two customers of same age might still be “far” if there are many other features. Same distance function can be good for one task, but bad for another! Can we get close to the best possible prediction rule? • Kontorovich and Sabato (BGU) Lecture 2 22 / 31 The Bayes-optimal rule What is the best prediction rule? Assume Y = {0, 1}. (E.g. Tea and Coffee.) Define η(x) = P(X ,Y )∼D [Y = 1 | X = x] ≡ D((x, 1))/D(x). (E.g. the probability that Ross selects Coffee.) The best prediction rule is the Bayes-optimal rule: h∗ (x) := I[η(x) > 1/2]. For a general set of labels Y: h∗ (x) := argmax P(X ,Y )∼D [Y = y | X = x]. y ∈Y • Kontorovich and Sabato (BGU) Lecture 2 23 / 31 The Bayes-optimal rule Proof for the Bayes-optimal rule: err(h, D) = P(X ,Y )∼D [h(X ) 6= Y ] X = P[X = x, Y = y ] (x,y )∈X ×Y:h(x)6=y X = P[X = x] · P[Y = y | X = x] (x,y )∈X ×Y:h(x)6=y = X X P[X = x] · P[Y = y | X = x] x∈X y ∈Y:h(x)6=y = X x∈X = X X P[X = x] P[Y = y | X = x] y ∈Y:h(x)6=y P[X = x] · P[Y 6= h(x) | X = x] x∈X = X P[X = x](1 − P[Y = h(x) | X = x]) x∈X err(h, D) is smallest when for each x, P[Y = h(x) | X = x] is largest. Hence argmaxy ∈Y P(X ,Y )∼D [Y = y | X = x] is the optimal choice for h(x). • Kontorovich and Sabato (BGU) Lecture 2 24 / 31 When does Nearest Neighbor approach the optimal error? We would like Nearest Neighbor to do well on our task. This will be the case if: examples that are close together usually have the same label. A “nice” distribution is one in which there is some c > 0 such that ∀x, x 0 ∈ X , |η(x) − η(x 0 )| ≤ ckx − x 0 k We say in this case that η is c-Lipschitz. Theorem If X ⊆ [0, 1]d , Y = {0, 1}, and η for the distribution D is c-Lipschitz, and if the training sample is of size m, then √ ES∼Dm [err(fSnn , D)] ≤ 2err(h∗ , D) + 4c dm−1/(d+1) . • Kontorovich and Sabato (BGU) Lecture 2 25 / 31 Implications of this guarantee Theorem If X ⊆ [0, 1]d , Y = {0, 1}, and η for the distribution D is c-Lipschitz, then √ ES∼Dm [err(fSnn , D)] ≤ 2err(h∗ , D) + 4c dm−1/(d+1) . The nearest neighbor rule approaches twice the error of the Bayes-optimal. In some cases, the sample size m needs to be exponential in d. If d is large, this could be a serious problem. This is termed The curse of dimensionality: I More dimensions (features) ⇒ many more examples are needed. • Kontorovich and Sabato (BGU) Lecture 2 26 / 31 k-Nearest Neighbor We showed that we can approach 2 × The Bayes-optimal error. Can we approach the Bayes-optimal error exactly? S = ((x1 , y1 ), . . . , (xm , ym )) For x ∈ X , denote by πi (x) the index of the i’th closest example to x in S. So: i ≤j =⇒ ρ(xπi (x) , x) ≤ ρ(xπj (x) , x). k-Nearest Neighbors algorithm input A training sample S, integer k ≥ 1. output A function ĥS : X → Y. 1: Set ĥS = fSk-nn , where fSk-nn is defined as: ∀x ∈ X , fSk-nn (x) = The majority label among yπ1 (x) , . . . , yπk (x) . • Kontorovich and Sabato (BGU) Lecture 2 27 / 31 k-Nearest Neighbors Theorem Suppose that k1 , k2 , . . . is a sequence such that limm→∞ km = ∞, and limm→∞ km /m = 0. Then lim ES∼Dm [err(fSkm -nn , D)] = err(h∗ , D). m→∞ By increasing k slowly with the sample size, the correct majority label can be approached in every point in the space of X . Curse of dimensionality still holds for k-nearest neighbors. In practice, k-nearest neighbors is often used with some small fixed k. • Kontorovich and Sabato (BGU) Lecture 2 28 / 31 Efficient implementation Computing distances in high dimensions is expensive. Also: storing the entire sample is memory-intensive. How to speed up calculations and save memory? Idea: project onto lower-dimensional subspace I I Principal Components Analysis (PCA) preserves the general “cloud” shape of the data. Johnson-Lindenstrauss transform (JL) approximately preserves pairwise distances. Idea: sample compression/condensing I I I I store only a few “representative” examples and discard the rest memory efficient also results in prediction speedup turns out that also can improve generalization! We will see some of these methods later in the course. • Kontorovich and Sabato (BGU) Lecture 2 29 / 31 Exercise 1 Kontorovich and Sabato (BGU) Lecture 2 30 / 31 Exercise 1 We are releasing it early for your convenience You can already do Q1 and Q2 For Q3 you need next week’s lecture. Deadline: 16.11.15 Main part of exercise: implement and run k-Nearest-Neighbors. Getting started on Matlab - see link in exercise. You may also use Octave instead of Matlab (In class) more explanations on the exercise. • Kontorovich and Sabato (BGU) Lecture 2 31 / 31

Lecture 2: Memorizer and Nearest Neighbor

Related documents

Products

Support

Lecture 2: Memorizer and Nearest Neighbor

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib