Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising Decision Trees Association Rules Dimensional ity reduction Spam Detection Queries on streams Perceptron, kNN Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2 Example: Spam filtering Instance space x X (|X|= n data points) Binary or real-valued feature vector x of word occurrences d features (words + other things, d~100,000) Class y Y y: Spam (+1), Ham (-1) Goal: Estimate a function f(x) so that y = f(x) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: Real number: Regression Categorical: Classification Complex object: Ranking of items, Parse tree, etc. Data is labeled: Have many pairs {(x, y)} x … vector of binary, categorical, real valued features y … class ({+1, -1}, or a real number) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 Task: Given data (X,Y) build a model f() to predict Y’ based on X’ Strategy: Estimate 𝒚 = 𝒇 𝒙 Training X on (𝑿, 𝒀). data Hope that the same 𝒇(𝒙) also Test X’ works to predict unknown 𝒀’ data Y Y’ The “hope” is called generalization Overfitting: If f(x) predicts well Y but is unable to predict Y’ We want to build a model that generalizes well to unseen data But Jure, how can we well on data we have never seen before?!? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 Idea: Pretend we do not know the data/labels we actually do know Build the model f(x) on the training data See how well f(x) does on the test data Training set Validation set Test set If it does well, then apply it also to X’ X Y X’ Refinement: Cross validation Splitting into training/validation set is brutal Let’s split our data (X,Y) into 10-folds (buckets) Take out 1-fold for validation, train on remaining 9 Repeat this 10 times, report average performance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 Binary classification: f (x) = { +1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d) -1 otherwise Input: Vectors xj and labels yj Vectors xj are real valued where 𝒙 𝟐 =𝟏 Goal: Find vector w = (w(1), w(2) ,... , w(d) ) Each w(i) is a real number wx=0 -- - - -- - - -- w J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Decision boundary is linear Note: x x, 1 x w w, 7 (Very) loose motivation: Neuron Inputs are feature values Each feature has a weight wi (1) w (1) Activation is the sum: x w(2) 𝒇 𝒙 = 𝒅 (𝒊) (𝒊) 𝒊 𝒘 𝒙 If the f(x) is: Positive: Predict +1 Negative: Predict -1 x(2) x(3) x(4) =𝒘⋅𝒙 wx=0 w(3) w(4) nigeria 0? x1 Spam=1 x2 w Ham=-1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org viagra 8 Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly. Perceptron: y’ = sign(w x) How to find parameters w? Start with w0 = 0 Pick training examples xt one by one Predict class of xt using current wt y’ = sign(wt xt) If y’ is correct (i.e., yt = y’) No change: wt+1 = wt ytxt If y’ is wrong: Adjust wt wt+1 = wt + yt xt is the learning rate parameter xt is the t-th training example yt is true t-th class label ({+1, -1}) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org wt wt+1 xt, yt=1 9 Good: Perceptron convergence theorem: If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge Bad: Never converges: If the data is not separable weights dance around indefinitely Bad: Mediocre generalization: Finds a “barely” separating solution J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 Perceptron will oscillate and won’t converge So, when to stop learning? (1) Slowly decrease the learning rate A classic way is to: = c1/(t + c2) But, we also need to determine constants c1 and c2 (2) Stop when the training error stops chaining (3) Have a small test dataset and stop when the test set error stops decreasing (4) Stop when we reached some maximum number of passes over the data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 Want to separate “+” from “-” using a line Data: + Training examples: + + (x1, y1) … (xn, yn) Each example i: + + - + - - - xi = ( xi(1),… , xi(d) ) xi(j) is real valued yi { -1, +1 } Inner product: 𝒘 ⋅ 𝒙 = 𝑑𝑗=1 𝑤 (𝑗) ⋅ 𝑥 (𝑗) Which is best linear separator (defined by w)? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 A C + + + + + + B + - - + - + - - - - Distance from the separating hyperplane corresponds to the “confidence” of prediction Example: We are more sure about the class of A and B than of C J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 Margin 𝜸: Distance of closest example from the decision line/hyperplane The reason we define margin this way is due to theoretical convenience and existence of generalization error bounds that depend on the value of margin. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 Remember: Dot product 𝑨 ⋅ 𝑩 = 𝑨 ⋅ 𝑩 ⋅ 𝐜𝐨𝐬 𝜽 𝑨 𝒄𝒐𝒔𝜽 𝒅 𝑨(𝒋) |𝑨|= 𝟐 𝒋=𝟏 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16 Dot product 𝑨 ⋅ 𝑩 = 𝑨 𝑩 𝐜𝐨𝐬 𝜽 What is 𝒘 ⋅ 𝒙𝟏 , 𝒘 ⋅ 𝒙𝟐 ? x2 + +x 1 𝒘 In this case 𝜸𝟏 ≈ 𝒘 𝟐 𝒘 x2 + +x 1 x2 + +x 1 𝒘 In this case 𝜸𝟐 ≈ 𝟐 𝒘 𝟐 So, 𝜸 roughly corresponds to the margin Bigger 𝜸 bigger the separation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Distance from a point to a line w A (xA(1), xA(2)) + H (0,0) M (x1, x2) L Note we assume 𝒘 𝟐=𝟏 Let: Line L: w∙x+b = w(1)x(1)+w(2)x(2)+b=0 w = (w(1), w(2)) Point A = (xA(1), xA(2)) Point M on a line = (xM(1), xM(2)) d(A, L) = |AH| = |(A-M) ∙ w| = |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)| = xA(1) w(1) + xA(2) w(2) + b =w∙A+b Remember xM(1)w(1) + xM(2)w(2) = - b since M belongs to line L J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 𝒘 + + + + - + + + - - - Prediction = sign(wx + b) “Confidence” = (w x + b) y For i-th datapoint: 𝜸𝒊 = 𝒘 𝒙𝒊 + 𝒃 𝒚𝒊 Want to solve: 𝐦𝐚𝐱 𝐦𝐢𝐧 𝜸𝒊 𝒘 𝒊 Can rewrite as max w, s.t.i, yi ( w xi b) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 Maximize the margin: + Good according to intuition, theory (VC dimension) & + + + practice + max + s.t.i, yi ( w xi b) + + - w, 𝜸 is margin … distance from the separating hyperplane wx+b=0 - - Maximizing the margin J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Separating hyperplane is defined by the support vectors Points on +/- planes from the solution If you knew these points, you could ignore the rest Generally, d+1 support vectors (for d dim. data) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 Problem: Let 𝒘𝒙 + 𝒃 𝒚 = 𝜸 then 𝟐𝒘𝒙 + 𝟐𝒃 𝒚 = 𝟐𝜸 Scaling w increases margin! x2 Solution: x1 Work with normalized w: 𝜸= 𝒘 𝒘 𝒙 + 𝒃 𝒚 w || w || Let’s also require support vectors 𝒙𝒋 to be on the plane defined by: 𝒘 ⋅ 𝒙𝒋 + 𝒃 = ±𝟏 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 𝑑 𝑤 (𝑗) |w|= 2 𝑗=1 23 Want to maximize margin 𝜸! What is the relation between x1 and x2? 𝒙𝟏 = 𝒙𝟐 + 𝟐𝜸 𝒘 ||𝒘|| x2 2 We also know: x1 𝒘 ⋅ 𝒙𝟏 + 𝒃 = +𝟏 𝒘 ⋅ 𝒙𝟐 + 𝒃 = −𝟏 w || w || So: 𝒘 ⋅ 𝒙𝟏 + 𝒃 = +𝟏 𝒘 𝒙𝟐 + 𝟐𝜸 𝒘 ||𝒘|| 𝒘 ⋅ 𝒙𝟐 + 𝒃 + 𝟐𝜸 + 𝒃 = +𝟏 𝒘⋅𝒘 𝒘 = +𝟏 w 1 w w w -1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Note: ww w 24 2 We started with maxw, s.t.i, yi ( w xi b) But w can be arbitrarily large! 2 We normalized and... arg max arg max x2 1 arg min w arg min 12 w w x1 2 Then: min 1 w 2 || w || w || w || 2 s.t.i, yi ( w xi b) 1 This is called SVM with “hard” constraints J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 If data is not separable introduce penalty: min 1 w 2 w C (# number of mistakes) 2 s.t.i, yi ( w xi b) 1 Minimize ǁwǁ2 plus the number of training mistakes Set C using cross validation How to penalize mistakes? All mistakes are not equally bad! + + + + + + - - - + - - + - J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Introduce slack variables i min w ,b , i 0 1 2 n w C i 2 i 1 s.t.i, yi ( w xi b) 1 i + If point xi is on the wrong side of the margin then get penalty i + + + + + + i j - + - - For each data point: If margin 1, don’t care If margin < 1, pay linear penalty J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 min 1 w 2 w C (# number of mistakes) 2 s.t.i, yi ( w xi b) 1 What is the role of slack penalty C: small C “good” C + C=: Only want to w, b + that separate the data + C=0: Can set i to anything, + + big C then w=0 (basically + ignores the data) + + - (0,0) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 SVM in the “natural”nform arg min w ,b 1 2 w w C max0,1 yi ( w xi b) i 1 Margin Regularization parameter SVM uses “Hinge Loss”: min penalty Empirical loss L (how well we fit training data) w ,b 1 2 n w C i 2 i 1 s.t.i, yi ( w xi b) 1 i 0/1 loss Hinge loss: max{0, 1-z} -1 0 1 2 z yi ( xi w b) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 n min w ,b 1 2 w w C i i 1 s.t.i, yi ( xi w b) 1 i Want to estimate 𝒘 and 𝒃! Standard way: Use a solver! Solver: software for finding solutions to “common” optimization problems Use a quadratic solver: Minimize quadratic function Subject to linear constraints Problem: Solvers are inefficient for big data! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 n Want to estimate w, b! Alternative approach: Want to minimize f(w,b): min w,b 1 2 w w C i i 1 s.t.i, yi ( xi w b) 1 i d ( j) ( j) 1 f ( w, b) 2 w w C max 0,1 yi ( w xi b) i 1 j 1 n Side note: How to minimize convex functions 𝒈(𝒛)? g(z) Use gradient descent: minz g(z) Iterate: zt+1 zt – g(zt) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org z 32 Want to minimize f(w,b): d f ( w, b) 12 w j 1 ( j) 2 d ( j) ( j) C max0,1 yi ( w xi b) i 1 j 1 n Empirical loss 𝑳(𝒙𝒊 𝒚𝒊 ) Compute the gradient (j) w.r.t. w(j) n L( xi , yi ) f ( w , b ) ( j) ( j) f w C ( j) ( j) w w i 1 L( xi , yi ) 0 if yi (w xi b) 1 ( j) w yi xi( j ) else J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33 Gradient descent: Iterate until convergence: • For j = 1 … d n L( xi , yi ) f ( w , b ) ( j) ( j) • Evaluate:f w C ( j) ( j) w w i 1 • Update: w(j) w(j) - f(j) …learning rate parameter C… regularization parameter Problem: Computing f(j) takes O(n) time! n … size of the training dataset J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 We just had: Stochastic Gradient Descent f ( j) w n ( j) C i 1 L( xi , yi ) w( j ) Instead of evaluating gradient over all examples evaluate it for each individual training example L( xi , yi ) f ( xi ) w C ( j) w Stochastic gradient descent: ( j) ( j) Notice: no summation over i anymore Iterate until convergence: • For i = 1 … n • For j = 1 … d • Compute: f(j)(xi) • Update: w(j) w(j) - f(j)(xi) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 Example by Leon Bottou: Reuters RCV1 document corpus Predict a category of a document One vs. the rest classification n = 781,000 training examples (documents) 23,000 test examples d = 50,000 features One feature per word Remove stop-words Remove low frequency words J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37 Questions: (1) Is SGD successful at minimizing f(w,b)? (2) How quickly does SGD find the min of f(w,b)? (3) What is the error on a test set? Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM (1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 SGD SVM Conventional SVM Optimization quality: | f(w,b) – f (wopt,bopt) | For optimizing f(w,b) within reasonable quality SGD-SVM is super fast J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39 SGD on full dataset vs. Conjugate Gradient on a sample of n training examples Theory says: Gradient descent converges in linear time 𝒌. Conjugate gradient converges in 𝒌. Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 𝒌… condition number 40 Need to choose learning rate and t0 t L( xi , yi ) wt 1 wt wt C t t0 w Leon suggests: Choose t0 so that the expected initial updates are comparable with the expected size of the weights Choose : Select a small subsample Try various rates (e.g., 10, 1, 0.1, 0.01, …) Pick the one that most reduces the cost Use for next 100k iterations on the full dataset J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 Sparse Linear SVM: Feature vector xi is sparse (contains many zeros) Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] But represent xi as a sparse vector xi=[(4,1), (9,5), …] Can we do the SGD update more efficiently? w w w w C L( xi , yi ) Approximated in 2 steps: L( xi , yi ) cheap: xi is sparse and so few w w C coordinates j of w will be updated w w w(1 ) expensive: w is not sparse, all coordinates need to be updated J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42 Solution 1: 𝒘 = 𝒔 ⋅ 𝒗 Represent vector w as the product of scalar s and vector v Then the update procedure is: (1) 𝒗 = 𝒗 − Two step update procedure: L( xi , yi ) w (2) w w(1 ) (1) w w C 𝝏𝑳 𝒙𝒊 ,𝒚𝒊 𝜼𝑪 𝝏𝒘 (2) 𝒔 = 𝒔(𝟏 − 𝜼) Solution 2: Perform only step (1) for each training example Perform step (2) with lower frequency and higher J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 Stopping criteria: How many iterations of SGD? Early stopping with cross validation Create a validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping Extract two disjoint subsamples A and B of training data Train on A, stop by validating on B Number of epochs is an estimate of k Train for k epochs on the full dataset J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 Idea 1: One against all Learn 3 classifiers + vs. {o, -} - vs. {o, +} o vs. {+, -} Obtain: w+ b+, w- b-, wo bo How to classify? Return class c arg maxc wc x + bc J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45 Idea 2: Learn 3 sets of weights simoultaneously! For each class c estimate wc, bc Want the correct class to have highest margin: wy xi + by 1 + wc xi + bc c yi , i i i (xi, yi) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46 Optimization problem: min w,b 1 2 w c c 2 n C i i 1 wyi xi byi wc xi bc 1 i c yi , i i 0, i To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM SVM is widely perceived a very powerful learning algorithm J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47