Linear Regression and Support Vector Regression

Linear Regression and Support Vector Regression Paul Paisitkriangkrai paulp@cs.adelaide.edu.au The University of Adelaide 24 October 2012 Outlines • • • • Regression overview Linear regression Support vector regression Machine learning tools available Regression Overview CLUSTERING + ++ + + + + + REGRESSION (THIS TALK) + + + + CLASSIFICATION + + + + + + + + + + -- -- + ++ +++ + + K-means • Decision tree • Linear Regression • Linear Discriminant Analysis • Support Vector Regression • Neural Networks • Support Vector Machines • Boosting Group data based on their characteristics Separate data based on their labels Find a model that can explain the output given the input Data processing flowchart (Income prediction) Transformed + data ++ Raw data (noise/outlier removal) Sex Age Hei Income ght M 20 1.7 25,000 F 30 1.6 55,000 1.8 30,000 … Feature extraction and selection Regression Height and sex seem to be irrelevant. Income Pre-processing M + + +++ Processed data Mis-entry (should have been 25!!!) ++ ++ + Age + ++ Linear Regression • Given data with n dimensional variables and 1 target-variable (real number) {(x1 , y1 ), (x 2 , y2 ),..., (x m , ym )} Where x  n , y   • The objective: Find a function f that returns n the best fit. f :    • Assume that the relationship between X and y is approximately linear. The model can be represented as (w represents coefficients and b is an intercept) f (w1 ,..., wn , b)  y  w  x  b   Linear Regression • To find the best fit, we minimize the sum of squared errors  Least square estimation m m min  ( yi  yˆ i )   ( yi  (w  xi  b)) 2 2 i 1 i 1 • The solution can be found by solving ˆ  ( X T X ) 1 X T Y w (By taking the derivative of the above objective function w.r.t. w ) • In MATLAB, the back-slash operator computes a least square solution. Linear Regression m m i 1 i 1 ˆ  xi  bˆ)) 2 min  ( yi  yˆ i ) 2   ( yi  (w • To ovoid over-fitting, a regularization term can be introduced (minimize a magnitude of w) – LASSO: m min  ( yi  w  xi  b) 2  C i 1 n | w j 1 j | m n i 1 j 1 2 2 – Ridge regression: min  ( yi  w  xi  b)  C  | w j | Support Vector Regression • Find a function, f(x), with at most -deviation from the target y w  xi  b  yi   The problem can be written as a convex optimization problem w 1  x i  b  yi   ; C: trade off the complexity What if the problem is not feasible? We can introduce slack variables (similar to soft margin loss function).  Income 1 min || w || 2 2 s.t. yi  w1  x i  b   ; ● ● ● ● ● ● ● yi  w1  xi  b   Age We do not care about errors as long as they are less than  Support Vector Regression Assume linear parameterization f (x, )  w  x  b Only the point outside the region contribute to the final cost y  1   2* x L ( y, f (x, ))  max | y  f (x, ) |  ,0 9 Soft margin y Given training data x i ,yi   1  i  1,..., m  2* Minimize 1 || w || 2  C 2 m x * (     i i) i 1 Under constraints  yi  (w  x i )  b     i  * ( w  x )  b  y      i i i   ,  *  0, i  1,..., m  i i 10 How about a non-linear case? 11 Linear versus Non-linear SVR • Linear case ● ● ● ● ● ● – Map data into a higher dimensional space, e.g., f : ( age , 2age 2 )  income ● Age y i  w1  xi  b Income Income f : age  income • Non-linear case y i  w1 xi  w 2 2xi2  b Dual problem • Primal 1 min || w || 2  C 2 • Dual m  ( i 1 i 1 m ( i   i* )( j   *j ) xi , x j    2 max  i , jm1 m *    ( i   i )   yi ( i   i* )  i 1 i 1   ) * i  yi  ( w  x i )  b     i  s.t.(w  x i )  b  yi     i*   ,  *  0, i  1,..., m  i i m s.t. ( i   i* )  0; 0   i ,  i*  C i 1 Primal variables: w for each feature dim Dual variables: , * for each data point Complexity: the dim of the input space Complexity: Number of support vectors y  1 >0   =0  2* x Kernel trick • Linear: x, y • Non-linear:  ( x), ( y)  K ( x, y) Note: No need to compute the mapping function, (.), explicitly. Instead, we use the kernel function. Commonly used kernels: T d K ( x , y )  ( x y  1 ) - Polynomial kernels: - Radial basis function (RBF) kernels: K ( x, y)  exp(  Note: for RBF kernel, dim((.)) is infinite 1 2 2 || x  y || 2 ) Dual problem for non-linear case • Primal 1 min || w || 2  C 2 • Dual m  ( i 1 i  ) * i  yi  (w   (x i ))  b     i  s.t.(w   (x i ))  b  yi     i*   ,  *  0, i  1,..., m i i  Primal variables: w for each feature dim K(xi, xj) 1 m ( i   i* )( j   *j )  (x i ), (x j )   2 max  i , j 1 m m *     ( i   i )   yi ( i   i* )  i 1 i 1  m s.t. ( i   i* )  0; 0   i ,  i*  C i 1 Dual variables: , * for each data point Complexity: the dim of the input space Complexity: Number of support vectors y  1 >0   =0  2* x SVR Applications Optical Character Recognition (OCR) A. J. Smola and B. Scholkopf, A Tutorial on Support Vector Regression, NeuroCOLT Technical Report TR-98-030 SVR Applications • Stock price prediction SVR Demo WEKA and linear regression • Software can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/ • Data set used in this experiment: Computer hardware • The objective is to predict CPU performance based on these given attributes: – – – – – – Machine cycle time in nanoseconds (MYCT) Minimum main memory in kilobytes (MMIN) Maximum main memory (MMAX) Cache memory in kilobytes (CACH) Minimum channels in units (CHMIN) Maximum channels in units (CHMAX) • Output is expressed as a linear combination of the attributes. Each attribute has a specific weight. – Output = w1a1  w2 a2  ...  wn an  b Evaluation • Root mean-square error 2 2 2 ˆ ˆ ˆ ( y1  y1 )  ( y2  y2 )  ...  ( ym  ym ) n • Mean absolute error | y1  yˆ1 |  | y2  yˆ 2 | ... | ym  yˆ m | n WEKA Load data and normalize each attribute to [0, 1] Data visualization WEKA (Linear regression) WEKA (Linear Regression) Performance = (72.8 x MYCT) + (484.8 x MMIN) + (355.6 x MMAX) + (161.2 x CACH) + (256.9 x CHMAX) – 53.9 Main memory plays a more important role in the system performance Large Machine cycle time (MYCT) does not indicate the best performance WEKA (linear SVR) Compare to Linear Regression Performance = (72.8 x MYCT) + (484.8 x MMIN) + (355.6 x MMAX) + (161.2 x CACH) + (256.9 x CHMAX) – 53.9 WEKA (non-linear SVR) A list of support vectors WEKA (Performance comparison) Method Mean absolute error Root mean squared error Linear regression 41.1 69.55 SVR (Linear) C = 1.0 35.0 78.8 SVR (RBF) C = 1.0, gamma = 1.0 28.8 66.3 Parameter C (for linear SVR) and <C,> (for non-linear SVR) need to be crossvalidated for a better performance. Other Machine Learning tools • Shogun toolbox (C++) – http://www.shogun-toolbox.org/ • Shark Machine Learning library (C++) – http://shark-project.sourceforge.net/ • Machine Learning in Python (Python) – http://pyml.sourceforge.net/ • Machine Learning in Open CV2 – http://opencv.willowgarage.com/wiki/ • LibSVM, LibLinear, etc.

Linear Regression and Support Vector Regression

Related documents

Products

Support

Linear Regression and Support Vector Regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib