Linear Regression and Support Vector Regression

advertisement
Linear Regression and Support
Vector Regression
Paul Paisitkriangkrai
paulp@cs.adelaide.edu.au
The University of Adelaide
24 October 2012
Outlines
•
•
•
•
Regression overview
Linear regression
Support vector regression
Machine learning tools available
Regression Overview
CLUSTERING
+
++
+
+
+
+
+
REGRESSION (THIS TALK)
+
+
+
+
CLASSIFICATION
+ +
+
+
+
+
+ +
+ + -- --
+
++
+++
+
+
K-means
• Decision tree
• Linear Regression
• Linear Discriminant Analysis • Support Vector Regression
• Neural Networks
• Support Vector Machines
• Boosting
Group data based on their
characteristics
Separate data based on their
labels
Find a model that can explain
the output given the input
Data processing flowchart (Income
prediction)
Transformed +
data
++
Raw
data
(noise/outlier removal)
Sex Age
Hei Income
ght
M
20
1.7
25,000
F
30
1.6
55,000
1.8
30,000
…
Feature extraction and
selection
Regression
Height and sex
seem to be
irrelevant.
Income
Pre-processing
M
+
+
+++
Processed
data
Mis-entry (should
have been 25!!!)
++
++
+
Age
+
++
Linear Regression
• Given data with n dimensional variables and 1
target-variable (real number)
{(x1 , y1 ), (x 2 , y2 ),..., (x m , ym )}
Where x  n , y  
• The objective: Find a function f that returns
n
the best fit. f :   
• Assume that the relationship between X and y
is approximately linear. The model can be
represented as (w represents coefficients and
b is an intercept)
f (w1 ,..., wn , b)  y  w  x  b  
Linear Regression
• To find the best fit, we minimize the sum of
squared errors  Least square estimation
m
m
min  ( yi  yˆ i )   ( yi  (w  xi  b)) 2
2
i 1
i 1
• The solution can be found by solving
ˆ  ( X T X ) 1 X T Y
w
(By taking the derivative of the above objective
function w.r.t. w )
• In MATLAB, the back-slash operator computes
a least square solution.
Linear Regression
m
m
i 1
i 1
ˆ  xi  bˆ)) 2
min  ( yi  yˆ i ) 2   ( yi  (w
• To ovoid over-fitting, a regularization term can
be introduced (minimize a magnitude of w)
– LASSO:
m
min  ( yi  w  xi  b) 2  C
i 1
n
| w
j 1
j
|
m
n
i 1
j 1
2
2
– Ridge regression: min  ( yi  w  xi  b)  C  | w j |
Support Vector Regression
• Find a function, f(x), with at most -deviation
from the target y
w  xi  b  yi  
The problem can be written as a
convex optimization problem
w 1  x i  b  yi   ;
C: trade off the complexity
What if the problem is not feasible?
We can introduce slack variables
(similar to soft margin loss function).

Income
1
min || w || 2
2
s.t. yi  w1  x i  b   ;
●
●
●
●
●
●
●
yi  w1  xi  b  
Age
We do not care about errors as long as
they are less than 
Support Vector Regression
Assume linear parameterization
f (x, )  w  x  b
Only the point outside the region contribute to the final
cost
y

1

 2*
x
L ( y, f (x, ))  max | y  f (x, ) |  ,0
9
Soft margin
y
Given training data
x i ,yi 

1

i  1,..., m
 2*
Minimize
1
|| w || 2  C
2
m
x
*
(



 i i)
i 1
Under constraints
 yi  (w  x i )  b     i

*
(
w

x
)

b

y





i
i
i
  ,  *  0, i  1,..., m
 i i
10
How about a non-linear case?
11
Linear versus Non-linear SVR
• Linear case
● ●
●
●
●
●
– Map data into a higher dimensional
space, e.g.,
f : ( age , 2age 2 )  income
●
Age
y i  w1  xi  b
Income
Income
f : age  income
• Non-linear case
y i  w1 xi  w 2 2xi2  b
Dual problem
• Primal
1
min || w || 2  C
2
• Dual
m
 (
i 1
i
1 m
( i   i* )( j   *j ) xi , x j


 2
max  i , jm1
m
*
   ( i   i )   yi ( i   i* )

i 1
i 1

 )
*
i
 yi  ( w  x i )  b     i

s.t.(w  x i )  b  yi     i*
  ,  *  0, i  1,..., m
 i i
m
s.t. ( i   i* )  0; 0   i ,  i*  C
i 1
Primal variables: w for each feature dim
Dual variables: , * for each data point
Complexity: the dim of the input space
Complexity: Number of support vectors
y

1
>0

 =0
 2*
x
Kernel trick
• Linear: x, y
• Non-linear:
 ( x), ( y)  K ( x, y)
Note: No need to compute the mapping function,
(.), explicitly. Instead, we use the kernel function.
Commonly used kernels:
T
d
K
(
x
,
y
)

(
x
y

1
)
- Polynomial kernels:
- Radial basis function (RBF) kernels:
K ( x, y)  exp( 
Note: for RBF kernel, dim((.)) is infinite
1
2 2
|| x  y || 2 )
Dual problem for non-linear case
• Primal
1
min || w || 2  C
2
• Dual
m
 (
i 1
i
 )
*
i
 yi  (w   (x i ))  b     i

s.t.(w   (x i ))  b  yi     i*
  ,  *  0, i  1,..., m
i
i

Primal variables: w for each feature dim
K(xi, xj)
1 m
( i   i* )( j   *j )  (x i ), (x j )


2
max  i , j 1 m
m
*
    ( i   i )   yi ( i   i* )

i 1
i 1

m
s.t. ( i   i* )  0; 0   i ,  i*  C
i 1
Dual variables: , * for each data point
Complexity: the dim of the input space
Complexity: Number of support vectors
y

1
>0

 =0
 2*
x
SVR Applications
Optical Character Recognition (OCR)
A. J. Smola and B. Scholkopf, A Tutorial on Support Vector Regression, NeuroCOLT Technical Report TR-98-030
SVR Applications
• Stock price prediction
SVR Demo
WEKA and linear regression
• Software can be downloaded from
http://www.cs.waikato.ac.nz/ml/weka/
• Data set used in this experiment: Computer hardware
• The objective is to predict CPU performance based on
these given attributes:
–
–
–
–
–
–
Machine cycle time in nanoseconds (MYCT)
Minimum main memory in kilobytes (MMIN)
Maximum main memory (MMAX)
Cache memory in kilobytes (CACH)
Minimum channels in units (CHMIN)
Maximum channels in units (CHMAX)
• Output is expressed as a linear combination of the
attributes. Each attribute has a specific weight.
– Output =
w1a1  w2 a2  ...  wn an  b
Evaluation
• Root mean-square error
2
2
2
ˆ
ˆ
ˆ
( y1  y1 )  ( y2  y2 )  ...  ( ym  ym )
n
• Mean absolute error
| y1  yˆ1 |  | y2  yˆ 2 | ... | ym  yˆ m |
n
WEKA
Load data and normalize each attribute to [0, 1]
Data visualization
WEKA (Linear regression)
WEKA (Linear Regression)
Performance = (72.8 x MYCT) + (484.8 x MMIN) + (355.6 x MMAX) +
(161.2 x CACH) + (256.9 x CHMAX) – 53.9
Main memory plays a more important
role in the system performance
Large Machine cycle time (MYCT) does
not indicate the best performance
WEKA (linear SVR)
Compare to Linear Regression
Performance = (72.8 x MYCT) + (484.8 x MMIN) + (355.6 x MMAX) +
(161.2 x CACH) + (256.9 x CHMAX) – 53.9
WEKA (non-linear SVR)
A list of support
vectors
WEKA (Performance comparison)
Method
Mean absolute error
Root mean squared
error
Linear regression
41.1
69.55
SVR (Linear) C = 1.0
35.0
78.8
SVR (RBF) C = 1.0,
gamma = 1.0
28.8
66.3
Parameter C (for linear SVR) and <C,> (for non-linear SVR) need to be crossvalidated for a better performance.
Other Machine Learning tools
• Shogun toolbox (C++)
– http://www.shogun-toolbox.org/
• Shark Machine Learning library (C++)
– http://shark-project.sourceforge.net/
• Machine Learning in Python (Python)
– http://pyml.sourceforge.net/
• Machine Learning in Open CV2
– http://opencv.willowgarage.com/wiki/
• LibSVM, LibLinear, etc.
Download