Uploaded by 陈芝霖

Lecture 07 SVM II(1)

advertisement
DDA3020 Machine Learning
Lecture 07 Support Vector Machine II
Jicong Fan
School of Data Science, CUHK-SZ
October 17/22, 2022
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
1 / 44
Outline
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
2 / 44
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
3 / 44
Lagrange duality
Given a general minimization problem
min
w∈Rn
f (w)
hi (w) ≤ 0,
`j (w) = 0,
subject to
i = 1, . . . , m
j = 1, . . . , r
Note that here w denotes the argument we aim to optimize, rather than a
data point.
The Lagrangian function:
L(w, u, v) = f (w) +
m
X
ui hi (w) +
i=1
r
X
vj `j (w)
j=1
The Lagrange dual function:
g(u, v) = minn L(w, u, v)
w∈R
The dual problem:
max
u∈Rm ,v∈Rr
g(u, v)
subject to u ≥ 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
4 / 44
KKT conditions
Given general problem
min
w∈Rn
subject to
f (w)
hi (w) ≤ 0,
`j (w) = 0,
i = 1, . . . , m
j = 1, . . . , r
The Karush-Kuhn-Tucker conditions or KKT conditions are:
0 ∈ ∂f (w) +
m
X
i=1
ui ∂hi (w) +
r
X
vj ∂`j (w)
(stationarity)
j=1
ui · hi (w) = 0 for all i
hi (w) ≤ 0, `j (w) = 0 for all i, j
ui ≥ 0 for all i
(complementary slackness)
(primal feasibility)
(dual feasibility)
Reference: S. Boyd and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press, Chapter 5.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
5 / 44
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
6 / 44
Optimization of SVM
The objective function of support vector machine is
min
w,b
1
kwk2
2
(1)
s.t. yi (w> xi + b) ≥ 1, ∀i
It can be transformed to
min
w,b
1
kwk2
2
(2)
s.t. 1 − yi (w> xi + b) ≤ 0, ∀i
Its Lagrange function is
L(w, b, α) =
m
X
1
kwk2 +
αi 1 − yi (w> xi + b) ,
2
i
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
7 / 44
Optimization of SVM
Lagrange function:
L(w, b, α) =
m
X
1
kwk2 +
αi 1 − yi (w> xi + b) ,
2
i
The primal and dual optimal solutions should satisfy KKT conditions:
Stationarity:
m
X
∂L
=0 ⇒ w=
αi yi xi
∂w
i
(3)
m
X
∂L
=0 ⇒
αi yi = 0
∂b
i
(4)
Feasibility:
αi ≥ 0, 1 − yi (w> xi + b) ≤ 0, ∀i
(5)
Complementary slackness:
αi 1 − yi (w> xi + b) = 0, ∀i
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
(6)
8 / 44
Optimization of SVM
Lagrange function:
L(w, b, α) =
m
X
1
kwk2 +
αi 1 − yi (w> xi + b) ,
2
i
Replacing the stationary condition into Lagrange function, we have
L(w, b, α)
m
m
m
m
X
X
X
X
>
1
= kwk2 +
αi −
αi yi
αj yj xj xi −
αi yi b
2
i
i
j
i
m
m
=
m
X
i
m
X
i
(8)
m
X
X
X
1
= kwk2 +
αi −
αi αj yi yj x>
αi yi
i xj − b
2
i
i,j
i
=
(7)
1
αi − kwk2
2
(9)
(10)
m
αi −
1X
αi αj yi yj x>
i xj
2 i,j
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 SupportOctober
Vector Machine
17/22, 2022
II
(11)
9 / 44
Optimization of SVM
Then, we obtain the following dual problem:
max
α
s.t.
m
X
i
m
X
m
αi −
1X
αi αj yi yj x>
i xj ,
2 i,j
αi yi = 0, αi ≥ 0, ∀i
(12)
(13)
i
It can be solved by any off-the-shelf optimization solver.
Then, we replace the solved α back into the stationary condition, thus we
obtain the primal solution w,
w=
m
X
αi yi xi
(14)
i
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
10 / 44
Optimization of SVM
Solution interpretation:
The primal solution w and the dual solution α should also satisfy other
KKT conditions
Feasibility: αi ≥ 0, 1 − yi (w> xi + b) ≤ 0, ∀i Complementary slackness: αi 1 − yi (w> xi + b) = 0, ∀i
When comparing above conditions together, we have that for xi , ∀i,
If it satisfies 1 − yi (w> xi + b) < 0, then αi = 0;
If it satisfies 1 − yi (w> xi + b) = 0, then αi ≥ 0.
If αi = 0, then it means that xi doesn’t contribute to w, i.e., the SVM
classifier
The data points with αi > 0 construct the classifier, and they are called
support vectors, which locate at the hyperplanes yi (w> xi + b) = 1. And,
we define the support set as S = {i|αi > 0} This is why we call it support
vector machine.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
11 / 44
Optimization of SVM
The remaining issue is how to determine the bias parameter b?
For any support vector xj , j ∈ S, we have
yj (w> xj + b) = 1, ∀j ∈ S
m
X
⇒ yj (
αi yi x>
i xj + b) = 1, ∀j ∈ S
(15)
(16)
i
Product yj for both sides of the above equation, and utilizing yj · yj = 1,
we have
m
X
α i yi x >
i xj + b = yj , ∀j ∈ S
(17)
i
m
⇒b=
X
1 X
yj −
αi yi x>
i xj
|S|
i
(18)
j∈S
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
12 / 44
Prediction using SVM
Prediction:
Given the optimized parameters {α, w, b}, given a new data x, its prediction is
w> x + b =
m
X
i
m
αi yi x>
i x+
X
1 X
yj −
αi yi x>
i xj
|S|
i
(19)
j∈S
If w> x + b > 0, then the predicted class of x is +1, otherwise −1
If and only if y(w> x + b) > 0, then your prediction is correct
Note that the prediction of new data depends on inner product with existing
training data, which is important to derive kernel SVM later
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
13 / 44
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
14 / 44
SVM with slack variables
In above derivation, we assume that all primal constraints yi (w> xi + b) ≥
1, ∀i can be satisfied, implying that the training data is separable.
However, sometimes samples of different classes are overlapped (i.e., nonseparable), as shown below.
Consequently, some constraints will be violated, and we can not obtain the
feasible solution.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
15 / 44
SVM with slack variables
To handle such data, we introduce slack variable ξi ≥ 0
We allow some errors for training data, i.e., yi (w> xi +b) ≥ 1−ξi , ∀i, rather
than yi (w> xi + b) ≥ 1, ∀i
But we hope that such errors ξi , ∀i are small
Please plot the corresponding hinge loss with slack variables
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
16 / 44
SVM with slack variables
In this case, the SVM is formulated as follows
m
X
1
min kwk2 + C
ξi
w,b,ξ 2
i
(20)
s.t. 1−ξi − yi (w> xi + b) ≤ 0, −ξi ≤ 0, ∀i
Its Lagrange function is
L(w, b, , ξ, α, µ) =
m
m
X
X
1
αi 1−ξi − yi (w> xi + b) + µi (−ξi ) ,
kwk2 + C
ξi +
2
i
i
and αi , µi ≥ 0, ∀i.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
17 / 44
SVM with slack variables
Lagrange function:
m
m
X
X
1
2
L(w, b, , ξ, α, µ) = kwk + C
αi 1−ξi − yi (w> xi + b) + µi (−ξi ) ,
ξi +
2
i
i
The primal and dual optimal solutions should satisfy KKT conditions:
Stationarity:
m
X
∂L
=0 ⇒ w=
αi yi xi
∂w
i
(21)
m
X
∂L
=0 ⇒
αi yi = 0
∂b
i
(22)
∂L
= 0 ⇒ αi = C − µi , ∀i
∂ξi
(23)
αi ≥ 0, 1−ξi − yi (w> xi + b) ≤ 0, ξi ≥ 0, µi ≥ 0, ∀i
(24)
Feasibility:
Complementary slackness:
αi 1−ξi − yi (w> xi + b) = 0, µi ξi = 0, ∀i
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
(25)
18 / 44
SVM with slack variables
Lagrange function:
L(w, b, ξ, α, µ) =
m
m
X
X
1
αi 1−ξi − yi (w> xi + b) + µi (−ξi ) .
kwk2 + C
ξi +
2
i
i
Replacing all stationary conditions into Lagrange function to eliminate primal variables, we have
m
L(α, µ) =
=
m
X
X
1
kwk2 +
αi 1 − yi (w> xi + b) +
(C − αi − µi )ξi
2
i
i
m
X
i
αi −
(26)
1X
αi αj yi yj x>
i xj .
2 i,j
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
19 / 44
SVM with slack variables
Then, we obtain the following dual problem:
max
α,µ
s.t.
m
X
i
m
X
αi −
1X
αi αj yi yj x>
i xj ,
2 i,j
αi yi = 0, 0 ≤ αi ≤ C, µi ≥ 0, αi = C − µi , ∀i
(27)
(28)
i
Utilizing αi = C − µi , we obtain a simpler dual problem:
max
α
s.t.
m
X
i
m
X
1X
αi αj yi yj x>
i xj ,
2 i,j
(29)
αi yi = 0, 0 ≤ αi ≤ C, ∀i
(30)
αi −
i
Note that the only change in dual problem is the constraint 0 ≤ αi ≤ C,
which is αi ≥ 0 in the dual problem of standard SVM.
Reference: https://nianlonggu.com/2019/06/07/tutorial-on-SVM/
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
20 / 44
SVM with slack variables
Solution interpretation
The solution αi has three cases: αi = 0, 0 < αi < C, αi = C
αi = 0: the corresponding data are correctly classified and doesn’t contribute to the classifier, locating outside of the margin
0 < αi < C: in this case, µi > 0 due to αi = C − µi ; Since µi ξi = 0,
then we have ξi = 0. The corresponding data are correctly classified and
contributes to the classifier, locating on the margin
αi = C: in this case, µi = 0; then we have ξi > 0. The corresponding data
contributes to the classifier, locating inside the margin
If ξi ≤ 1, then the data is still correctly classified, not crossing decision
boundary
If ξi > 1, then the data is incorrectly classified, crossing decision boundary
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
21 / 44
SVM with slack variables
How to determine the bias parameter b?
We define M = {i|0 < αi < C}
Since 0 < αi < C, we have ξi = 0
Then, for any support vector xj , j ∈ M, we have
yj (w> xj + b) = 1, ∀j ∈ M
m
X
⇒ yj (
αi yi x>
i xj + b) = 1, ∀j ∈ M
(31)
(32)
i
Utilizing yj · yj = 1, we have
m
X
αi yi x>
i xj + b = yj , ∀j ∈ M
(33)
m
X
1 X
yj −
αi yi x>
i xj
|M|
i
(34)
i
⇒b=
j∈M
Note that using the average of all support vectors, rather than one single
support vector, could make the solution of b more numerically stable.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
22 / 44
Optimization of SVM
Why do we prefer to optimize the dual problem, rather than directly optimizing
the primal problem? There are two main advantages:
By examining the dual form of the optimization problem, we gained significant insight into the structure of the problem
The entire algorithm can be written in terms of only inner products between
input feature vectors. In the following, we will exploit this property to apply
the kernels to classification problem. The resulting algorithm, support
vector machines, will be able to efficiently learn in very high dimensional
spaces.
Reference:
https://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-wh
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
23 / 44
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
24 / 44
Kernels
In above derivation, SVM can only handle linearly separable data.
For non-linearly separable data (e.g., XOR data, and the following data),
how to use SVM?
Recall that the polynomial regression can handle non-linearly separable
data
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
25 / 44
SVM with polynomial hypothesis function
Predict y = 1 if
w0 + w1 x1 + w2 x2 + w3 x1 x2
+w4 x21 + w5 x22 + · · · ≥ 0
As introduced before, one can choose high-order polynomial hypothesis
function to handle non-linear separable data,
fw,b (x) = w> [1; x1 ; x2 ; x1 x2 ; x21 ; x22 ; · · · ]
(35)
However, in many real problems, such as image classification, the dimensionality of original features |x| is already very high. Consequently, the
dimensionality of high-order polynomial function will be too high, causing
high computational cost or overfitting
To tackle this difficulty, we will introduce kernel.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
26 / 44
Kernels
Given a new data x, compute its new features based on proximity o landmarks l(1) , l(2) , l(3) (plot above), and here we use the Gaussian kernel, as
follows
kx − l(1) k2
)
2σ 2
kx − l(2) k2
f2 = similarity(x, l(2) ) = exp(−
)
2σ 2
kx − l(3) k2
f3 = similarity(x, l(3) ) = exp(−
)
2σ 2
Then, we have a new representation [f1 ; f2 ; f3 ] for the data x
f1 = similarity(x, l(1) ) = exp(−
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
(36)
(37)
(38)
27 / 44
Kernels
Kernel and similarity
f1 = similarity(x, l(1) ) = exp(−
kx − l(1) k2
)
2σ 2
(39)
If x ≈ l(1) , then f1 ≈ 1
If x is far from l(1) , then f1 ≈ 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
28 / 44
Kernels
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
29 / 44
Kernels
Prediction with kernels
Predict “1” if
w0 + w1 f1 + w2 f2 + w3 f3 ≥ 0
We set w0 = −0.5, w1 = 1, w2 = 1, w3 = 0
Thus, we require that f1 + f2 − 0.5 ≥ 0 for predicting “1”
Consequently, we can obtain a non-linear decision boundary (plot above)
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
30 / 44
Kernels
Given x :
fi = similarity x, l(i)
x − l(i)
= exp −
2σ 2
2!
How to obtain the landmark points?
We can set all training data points as landmarks.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
31 / 44
SVM with Kernels
We firstly define the following kernel
k(xi , xj ) = φ(xi )> φ(xj )
(40)
Utilizing this kernel to replacing x>
i xj , we have the following dual problem
max
α
s.t.
m
X
i
m
X
αi −
1X
αi αj yi yj k(xi , xj ),
2 i,j
αi yi = 0, αi ≥ 0, ∀i
(41)
(42)
i
The solution of b becomes
m
b=
X
1 X
αi yi k(xi , xj )
yj −
|S|
i
(43)
j∈S
The prediction of new data x becomes
m
X
w> x + b =
αi yi k(xi , x) + b
(44)
i
Since α is sparse, the above classifier is also called sparse kernel classifier.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
32 / 44
SVM with Kernels
Widely used kernels:
p
x> xi
1+
, p>0
σ2
kx − xi k2
Radial Basis Function (RBF) kernel: k(x, xi ) = exp −
2σ 2
1
Sigmoidal kernel: k(x, xi ) =
x> xi +b
1 + exp− σ2
Polynomial kernel: k(x, xi ) =
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
(45)
(46)
(47)
33 / 44
SVM with Kernels
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
34 / 44
SVM with Kernels
Figure: Decision boundaries produced by SVM with a 2nd-order polynomial kernel
(top-left), a 3rd-order polynomial kernel (top-right), and a RBF kernel (bottom).
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
35 / 44
1
Lagrange duality and KKT conditions
2
Optimizing SVM by Lagrange duality
3
SVM with slack variables
4
SVM with kernels
5
Others
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
36 / 44
Multi-class SVM
SVM is good for binary classification:
f (x) > 0 ⇒ x ∈ Class 1; f (x) ≤ 0 ⇒ x ∈ Class 2
To classify multiple classes, we use the one-vs-rest approach to convert K
binary classifications to a K-class classification:
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
37 / 44
Multi-class SVM
y ∈ {1, 2, 3, . . . , K}
Many SVM packages already have built-in multi-class classification functionality.
Otherwise, use one-vs.-all method. (Train K SVMs, one to distinguish
y = k from the rest, for k = 1, 2, . . . , K), get (w(1) , b(1) ), . . . , (w(K) , b(K) ).
Predict the label of x as
arg max
w(k)
>
x + b(k)
k∈{1,2,...,K}
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
38 / 44
SVM vs. Logistic regression
SVM : Hinge loss
loss (f (xi ) , yi ) = 1 − w> xi + b yi +
Logistic Regression : Log loss ( - log conditional likelihood)
>
loss (f (xi ) , yi ) = − log P (yi | xi , w, b) = log 1 + e−(w xi +b)yi
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
39 / 44
SVM vs. Logistic regression
Logistic regression (LR) vs. SVM
n = |x| indicates the number of features, and m = |Dtrain | is the number
of training data
If n is large (relative to m), then the data is linearly separable, one can use
LR or SVM without kernel
If n is small, and m is intermediate, then the data may be non-linearly
separable, one use SVM with Gaussian kernel
If n is small, and m is large, then create/add more features to make the
data more separable, and one can use LR or SVM without kernel. Why
not choose SVM with kernel in this case?
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
40 / 44
Software of SVM
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
41 / 44
Using SVM in practice
You are not required to implement SVM by yourself, and there are many
well implemented sortwares, such as libsvm.
When you choose a software to learn a SVM model, you need to specify:
Choice of parameter C (i.e., the tradeoff hyper-parameter of the slack variables)
Choice of kernel
Linear kernel
Gaussian kernel, but you should set the kernel size (i.e., variance of Gaussian
distribution). Note that do perform feature scaling before using the Gaussian
kernel.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
42 / 44
Reading material
References:
Andrew Ng’s note on SVM:
https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf
Chapter 7.1 of Bishop’s book
KKT conditions:
https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/12-kktpdf
More variants of SVM:
Semi-supervised SVM
Structured SVM
SVM with latent variables
SVM for regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
43 / 44
Summary of SVM
What you need to know:
Lagrange duality and KKT conditions
Support vector machine:
Derivation of large margin
Derivation of hingle loss
Optimization using dual problem and KKT conditions
SVM with slack variables
SVM with kernels
Relationship between SVM and logistic regression
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 07 Support
October
Vector17/22,
Machine
2022
II
44 / 44
Download