Uploaded by Vilakshan Vijaywargiya

ds2

advertisement
Learning a Linear
Classifier
Quiz 1
Jan 24 (Wed), 6:30PM, L18, L19, L20
Only for registered students (regular + audit)
Assigned seating – will be announced soon
Open notes (handwritten only)
No mobile phones, tablets etc
Bring your institute ID card
If you don’t bring it you will have to spend
precious time waiting to get verified
Syllabus:
All videos, slides, code linked on the course
discussion page (link below) till 22 Jan, 2024
https://www.cse.iitk.ac.in/users/purushot/courses
/ml/2023-24-w/discussion.html
See GitHub for practice questions
Authentication by Secret Questions
Give me your device ID and
answer the following questions
1. 10111100
2. 00110010
3. 10001110
4. 00010100
5. …
TS271828182845
SERVER
DEVICE
1.
2.
3.
4.
5.
1
0
1
0
…
Physically Unclonable Functions
0.50ms
These tiny differences are
difficult to predict or clone
0.55ms
Then these could act
as the fingerprints
for the devices!
Arbiter PUFs
If the top signal reaches the finish line first,
the “answer” to this question is 0, else if the
bottom signal reaches first, the “answer” is 1
Question: 1011
1
0
1
1
1?
Arbiter PUFs
If the top signal reaches the finish line first,
the “answer” to this question is 0, else if the
bottom signal reaches first, the “answer” is 1
Question: 0110
0
1
1
0
0?
Linear Models
We have
Δ63 = 𝑀0 ⋅ π‘₯0 + 𝑀1 ⋅ π‘₯1 + β‹― + 𝑀63 ⋅ π‘₯63 + 𝛽63 = 𝐰 ⊀ 𝐱 + 𝑏
where
π‘₯𝑖 = 𝑑𝑖 ⋅ 𝑑𝑖+1 ⋅ … ⋅ 𝑑63
𝐰
𝑀0 = 𝛼0
𝑀𝑖 = 𝛼𝑖 + 𝛽𝑖−1 for 𝑖 > 0
If Δ63 < 0, upper signal wins and answer is 0
If Δ63 > 0, lower signal wins and answer is 1
Thus, answer is simply
sign 𝐰 ⊀ 𝐱+𝑏 +1
2
This is nothing but
a linear classifier!
The “best” Linear Classifier
It seems infinitely many
classifiers perfectly
classify the data. Which
one should I choose?
Indeed! Such models would be very brittle and might
misclassify test data (i.e. predict the wrong class), even
those test data which look very similar to train data
It is better to not select a model
whose decision boundary passes
very close to a training data point
8
9
Large Margin Classifiers
Fact: distance of origin from hyperplane 𝐰 ⊀ 𝐱 + 𝑏 = 0 is 𝑏 𝐰 2
Fact: distance of a point 𝐩 from this hyperplane is 𝐰 ⊀ 𝐩 + 𝑏 𝐰 2
𝑖
𝑖
𝑛
Given train data 𝐱 , 𝑦 𝑖=1 for a binary classfn problem where
𝐱 𝑖 ∈ ℝ𝑑 and 𝑦 𝑖 ∈ −1,1 , we want two things from a classifier
Demand 1: classify every point correctly – how to ask this politely?
One way: demand that for all 𝑖 = 1 … 𝑛, sign 𝐰 ⊀ 𝐱 𝑖 + 𝑏 = 𝑦 𝑖
Easier way: demand that for all 𝑖 = 1 … 𝑛, 𝑦 𝑖 ⋅ (𝐰 ⊀ 𝐱 𝑖 + b) ≥ 0
Demand 2: not let any data point come close to the boundary
Demand that min 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝑖=1…𝑛
𝐰
2
be as large as possible
10
Support Vector Machines
“
Just a fancy way of saying
Let us simplify this
problem
Please find me a linear classifieroptimization
that perfectly
classifies the train data while keeping data points
as far away from the hyperplane as possible
The mathematical way of writing this request is the following
“
Constraints
max min 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝐰,𝑏
𝑖=1…𝑛
𝑖
⊀ 𝑖
𝐰
2
Objective
such that 𝑦 ⋅ (𝐰 𝐱 + b) ≥ 0 for all 𝑖 = 1 … 𝑛
This is known as an
optimization problem with an
objective and lots of constraints
This looks so complicated,
how will I ever find a solution
to this optimization problem?
Constrained Optimization 101
HOW WE MUST SPEAK TO MS
Objective
min 𝑓 π‘₯
Constraints
π‘₯
such that 𝑝 π‘₯ < 0
and π‘ž π‘₯ > 0 etc. etc.
min π‘₯
2
π‘₯
Feasible
Feasibleset
set is the
is empty!
interval 3,6
s.t. π‘₯ ≤
≥6
and π‘₯ ≥
≤3
3
6
11
Constraints are usually specified using
math equations. The set of points that
satisfy all the constraints is called the
set of WE
the optimization
SPEAK TOproblem
A HUMAN
MfeasibleHOW
I want to find an unknown π‘₯
that
gives
meproblem
the best
For
You
your
optimization
specified
constraints,
has value
according
to this
function
𝑓
nooptimal
solution
since
no point
the
(least)
value
of 𝑓 is
9 andbtw,
itall
is achieved
at π‘₯π‘₯=would
3
satisfies
your

Oh!
notconstraints
any
do! π‘₯
must satisfy these conditions
All I am saying is, of the values
of π‘₯ that satisfy my conditions,
find me the one that gives the
best value according to 𝑓
12
Back to SVMs
Assume there do exist params that perfectly classify all train data
Consider one such params 𝐰, 𝑏 which classifies train data perfectly
Now, 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝐰
2
= 𝑦𝑖 ⋅ π°βŠ€π±π‘– + 𝑏
𝐰
𝑖
as
𝑦
∈ −1,1
2
Thus, geometric margin is same as min 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝐰 2=
𝑖=1…𝑛
min 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝐰 2 since model has perfect classification!
𝑖=1…𝑛
We will use this useful fact to greatly simplify the optimization problem
We will remove this
assumption later
What if train data is non-linearly
separable i.e no linear classifier can
perfectly classify it? For example
Support Vector Machines
13
Let 𝑖0 be the data point that comes closest to the hyperplane i.e.
min 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏 = 𝑦 𝑖0 ⋅ 𝐰 ⊀ 𝐱 𝑖0 + 𝑏
𝑖=1…𝑛
Recall that all this discussion holds only for a perfect classifier 𝐰, 𝑏
Let πœ– = 𝑦 𝑖0 ⋅ 𝐰 ⊀ 𝐱 𝑖0 + 𝑏 and consider 𝐰 = 𝐰 πœ– , 𝑏 = 𝑏 πœ–
Note this gives us 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏 ≥ 1 for all 𝑖 = 1 … 𝑛 as well as
min 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏
𝐰 2 = 1 𝐰 2 (as 𝑦 𝑖0 ⋅ 𝐰 ⊀ 𝐱 𝑖0 + 𝑏 = 1)
𝑖=1…𝑛
Thus, instead of searching for 𝐰, 𝑏 , easier to search for 𝐰, 𝑏
2
min
𝐰
max 1 𝐰
2 2
𝐰,𝑏
⊀ 𝑖
such that 𝑦 𝑖 ⋅ (𝐰 𝐱 + 𝑏) ≥ 1 for all 𝑖 = 1 … 𝑛
What prevents me from misusing the slack variables
The C-SVM Technique
to learn a model that misclassifies every data point?
14
For linearly separable cases where
perfect
classifier exists
The 𝐢we
termsuspect
prevents a
you
from doing
1 set 𝐢 2
so. If we
to be a large value (it is a
min
𝐰
2 then it will penalize
hyper-parameter),
2
𝐰,𝑏
𝑖
⊀ 𝑖solutions that misuse slack too much
𝑦 ⋅ (𝐰 𝐱 + 𝑏) ≥ 1 for all 𝑖 ∈ 𝑛
s.t.
If a linear classifier cannot perfectly classify data, then find model using
1
min
𝐰,𝑏, πœ‰π‘– 2
⊀ 𝑖
𝐰
2
2
+𝐢
𝑛 Having the constraint πœ‰π‘– ≥ 0
𝑖=1 πœ‰π‘–
prevents us from misusing slack
to artificially inflate the margin
s.t. 𝑦 𝑖 ⋅ 𝐰 𝐱 + 𝑏 ≥ 1 − πœ‰π‘– for all 𝑖 ∈ 𝑛
Recall English phrase
as well as πœ‰π‘– ≥ 0 for all 𝑖 ∈ 𝑛
“cut me some slack”
The πœ‰π‘– terms are called slack variables. They allow some data points
to come close to the hyperplane or be misclassified altogether
15
From C-SVM to Loss Functions
We can further simplify the previous optimization problem
Note πœ‰π‘– basically allows us to have 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏 < 1 (even < 0)
Thus, the amount of slack we want is just πœ‰π‘– = 1 − 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏
However, recall that we must also satisfy πœ‰π‘– ≥ 0
Another way of saying that if you already have 𝑦 𝑖 ⋅ 𝐰 ⊀π‘₯𝐱 𝑖++=𝑏max
≥ π‘₯,
1,0
then you don’t need any slack i.e. you should have πœ‰π‘– = 0 in this case
Thus, we need only set πœ‰π‘– = 1 − 𝑦 𝑖 ⋅ 𝐰 ⊀ 𝐱 𝑖 + 𝑏
+
The above is nothing but the popular hinge loss function!
Hinge Loss
16
Captures how well as a classifier classified a data point
Suppose on a data point 𝐱, 𝑦 , 𝑦 ∈ −1,1 , a model gives prediction
score of 𝑠 (for a linear model 𝐰, 𝑏 , we have 𝑠 = 𝐰 ⊀ 𝐱 + 𝑏)
We obviously want 𝑠 ⋅ 𝑦 ≥ 0 for correct classification but we also
want 𝑠 ⋅ 𝑦 ≫ 0 for large margin – hinge loss function captures both
Note that hinge loss not only penalizes misclassification
but also correct classification if the data point gets
too close to the hyperplane!
17
Final Form of C-SVM
Recall that the C-SVM optimization finds a model by solving
1
min
𝐰,𝑏, πœ‰π‘– 2
⊀ 𝑖
𝐰
2
2
+𝐢
𝑛
𝑖=1 πœ‰π‘–
s.t. 𝑦 𝑖 ⋅ 𝐰 𝐱 + 𝑏 ≥ 1 − πœ‰π‘– for all 𝑖 ∈ 𝑛
as well as πœ‰π‘– ≥ 0 for all 𝑖 ∈ 𝑛
Using the previous discussion, we can rewrite the above very simply
1
min
𝐰,𝑏 2
𝐰
2
2
+𝐢
𝑛
𝑖=1 β„“hinge
𝐰 ⊀ 𝐱 𝑖 + 𝑏, 𝑦 𝑖
Download