Learning a Linear Classifier Quiz 1 Jan 24 (Wed), 6:30PM, L18, L19, L20 Only for registered students (regular + audit) Assigned seating – will be announced soon Open notes (handwritten only) No mobile phones, tablets etc Bring your institute ID card If you don’t bring it you will have to spend precious time waiting to get verified Syllabus: All videos, slides, code linked on the course discussion page (link below) till 22 Jan, 2024 https://www.cse.iitk.ac.in/users/purushot/courses /ml/2023-24-w/discussion.html See GitHub for practice questions Authentication by Secret Questions Give me your device ID and answer the following questions 1. 10111100 2. 00110010 3. 10001110 4. 00010100 5. … TS271828182845 SERVER DEVICE 1. 2. 3. 4. 5. 1 0 1 0 … Physically Unclonable Functions 0.50ms These tiny differences are difficult to predict or clone 0.55ms Then these could act as the fingerprints for the devices! Arbiter PUFs If the top signal reaches the finish line first, the “answer” to this question is 0, else if the bottom signal reaches first, the “answer” is 1 Question: 1011 1 0 1 1 1? Arbiter PUFs If the top signal reaches the finish line first, the “answer” to this question is 0, else if the bottom signal reaches first, the “answer” is 1 Question: 0110 0 1 1 0 0? Linear Models We have Δ63 = π€0 ⋅ π₯0 + π€1 ⋅ π₯1 + β― + π€63 ⋅ π₯63 + π½63 = π° β€ π± + π where π₯π = ππ ⋅ ππ+1 ⋅ … ⋅ π63 π° π€0 = πΌ0 π€π = πΌπ + π½π−1 for π > 0 If Δ63 < 0, upper signal wins and answer is 0 If Δ63 > 0, lower signal wins and answer is 1 Thus, answer is simply sign π° β€ π±+π +1 2 This is nothing but a linear classifier! The “best” Linear Classifier It seems infinitely many classifiers perfectly classify the data. Which one should I choose? Indeed! Such models would be very brittle and might misclassify test data (i.e. predict the wrong class), even those test data which look very similar to train data It is better to not select a model whose decision boundary passes very close to a training data point 8 9 Large Margin Classifiers Fact: distance of origin from hyperplane π° β€ π± + π = 0 is π π° 2 Fact: distance of a point π© from this hyperplane is π° β€ π© + π π° 2 π π π Given train data π± , π¦ π=1 for a binary classfn problem where π± π ∈ βπ and π¦ π ∈ −1,1 , we want two things from a classifier Demand 1: classify every point correctly – how to ask this politely? One way: demand that for all π = 1 … π, sign π° β€ π± π + π = π¦ π Easier way: demand that for all π = 1 … π, π¦ π ⋅ (π° β€ π± π + b) ≥ 0 Demand 2: not let any data point come close to the boundary Demand that min π° β€ π± π + π π=1…π π° 2 be as large as possible 10 Support Vector Machines “ Just a fancy way of saying Let us simplify this problem Please find me a linear classifieroptimization that perfectly classifies the train data while keeping data points as far away from the hyperplane as possible The mathematical way of writing this request is the following “ Constraints max min π° β€ π± π + π π°,π π=1…π π β€ π π° 2 Objective such that π¦ ⋅ (π° π± + b) ≥ 0 for all π = 1 … π This is known as an optimization problem with an objective and lots of constraints This looks so complicated, how will I ever find a solution to this optimization problem? Constrained Optimization 101 HOW WE MUST SPEAK TO MS Objective min π π₯ Constraints π₯ such that π π₯ < 0 and π π₯ > 0 etc. etc. min π₯ 2 π₯ Feasible Feasibleset set is the is empty! interval 3,6 s.t. π₯ ≤ ≥6 and π₯ ≥ ≤3 3 6 11 Constraints are usually specified using math equations. The set of points that satisfy all the constraints is called the set of WE the optimization SPEAK TOproblem A HUMAN MfeasibleHOW I want to find an unknown π₯ that gives meproblem the best For You your optimization specified constraints, has value according to this function π nooptimal solution since no point the (least) value of π is 9 andbtw, itall is achieved at π₯π₯=would 3 satisfies your ο Oh! notconstraints any do! π₯ must satisfy these conditions All I am saying is, of the values of π₯ that satisfy my conditions, find me the one that gives the best value according to π 12 Back to SVMs Assume there do exist params that perfectly classify all train data Consider one such params π°, π which classifies train data perfectly Now, π° β€ π± π + π π° 2 = π¦π ⋅ π°β€π±π + π π° π as π¦ ∈ −1,1 2 Thus, geometric margin is same as min π¦ π ⋅ π° β€ π± π + π π° 2= π=1…π min π¦ π ⋅ π° β€ π± π + π π° 2 since model has perfect classification! π=1…π We will use this useful fact to greatly simplify the optimization problem We will remove this assumption later What if train data is non-linearly separable i.e no linear classifier can perfectly classify it? For example Support Vector Machines 13 Let π0 be the data point that comes closest to the hyperplane i.e. min π¦ π ⋅ π° β€ π± π + π = π¦ π0 ⋅ π° β€ π± π0 + π π=1…π Recall that all this discussion holds only for a perfect classifier π°, π Let π = π¦ π0 ⋅ π° β€ π± π0 + π and consider π° = π° π , π = π π Note this gives us π¦ π ⋅ π° β€ π± π + π ≥ 1 for all π = 1 … π as well as min π¦ π ⋅ π° β€ π± π + π π° 2 = 1 π° 2 (as π¦ π0 ⋅ π° β€ π± π0 + π = 1) π=1…π Thus, instead of searching for π°, π , easier to search for π°, π 2 min π° max 1 π° 2 2 π°,π β€ π such that π¦ π ⋅ (π° π± + π) ≥ 1 for all π = 1 … π What prevents me from misusing the slack variables The C-SVM Technique to learn a model that misclassifies every data point? 14 For linearly separable cases where perfect classifier exists The πΆwe termsuspect prevents a you from doing 1 set πΆ 2 so. If we to be a large value (it is a min π° 2 then it will penalize hyper-parameter), 2 π°,π π β€ πsolutions that misuse slack too much π¦ ⋅ (π° π± + π) ≥ 1 for all π ∈ π s.t. If a linear classifier cannot perfectly classify data, then find model using 1 min π°,π, ππ 2 β€ π π° 2 2 +πΆ π Having the constraint ππ ≥ 0 π=1 ππ prevents us from misusing slack to artificially inflate the margin s.t. π¦ π ⋅ π° π± + π ≥ 1 − ππ for all π ∈ π Recall English phrase as well as ππ ≥ 0 for all π ∈ π “cut me some slack” The ππ terms are called slack variables. They allow some data points to come close to the hyperplane or be misclassified altogether 15 From C-SVM to Loss Functions We can further simplify the previous optimization problem Note ππ basically allows us to have π¦ π ⋅ π° β€ π± π + π < 1 (even < 0) Thus, the amount of slack we want is just ππ = 1 − π¦ π ⋅ π° β€ π± π + π However, recall that we must also satisfy ππ ≥ 0 Another way of saying that if you already have π¦ π ⋅ π° β€π₯π± π++=πmax ≥ π₯, 1,0 then you don’t need any slack i.e. you should have ππ = 0 in this case Thus, we need only set ππ = 1 − π¦ π ⋅ π° β€ π± π + π + The above is nothing but the popular hinge loss function! Hinge Loss 16 Captures how well as a classifier classified a data point Suppose on a data point π±, π¦ , π¦ ∈ −1,1 , a model gives prediction score of π (for a linear model π°, π , we have π = π° β€ π± + π) We obviously want π ⋅ π¦ ≥ 0 for correct classification but we also want π ⋅ π¦ β« 0 for large margin – hinge loss function captures both Note that hinge loss not only penalizes misclassification but also correct classification if the data point gets too close to the hyperplane! 17 Final Form of C-SVM Recall that the C-SVM optimization finds a model by solving 1 min π°,π, ππ 2 β€ π π° 2 2 +πΆ π π=1 ππ s.t. π¦ π ⋅ π° π± + π ≥ 1 − ππ for all π ∈ π as well as ππ ≥ 0 for all π ∈ π Using the previous discussion, we can rewrite the above very simply 1 min π°,π 2 π° 2 2 +πΆ π π=1 βhinge π° β€ π± π + π, π¦ π