From linear classifiers to neural network Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Linear Classifier: Recap & Notation • We focus on two-class classification for the entire class •π π = π₯ π 1 ,π₯ π 2 ,…,π₯ π π· π - Input feature • π – index of training tokens, π· – feature dimension • π = π€1 , π€2 , … , π€π· •π π = π + ππ π •π¦ π - Predicted class •π‘ π - Labelled class π¦ π π π - Weight vector - Linear output π 1 if π ≥ 0 or π¦ = 0 otherwise π π 1 if π ≥0 = −1 otherwise Discriminant function •π π = π + ππ π •π¦ π =π π π - Linear output π 1 if π ≥0 π π¦ = −1 otherwise π π π = 1 −1 if π ≥ 0 otherwise • π π - Nonlinear discriminant function 1 -1 Loss function • To evaluate the performance of the classifier π β π¦ 1:π ,π‘ 1:π = π π¦ π ,π‘ π π=1 • Loss function π π¦, π‘ = π¦ ≠ π‘ π π¦, 1 -1 π π¦, −1 1 π¦ -1 1 π¦ Structure of linear classifier π π¦, 1 π π¦, −1 π β, π‘ 1 1 π¦ 1 1 π π¦ π¦ π π β π π π π = π + ππ π π π 1 π₯ π 1 π …… π₯ π π· Alternatively • Nonlinear function π π =π • Loss function π π¦, π‘ = π’ −π¦π‘ π π¦, 1 π π¦, −1 π¦ π¦ Structure of linear classifier π π¦, 1 π π¦, −1 π β, π‘ 1 1 π¦ π¦π¦ 1 1 π π¦ π π β π π π π = π + ππ π π π 1 π₯ π 1 π …… π₯ π π· Ideal case - Problem? • Nonlinear function 1 -1 • Loss function π π¦, 1 π π¦, −1 π¦ • Not differentiable. Cannot train using gradient methods • We need proxy for both. π¦ Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Nonlinear Discriminant function • Our Goal: Find a function that is • Differentiable • Approximates the step function • Solution: Sigmoid function • Definition: • A bounded, differentiable and monotonically increasing function. Sigmoid Function - Examples • Logistic Function: 1 π π = 1 + −π π −π π ′ π π = 1 + π −π 2 1 0.5 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Sigmoid Function - Examples • Hyperbolic Tangent: π π − π −π π π = tanh π = π π + −π π −π π −π π −π π π +π π −π π −π ′ 2 π π π = π − = 1 − tanh π + π −π π π + π −π 2 1 0.5 0 -0.5 -1 -5 -4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5 1 0.5 0 -5 Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Nonlinear Loss function • Our Goal: Find a function that is • Differentiable • Is an UPPER BOUND of the step function • Why? • For training: min loss => error not large • For test: generalized error < generalized loss < upper bounds Loss function example • Square loss π π¦, π‘ = π¦ − π‘ ππ =2 π¦−π‘ ππ¦ π π¦, 1 2 π π¦, −1 π¦ • Advantage: easy to solve, common for regression • Disadvantage: punish ‘right’ tokens π¦ Loss function example • Hinge loss −π‘π¦ + 1 if π‘π¦ < 1 π π¦, π‘ = 0 otherwise ππ = π’ −π‘π¦ + 1 ππ¦ π π¦, 1 π π¦, −1 π¦ • Advantage: easy to solve, good for classification π¦ Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Linear classifiers example Nonlinear Discriminant Function Loss Function Linear Square Sigmoid π β, π‘ π¦ π π ,π‘ π π π¦ Hinge π =π π π π β • Linear + Square: MSE classifier • Sigmoid + Squared: Nonlinear MSE classifier • Linear + Hinge + Regularization : SVM π π¦ π 1 π₯ π 1 π π …… π₯ π π = ππ π π π π π· MSE Classifier π π¦ π ,π‘ π = π¦ π π π»π π π π −π‘ π π 2 π −π‘ =2 π 2 π π π = π π π π π π π π−π‘ π π π π π ππ = π π= π π π π‘ π π 1 ,…,π π ,π= π‘ 1 πππ π = ππ π = πππ −1 ππ π ,…,π‘ π π −π‘ π π =0 2 Nonlinear MSE Classifier π π¦ π ,π‘ π = π π ππ π π = π −π‘ π¦ π π¦ π −π‘ π 2 −π‘ π 2 2 π π π¦ π ,π‘ π π = π = π π ππ π π −π‘ π 2 Training a Nonlinear MSE Classifier π π¦ π ,π‘ π = π π¦ π = π ππ π π π −π‘ 2 π −π‘ π 2 −π‘ π π ′ ππ π π Chain rule: 2 π ππ π π»π = π π Disadvantage: Can be stagnant. π π π Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Introduction of neural network • π¦ π is a function of of π π , πΉ π π • For linear classifier, this function takes a simple form • What if we need more complicated functions? 1 π₯ π¦ π …… π¦ π π π …… π π π 1 …… π₯ π π· Introduction of neural network π 1 π 1 π 1 π₯1 π1 1 π₯0 …… …… …… π π·1 π π·1 π π·0 π₯1 π1 π₯0 π π₯3 1 …… π π₯3 π·3 Introductionπ of1 neural network …… π π· π 3 1 π 1 π 1 π 1 π 1 π 1 π₯2 π2 1 π₯1 π1 1 π 3 π₯0 …… …… …… …… …… 3 π π·2 π π·2 π π·1 π π·1 π π·0 π₯2 π2 π₯1 π1 π₯0 Introduction of neural network π¦ π ππΏ 1 π 1 π …… π …… π₯πΏ−1 1 ππΏ−1 1 π π₯πΏ−1 π·πΏ−1 π ππΏ−1 π·πΏ−1 Notation π 1 π 1 π 1 π 1 π 1 π₯2 π2 1 π₯1 π1 1 π₯0 …… …… …… …… …… π π π π·2 π2 = π2 π2 π π·2 π2 = π2 + πΎ2 π2 π π·1 π1 π π·1 π1 π π·0 π0 input π₯2 π2 π₯1 π1 π₯0 π π π = π1 π1 π = π1 + πΎ1 π0 π π Hidden layer 1 π Notation π¦ π ππΏ 1 π π₯πΏ−1 π ππΏ−1 1 1 π π¦ = ππΏ ππΏ π = ππΏ + ππΏ ππΏ−1 ππΏ 1 …… π π₯πΏ−1 …… π ππΏ−1 π π π π π π·πΏ−1 ππΏ−1 = ππΏ−1 ππΏ−1 π·πΏ−1 ππΏ−1 = ππΏ−1 + πΎπΏ−1 ππΏ−2 π π Question π π π0 π π0 • π¦ is a function of ,πΉ , how many kinds of function can be represented by a neural net? • Are sigmoid functions good candidates for π β ? • Answer: • Given enough nodes, a 3-layer network with sigmoid or linear activation functions can approximate ANY functions with bounded support sufficiently accurately. Skeleton • Recap – linear classifier • Nonlinear discriminant function • Nonlinear cost function • Example Linear Classifiers • Introduction to neural network • Representation power of sigmoidal neural network Proof: representation power • For simplicity, we only consider functions with 1 variable. Real input real output. In this case, two layers are enough. β π· πΉ π₯ ≈ πΉ π + 1 β − πΉ πβ π’ π₯ − πβ π=1 Proof: representation power • First Layer: π· hidden nodes, each node represents π₯1 π = π’ π₯0 − πβ • π1 β - logistic function, π€1 π = 1, π1 = −πβ • Second (Output) layer: π€2 π = πΉ π + 1 β − πΉ πβ β π· πΉ π₯ ≈ πΉ π + 1 β − πΉ πβ π’ π₯ − πβ π=1