pptx format

advertisement
From linear classifiers to
neural network
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Linear Classifier: Recap & Notation
• We focus on two-class classification for the entire class
•π’™
𝑛
= π‘₯
𝑛
1 ,π‘₯
𝑛
2 ,…,π‘₯
𝑛
𝐷
𝑇
- Input feature
• 𝑛 – index of training tokens, 𝐷 – feature dimension
• π’˜ = 𝑀1 , 𝑀2 , … , 𝑀𝐷
•π‘Ž
𝑛
= 𝑏 + π’˜π‘‡ 𝒙
•π‘¦
𝑛
- Predicted class
•π‘‘
𝑛
- Labelled class
𝑦
𝑛
𝑛
𝑇
- Weight vector
- Linear output
𝑛
1
if
π‘Ž
≥ 0 or 𝑦
=
0 otherwise
𝑛
𝑛
1
if
π‘Ž
≥0
=
−1 otherwise
Discriminant function
•π‘Ž
𝑛
= 𝑏 + π’˜π‘‡ 𝒙
•π‘¦
𝑛
=𝑔 π‘Ž
𝑛
- Linear output
𝑛
1
if
π‘Ž
≥0
𝑛
𝑦 =
−1 otherwise
𝑛
𝑔 π‘Ž =
1
−1
if π‘Ž ≥ 0
otherwise
• 𝑔 π‘Ž - Nonlinear discriminant function
1
-1
Loss function
• To evaluate the performance of the classifier
𝑁
β„’ 𝑦
1:𝑁
,𝑑
1:𝑁
=
𝓁 𝑦
𝑛
,𝑑
𝑛
𝑛=1
• Loss function
𝓁 𝑦, 𝑑 = 𝑦 ≠ 𝑑
𝓁 𝑦, 1
-1
𝓁 𝑦, −1
1
𝑦
-1
1
𝑦
Structure of linear classifier
𝓁 𝑦, 1
𝓁 𝑦, −1
𝓁 βˆ™, 𝑑
1
1
𝑦
1
1
𝑛
𝑦
𝑦
𝑛
𝑔 βˆ™
π‘Ž
𝒙
𝑛
𝑛
= 𝑏 + π’˜π‘‡ 𝒙
π‘Ž
𝑛
1
π‘₯
𝑛
1
𝑛
……
π‘₯
𝑛
𝐷
Alternatively
• Nonlinear function
𝑔 π‘Ž =π‘Ž
• Loss function
𝓁 𝑦, 𝑑 = 𝑒 −𝑦𝑑
𝓁 𝑦, 1
𝓁 𝑦, −1
𝑦
𝑦
Structure of linear classifier
𝓁 𝑦, 1
𝓁 𝑦, −1
𝓁 βˆ™, 𝑑
1
1
𝑦
𝑦𝑦
1
1
𝑛
𝑦
𝑛
𝑔 βˆ™
π‘Ž
𝒙
𝑛
𝑛
= 𝑏 + π’˜π‘‡ 𝒙
π‘Ž
𝑛
1
π‘₯
𝑛
1
𝑛
……
π‘₯
𝑛
𝐷
Ideal case - Problem?
• Nonlinear function
1
-1
• Loss function
𝓁 𝑦, 1
𝓁 𝑦, −1
𝑦
• Not differentiable. Cannot train using gradient methods
• We need proxy for both.
𝑦
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Nonlinear Discriminant function
• Our Goal: Find a function that is
• Differentiable
• Approximates the step function
• Solution: Sigmoid function
• Definition:
• A bounded, differentiable and monotonically increasing
function.
Sigmoid Function - Examples
• Logistic Function:
1
𝑔 π‘Ž =
1 + −π‘Ž
𝑒 −π‘Ž
𝑒
′
𝑔 π‘Ž =
1 + 𝑒 −π‘Ž
2
1
0.5
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
0.25
0.2
0.15
0.1
0.05
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
Sigmoid Function - Examples
• Hyperbolic Tangent:
𝑒 π‘Ž − 𝑒 −π‘Ž
𝑔 π‘Ž = tanh π‘Ž = π‘Ž
𝑒 + −π‘Ž
𝑒 −π‘Ž
π‘Ž
−π‘Ž
π‘Ž
−π‘Ž
π‘Ž
𝑒 +𝑒
𝑒 −𝑒
𝑒 −𝑒
′
2 π‘Ž
𝑔 π‘Ž = π‘Ž
−
=
1
−
tanh
𝑒 + 𝑒 −π‘Ž
𝑒 π‘Ž + 𝑒 −π‘Ž 2
1
0.5
0
-0.5
-1
-5
-4
-3
-2
-1
0
1
2
3
4
5
-4
-3
-2
-1
0
1
2
3
4
5
1
0.5
0
-5
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Nonlinear Loss function
• Our Goal: Find a function that is
• Differentiable
• Is an UPPER BOUND of the step function
• Why?
• For training: min loss => error not large
• For test: generalized error < generalized loss < upper
bounds
Loss function example
• Square loss
𝓁 𝑦, 𝑑 = 𝑦 − 𝑑
πœ•π“
=2 𝑦−𝑑
πœ•π‘¦
𝓁 𝑦, 1
2
𝓁 𝑦, −1
𝑦
• Advantage: easy to solve, common for regression
• Disadvantage: punish ‘right’ tokens
𝑦
Loss function example
• Hinge loss
−𝑑𝑦 + 1 if 𝑑𝑦 < 1
𝓁 𝑦, 𝑑 =
0
otherwise
πœ•π“
= 𝑒 −𝑑𝑦 + 1
πœ•π‘¦
𝓁 𝑦, 1
𝓁 𝑦, −1
𝑦
• Advantage: easy to solve, good for classification
𝑦
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Linear classifiers example
Nonlinear
Discriminant
Function
Loss Function
Linear
Square
Sigmoid
𝓁 βˆ™, 𝑑
𝑦
𝑛
𝑛
,𝑑
𝑛
𝑛
𝑦
Hinge
𝑛
=𝑔 π‘Ž
𝑛
𝑔 βˆ™
• Linear + Square: MSE
classifier
• Sigmoid + Squared:
Nonlinear MSE
classifier
• Linear + Hinge +
Regularization : SVM
𝓁 𝑦
π‘Ž
1
π‘₯
𝑛
1
𝑛
π‘Ž
……
π‘₯
𝑛
𝑛
= π’˜π‘‡ 𝒛
𝑛
𝒛
𝑛
𝐷
MSE Classifier
𝓁 𝑦
𝑛
,𝑑
𝑛
=
𝑦
𝑛
𝑇
π›»π’˜
π’˜ 𝒛
𝑛
−𝑑
𝑛
𝑛
2
𝑛
−𝑑
=2
𝑛
2
𝑛
𝒛
𝑇
=
𝑛
π’˜ 𝒛
𝒛
𝑛
𝑛 𝑇
π’˜−𝑑
𝑛
𝒛
𝑛
𝒛 𝑛 π‘‡π’˜ =
𝑛
𝒁= 𝒛
𝒛
𝑛
𝑑
𝑛
𝑛
1
,…,𝒛
𝑁
,𝒕= 𝑑
1
𝒁𝒁𝑇 π’˜ = 𝒁𝒕
π’˜ = 𝒁𝒁𝑇
−1 𝒁𝒕
𝑛
,…,𝑑
𝑁 𝑇
−𝑑
𝑛
𝑛
=0
2
Nonlinear MSE Classifier
𝓁 𝑦
𝑛
,𝑑
𝑛
=
𝑛
𝑛
π’˜π‘‡ 𝒙 𝑛
=
𝑛
−𝑑
𝑦
𝑛
𝑦
𝑛
−𝑑
𝑛
2
−𝑑
𝑛
2
2
𝑛
𝓁 𝑦
𝑛
,𝑑
𝑛
𝑛
=
𝑛
=
𝑔
𝑛
π’˜π‘‡ 𝒙 𝑛
−𝑑
𝑛
2
Training a Nonlinear MSE Classifier
𝓁 𝑦
𝑛
,𝑑
𝑛
=
𝑛
𝑦
𝑛
=
𝑔
π’˜π‘‡ 𝒙 𝑛
𝑛
−𝑑
2
𝑛
−𝑑
𝑛
2
−𝑑
𝑛
𝑔 ′ π’˜π‘‡ 𝒙
𝑛
Chain rule:
2 𝑔 π’˜π‘‡ 𝒙
π›»π’˜ =
𝑛
𝑛
Disadvantage: Can be stagnant.
𝑛
𝒙
𝑛
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Introduction of neural network
• 𝑦 𝑛 is a function of of 𝒙 𝑛 , 𝐹 𝒙 𝑛
• For linear classifier, this function takes a simple
form
• What if we need more complicated functions?
1
π‘₯
𝑦
𝑛
……
𝑦
𝑛
π‘Ž
𝑛
……
π‘Ž
𝑛
𝑛
1
……
π‘₯
𝑛
𝐷
Introduction of neural network
𝑛
1
𝑛
1
𝑛
1
π‘₯1
π‘Ž1
1
π‘₯0
……
……
……
𝑛
𝐷1
𝑛
𝐷1
𝑛
𝐷0
π‘₯1
π‘Ž1
π‘₯0
𝑛
π‘₯3
1
……
𝑛
π‘₯3
𝐷3
Introductionπ‘Ž of1 neural
network
……
π‘Ž
𝐷
𝑛
3
1
𝑛
1
𝑛
1
𝑛
1
𝑛
1
𝑛
1
π‘₯2
π‘Ž2
1
π‘₯1
π‘Ž1
1
𝑛
3
π‘₯0
……
……
……
……
……
3
𝑛
𝐷2
𝑛
𝐷2
𝑛
𝐷1
𝑛
𝐷1
𝑛
𝐷0
π‘₯2
π‘Ž2
π‘₯1
π‘Ž1
π‘₯0
Introduction of neural network
𝑦
𝑛
π‘ŽπΏ
1
𝑛
1
𝑛
……
𝑛
……
π‘₯𝐿−1 1
π‘ŽπΏ−1 1
𝑛
π‘₯𝐿−1 𝐷𝐿−1
𝑛
π‘ŽπΏ−1 𝐷𝐿−1
Notation
𝑛
1
𝑛
1
𝑛
1
𝑛
1
𝑛
1
π‘₯2
π‘Ž2
1
π‘₯1
π‘Ž1
1
π‘₯0
……
……
……
……
……
𝑛
𝑛
𝑛
𝐷2
𝒙2 = 𝑔2 𝒂2
𝑛
𝐷2
𝒂2 = 𝒃2 + 𝑾2 𝒙2
𝑛
𝐷1
𝒙1
𝑛
𝐷1
𝒂1
𝑛
𝐷0
𝒙0 input
π‘₯2
π‘Ž2
π‘₯1
π‘Ž1
π‘₯0
𝑛
𝑛
𝑛
= 𝑔1 𝒂1
𝑛
= 𝒃1 + 𝑾1 𝒙0
𝑛
𝑛
Hidden layer 1
𝑛
Notation
𝑦
𝑛
π‘ŽπΏ
1
𝑛
π‘₯𝐿−1
𝑛
π‘ŽπΏ−1
1
1
𝑛
𝑦
= 𝑔𝐿 π‘ŽπΏ
𝑛
= 𝑏𝐿 + π’˜πΏ 𝒙𝐿−1
π‘ŽπΏ
1
……
𝑛
π‘₯𝐿−1
……
𝑛
π‘ŽπΏ−1
𝑛
𝑛
𝑛
𝑛
𝑛
𝐷𝐿−1
𝒙𝐿−1 = 𝑔𝐿−1 𝒂𝐿−1
𝐷𝐿−1
𝒂𝐿−1 = 𝒃𝐿−1 + 𝑾𝐿−1 𝒙𝐿−2
𝑛
𝑛
Question
𝑛
𝑛
𝒙0
𝑛
𝒙0
• 𝑦 is a function of
,𝐹
, how many kinds
of function can be represented by a neural net?
• Are sigmoid functions good candidates for 𝑔 βˆ™ ?
• Answer:
• Given enough nodes, a 3-layer network with
sigmoid or linear activation functions can
approximate ANY functions with bounded support
sufficiently accurately.
Skeleton
• Recap – linear classifier
• Nonlinear discriminant function
• Nonlinear cost function
• Example Linear Classifiers
• Introduction to neural network
• Representation power of sigmoidal neural network
Proof: representation power
• For simplicity, we only consider functions with 1
variable. Real input real output. In this case, two
layers are enough.
βˆ†
𝐷
𝐹 π‘₯ ≈
𝐹 𝑑 + 1 βˆ† − 𝐹 π‘‘βˆ† 𝑒 π‘₯ − π‘‘βˆ†
𝑑=1
Proof: representation power
• First Layer: 𝐷 hidden nodes, each node represents
π‘₯1 𝑑 = 𝑒 π‘₯0 − π‘‘βˆ†
• 𝑔1 βˆ™ - logistic function, 𝑀1 𝑑 = 1, 𝑏1 = −π‘‘βˆ†
• Second (Output) layer:
𝑀2 𝑑 = 𝐹 𝑑 + 1 βˆ† − 𝐹 π‘‘βˆ†
βˆ†
𝐷
𝐹 π‘₯ ≈
𝐹 𝑑 + 1 βˆ† − 𝐹 π‘‘βˆ† 𝑒 π‘₯ − π‘‘βˆ†
𝑑=1
Download