Uploaded by Liu Wallace

lecture

advertisement
《神经网络理论与应用》第四讲
Neural Network Theory and Applications
主讲教师:吕宝粮、郑伟龙
助教:马天放、刘佳雯、姜卫邦、蓝宇霆
上海交通大学计算机科学与工程系
bllu@sjtu.edu.cn
weilong@sjtu.edu.cn
http://bcmi.sjtu.edu.cn
2022年3月9日
4: 1
Radial-Basis Function Networks (RBF)
径向基函数网络
(是某种沿径向对称的标量函数)
4: 2
Radial-Basis Function (RBF) Network
ϕ =1
w0 = b
x1
ϕ
w1
…
…
xm
Input
layer
4: 3
wj
…
…
xm −1
ϕ
…
…
x2
wm1
ϕ
Hidden layer of
m1 radial-basis functions
Output
layer
Multilayer Perceptron with Two Hidden Layers
Output
Signal
(response)
Input
Signal
(Stimulus)
…
…
…
4: 4
Input
layer
First
hidden
layer
Second
hidden
layer
Output
layer
Introduction to RBF Network

A basic radial-basis function (RBF) network consists of three layers having
entirely different roles:
1. Input layer is made up of source nodes (sensory units).
2. The hidden layer applies a nonlinear transformation from the input space to
the hidden space.
-RBF networks have only one, often high-dimensional hidden layer.
3. A linear output layer.

The hidden space is usually chosen high-dimensional because of two reasons:
1. Pattern vectors are more likely to be linearly separable in a highdimensional space.
2. The ability of the network is the better the more there are hidden units.
4: 5
Radial Basis Function
m
f (x) = ∑ wiφi (x)
i =1
Three parameters for a radial
basis function:
φi(x)=φ (||x − xi||)

4: 6
x
Center: i

Distance Measure: r

Shape:
φ
= ||x − xi||
Typical Radial Functions

Gaussian:
=
φ (r ) e

σ > 0 and r ∈ℜ
2
2
r +c c
c > 0 and r ∈ℜ
Inverse Multiquadratics:
=
φ (r ) c
4: 7
r2
2σ 2
Multiquadratics:
φ (=
r)

−
r 2 + c2
c > 0 and r ∈ℜ
Gaussian Basis Function
=
φ (r ) e
−
r2
2σ 2
σ > 0 and r ∈ℜ
σ = 1.5
σ = 1.0
σ = 0.5
4: 8
Inverse Multiquadratics
=
φ (r ) c
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
c > 0 and r ∈ℜ
c=5
c=4
c=3
c=2
c=1
-10
4: 9
r 2 + c2
-5
0
5
10
Cover’s Theorem

Consider the use of a RBF network for a complex pattern classification task.

The problem is basically solved by transforming it into a high-dimensional
space in a nonlinear manner.

Justification: Cover’s theorem on the separability of patterns:

A complex pattern classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.

If the patterns are linearly separable, the classification problem is fairly easy
to solve.
4: 10
Cover’s Theorem -2

Consider a family of surfaces.

Each surface divides an input space into two regions.






ℋ = {𝐱𝐱1 ,𝐱𝐱 2 ,..., 𝐱𝐱 𝑁𝑁 } is a set of N pattern vectors.
Each pattern vector belongs to one of the two classes ℋ1 or ℋ2 .
This kind of binary partition is called a dichotomy.
A dichotomy is called separable with respect to a family of surfaces if
there exists a surface separating the points in class from those in class ℋ1
from those in class ℋ2 .
Let 𝜑𝜑1 (x), 𝜑𝜑2 (x),..., 𝜑𝜑𝑚𝑚1 (x) be a set of 𝑚𝑚1 real-valued functions.
Using these functions, we can define for each pattern x∈ℋ the vector
𝐟𝐟(𝐱𝐱) = [𝜑𝜑1 (𝑥𝑥), 𝜑𝜑2 (𝑥𝑥), . . . , 𝜑𝜑𝑚𝑚1 (𝑥𝑥)]𝑇𝑇
4: 11
Hyperplane
Cover’s Theorem -3

Assume now that x is a 𝑚𝑚0 -dimensional vector.

corresponding points in a new space of dimension 𝑚𝑚1 .


Then the function f(x) maps points in 𝑚𝑚0 -dimensional input space into
𝜑𝜑𝑖𝑖 (x) is referred to a hidden function.
The space spanned by the functions 𝜑𝜑1 (x),..., 𝜑𝜑𝑚𝑚1 (x) is called the hidden
space or feature space.

The hidden functions have a similar role as hidden units in a MLP
network.
4: 12
Cover’s Theorem -4

A dichotomy [ℋ1 , ℋ2 ] of ℋ is said to be φ - separable if there exists
an 𝑚𝑚1 -dimensional vector w satisfying the condition
𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) > 0,

𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) > 0,
𝐱𝐱 ∈ ℋ1
𝐱𝐱 ∈ ℋ2
The hyperplane defined by the equation
𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) = 0
describes the separating surface in the hidden φ-space.
4: 13

The inverse image of this subspace, that is,

𝐱𝐱: 𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) = 0
Defines the separating surface in the input space.
Cover’s Theorem -5

Summarizing, Cover’s theorem on the separability of patterns has two
basic ingredients:
1. Nonlinear formulation of the hidden functions 𝜑𝜑𝑖𝑖 (x), i = 1, 2,..., 𝑚𝑚1 .
2. High dimensionality of the hidden space compared to the input
space.

Sometimes the use of nonlinear mapping alone without increasing the
dimensionality is sufficient for producing linear separability.
4: 14
Two Mappings for Pattern Classification
4: 15
XOR Problem



For illustrating the importance of φ-separability, consider again the simple yet
important XOR problem.
Foour points (patterns) (1,1), (0,1), (0,0), and (1,0) in a two-dimensional input
space.
Requirement: construct a binary classifier with output:
- 0 for the inputs (1,1) or (0,0)
- 1 for the inputs (1,0) or (0,1).

Recall that the XOR problem is not linearly separable in the original input space.

Define a pair of Gaussian hidden functions
𝜑𝜑1 (𝐱𝐱) = exp(−||𝐱𝐱 − t1 ||2 ),
𝜑𝜑2 (𝐱𝐱) = exp(−||𝐱𝐱 − t 2 ||2 ),
4: 16
t1 = [1,1]𝑇𝑇
t 2 = [0,0]𝑇𝑇
Solution of the XOR Problem
4: 17
Approximation Properties




Multilayer perceptrons have the universal approximation property.
Also the family of RBF networks can uniformly approximate any continuous
function on a compact set.
Formally, let G : ℛ 𝑛𝑛 → ℛ be integrable, continuous, and bounded function
satisfying the condition
� 𝐺𝐺(𝐱𝐱)𝑑𝑑𝐱𝐱 ≠ 𝟎𝟎
ℛ 𝑛𝑛
Let ℱ𝐺𝐺 denote the family of RBF networks consisting of
functions F : ℛ 𝑛𝑛 → ℛ
𝑚𝑚1

4: 18
𝐱𝐱 − 𝐭𝐭 𝑖𝑖
)
𝐹𝐹(𝐱𝐱) = � 𝜔𝜔𝑖𝑖 𝐺𝐺(
𝜎𝜎
𝑖𝑖=1
Here 𝜎𝜎 > 0, 𝜔𝜔𝑖𝑖 ∈ ℛ and 𝐭𝐭 𝑖𝑖 ∈ ℛ 𝑛𝑛 for i=1, 2,..., 𝑚𝑚1 .
Approximation Properties -2

The universal approximation theorem for RBF networks:

For any continuous input-output mapping function f(x) there is an RBF
network with

a set of centers 𝐭𝐭 𝑖𝑖 , i=1, 2,..., 𝑚𝑚1 and a common width 𝜎𝜎 such that
the input-output mapping functions F(x) realized by the RBF network is


close to f(x) in the 𝐿𝐿𝑝𝑝 norm 𝑝𝑝 ∈ [1, ∞).
Note that the kernel G : ℛ 𝑛𝑛 → ℛ need not be radially symmetric.
The theorem provides a theoretical basis for using RBF networks in
practical applications.
4: 19
Learning Strategies

In RBF networks, learning proceeds differently for different layer.

The linear output layer’s weights are learned rapidly through a linear
optimization strategy.

The hidden layer’s activation functions evolve slowly using some
nonlinear optimization strategy.

The layers of a RBF network perform different tasks.

It is reasonable to use different optimization techniques for the hidden and
output layers.

Learning strategies for the RBF networks differ in the method used for
specifying the centers of the RBF network.
4: 20
What to Learn?
y1
yl



w11 w12 w1m wl1 wl2
φ2
φ1
x1
4: 21
wlm
φm
x2
xn

Weights: wij’s
Centers: µj’s of φj’s
Widths: σj’s of φj’s
Number of φj’s Model
Selection
Two-Stage Training
y1
yl
Step 2
w11 w12 w1m wl1 wl2
φ2
φ1
x1
4: 22
wlml
φm
x2
xn
Determines wij’s.
E.g., using batch-learning.
Step 1
Determines
 Centers µj’s of φj’s.
 Widths σj’s of φj’s.
 Number of φj’s.
Fixed Centers Selected at Random

The simplest approach is to assume fixed radial-basis functions.

The locations of the centers may be chosen randomly form the training
data set.

This is a sensible approach provided that the training data are
representative for the problem.

The radial-basis functions are typically chosen to be isotropic Gaussian
functions:(各项同性的Gauss函数)

4: 23
𝐺𝐺(||𝐱𝐱 − 𝐭𝐭 𝑖𝑖 ||) = exp(−
𝑚𝑚1
2
||𝐱𝐱
−
𝐭𝐭
||
)
𝑖𝑖
2
𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚
i = 1, 2,..., 𝑚𝑚1 where is the number of centers (basis functions).
Comparison of RBF with MLP

Both RBF and MLP networks are nonlinear layered networks having
universal approximation properties.

The most important differences between them are:
1. An RBF network has a single hidden layer, while an MLP can have several
hidden layers.
2. The computational nodes in the MLP network are similar in various layers,
while in the RBF network they are quite different in the output and hidden
layers.
4: 24
Comparison of RBF with MLP -2
3. In the RBF network, the output layer is linear, while it is usually nonlinear
in an MLP network.
4. In each hidden node, the activation function of RBF network computes an
Euclidean distance, while in MLP networks an inner product between the
input and the weight vector is computed.
5. MLPs construct global approximations, while RBF networks approximate
locally nonlinear input-output mappings.

MLP may require less parameters than the RBF network for achieving the
same accuracy.
4: 25
从感知机到支持向量机
• Kohonen
• SOM
• Fukushima
• Neurocognitron
1992年
1986年
2006年
1982年
1980年
• 感知机
1956年
1965年
• BP算法
• Cover定理
1949年
• 最小均方学习算法
• 径向基函数网络
1985年
1960年
• Hebbian学习规则
4: 26
• 支持向量机
1986年
50年
• LeCun
• CNN
支持向量机与多层感知机
支持向量机
4: 27
MLP的缺点:
支持向量机的优点:
•
•
•
•
•
•
BP算法不保证收敛
学习率需要靠经验试探
隐藏层神经元数目也靠经验试探
二次规划算法保证收敛
二次规划求解算法高效
支持向量是由算法确定的
Support Vector Machine
4: 28
Empirical Risk

We want to estimate a function using training data
T =({ X i , d i )}i =1
N

F ( X ,W )
Loss between desired response and actual
response
L(d , F ( X , W )) = (d − F ( X , W )) 2

Expected risk (风险泛函)
R (W ) =

1
2
∫ L(d ,F(X ;W )) dFX D(X ,d )
Empirical risk(经验风险泛函)
1
Remp (W ) =
N
4: 29
,
N
∑ L(d , F ( X , W )
i =1
i
i
Empirical risk minimization principle

The true expected risk is approximated by
empirical risk
1
Remp (W ) =
N

N
∑ L(d , F ( X , W )
i =1
i
i
The learning based on the empirical
minimization principle is defined as
W * = arg min Remp (W )
W
Examples of algorithms: Perceptron, Back-propagation, etc.
4: 30
Overfitting and underfitting

Problem: How rich class of classifications
F(X,W) to use
underfitting

4: 31
good fit
overfitting
Problem of generalization: A small empirical
risk Remp (W ) does not imply small true
expected risk R (W )
Structural Risk Minimization
Statistical learning theory : Vapnik & Chervonenkis
 An upper bound on the expected risk of a
classification rule

R(W ) ≤ Remp(W ) +
h[log(2N / h ) + 1] − log(α )
N
where N is the number of training data, h is VC-dimension of
class of functions.

4: 32
SRM Principle: to find a network structure such
that decreasing the VC dimension occurs at the
expense of the smallest possible increase in
training error
Perceptron Revisited: Linear Separators

Binary classification can be viewed as the task
of separating classes in feature space:
wTx + b > 0
wTx + b = 0
f(x) = sign(wTx + b)
wTx + b < 0
4: 33
Linear Separators

4: 34
Which of the linear separators is optimal?
Classification Margin
wT xi + b
r=
w
Distance from example xi to the separator is
 Examples closest to the hyperplane are support vectors.
 Margin ρ of the separator is the distance between support
vectors.

ρ
r
4: 35
Maximum Margin Classification
Maximizing the margin is good according to
intuition and PAC theory.
 Implies that only support vectors matter; other
training examples are ignorable.

4: 36
Linear SVM Mathematically

Let training set {(xi, yi)}i=1..n, xi∈Rd, yi ∈ {-1, 1} be separated by
a hyperplane with margin ρ. Then for each training example
(xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
Tx + b) ≥ ρ/2
(w
y
⇔
i
i
if yi = 1
wTxi + b ≥ ρ/2

For every support vector xs the above inequality is an
equality. After rescaling w and b by ρ/2 in the equality, we
obtain that distance between each xs and the hyperplane is
y s ( w T x s + b)
1
r=
=
w
w

Then the margin can be expressed through (rescaled) w and
b as:
2
ρ = 2r =
w
4: 37
Target y(=d)
g(x,w,b)=-2x-+5
+2
+1
Feature X1
0
1
-1
-2
4: 38
2
3
4
5
Linear SVMs Mathematically (cont.)

Then we can formulate the quadratic
optimization problem:
Find w and b such that
ρ=
2
w
is maximized
and for all (xi, yi), i=1..n :
yi(wTxi + b) ≥ 1
Which can be reformulated as:
Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
4: 39
Solving the Optimization Problem
Find w and b such that
Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n :



4: 40
yi (wTxi + b) ≥ 1
Need to optimize a quadratic function subject to linear
constraints.
Quadratic optimization problems are a well-known class of
mathematical programming problems for which several (nontrivial) algorithms exist.
The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every inequality
constraint in the primal (original) problem:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem Solution

Given a solution α1…αn to the dual problem,
solution to the primal is:
w = Σ αi yi x i
b = yk - Σαiyixi Txk for any αk > 0
Each non-zero αi indicates that corresponding xi
is a support vector.
 Then the classifying function is (note that we
don’t need w explicitly):

f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between
the test point x and the support vectors xi
 Also keep in mind that solving the optimization
problem involved computing the inner products
xiTxj between all training points.

4: 41
Soft Margin Classification


What if the training set is not linearly separable?
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples,
resulting margin called soft.
ξi
ξi
4: 42
Soft Margin Classification Mathematically

The old formulation:
Find w and b such that
Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n :

yi (wTxi + b) ≥ 1
Modified formulation incorporates slack variables:
Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n :
yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

4: 43
Parameter C can be viewed as a way to control
overfitting: it “trades off” the relative importance
of maximizing the margin and fitting the training
data.
Soft Margin Classification – Solution

Dual problem is identical to separable case (would not be
identical if the 2-norm penalty for slack variables CΣξi2 was
used in primal objective, we would need additional Lagrange
multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Again, xi with non-zero αi will be support vectors.
 Solution to the dual problem is:

w = Σ αi yi x i
b= yk(1- ξk) - ΣαiyixiTxk
for any k s.t. αk>0
Again, we don’t need to compute w explicitly for classification:
4: 44
f(x) = ΣαiyixiTx + b
SVM Boundaries with different C
Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n :
yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
4: 45
Theoretical Justification for Maximum Margins

Vapnik has proved the following:
The class of optimal linear separators has VC dimension h
bounded from above as
 D 2 

h ≤ min  2 , m0  + 1
 ρ 



4: 46
where ρ is the margin, D is the diameter of the smallest
sphere that can enclose all of the training examples, and m0 is
the dimensionality.
Intuitively, this implies that regardless of dimensionality m0 we
can minimize the VC dimension by maximizing the margin ρ.
Thus, complexity of the classifier is kept small regardless of
dimensionality.
Linear SVMs: Overview




The classifier is a separating hyperplane.
Most “important” training points are support vectors;
they define the hyperplane.
Quadratic optimization algorithms can identify which
training points xi are support vectors with non-zero
Lagrangian multipliers αi.
Both in the dual formulation of the problem and in the
solution training points appear only inside inner products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
4: 47
f(x) = ΣαiyixiTx + b
Non-linear SVMs

Datasets that are linearly separable with some noise work
out great:
x
0

But what are we going to do if the dataset is just too hard?
x
0

How about… mapping data to a higher-dimensional space:
x2
0
4: 48
x
Non-linear SVMs: Feature spaces

General idea: the original feature space can always be
mapped to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
4: 49
The “Kernel Trick”
4: 50
What Functions are Kernels?
K=
K(x1,x1)
K(x1,x2)
K(x1,x3)
K(x2,x1)
K(x2,x2)
K(x2,x3)
…
K(xn,x1)
4: 51
…
K(xn,x2)
…
K(xn,x3)
…
K(x1,xn)
K(x2,xn)
…
…
…
K(xn,xn)
Examples of Kernel Functions
4: 52
Non-linear SVMs Mathematically

Dual problem formulation:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

The solution is:
f(x) = ΣαiyiK(xi, xj)+ b

4: 53
Optimization techniques for finding αi’s remain
the same!
Examples of Kernel Functions
4: 54
SVM Examples
4: 55
Key Points
Learning depends only on dot products of sample
pairs.
 Exclusive reliance on dot products enables
approach to non-linearly separable problems.
 The classifier depends only on the support
vectors, not on all the training points.
 Max margin lowers hypothesis variance.
 The optimal classifier is defined uniquely-there
are no “local maxima” in the search space
 Polynomial in number of data points and
dimensionality

4: 56
Structure SVM from the Point View of MLP
A Support Vector Machine maps the input space into a high-dimensional feature
space and then constructs as optimal hyperplane in the feature space
4: 57
Two Mappings for Pattern Classification
4: 58
Architecture of SVM
4: 59
SVMs for Multi-class Classification Problems
Two task decomposition methods:
One-versus-rest
 One-versus-one

4: 60
One-Versus-Rest method

This method requires one classifier per category. The i th
SVM will be trained with all of the examples in the i th
class with positive labels, and all other examples with
negative labels.
K-class
SVM1
Category1
SVM 2
Category 2
…
Problem
SVM K
Category K
The Number of training data for
each classifier is N
4: 61
One-Versus-One Method

This method constructs K(K-1)/2 classifiers where
each one is trained on data from two out of K
classes.
SVM1, 2
SVM1,3
…
SVM1, K
SVM 2,3
SVM 2, 4
Vote
Max Win
…
Problem
…
K-class
…
SVM 2, K
…
SVM K −1, K
4: 62
In average, number of data for
training each classifier is 2N/K
SVM software packages

LibSVM



SVMlight


4: 63
Http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Chih-Chung Chang and Chin-Jen Lin
http://svmlight.joachims.org/
Thorsten Joachims
LibSVM

Various language versions


C++, C#, java, MatLab, etc.
Recommend C++ version
The source code is readable
 The interface is clear

4: 64
LibSVM

Two executable files


4: 65
Train.exe
Compiled by svm.cpp, svm.h and svm-train.c
Test.exe
Complied by svm.cpp, svm.h and svm-predict.c
LibSVM

Description of svmtrain.exe


4: 66
“one versus one” is implemented a solution to multi-class
problem
Several frequently used parameters
-s : svm type (0 for classification)
-t : kernel type (2 for RBF kernel)
-g : gamma value
-c : panelized cost
e. g.,
svmtrain -s 0 -t 2 -g 0.5 –c 2 train_file model_file
LibSVM

Description of svmpredict.exe

4: 67
e. g.,
svmpredict test_file model_file result_file
LibSVM

If you want to directly modify the source code and
do your homework…


4: 68
The source code has several interface functions. You
can write codes to call these functions.
svm_train(), svm_predict_values(),
svm_save_model(),…
Not recommended unless you have strong
understanding to SVMs
谢谢!下周见!
4: 69
Download