《神经网络理论与应用》第四讲 Neural Network Theory and Applications 主讲教师:吕宝粮、郑伟龙 助教:马天放、刘佳雯、姜卫邦、蓝宇霆 上海交通大学计算机科学与工程系 bllu@sjtu.edu.cn weilong@sjtu.edu.cn http://bcmi.sjtu.edu.cn 2022年3月9日 4: 1 Radial-Basis Function Networks (RBF) 径向基函数网络 (是某种沿径向对称的标量函数) 4: 2 Radial-Basis Function (RBF) Network ϕ =1 w0 = b x1 ϕ w1 … … xm Input layer 4: 3 wj … … xm −1 ϕ … … x2 wm1 ϕ Hidden layer of m1 radial-basis functions Output layer Multilayer Perceptron with Two Hidden Layers Output Signal (response) Input Signal (Stimulus) … … … 4: 4 Input layer First hidden layer Second hidden layer Output layer Introduction to RBF Network A basic radial-basis function (RBF) network consists of three layers having entirely different roles: 1. Input layer is made up of source nodes (sensory units). 2. The hidden layer applies a nonlinear transformation from the input space to the hidden space. -RBF networks have only one, often high-dimensional hidden layer. 3. A linear output layer. The hidden space is usually chosen high-dimensional because of two reasons: 1. Pattern vectors are more likely to be linearly separable in a highdimensional space. 2. The ability of the network is the better the more there are hidden units. 4: 5 Radial Basis Function m f (x) = ∑ wiφi (x) i =1 Three parameters for a radial basis function: φi(x)=φ (||x − xi||) 4: 6 x Center: i Distance Measure: r Shape: φ = ||x − xi|| Typical Radial Functions Gaussian: = φ (r ) e σ > 0 and r ∈ℜ 2 2 r +c c c > 0 and r ∈ℜ Inverse Multiquadratics: = φ (r ) c 4: 7 r2 2σ 2 Multiquadratics: φ (= r) − r 2 + c2 c > 0 and r ∈ℜ Gaussian Basis Function = φ (r ) e − r2 2σ 2 σ > 0 and r ∈ℜ σ = 1.5 σ = 1.0 σ = 0.5 4: 8 Inverse Multiquadratics = φ (r ) c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 c > 0 and r ∈ℜ c=5 c=4 c=3 c=2 c=1 -10 4: 9 r 2 + c2 -5 0 5 10 Cover’s Theorem Consider the use of a RBF network for a complex pattern classification task. The problem is basically solved by transforming it into a high-dimensional space in a nonlinear manner. Justification: Cover’s theorem on the separability of patterns: A complex pattern classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. If the patterns are linearly separable, the classification problem is fairly easy to solve. 4: 10 Cover’s Theorem -2 Consider a family of surfaces. Each surface divides an input space into two regions. ℋ = {𝐱𝐱1 ,𝐱𝐱 2 ,..., 𝐱𝐱 𝑁𝑁 } is a set of N pattern vectors. Each pattern vector belongs to one of the two classes ℋ1 or ℋ2 . This kind of binary partition is called a dichotomy. A dichotomy is called separable with respect to a family of surfaces if there exists a surface separating the points in class from those in class ℋ1 from those in class ℋ2 . Let 𝜑𝜑1 (x), 𝜑𝜑2 (x),..., 𝜑𝜑𝑚𝑚1 (x) be a set of 𝑚𝑚1 real-valued functions. Using these functions, we can define for each pattern x∈ℋ the vector 𝐟𝐟(𝐱𝐱) = [𝜑𝜑1 (𝑥𝑥), 𝜑𝜑2 (𝑥𝑥), . . . , 𝜑𝜑𝑚𝑚1 (𝑥𝑥)]𝑇𝑇 4: 11 Hyperplane Cover’s Theorem -3 Assume now that x is a 𝑚𝑚0 -dimensional vector. corresponding points in a new space of dimension 𝑚𝑚1 . Then the function f(x) maps points in 𝑚𝑚0 -dimensional input space into 𝜑𝜑𝑖𝑖 (x) is referred to a hidden function. The space spanned by the functions 𝜑𝜑1 (x),..., 𝜑𝜑𝑚𝑚1 (x) is called the hidden space or feature space. The hidden functions have a similar role as hidden units in a MLP network. 4: 12 Cover’s Theorem -4 A dichotomy [ℋ1 , ℋ2 ] of ℋ is said to be φ - separable if there exists an 𝑚𝑚1 -dimensional vector w satisfying the condition 𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) > 0, 𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) > 0, 𝐱𝐱 ∈ ℋ1 𝐱𝐱 ∈ ℋ2 The hyperplane defined by the equation 𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) = 0 describes the separating surface in the hidden φ-space. 4: 13 The inverse image of this subspace, that is, 𝐱𝐱: 𝐰𝐰 𝑇𝑇 𝐟𝐟(𝐱𝐱) = 0 Defines the separating surface in the input space. Cover’s Theorem -5 Summarizing, Cover’s theorem on the separability of patterns has two basic ingredients: 1. Nonlinear formulation of the hidden functions 𝜑𝜑𝑖𝑖 (x), i = 1, 2,..., 𝑚𝑚1 . 2. High dimensionality of the hidden space compared to the input space. Sometimes the use of nonlinear mapping alone without increasing the dimensionality is sufficient for producing linear separability. 4: 14 Two Mappings for Pattern Classification 4: 15 XOR Problem For illustrating the importance of φ-separability, consider again the simple yet important XOR problem. Foour points (patterns) (1,1), (0,1), (0,0), and (1,0) in a two-dimensional input space. Requirement: construct a binary classifier with output: - 0 for the inputs (1,1) or (0,0) - 1 for the inputs (1,0) or (0,1). Recall that the XOR problem is not linearly separable in the original input space. Define a pair of Gaussian hidden functions 𝜑𝜑1 (𝐱𝐱) = exp(−||𝐱𝐱 − t1 ||2 ), 𝜑𝜑2 (𝐱𝐱) = exp(−||𝐱𝐱 − t 2 ||2 ), 4: 16 t1 = [1,1]𝑇𝑇 t 2 = [0,0]𝑇𝑇 Solution of the XOR Problem 4: 17 Approximation Properties Multilayer perceptrons have the universal approximation property. Also the family of RBF networks can uniformly approximate any continuous function on a compact set. Formally, let G : ℛ 𝑛𝑛 → ℛ be integrable, continuous, and bounded function satisfying the condition � 𝐺𝐺(𝐱𝐱)𝑑𝑑𝐱𝐱 ≠ 𝟎𝟎 ℛ 𝑛𝑛 Let ℱ𝐺𝐺 denote the family of RBF networks consisting of functions F : ℛ 𝑛𝑛 → ℛ 𝑚𝑚1 4: 18 𝐱𝐱 − 𝐭𝐭 𝑖𝑖 ) 𝐹𝐹(𝐱𝐱) = � 𝜔𝜔𝑖𝑖 𝐺𝐺( 𝜎𝜎 𝑖𝑖=1 Here 𝜎𝜎 > 0, 𝜔𝜔𝑖𝑖 ∈ ℛ and 𝐭𝐭 𝑖𝑖 ∈ ℛ 𝑛𝑛 for i=1, 2,..., 𝑚𝑚1 . Approximation Properties -2 The universal approximation theorem for RBF networks: For any continuous input-output mapping function f(x) there is an RBF network with a set of centers 𝐭𝐭 𝑖𝑖 , i=1, 2,..., 𝑚𝑚1 and a common width 𝜎𝜎 such that the input-output mapping functions F(x) realized by the RBF network is close to f(x) in the 𝐿𝐿𝑝𝑝 norm 𝑝𝑝 ∈ [1, ∞). Note that the kernel G : ℛ 𝑛𝑛 → ℛ need not be radially symmetric. The theorem provides a theoretical basis for using RBF networks in practical applications. 4: 19 Learning Strategies In RBF networks, learning proceeds differently for different layer. The linear output layer’s weights are learned rapidly through a linear optimization strategy. The hidden layer’s activation functions evolve slowly using some nonlinear optimization strategy. The layers of a RBF network perform different tasks. It is reasonable to use different optimization techniques for the hidden and output layers. Learning strategies for the RBF networks differ in the method used for specifying the centers of the RBF network. 4: 20 What to Learn? y1 yl w11 w12 w1m wl1 wl2 φ2 φ1 x1 4: 21 wlm φm x2 xn Weights: wij’s Centers: µj’s of φj’s Widths: σj’s of φj’s Number of φj’s Model Selection Two-Stage Training y1 yl Step 2 w11 w12 w1m wl1 wl2 φ2 φ1 x1 4: 22 wlml φm x2 xn Determines wij’s. E.g., using batch-learning. Step 1 Determines Centers µj’s of φj’s. Widths σj’s of φj’s. Number of φj’s. Fixed Centers Selected at Random The simplest approach is to assume fixed radial-basis functions. The locations of the centers may be chosen randomly form the training data set. This is a sensible approach provided that the training data are representative for the problem. The radial-basis functions are typically chosen to be isotropic Gaussian functions:(各项同性的Gauss函数) 4: 23 𝐺𝐺(||𝐱𝐱 − 𝐭𝐭 𝑖𝑖 ||) = exp(− 𝑚𝑚1 2 ||𝐱𝐱 − 𝐭𝐭 || ) 𝑖𝑖 2 𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚 i = 1, 2,..., 𝑚𝑚1 where is the number of centers (basis functions). Comparison of RBF with MLP Both RBF and MLP networks are nonlinear layered networks having universal approximation properties. The most important differences between them are: 1. An RBF network has a single hidden layer, while an MLP can have several hidden layers. 2. The computational nodes in the MLP network are similar in various layers, while in the RBF network they are quite different in the output and hidden layers. 4: 24 Comparison of RBF with MLP -2 3. In the RBF network, the output layer is linear, while it is usually nonlinear in an MLP network. 4. In each hidden node, the activation function of RBF network computes an Euclidean distance, while in MLP networks an inner product between the input and the weight vector is computed. 5. MLPs construct global approximations, while RBF networks approximate locally nonlinear input-output mappings. MLP may require less parameters than the RBF network for achieving the same accuracy. 4: 25 从感知机到支持向量机 • Kohonen • SOM • Fukushima • Neurocognitron 1992年 1986年 2006年 1982年 1980年 • 感知机 1956年 1965年 • BP算法 • Cover定理 1949年 • 最小均方学习算法 • 径向基函数网络 1985年 1960年 • Hebbian学习规则 4: 26 • 支持向量机 1986年 50年 • LeCun • CNN 支持向量机与多层感知机 支持向量机 4: 27 MLP的缺点: 支持向量机的优点: • • • • • • BP算法不保证收敛 学习率需要靠经验试探 隐藏层神经元数目也靠经验试探 二次规划算法保证收敛 二次规划求解算法高效 支持向量是由算法确定的 Support Vector Machine 4: 28 Empirical Risk We want to estimate a function using training data T =({ X i , d i )}i =1 N F ( X ,W ) Loss between desired response and actual response L(d , F ( X , W )) = (d − F ( X , W )) 2 Expected risk (风险泛函) R (W ) = 1 2 ∫ L(d ,F(X ;W )) dFX D(X ,d ) Empirical risk(经验风险泛函) 1 Remp (W ) = N 4: 29 , N ∑ L(d , F ( X , W ) i =1 i i Empirical risk minimization principle The true expected risk is approximated by empirical risk 1 Remp (W ) = N N ∑ L(d , F ( X , W ) i =1 i i The learning based on the empirical minimization principle is defined as W * = arg min Remp (W ) W Examples of algorithms: Perceptron, Back-propagation, etc. 4: 30 Overfitting and underfitting Problem: How rich class of classifications F(X,W) to use underfitting 4: 31 good fit overfitting Problem of generalization: A small empirical risk Remp (W ) does not imply small true expected risk R (W ) Structural Risk Minimization Statistical learning theory : Vapnik & Chervonenkis An upper bound on the expected risk of a classification rule R(W ) ≤ Remp(W ) + h[log(2N / h ) + 1] − log(α ) N where N is the number of training data, h is VC-dimension of class of functions. 4: 32 SRM Principle: to find a network structure such that decreasing the VC dimension occurs at the expense of the smallest possible increase in training error Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: wTx + b > 0 wTx + b = 0 f(x) = sign(wTx + b) wTx + b < 0 4: 33 Linear Separators 4: 34 Which of the linear separators is optimal? Classification Margin wT xi + b r= w Distance from example xi to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support vectors. ρ r 4: 35 Maximum Margin Classification Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable. 4: 36 Linear SVM Mathematically Let training set {(xi, yi)}i=1..n, xi∈Rd, yi ∈ {-1, 1} be separated by a hyperplane with margin ρ. Then for each training example (xi, yi): wTxi + b ≤ - ρ/2 if yi = -1 Tx + b) ≥ ρ/2 (w y ⇔ i i if yi = 1 wTxi + b ≥ ρ/2 For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and the hyperplane is y s ( w T x s + b) 1 r= = w w Then the margin can be expressed through (rescaled) w and b as: 2 ρ = 2r = w 4: 37 Target y(=d) g(x,w,b)=-2x-+5 +2 +1 Feature X1 0 1 -1 -2 4: 38 2 3 4 5 Linear SVMs Mathematically (cont.) Then we can formulate the quadratic optimization problem: Find w and b such that ρ= 2 w is maximized and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1 Which can be reformulated as: Find w and b such that Φ(w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1 4: 39 Solving the Optimization Problem Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi), i=1..n : 4: 40 yi (wTxi + b) ≥ 1 Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems for which several (nontrivial) algorithms exist. The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem: Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi The Optimization Problem Solution Given a solution α1…αn to the dual problem, solution to the primal is: w = Σ αi yi x i b = yk - Σαiyixi Txk for any αk > 0 Each non-zero αi indicates that corresponding xi is a support vector. Then the classifying function is (note that we don’t need w explicitly): f(x) = ΣαiyixiTx + b Notice that it relies on an inner product between the test point x and the support vectors xi Also keep in mind that solving the optimization problem involved computing the inner products xiTxj between all training points. 4: 41 Soft Margin Classification What if the training set is not linearly separable? Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξi ξi 4: 42 Soft Margin Classification Mathematically The old formulation: Find w and b such that Φ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 Modified formulation incorporates slack variables: Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0 4: 43 Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. Soft Margin Classification – Solution Dual problem is identical to separable case (would not be identical if the 2-norm penalty for slack variables CΣξi2 was used in primal objective, we would need additional Lagrange multipliers for slack variables): Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi Again, xi with non-zero αi will be support vectors. Solution to the dual problem is: w = Σ αi yi x i b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0 Again, we don’t need to compute w explicitly for classification: 4: 44 f(x) = ΣαiyixiTx + b SVM Boundaries with different C Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0 Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi 4: 45 Theoretical Justification for Maximum Margins Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded from above as D 2 h ≤ min 2 , m0 + 1 ρ 4: 46 where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m0 is the dimensionality. Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin ρ. Thus, complexity of the classifier is kept small regardless of dimensionality. Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α1…αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi 4: 47 f(x) = ΣαiyixiTx + b Non-linear SVMs Datasets that are linearly separable with some noise work out great: x 0 But what are we going to do if the dataset is just too hard? x 0 How about… mapping data to a higher-dimensional space: x2 0 4: 48 x Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) 4: 49 The “Kernel Trick” 4: 50 What Functions are Kernels? K= K(x1,x1) K(x1,x2) K(x1,x3) K(x2,x1) K(x2,x2) K(x2,x3) … K(xn,x1) 4: 51 … K(xn,x2) … K(xn,x3) … K(x1,xn) K(x2,xn) … … … K(xn,xn) Examples of Kernel Functions 4: 52 Non-linear SVMs Mathematically Dual problem formulation: Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi The solution is: f(x) = ΣαiyiK(xi, xj)+ b 4: 53 Optimization techniques for finding αi’s remain the same! Examples of Kernel Functions 4: 54 SVM Examples 4: 55 Key Points Learning depends only on dot products of sample pairs. Exclusive reliance on dot products enables approach to non-linearly separable problems. The classifier depends only on the support vectors, not on all the training points. Max margin lowers hypothesis variance. The optimal classifier is defined uniquely-there are no “local maxima” in the search space Polynomial in number of data points and dimensionality 4: 56 Structure SVM from the Point View of MLP A Support Vector Machine maps the input space into a high-dimensional feature space and then constructs as optimal hyperplane in the feature space 4: 57 Two Mappings for Pattern Classification 4: 58 Architecture of SVM 4: 59 SVMs for Multi-class Classification Problems Two task decomposition methods: One-versus-rest One-versus-one 4: 60 One-Versus-Rest method This method requires one classifier per category. The i th SVM will be trained with all of the examples in the i th class with positive labels, and all other examples with negative labels. K-class SVM1 Category1 SVM 2 Category 2 … Problem SVM K Category K The Number of training data for each classifier is N 4: 61 One-Versus-One Method This method constructs K(K-1)/2 classifiers where each one is trained on data from two out of K classes. SVM1, 2 SVM1,3 … SVM1, K SVM 2,3 SVM 2, 4 Vote Max Win … Problem … K-class … SVM 2, K … SVM K −1, K 4: 62 In average, number of data for training each classifier is 2N/K SVM software packages LibSVM SVMlight 4: 63 Http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Chih-Chung Chang and Chin-Jen Lin http://svmlight.joachims.org/ Thorsten Joachims LibSVM Various language versions C++, C#, java, MatLab, etc. Recommend C++ version The source code is readable The interface is clear 4: 64 LibSVM Two executable files 4: 65 Train.exe Compiled by svm.cpp, svm.h and svm-train.c Test.exe Complied by svm.cpp, svm.h and svm-predict.c LibSVM Description of svmtrain.exe 4: 66 “one versus one” is implemented a solution to multi-class problem Several frequently used parameters -s : svm type (0 for classification) -t : kernel type (2 for RBF kernel) -g : gamma value -c : panelized cost e. g., svmtrain -s 0 -t 2 -g 0.5 –c 2 train_file model_file LibSVM Description of svmpredict.exe 4: 67 e. g., svmpredict test_file model_file result_file LibSVM If you want to directly modify the source code and do your homework… 4: 68 The source code has several interface functions. You can write codes to call these functions. svm_train(), svm_predict_values(), svm_save_model(),… Not recommended unless you have strong understanding to SVMs 谢谢!下周见! 4: 69