Supplementary Material S1. Support Vector Regression Consider a

advertisement
Supplementary Material
S1. Support Vector Regression
Consider a set of data G   xi , yi i 1 , where x i is a vector of the model inputs, yi is actual
n
value and represents the corresponding scalar output, and n is total number of data patterns. The
objective of the regression analysis is to determine a function f  x  , so as to predict accurately
the desired outputs ( y ). Thus, the typical regression function can be formulated as
yi  f  xi    , where  is the random error with distribution of N  0,  2  . The regression
problem can be classified as linear and nonlinear regression problems. As the nonlinear
regression problem is more difficult to deal with, support vector regression (SVR) was mainly
developed for tackling the nonlinear regression problem (Lu et al., 2009). In SVR, the problem
of nonlinear regression in the lower dimension input space (x) is transformed into a linear
regression problem in a high dimension feature space (F). As a result, the SVR formalism
considers the follows linear estimation function (Vapnik, 1995):
f  x   w  φ  x   b ,
(S1)
where, w is weight vector, b is a constant, φ  x  denotes a mapping function in the feature
space, and
 w  φ  x 
describes the dot production in the feature space F. Moreover, weight
vector (w) and constant (b) are estimated by minimizing the following regularized risk function
(Vapnik, 1998; 1999):
R C   C
1 n
1
2
 L( y, f )  w ,
n i 1
2
(S2)
where C is a regular constant determining the tradeoff between the model flatness and the
1
w
2
training error;
2
is a regularizer term that controls the model complexity; L( y, f ) adopt
the ε-insensitive loss function and is defined as follows:
 y  f ,
L( y , f )  
 0,
y  f  ,
otherwise
(S3)
where  (   0 ) is a precision parameter representing that errors less than  are not taken into
consideration. SVR introduces slack variable  and  * and leads equation (S2) into the
following constrained function (Wang and Xu, 2004):
min R  w,    
n
1
2
w  C  i  i*  ,
i 1
2
(S4a)
subject to
wφ  xi   bi  yi    i ,
(S4b)
yi  wφ  xi   bi    i* ,
(S4c)
i , i*  0
(S4d)
where  i and  i are slack variables representing upper and lower constraints on the outputs of
the system. By using Lagrangian multipliers (i.e.  i and  i ) and Karush Kuhn Tucker
conditions to model (S4), it thus yield the following dual Lagrangian form (Lu et al., 2009):
max Q  i ,  i      i   i    yi  i   i 
n
n
i 1
i 1
1 n n
    i   i   j   j  K  xi , x j ,
2 i 1 j 1
(S5a)
subject to
n
n
i 1
i 1

 i   i ,
(S5b)
0  i , i  C, i.
(S5c)
Lagrangian multipliers (i.e.  i and  i ) in mode (S5) are calculated and an optimal desired
weight vector of the regression hyperplan is, w *    i   i  K  x, xi  . Thus, the solution of
n
i 1
SVR approach can be expressed in the form of the kernel functions as follows Vapnik (1995):
f  x,  ,  *     i   i  K  x, xi   b.
n
(S6)
i 1
where K  x, xi  is called the kernel function. The value of the kernel equals the inner product of
two vectors, x i and x j , in the feature space φ(xi ) and φ( x j ) . Thus, we have
K  x i , x j   φ ( x i )T φ ( x j )
(S7)
The kernel function determines the smoothness properties of solutions and should reflect a prior
knowledge of the data (Jeng and Chuang, 2004). Any function that meets Mercer’s condition can
be employed as a kernel function (Mercer, 1909). In literature, the popular kernel functions in
machine learning theories are as follows (Campbell, 2002; Tzeng, 2002):
Linear K  xi , x j   xi T x j
Polynomial K  xi , x j    xi T x j  t 
(S8a)
d
(S8b)

Radial basis function (RBF) K  xi , x j   exp  xi  x j
2
2 2

Multi-layer perception K  xi , x j   tanh  xi T x j   b 
(S8c)
(S8d)
where xi and xj are input space vectors; σ2 denotes the variance in the RBF kernel; t is the
intercept constant term which and d is the degree in the polynomial kernel; b is a certain value
used only in the multilayer perception kernel.
Reference
Campbell, C., 2002. Kernel Methods: A survey of current techniques. Neurocomputing 48,
63–84.
Jeng, J.T., Chuang, C.C., 2004. Selection of initial structures with support vector regression for
fuzzy neural networks. International Journal of Fuzzy Systems, 6(2), 63–70.
Levis, A.A., Papageorgiou, L.G., 2005. Customer demand forecasting via support vector
regression analysis. Chemical Engineering Research and Design 83, 1009–1018.
Lu, C.J., Lee,T.S., Chiu, C.C., 2009. Financial time series forecasting using independent
component analysis and support vector regression. Decision Support Systems 47 (2),
115-125.
Mercer, J., 1909. Functions of positive and negative type and their connection with the theory of
integral equations. Philosophical Transactions of the Royal Society, London A 209,
415-446.
Shieh, H.L., Kuo, C.C., 2010. A reduced data set method for support vector regression. Expert
Systems with Applications 37, 7781–7787.
Tzeng, Y.H., 2002. Discussion on the efficiency of automatic classification based on file subject.
Journal of China Society for Library Science 68, 62-83.
Vapnik, V.N., 1995. The nature of statistical learning theory. Springer-Verlag.
Vapnik, V.N., 1998. Statistical learning theory. John Wiley & Sons, New York.
Vapnik, V.N., 1999. An overview of statistical learning theory, IEEE Transactions on Neural
Networks 10, 988-999.
Wang, W., Xu, Z., 2004. A heuristic training for support vector regression. Neurocomputing 61,
259-275.
S2. List of Figure Captions:
Figure S1. Samples of x ij by using the latin hypercube sampling method
Figure S2. Outputs of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
Figure S3. Training and prediction results of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
Figure S4. Relative errors of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
200
300
Coal (tonne/ day)
Training
Testing
100
Coal (tonne/ day)
Value of x11
0
Value of x12
Training
Testing
200
100
0
1
40
80
120
160
1
200
40
200
300
Testing
100
Coal (tonne/ day)
Value of x21
Training
Coal (tonne/ day)
80
120
160
200
Sample
Sample
0
Value of x22
Training
Testing
200
100
0
1
40
80
120
160
1
200
40
Sample
200
120
160
200
Sample
300
Training
Testing
100
Coal (tonne/ day)
Value of x31
Coal (tonne/ day)
80
0
Value of x32
Training
Testing
200
100
0
1
40
80
120
160
1
200
40
200
300
Coal (tonne/ day)
Testing
100
Coal (tonne/ day)
Value of x41
Training
80
120
160
200
Sample
Sample
Value of x42
Training
Testing
200
100
0
0
1
40
80
120
Sample
160
200
1
40
80
120
160
Sample
Figure S1. Samples of x ij by using the latin hypercube sampling method
200
100
(a) SO2
PHQ (× 10-3)
Training
Testing
50
0
1
40
80
120
160
200
Sample
300
(b) NOx
Training
PHQ (× 10-3)
Testing
200
100
0
1
40
80
120
160
200
Sample
6
(c) PM10
PHQ (× 10-3)
Training
Testing
4
2
0
1
40
80
120
160
200
Sample
Figure S2. Outputs of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
(a) SO2
100
PHQ (× 10-3)
Training
Testing
50
0
1
40
80
120
160
200
Sample
300
(b)NOx
Training
PHQ (× 10-3)
Testing
200
100
0
1
40
80
120
160
200
Sample
(c) PM10
6
PHQ (× 10-3)
Training
Testing
4
2
0
1
40
80
120
160
200
Sample
Figure S3. Training and prediction results of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
100
80
Training
Testing
Relative error (%)
60
40
20
0
-20
-40
-60
(a) SVR- SO2
-80
-100
100
80
Training
Testing
Relative error (%)
60
40
20
0
-20
-40
-60
(b)SVR- NOx
-80
-100
100
80
Training
Testing
Relative error (%)
60
40
20
0
-20
-40
-60
(c) SVR- PM10
-80
-100
Figure S4. Relative errors of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10
Download