Supplementary Material S1. Support Vector Regression Consider a set of data G xi , yi i 1 , where x i is a vector of the model inputs, yi is actual n value and represents the corresponding scalar output, and n is total number of data patterns. The objective of the regression analysis is to determine a function f x , so as to predict accurately the desired outputs ( y ). Thus, the typical regression function can be formulated as yi f xi , where is the random error with distribution of N 0, 2 . The regression problem can be classified as linear and nonlinear regression problems. As the nonlinear regression problem is more difficult to deal with, support vector regression (SVR) was mainly developed for tackling the nonlinear regression problem (Lu et al., 2009). In SVR, the problem of nonlinear regression in the lower dimension input space (x) is transformed into a linear regression problem in a high dimension feature space (F). As a result, the SVR formalism considers the follows linear estimation function (Vapnik, 1995): f x w φ x b , (S1) where, w is weight vector, b is a constant, φ x denotes a mapping function in the feature space, and w φ x describes the dot production in the feature space F. Moreover, weight vector (w) and constant (b) are estimated by minimizing the following regularized risk function (Vapnik, 1998; 1999): R C C 1 n 1 2 L( y, f ) w , n i 1 2 (S2) where C is a regular constant determining the tradeoff between the model flatness and the 1 w 2 training error; 2 is a regularizer term that controls the model complexity; L( y, f ) adopt the ε-insensitive loss function and is defined as follows: y f , L( y , f ) 0, y f , otherwise (S3) where ( 0 ) is a precision parameter representing that errors less than are not taken into consideration. SVR introduces slack variable and * and leads equation (S2) into the following constrained function (Wang and Xu, 2004): min R w, n 1 2 w C i i* , i 1 2 (S4a) subject to wφ xi bi yi i , (S4b) yi wφ xi bi i* , (S4c) i , i* 0 (S4d) where i and i are slack variables representing upper and lower constraints on the outputs of the system. By using Lagrangian multipliers (i.e. i and i ) and Karush Kuhn Tucker conditions to model (S4), it thus yield the following dual Lagrangian form (Lu et al., 2009): max Q i , i i i yi i i n n i 1 i 1 1 n n i i j j K xi , x j , 2 i 1 j 1 (S5a) subject to n n i 1 i 1 i i , (S5b) 0 i , i C, i. (S5c) Lagrangian multipliers (i.e. i and i ) in mode (S5) are calculated and an optimal desired weight vector of the regression hyperplan is, w * i i K x, xi . Thus, the solution of n i 1 SVR approach can be expressed in the form of the kernel functions as follows Vapnik (1995): f x, , * i i K x, xi b. n (S6) i 1 where K x, xi is called the kernel function. The value of the kernel equals the inner product of two vectors, x i and x j , in the feature space φ(xi ) and φ( x j ) . Thus, we have K x i , x j φ ( x i )T φ ( x j ) (S7) The kernel function determines the smoothness properties of solutions and should reflect a prior knowledge of the data (Jeng and Chuang, 2004). Any function that meets Mercer’s condition can be employed as a kernel function (Mercer, 1909). In literature, the popular kernel functions in machine learning theories are as follows (Campbell, 2002; Tzeng, 2002): Linear K xi , x j xi T x j Polynomial K xi , x j xi T x j t (S8a) d (S8b) Radial basis function (RBF) K xi , x j exp xi x j 2 2 2 Multi-layer perception K xi , x j tanh xi T x j b (S8c) (S8d) where xi and xj are input space vectors; σ2 denotes the variance in the RBF kernel; t is the intercept constant term which and d is the degree in the polynomial kernel; b is a certain value used only in the multilayer perception kernel. Reference Campbell, C., 2002. Kernel Methods: A survey of current techniques. Neurocomputing 48, 63–84. Jeng, J.T., Chuang, C.C., 2004. Selection of initial structures with support vector regression for fuzzy neural networks. International Journal of Fuzzy Systems, 6(2), 63–70. Levis, A.A., Papageorgiou, L.G., 2005. Customer demand forecasting via support vector regression analysis. Chemical Engineering Research and Design 83, 1009–1018. Lu, C.J., Lee,T.S., Chiu, C.C., 2009. Financial time series forecasting using independent component analysis and support vector regression. Decision Support Systems 47 (2), 115-125. Mercer, J., 1909. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, London A 209, 415-446. Shieh, H.L., Kuo, C.C., 2010. A reduced data set method for support vector regression. Expert Systems with Applications 37, 7781–7787. Tzeng, Y.H., 2002. Discussion on the efficiency of automatic classification based on file subject. Journal of China Society for Library Science 68, 62-83. Vapnik, V.N., 1995. The nature of statistical learning theory. Springer-Verlag. Vapnik, V.N., 1998. Statistical learning theory. John Wiley & Sons, New York. Vapnik, V.N., 1999. An overview of statistical learning theory, IEEE Transactions on Neural Networks 10, 988-999. Wang, W., Xu, Z., 2004. A heuristic training for support vector regression. Neurocomputing 61, 259-275. S2. List of Figure Captions: Figure S1. Samples of x ij by using the latin hypercube sampling method Figure S2. Outputs of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10 Figure S3. Training and prediction results of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10 Figure S4. Relative errors of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10 200 300 Coal (tonne/ day) Training Testing 100 Coal (tonne/ day) Value of x11 0 Value of x12 Training Testing 200 100 0 1 40 80 120 160 1 200 40 200 300 Testing 100 Coal (tonne/ day) Value of x21 Training Coal (tonne/ day) 80 120 160 200 Sample Sample 0 Value of x22 Training Testing 200 100 0 1 40 80 120 160 1 200 40 Sample 200 120 160 200 Sample 300 Training Testing 100 Coal (tonne/ day) Value of x31 Coal (tonne/ day) 80 0 Value of x32 Training Testing 200 100 0 1 40 80 120 160 1 200 40 200 300 Coal (tonne/ day) Testing 100 Coal (tonne/ day) Value of x41 Training 80 120 160 200 Sample Sample Value of x42 Training Testing 200 100 0 0 1 40 80 120 Sample 160 200 1 40 80 120 160 Sample Figure S1. Samples of x ij by using the latin hypercube sampling method 200 100 (a) SO2 PHQ (× 10-3) Training Testing 50 0 1 40 80 120 160 200 Sample 300 (b) NOx Training PHQ (× 10-3) Testing 200 100 0 1 40 80 120 160 200 Sample 6 (c) PM10 PHQ (× 10-3) Training Testing 4 2 0 1 40 80 120 160 200 Sample Figure S2. Outputs of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10 (a) SO2 100 PHQ (× 10-3) Training Testing 50 0 1 40 80 120 160 200 Sample 300 (b)NOx Training PHQ (× 10-3) Testing 200 100 0 1 40 80 120 160 200 Sample (c) PM10 6 PHQ (× 10-3) Training Testing 4 2 0 1 40 80 120 160 200 Sample Figure S3. Training and prediction results of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10 100 80 Training Testing Relative error (%) 60 40 20 0 -20 -40 -60 (a) SVR- SO2 -80 -100 100 80 Training Testing Relative error (%) 60 40 20 0 -20 -40 -60 (b)SVR- NOx -80 -100 100 80 Training Testing Relative error (%) 60 40 20 0 -20 -40 -60 (c) SVR- PM10 -80 -100 Figure S4. Relative errors of (a) SVR- SO2, (b) SVR- NOx, and (c) SVR- PM10