By Shyunji Takahashi Statistical methods used in the long-forecast at JMA 1 Introduction History of long-range forecast and the statistical methods used in Japan There had long been demands mainly by farmers and central/local government agencies on the issue of long-range weather forecasts. After severe damage to rice crops in the unusually cool summer of 1934, the request for long-range forecasts became stronger in order to avoid or prepare against damage by unusual weather conditions. Research was done to develop forecast methods. In June 1942, a section in charge of long-range weather forecasting was established in the Weather Forecasting Department of the Central Meteorological Observatory. Issuing one-month, three-month and warm/cold season forecasts were started one after another. In February 1945, the Long-range Forecast Division was established in the Operational Department of the Central Meteorological Observatory. Forecasts, however, did not necessarily satisfy the user’s demand due to their poor accuracy. After the failure of the forecast for the winter of 1948/49, the record-breaking warm winter, there appeared doubt and criticism on the accuracy of long-range forecasts. The operational forecasting was ceased in February 1949 and the division was closed in November 1949. After that research activities were strengthened at the Meteorological Research Institute. As a result issuing of one-month, three-month and warm/cold season forecasts was restarted in February 1953. In those days, forecasting method was based on periodicity of the global atmospheric circulation and harmonic analysis was main tool. Harmonic analysis is to detect a few dominant periods in the time sequence of temperature or circulation indices until the present time and to predict the future state extrapolating the dominant periodic changes. In the early 1970s unusual weather occurred worldwide. This caused considerable attention to be paid to the necessity of long-range weather forecasts as well as monitoring of global weather conditions. In April 1974, JMA reestablished the Long-range Forecast Division in the Forecast Department and at the same time started to monitor unusual weather in the world. The division had been responsible for long-range weather forecasting and climate monitoring activities. In those days, adding the harmonic analysis, analog method was introduced to forecasting tools. Analog method is based on the assumption that the future evolution of the atmospheric circulation and the realized weather condition will be similar to one 1 in the historical analog year. The analog year is defined by the comparison of the atmospheric circulation pattern currently observed with a historical one. In the early stage, long-range forecasts used to be required exclusively by farmers and related public agencies. But in the course of industrial development, uses were diversified into various fields such as construction, water resource control, electric power plants, makers of electronics appliances, food companies, tourism, etc. Occasional appearances of unusual weather in the 1980s also caused an increase of interest in long-range forecasts. Recently social as well as economic impacts of the long-range forecast have become more significant. In March 1996, JMA started one-month forecasting based on dynamical ensemble prediction instead of statistical and empirical methods. At the same time, probabilistic expression was introduced into the long-range forecast to indicate explicitly the uncertainty of forecasts. In accordance with this change, Statistical tools in one-month forecasting were scraped except for an analog method as a reference. On the other hand, statistical tools used in three-months forecast and cold/warm season forecast were improved in early 1990’s in order to adjust to bulk data compiled from worldwide sources and to computer power highly evolved. Five kinds of statistical methods were available in 1996, that is, analog method using grid point value of 500hPa in the northern hemisphere, analog/anti-analog method using SST sequences in Nino3 region, optimal climate normal (OCN), method of periodicity using several circulation indices, and multiple regression method using 500hPa patterns. Analog/anti-analog method is as an extension of analog and is based on that negative correlated field is respected to yield the opposite weather condition to the one expected by analogous field. OCN is a optimal estimator for the forthcoming climate state and average of the past 10 years or so is found to be better estimator than the usual normal which is calculated by the past 30 years. OCN used in the current forecast at JMA is the latest 10 years mean. JMA was reorganized in July 1996. The Climate Prediction Division (CPD) was newly established in the Climate and Marine Department, and took over the responsibilities from the former Long-range Forecast Division. Climate information services were added to CPD services. Furthermore, El Nino monitoring and prediction were transferred from the Ocean-graphical Division to CPD. Unifying the climate-related services was an important reason for the reorganization. CPD is now developing and extends its responsibility to development of numerical prediction model for the long-range forecasting, a reanalysis project of historical atmospheric circulation data, and assessment and projection of global warming. 2 In March 2003, JMA started three-month forecasting based on dynamical ensemble prediction. At the same time, statistical methods were reconstructed and rearranged into two methods, that is, OCN and canonical correlation analysis (CCA) technique. In this lecture, two statistical methods are described. First one is multiple regression model and the second is canonical correlation analysis (CCA). Both two are currently used in many forecasting centers and the basic knowledge around the both two methods would be useful. 2 Multiple Regression Method 2.1 Multiple Regression model and the solution Our situation is that we have a time series of meteorological variable to forecast and a set of time series of other variables obviously related to the former. The former and the latter elements are predictor and predictand, respectively. Our purpose is to predict the future value of predictand using the relationship between predictand and predictors and the present values of predictors. The multiple regression model assumes predictand vector is sum of a linear combination of predictors and a noise. Suppose that xi (t ), t 1, n are predictors, y (t ) a predictand vector, and noise vector, multiple regression model is written as y (t ) 1 x1 (t ) m xm (t ) This expression can be rewritten as follows using matrix-vector notation. y (t ) X x11 x1m X x1 (t ), xm (t ) x x nm n1 1 m The vector is called regression coefficient vector and have to be estimated using data matrix X and predictand vector y . The coefficient vector is usually estimated so as to minimize the sum of squared errors of forecast vector ŷ . That is, S y yˆ 2 y Xa 2 min imum , where a is the estimator of vector . The solution of the above minimize problem is, a ( X t X ) 1 X t y S y t ( I X ( X t X ) 1 X t ) y y t y a t X t y A1), where A t , A 1 are transpose and inverse of matrix A respectively, and I is unit matrix. On the other hand, the estimator of regression coefficient is determined uniquely by 3 given data matrix and predictand vector, but it is considered as random variable whose distribution depends on stochastic noise. It means that determined coefficient vector has some uncertainty, which would affect the accuracy of the forecast. 2.2 Stochastic property of sum of squared error and regression coefficient In the usual regression model, the stochastic noise is assumed to be Gaussian noise with zero mean and a certain variance. And it is assumed that each noise of the noise vector is independent to each other. That is, i N 0, 2 , E i j 2 ij , E t 2 I . Where, symbol N , 2 represents Gaussian probability density function with mean and variance 2 . The notation x N , 2 indicates that x is a random variable with a normal distribution and E (x) represents the expected value of x. These situation can be represented in the following figure as visual image. Defining S (X ) as the subspace spanned by the column vectors of matrix X , assumed true regression X exists on the S (X ) plain and it is represented by vector or in the figure. Let rq be the noise vector , y X is represented as vector op . Under this situation, we can see that among the vectors on the S (X ) plain, the vector whose distance to y qp becomes minimum, is just orthogonal projection vector oq . The vector is, of course, the same as the previous solution of normal equationA2). The vector qp is just error vector of the regression, whose squared distance is just sum of squared error S and it is the projection onto the vector space S ( X ) of the original noise vector rp . On the other hand, vector qr is on the subspace S (X ) Vector Space Vn and p perpendicular to vector qp . In consequence, Sub-Space original error vector can be decomposed into two components, which are perpendicular to each other. Matrix of orthogonal projection onto S (X ) is X X t X X t 1 A2), so that the decomposition is represented as, r o op y or X 4 S(X ) q oq X ( X t X ) 1 X t y rp X X t X X t I X X t X X t 1 1 Here, Adapting Cochran’s theorem to above relation, it can be shown that the sum of squared error S has a 2 distribution with n m degrees of freedom A3). S 2 2 n m E S 2 m n This means that unbiased estimator of 2 is ̂ 2 S nm Next, probability distribution of the estimated regression coefficient a is derived. The expression of a can be rewritten as, 1 1 1 a X t X X t y X t X X t ( X ) X t X X t As a is linear combination of , which is assumed to normally distribute, a also normally distributes. The mean and variance are as the followings A4). t 1 E a E a a 2 X t X 2.3 Stochastic property of the prediction by independent input For the predictor vector t x0 x1 ,, xm , which is independent to data used to t determine the regression coefficient, predicted value is represented as yˆ x0 a . This predicted value ŷ is the linear combination of a , which normally distributes. Calculating the expected values of its mean and variance, t t E x0 b x0 t t t 1 E x0 a a x0 2 x0 X t X x0 That is, t t 1 yˆ N x0 , 2 x0 X t X x0 And value of reality y * also normally distributes as, t y N x0 , 2 . * Hence, t 1 y * yˆ N 0 , 2 1 x0 X t X x0 This relation is available to probability forecast. While the population variance 2 is unknown, unbiased estimator of it is available if the n is large enough. Final probability distribution of the reality is approximately as A5), t 1 y * N yˆ , ˆ 2 1 x0 X t X x0 ˆ 2 S /(n m) 2.4 Significant test of the regression Let m -terms regression with sum of squared error S m already to be set and adding 5 new predictor variables to data matrix, m 1 -terms regression with sum of squared error S m1 is obtained. S m S m1 is trivial. For the purpose of that statistical test whether new variable is significant or not, the following relation is much useful and the stepwise technique, which is one of model selection methods, employs same relation. S m , S m1 , and S m S m1 follow 2 -distributions. Sm 2 2 (n m) Sm1 2 S m S m1 2 (n m 1) 2 2 1 Because S m1 and S m S m1 are independent to each other A6), the following is derived as S m S m1 n m 1 F 1 , n m 1 . S m1 Where, F k , l means F-distribution with degrees of freedom k, l . 2.5 Model selection Multiple regression model using all predictors in given data matrix has minimum S , but optimal model we need is the model that prediction error of independent case makes minimum. As number of terms increases, S decreases, but uncertainty due to regression coefficients increases. Final prediction error (FPE) can explain these matters. FPE is defined as expected value of S , which is calculated independent data. That is, t FPE E y Xa y Xa . Where y is an independent predictand vector and y X . After a little calculation, the following equation is obtained A7). FPE n m 2 Usual form of FPE is modified as follows, using ̂ 2 FPE S nm nm S nm The effect of uncertainty increasing due to regression coefficients, accompanied with number of terms increasing, is represented as nm . nm One approach to obtain the optimal regression equation is method of all subsets regression. Calculating regressions for all possible subsets of predictors, the set giving minimized FPM or AIC is selected as an optimal set. ALL subsets regression method is secure and had been used in JMA ’s long-range forecast with AIC criterion, while number of subsets to check is 2 m 1 , which increases rapidly as m increasing. 6 AIC criterion gives almost same results as FPM criterion A8) and it is, AIC n log e S m 2m . The other way to obtain the optimal equation is stepwise method. The method consists of two processes. These are process of adding new variable, which has the highest and significant contribution to regression represented by equation (X) and process of eliminating variable already in regression, which becomes insignificant. Both processes repeat until there exists no significant variable outside of regression or no insignificant variable inside of regression. This method is very convenient and has been used in making guidance of numerical model output at JMA. 2.6 Single regression model In order to help the recognition of multiple regression method, let consider the simplest case, that is, simple regression. Simple regression is now not used in the long range forecasting of JMA, but the prediction using the simple relation between width of the polar vortex in autumn and winter mean temperature had been one of the bases of cold season outlook until a recent year. Single regression model is written as y (t ) 0 1 x (t ) Rearranging above to the notation in this note, 1 x1 X 1 x n y X 0 1 Calculating a ( X t X ) 1 X t y and rearranging, y a1 x n n 2 a xy xx xi nx 2 , xy xi yi nx y i 1 i 1 xx S y t y a t X t y is rewritten in this case, n 2 S yy (1 xy ) yy (1 r 2 ) yy yi ny 2 , i 1 xx yy 2 Where, r is the correlation coefficient between x and y . Stochastic properties are also modified. Writing only results as, 1 2 S 2 2 (n 2) a1 1 N (0 , 2 ) xx ˆ 2 S n2 1 x 2 a0 0 N 0 , 2 n xx 7 Considering above two, the following relation is derived. n 2 a xx S 1 1 t n 2 Using above relation, a confidence interval for the slope parameter a1 is given as a1 t n 2 S S 1 a1 t n 2 n 2 xx n 2 xx Where, t (k ) is the -quantile of the t-distribution with k degrees of freedom. Also, a confidence interval for the section parameter a0 as, a0 t n 2 1 x2 1 x2 S S n xx a t n 2 n xx 0 0 n 2 n 2 A confidence interval for the regression a0 a1 x may be slightly difficult to derive, because of covariance of a0 and a1 . Resultant representation is, a0 a1 x t n 2 1 x x 2 1 x x 2 S S n n xx xx x a a x t n 2 1 2 0 1 n 2 n 2 Finally, a confidence interval for the independently predicted value is as following a0 a1 x t n 2 1 x x 2 1 x x 2 S 1 S 1 n n xx xx x a a x t n 2 1 2 0 1 n 2 n 2 3 Canonical correlation analysis 3.1 Disadvantages of the multiple Regression method in long-range forecast Canonical correlation analysis (CCA) can be thought an extension of multiple regression analysis. Multiple regression analysis has the following two major disadvantages, as it is applied to long-range forecast. 1 Covariance matrix being singular In our data matrix used in long-range forecast, a predictand such as monthly mean temperature is one per a year and data length is usually a few tens. Otherwise, considering global grid point values as predictors, number of variables can be a few hundred or a few thousand. That is usually n m in our case. Thinking the fact that rank ( X ) rank ( X t ) rank ( X t X ) rank ( XX t ) , m m covariance matrix X t X is singular A9) and the inverse of it ( X t X ) 1 can not be obtained in usual way. 2 Not taking accounts of the correlations among the predictands 8 In the long-range forecast at JMA, there are four objects to predict, such as four mean temperatures in northern, eastern, western, and southern Japan. If making a multiple regression equation in each region, the correlations of the temperatures among regions are not taken account in the each equation explicitly. CCA treats each predictand as a predictand data matrix and directly considers the relation between predictor data matrix and predictand data matrix. In the case that predictand matrix reduces to a vector, CCA gives an identical result to one of multiple regression analysis. 3.2 EOF analysis as a pre-processing CCA used in the meteorological field is slightly different from usual CCA described in standard textbooks on multivariate analysis. As avoiding the first problem mentioned above, both predictand and predictor matrices are analyzed by Empirical Orthogonal Functions (EOF) analysis and both two data matrices are modified to new full-rank matrices. Let Y and X be predictand and predictor matrices and my and m x be numbers of variables, respectively. Data length n is the same in both matrices. After EOF analysis being performed to both matrices, both are modified to matrices whose ranks are full. Let F and E be the eigen vector matrices of both Y tY and X t X and Dx Diag( 1 ,, p ) , Dy Diag(1 ,, q ) be diagonal matrices of both eigen values. Numbers of eigen values of both data are q, p , which usually limited to be less than data length n in meteorological data A10). New data matrices are represented using above notations as follows A XEDx 1 2 B YFDy 1 2 Both two matrices have orthogonality, as showing the below. 1 At A Dx 2 E t X t XEDx And At A, 1 2 1 Dx 2 E t EDx Dx 1 2 Ip B t B are of course full rank matrices A11). In the routine CCA model in JMA and USA, a reduction of the matrix dimension is done. That is, eigen vectors with relatively small eigen values are ignored and not used to represent matrices A, B . 4 or 5 eigen vectors are used in JMA ’s model and USA ‘s may be also. This reduction is based on the thought that eigen components with bigger eigen values, such as famous AO pattern which is the first component of 500hPa height, has much information to predict the future phenomena and others with small one are rather noise, while it has no meteorological or mathematical background strictly. 9 3.3 Main part of CCA Considering the linear combinations of both matrices A, B , they are represented as u Ar v Bs Next, both two new variables are determined under the condition that the correlation of both becomes maxim. The variances and covariance of both are as u t u ( Ar ) t Ar r t At Ar r t r v t v ( Bs ) t Bs s t B t Bs s t s u t v ( Ar ) t Bs r t At Bs Using these relations, vectors u , v are found minimizing the following t s s 1 L r t At Bs r t r 1 2 2 , where , are Lagrange multipliers. Following equations are obtained after some calculations A12). A t B s r 0 B t Ar s 0 Now setting C At B and rearranging the above, we obtain C t Cs 2 s 0 CC t r 2 r 0 These are eigen value problems, but are not independent on each other. A pair of eigen vectors are related as the following relation Cs j rj j C t rj sj j and number of non-zero eigen vectors is limited to the minimum of q, p A13). Sorting eigen values and vectors in descending order, the pair of u , v corresponding to maximum eigen value is called first canonical variables. The correlation coefficient between these is just the corresponding eigen value from the following relation. t t t u j v j rj At Bs j j rj rj j The second canonical variables is also defined and so on. The orthogonal relations are there in these variables as followings t t t u j uk jk v j vk jk u j vk k jk ( j , k 1, q) And defining new matrices such as U u1 ,, uq V v1 ,, vq R r1 ,, rq S s1 ,, sq 10 above orthogonal relations of canonical variables are expressed as U tU I V tV I 1 0 U V V U t t 0 q 3.4 Prediction form of CCA Matrix U is new matrix corresponding to predictant data matrix Y , V is corresponding to predictor data matrix X . Considering multiple regression prediction of V using U , that is suppose the following regression equation, V UD , where D is regression coefficient matrix to be determined. The normal equation in above case is represented as U tUD U tV Rearranging the above by use of the orthogonal relation, D That is, the coefficient matrix is diagonal matrix consisted of eigen values. Resultant regression is as following Vˆ U where, Vˆ is predicted matrix of V . The above predictive equation can avoid two disadvantages mentioned in the section 3.1. Vˆ is able to be calculated using n q matrix U and q q diagonal matrix avoiding the rank deficiency. On the other hand, matrix Vˆ means time series of patterns made by predictands, and Vˆ includes the correlations between predictand vectors, expected to be taken accounts. Our purpose of the CCA is not to get Vˆ , but to get original predictant variables, so a transformation of the output variables is needed in routine CCA model. In the routine model at JMA, estimating a X A U transform matrix H , which satisfies Predictor Matrix the following relation, Predictor Matrix Canonical Variable (Real Space) (EOF Space) (CCA Space) Y VH H V Y , t Final prediction equation is represented Multiple Regression CCA Prediction as Yˆ X ( X t X ) 1 X t Y Yˆ HU Vˆ U Yˆ UH The right figure is flow chart of CCA Predictand Matrix Predictor Matrix Canonical Variables (Real Space) (EOF Space) (CCA Space) Y prediction. 11 V B Appendix A1) S y Xa 2 t y Xa y Xa y t y y t Xa a t X t y a t X t Xa The solution must be satisfied the relation S 0 , which yields a i t y Xa yi X ij a j yi X ij jk X ik yi X t ki yi ak ak t t aX y ai X t ij y j ik X t ij y j X t ki yi ak ak t t a X Xa al X il X ij a j kl X il X ij a j al X il X ij jk a k ak X ik X ij a j X ik X il al 2 X t ki X ij a j Final equation, so-called “Normal Equation” is X t Xa X t y 0 The solution of the normal equation is as, a ( X t X ) 1 X t y A2) y can be uniquely decomposed into two vectors y y1 y2 y1 S ( X ) y2 S ( X ) t The existence of a vector , which satisfies y1 X and y1 y2 t X t y2 0 is always satisfied. From above, X t y1 0 and X t y X t X 1 In consequence, we can derive the relation y1 X X t X X t y . A3) Dimensions of S (X ) and S ( X ) are m and n m , respectively and orthonormal bases exist in both spaces whose numbers are m and n m . Two matrices can be composed using each basis as follows. C1 c1 ,, cm , C2 cm1 ,, cn , S ( X ) S (C1 ), S ( X ) S C 2 Orthogonal projection matrices can be rewritten using two matrices and the decomposition of error vector is rewritten as C1C1 C2C2 t t Multiplying matrix C t C1 , C2 , t t t t t C t 1,, n C t C1C1 C t C2C2 1,, m ,0,0 0,,0, m 1 ,, n t t 12 As matrix C is orthogonal matrix, the distribution of new error vector C t 1,, m is also independent normal distribution. In consequence, t S p1 m 2 2 n m 2 2 A4) E a ( X t X ) 1 XE ( ) t 1 E a a ( X t X ) 1 X t E ( t ) X ( X t X ) 1 2 X t X And independence of a and S can be shown as, Ea S ( X t X ) 1 X t t ( I X ( X t X ) 1 X ) 0 A5) Precise distribution can be represented using t-distribution. t 1 y * N yˆ , 2 1 x0 X t X x0 S 2 2 (n m) is already shown, and using above relation, the following formula can be derived as y yˆ S t 1 1 x0 X t X x0 nm t n m . Symbol t (k ) represents t-distribution with k degrees of freedom. A6) Let Pm , Pm 1 be the orthogonal projection matrices to S ( X m ), S ( X m1 ) , respectively. Using these notations, sums of squared error are given as S m t ( I Pm ) , S m1 t ( I Pm1 ) , S m S m1 t ( Pm1 Pm ) Expected value of the correlation between S m1 and S m S m1 is as, E ( Sm1 ( Sm Sm1 )) E ( t ( I Pm1 ) t ( Pm1 Pm ) 2 E ( t ( I Pm1 )( Pm1 Pm ) ) 2 2 E ( t ( Pm1 Pm Pm1 Pm1 Pm ) ) 2 E ( t Pm ( Pm1 I ) ) Considering the relation S ( X m ) S ( X m1 ) , Pm Pm1 Pm is obvious. Hence, E ( S m1 ( S m S m1 )) 0 Above means independence of S m1 and S m S m1 . A7) Substituting the following equations y X a ( X t X ) 1 X t y y X ( y Xa )t ( y Xa ) t 2 t X ( X t X ) 1 X t t X ( X t X ) 1 X t Observing the independency of two noise vectors, that is E ( t ) 0 , and also note the 13 t X X t X X t 2 m , FPM formula can be obtained, 2 relations t 2 n 2 1 A8) 2m nm nm n log e FPE n log e S m n log e S m n log e n log e S m n log e 1 n m n m n m 2m n log e S m n n log e S m 2m AIC nm A9) Let S (X ) and Ker( X ) be the subspace of matrix X and zero-space of X , respectively. S ( X ) X ; V m Ker ( X ) ; X 0 , V m That is, Rank of matrix X is defined as the dimension of S (X ) rank( X ) dim S ( X ) At first, the fact Ker ( X t ) S ( X ) Ker ( X ) S ( X t ) will be shown. Considering the inner product both 1 Ker( X t ) and 2 X 2 S ( X ) , t t t 1 2 1 XA 2 2 ( X t 1 ) 0 1 S ( X ) Ker ( X t ) S ( X ) Also, considering the inner product both t t t 1 2 1 X 2 2 ( X t 1 ) 0 1 S ( X ) and 2 X 2 S ( X ) , X t 1 0 1 Ker( X t ) Ker( X t ) S ( X ) In consequence, Ker ( X t ) S ( X ) and simile results in Ker ( X ) S ( X t ) Next rank ( X ) rank ( X t ) rank ( X t X ) rank ( XX t ) will be shown. For V m , the following decomposition can be, from above results 1 2 1 S ( X t ) 2 K e r( X ) Multiplying matrix X to the above, X X1 X 2 X1 XX t 1 1 V n It indicates S ( X ) S ( XX t ) . On the other hand, S ( A) S ( AAt ) is always because of S(X t ) V m . S ( X ) S ( XX t ) dim S ( X ) dim S ( XX t ) rank ( X ) rank ( XX t ) And, using general relation rank ( XY ) min rank ( X ), rank (Y ) rank ( X ) rank ( X t ) S ( X t ) S ( X t X ) , rank ( X t ) rank ( X t X ) , and rank ( X ) rank ( X t ) can be shown by the similar way. In consequence, rank ( X ) rank ( X t ) rank ( X t X ) rank ( XX t ) A10) In general, there are m x solutions of the eigen value problem X t Xe e and eigen 14 vectors are perpendicular to each other and eigen values are real and non-negative. if 0 , then X t Xe e S ( X t X ) . But number of orthogonal vectors in S ( X t X ) must be dim S ( X t X ) rank ( X ) n So, number of non-zero eigen values is limited rank(X ) in the given case. Eigen value being non-negative and orthogonality of eigen vectors in the problem X t Xe e , can be shown as below. 1 orthogonality For two different eigen value, X t Xei i ei X t Xe j j e j Multiplying transposed vector of the other to itself on the left hand side, t t t t e j X t Xei i e j ei ei X t Xe j j ei e j Rearranging above two equations, t (i j )e j ei 0 Above indicates orthogonality. 2 non-negativeness Multiplying transposed vector to itself on the left hand side and considering that inner product of the same vectors means distance. t t t ei X t Xei Xei Xei i ei ei i 0 A11) The column vectors of A are on the subspace S ( X ) and orthogonal to each other and also dim S ( X ) rank( A) . Hence, S ( X ) S ( A) From this, orthogonal projection onto S ( X ) is identical to that onto S ( A) . That is ˆ X ( X t X ) 1 X t y y A( At A) 1 At y AAt y The last expression of ŷ can be calculated avoiding the rank deficiency of ( X t X ) 1 . Additionally, substituting A XEDx 1 yˆ XEDx 2 Ay 1 2 to the equation, we get 1 a EDx 2 Ay X y In the above, X is identical to so-called Moore-Penrose Generalized inverse. A12) L rk Aik Bij s j 2 r r k k 1 2 s k sk 1 Partial derivative with rl is 15 L Aik Bij s j kl rk kl Ail Bij s j rl rl Setting derivative to be zero and rearranging to matrix notation, At Bs r 0 Partial derivative with s is the same and followings are obtained. At Bs r 0 B t Ar s 0 Multiplying r t , s t , respectively on the left hand side, and considering the followings, r t r 1, sts 1 Ar t Bs Bs t Ar A13) Multiplying C t to C t Cs 2 s , we obtain C t CC t r 2C t r This means C t r is proportional to the eigen vector s . Using the relation r t CC t r 2 , Ctr we obtain s 16