Statistical methods used in the long

advertisement
By Shyunji Takahashi
Statistical methods used in the long-forecast at JMA
1 Introduction
History of long-range forecast and the statistical methods used in Japan
There had long been demands mainly by farmers and central/local government
agencies on the issue of long-range weather forecasts. After severe damage to rice
crops in the unusually cool summer of 1934, the request for long-range forecasts
became stronger in order to avoid or prepare against damage by unusual weather
conditions. Research was done to develop forecast methods.
In June 1942, a section in charge of long-range weather forecasting was
established in the Weather Forecasting Department of the Central Meteorological
Observatory. Issuing one-month, three-month and warm/cold season forecasts were
started one after another. In February 1945, the Long-range Forecast Division was
established in the Operational Department of the Central Meteorological Observatory.
Forecasts, however, did not necessarily satisfy the user’s demand due to their poor
accuracy.
After the failure of the forecast for the winter of 1948/49, the
record-breaking warm winter, there appeared doubt and criticism on the accuracy of
long-range forecasts. The operational forecasting was ceased in February 1949 and
the division was closed in November 1949.
After that research activities were
strengthened at the Meteorological Research Institute.
As a result issuing of
one-month, three-month and warm/cold season forecasts was restarted in February
1953.
In those days, forecasting method was based on periodicity of the global
atmospheric circulation and harmonic analysis was main tool. Harmonic analysis is to
detect a few dominant periods in the time sequence of temperature or circulation
indices until the present time and to predict the future state extrapolating the
dominant periodic changes.
In the early 1970s unusual weather occurred worldwide.
This caused
considerable attention to be paid to the necessity of long-range weather forecasts as
well as monitoring of global weather conditions. In April 1974, JMA reestablished the
Long-range Forecast Division in the Forecast Department and at the same time started
to monitor unusual weather in the world.
The division had been responsible for
long-range weather forecasting and climate monitoring activities.
In those days, adding the harmonic analysis, analog method was introduced to
forecasting tools. Analog method is based on the assumption that the future evolution
of the atmospheric circulation and the realized weather condition will be similar to one
1
in the historical analog year. The analog year is defined by the comparison of the
atmospheric circulation pattern currently observed with a historical one.
In the early stage, long-range forecasts used to be required exclusively by farmers
and related public agencies. But in the course of industrial development, uses were
diversified into various fields such as construction, water resource control, electric
power plants, makers of electronics appliances, food companies, tourism, etc.
Occasional appearances of unusual weather in the 1980s also caused an increase of
interest in long-range forecasts. Recently social as well as economic impacts of the
long-range forecast have become more significant.
In March 1996, JMA started one-month forecasting based on dynamical ensemble
prediction instead of statistical and empirical methods.
At the same time,
probabilistic expression was introduced into the long-range forecast to indicate
explicitly the uncertainty of forecasts. In accordance with this change, Statistical tools
in one-month forecasting were scraped except for an analog method as a reference. On
the other hand, statistical tools used in three-months forecast and cold/warm season
forecast were improved in early 1990’s in order to adjust to bulk data compiled from
worldwide sources and to computer power highly evolved. Five kinds of statistical
methods were available in 1996, that is, analog method using grid point value of
500hPa in the northern hemisphere, analog/anti-analog method using SST sequences
in Nino3 region, optimal climate normal (OCN), method of periodicity using several
circulation indices, and multiple regression method using 500hPa patterns.
Analog/anti-analog method is as an extension of analog and is based on that negative
correlated field is respected to yield the opposite weather condition to the one expected
by analogous field. OCN is a optimal estimator for the forthcoming climate state and
average of the past 10 years or so is found to be better estimator than the usual normal
which is calculated by the past 30 years. OCN used in the current forecast at JMA is
the latest 10 years mean.
JMA was reorganized in July 1996. The Climate Prediction Division (CPD) was
newly established in the Climate and Marine Department, and took over the
responsibilities from the former Long-range Forecast Division. Climate information
services were added to CPD services.
Furthermore, El Nino monitoring and
prediction were transferred from the Ocean-graphical Division to CPD. Unifying the
climate-related services was an important reason for the reorganization. CPD is now
developing and extends its responsibility to development of numerical prediction model
for the long-range forecasting, a reanalysis project of historical atmospheric circulation
data, and assessment and projection of global warming.
2
In March 2003, JMA started three-month forecasting based on dynamical
ensemble prediction. At the same time, statistical methods were reconstructed and
rearranged into two methods, that is, OCN and canonical correlation analysis (CCA)
technique.
In this lecture, two statistical methods are described. First one is multiple
regression model and the second is canonical correlation analysis (CCA). Both two are
currently used in many forecasting centers and the basic knowledge around the both
two methods would be useful.
2 Multiple Regression Method
2.1 Multiple Regression model and the solution
Our situation is that we have a time series of meteorological variable to forecast
and a set of time series of other variables obviously related to the former. The former
and the latter elements are predictor and predictand, respectively. Our purpose is to
predict the future value of predictand using the relationship between predictand and
predictors and the present values of predictors.
The multiple regression model assumes predictand vector is sum of a linear
combination of predictors and a noise.



Suppose that xi (t ), t  1, n  are predictors, y (t ) a predictand vector, and  noise
vector, multiple regression model is written as




y (t )  1 x1 (t )     m xm (t )  
This expression can be rewritten as follows using matrix-vector notation.

 
y (t )  X  
 x11  x1m 




X  x1 (t ), xm (t )    
 
x  x 
nm 
 n1
 1 
 
   
 
 m


The vector  is called regression coefficient vector and have to be estimated using

data matrix X and predictand vector y . The coefficient vector is usually estimated so

as to minimize the sum of squared errors of forecast vector ŷ . That is,
 
S  y  yˆ
2


 y  Xa
2
 min imum ,


where a is the estimator of vector  .
The solution of the above minimize problem is,



   

a  ( X t X ) 1 X t y
S  y t ( I  X ( X t X ) 1 X t ) y  y t y  a t X t y
A1),
where A t , A 1 are transpose and inverse of matrix A respectively, and I is unit
matrix.
On the other hand, the estimator of regression coefficient is determined uniquely by
3
given data matrix and predictand vector, but it is considered as random variable whose
distribution depends on stochastic noise. It means that determined coefficient vector
has some uncertainty, which would affect the accuracy of the forecast.
2.2 Stochastic property of sum of squared error and regression coefficient
In the usual regression model, the stochastic noise is assumed to be Gaussian noise
with zero mean and a certain variance. And it is assumed that each noise of the noise
vector is independent to each other. That is,
 i  N 0, 2 , E i  j    2 ij , E t    2 I .

Where, symbol N  ,  2  represents Gaussian probability density function with
mean  and variance  2 . The notation x  N  ,  2  indicates that x is a random
variable with a normal distribution and E (x) represents the expected value of x.
These situation can be represented in the following figure as visual image.
Defining S (X ) as the subspace spanned by the column vectors of matrix X , assumed

true regression X exists on the S (X ) plain and it is represented by vector or in
 
 
the figure. Let rq be the noise vector  , y  X   is represented as vector op .
Under this situation, we can see that among the vectors on the S (X ) plain, the vector

whose distance to y
 qp  becomes minimum, is just orthogonal projection vector oq .
The vector is, of course, the same as the previous solution of normal equationA2).
The vector qp is just error vector of the regression, whose squared distance is just sum
of squared error S and it is the projection onto the vector space S ( X )  of the original

noise vector   rp . On the other hand,
vector qr
is on the subspace S (X )
Vector Space
Vn
and
p
perpendicular to vector qp . In consequence,
Sub-Space
original error vector can be decomposed into
two components, which are perpendicular to
each other. Matrix of orthogonal projection
onto S (X ) is X  X t X  X t
1
A2),
so that the
decomposition is represented as,
r
o

op  y

or  X
4
S(X )
q

oq  X ( X t X ) 1 X t y

rp  



  X X t X  X t  I  X X t X  X t 

1
1
Here, Adapting Cochran’s theorem to above relation, it can be shown that the sum of
squared error S has a  2 distribution with n  m degrees of freedom A3).
S
2
  2 n  m
E S    2 m  n 
This means that unbiased estimator of  2 is ̂ 2 
S
nm

Next, probability distribution of the estimated regression coefficient a is derived.

The expression of a can be rewritten as,



 

1
1
1
a   X t X  X t y   X t X  X t ( X   )     X t X  X t 



As a is linear combination of  , which is assumed to normally distribute, a also
normally distributes. The mean and variance are as the followings A4).
 
   t
1
E a   
E a   a      2  X t X 


2.3 Stochastic property of the prediction by independent input
For the predictor vector
t
x0  x1 ,, xm  , which is independent to data used to
 t
determine the regression coefficient, predicted value is represented as yˆ  x0 a .

This predicted value ŷ is the linear combination of a , which normally distributes.
Calculating the expected values of its mean and variance,
 
 t  t 
E x0 b  x0 
t     t
t
1 
E x0 a   a    x0   2 x0  X t X  x0

That is,


t
t
1 
yˆ  N x0  ,  2 x0  X t X  x0
And value of reality y

*

also normally distributes as,

t
y  N x0  ,  2 .
*
Hence,


t
1 
y *  yˆ  N 0 ,  2 1  x0  X t X  x0

This relation is available to probability forecast. While the population variance  2 is
unknown, unbiased estimator of it is available if the n is large enough. Final
probability distribution of the reality is approximately as A5),
t
1 
y *  N yˆ , ˆ 2 1  x0  X t X  x0
ˆ 2  S /(n  m)



2.4 Significant test of the regression
Let m -terms regression with sum of squared error S m already to be set and adding
5
new predictor variables to data matrix, m  1 -terms regression with sum of squared
error S m1 is obtained.
S m  S m1 is trivial.
For the purpose of that statistical test whether new variable is significant or not, the
following relation is much useful and the stepwise technique, which is one of model
selection methods, employs same relation.
S m , S m1 , and S m  S m1 follow  2 -distributions.
Sm

2
  2 (n  m)
Sm1

2
S m  S m1
  2 (n  m  1)

2
  2 1
Because S m1 and S m  S m1 are independent to each other A6), the following is derived as
S m  S m1
n  m  1 F 1 , n  m  1 .
S m1
Where, F k , l  means F-distribution with degrees of freedom k, l .
2.5 Model selection
Multiple regression model using all predictors in given data matrix has minimum
S , but optimal model we need is the model that prediction error of independent case
makes minimum. As number of terms increases, S decreases, but uncertainty due to
regression coefficients increases. Final prediction error (FPE) can explain these
matters.
FPE is defined as expected value of S , which is calculated independent data. That is,



t 

FPE  E  y  Xa   y  Xa  .


 
Where y is an independent predictand vector and y  X    . After a little
calculation, the following equation is obtained A7).
FPE  n  m  2
Usual form of FPE is modified as follows, using ̂ 2 
FPE 
S
nm
nm
S
nm
The effect of uncertainty increasing due to regression coefficients, accompanied with
number of terms increasing, is represented as
nm
.
nm
One approach to obtain the optimal regression equation is method of all subsets
regression.
Calculating regressions for all possible subsets of predictors, the set
giving minimized FPM or AIC is selected as an optimal set. ALL subsets regression
method is secure and had been used in JMA ’s long-range forecast with AIC criterion,
while number of subsets to check is 2 m  1 , which increases rapidly as m increasing.
6
AIC criterion gives almost same results as FPM criterion A8) and it is,
AIC  n log e S m  2m .
The other way to obtain the optimal equation is stepwise method. The method consists
of two processes. These are process of adding new variable, which has the highest and
significant contribution to regression represented by equation (X) and process of
eliminating variable already in regression, which becomes insignificant. Both processes
repeat until there exists no significant variable outside of regression or no insignificant
variable inside of regression. This method is very convenient and has been used in
making guidance of numerical model output at JMA.
2.6 Single regression model
In order to help the recognition of multiple regression method, let consider the
simplest case, that is, simple regression. Simple regression is now not used in the
long range forecasting of JMA, but the prediction using the simple relation between
width of the polar vortex in autumn and winter mean temperature had been one of the
bases of cold season outlook until a recent year.
Single regression model is written as



y (t )   0   1 x (t )  
Rearranging above to the notation in this note,
1 x1 


X    
1 x 
n 


 
y  X  
  
   0 
 1 


Calculating a  ( X t X ) 1 X t y and rearranging,
 y  a1 x 
n
n

 
2
a    xy   xx   xi  nx 2 ,  xy   xi yi  nx y
i 1
i 1
 

xx


  

S  y t y  a t X t y is rewritten in this case,
n

2
S   yy (1  xy )   yy (1  r 2 )  yy   yi  ny 2 ,
i 1
 xx yy
2
Where, r is the correlation coefficient between x and y .
Stochastic properties are also modified. Writing only results as,
1
2
S
2
  2 (n  2)
a1  1  N (0 ,
2
)
 xx
ˆ 2 
S
n2

 1 x 2 
 
a0   0  N  0 ,  2  

 n  xx  

7
Considering above two, the following relation is derived.
n  2 a
xx
S
1
 1  t n  2
Using above relation, a confidence interval for the slope parameter a1 is given as
a1  t n  2
S
S
 1  a1  t n  2
n  2 xx
n  2 xx
Where, t (k ) is the  -quantile of the t-distribution with k degrees of freedom.
Also, a confidence interval for the section parameter a0 as,
a0  t n  2
 1 x2 
 1 x2 


S  
S  
 n  xx     a  t n  2  n  xx 
0
0

n  2
n  2
A confidence interval for the regression a0  a1 x may be slightly difficult to derive,
because of covariance of a0 and a1 . Resultant representation is,
a0  a1 x  t n  2
 1 x  x 2 
 1 x  x 2 



S 
S  

n

n

xx
xx

     x  a  a x  t n  2 

1
2
0
1

n  2
n  2
Finally, a confidence interval for the independently predicted value is as following
a0  a1 x  t n  2
 1 x  x 2 
 1 x  x 2 



S 1  
S 1  

n

n

xx
xx

     x  a  a x  t n  2 

1
2
0
1

n  2
n  2
3 Canonical correlation analysis
3.1 Disadvantages of the multiple Regression method in long-range forecast
Canonical correlation analysis (CCA) can be thought an extension of multiple
regression analysis. Multiple regression analysis has the following two major
disadvantages, as it is applied to long-range forecast.
1 Covariance matrix being singular
In our data matrix used in long-range forecast, a predictand such as monthly
mean temperature is one per a year and data length is usually a few tens. Otherwise,
considering global grid point values as predictors, number of variables can be a few
hundred or a few thousand. That is usually n  m in our case.
Thinking the fact that rank ( X )  rank ( X t )  rank ( X t X )  rank ( XX t ) , m m covariance
matrix X t X is singular A9) and the inverse of it ( X t X ) 1 can not be obtained in usual
way.
2 Not taking accounts of the correlations among the predictands
8
In the long-range forecast at JMA, there are four objects to predict, such as
four mean temperatures in northern, eastern, western, and southern Japan. If making
a multiple regression equation in each region, the correlations of the temperatures
among regions are not taken account in the each equation explicitly.
CCA treats each predictand as a predictand data matrix and directly considers the
relation between predictor data matrix and predictand data matrix. In the case that
predictand matrix reduces to a vector, CCA gives an identical result to one of multiple
regression analysis.
3.2 EOF analysis as a pre-processing
CCA used in the meteorological field is slightly different from usual CCA described
in standard textbooks on multivariate analysis. As avoiding the first problem
mentioned above, both predictand and predictor matrices are analyzed by Empirical
Orthogonal Functions (EOF) analysis and both two data matrices are modified to new
full-rank matrices.
Let Y and X be predictand and predictor matrices and my and m x be numbers of
variables, respectively. Data length n is the same in both matrices.
After EOF analysis being performed to both matrices, both are modified to matrices
whose ranks are full.
Let F and E be the eigen vector matrices of both Y tY and
X t X and Dx  Diag( 1 ,, p ) , Dy  Diag(1 ,, q ) be diagonal matrices of both eigen
values.
Numbers of eigen values of both data are q, p , which usually limited to be
less than data length n in meteorological data A10).
New data matrices are represented using above notations as follows
A  XEDx

1
2
B  YFDy

1
2
Both two matrices have orthogonality, as showing the below.

1
At A  Dx 2 E t X t XEDx
And At A,

1
2

1
 Dx 2 E t EDx Dx

1
2
 Ip
B t B are of course full rank matrices A11).
In the routine CCA model in JMA and USA, a reduction of the matrix dimension is
done. That is, eigen vectors with relatively small eigen values are ignored and not used
to represent matrices A,
B . 4 or 5 eigen vectors are used in JMA ’s model and USA ‘s
may be also. This reduction is based on the thought that eigen components with bigger
eigen values, such as famous AO pattern which is the first component of 500hPa height,
has much information to predict the future phenomena and others with small one are
rather noise, while it has no meteorological or mathematical background strictly.
9
3.3 Main part of CCA
Considering the linear combinations of both matrices A, B , they are represented as
 


u  Ar v  Bs
Next, both two new variables are determined under the condition that the correlation
of both becomes maxim. The variances and covariance of both are as
 




 
u t u  ( Ar ) t Ar  r t At Ar  r t r
 




 
v t v  ( Bs ) t Bs  s t B t Bs  s t s
 




u t v  ( Ar ) t Bs  r t At Bs
 
Using these relations, vectors u , v are found minimizing the following

   
 t 
s s  1
L  r t At Bs  r t r  1 
2
2
,
where  ,  are Lagrange multipliers.
Following equations are obtained after some calculations A12).


A t B s  r  0


B t Ar   s  0

Now setting C  At B and rearranging the above, we obtain


C t Cs   2 s  0


CC t r   2 r  0
These are eigen value problems, but are not independent on each other.
A pair of eigen vectors are related as the following relation

Cs j

rj 
j


C t rj
 sj 

j





and number of non-zero eigen vectors is limited to the minimum of q, p A13).
 
Sorting eigen values and vectors in descending order, the pair of u , v corresponding to
maximum eigen value is called first canonical variables. The correlation coefficient
between these is just the corresponding eigen value from the following relation.
 t
t

t
u j v j  rj At Bs j   j rj rj   j
The second canonical variables is also defined and so on.
The orthogonal relations are there in these variables as followings
 t
 t
 t
u j uk   jk
v j vk   jk
u j vk   k  jk
( j , k  1, q)
And defining new matrices such as




U  u1 ,, uq 
V  v1 ,, vq 




R  r1 ,, rq 
S  s1 ,, sq 
10
above orthogonal relations of canonical variables are expressed as
U tU  I
V tV  I
 1


 0

U V V U  
t
t

0 


 q 
3.4 Prediction form of CCA
Matrix U is new matrix corresponding to predictant data matrix Y , V is
corresponding to predictor data matrix
X .
Considering multiple regression
prediction of V using U , that is suppose the following regression equation,
V  UD ,
where D is regression coefficient matrix to be determined.
The normal equation in above case is represented as
U tUD  U tV
Rearranging the above by use of the orthogonal relation,
D
That is, the coefficient matrix is diagonal matrix consisted of eigen values. Resultant
regression is as following
Vˆ  U
where, Vˆ is predicted matrix of V . The above predictive equation can avoid two
disadvantages mentioned in the section 3.1. Vˆ is able to be calculated using
n  q matrix U and q  q diagonal matrix  avoiding the rank deficiency. On the
other hand, matrix Vˆ means time series of patterns made by predictands, and
Vˆ includes the correlations between predictand vectors, expected to be taken accounts.
Our purpose of the CCA is not to get Vˆ , but to get original predictant variables,
so a transformation of the output variables is needed in routine CCA model. In the
routine model at JMA, estimating a
X
A
U
transform matrix H , which satisfies
Predictor
Matrix
the following relation,
Predictor
Matrix
Canonical
Variable
(Real Space)
(EOF Space)
(CCA Space)
Y  VH
H V Y ,
t
Final prediction equation is represented
Multiple Regression
CCA Prediction
as
Yˆ  X ( X t X ) 1 X t Y
Yˆ  HU
Vˆ  U
Yˆ  UH
The right figure is flow chart of CCA
Predictand
Matrix
Predictor
Matrix
Canonical
Variables
(Real Space)
(EOF Space)
(CCA Space)
Y
prediction.
11
V
B
Appendix
A1)


S  y  Xa
2

 t 
       

 y  Xa  y  Xa   y t y  y t Xa a t X t y  a t X t Xa
The solution must be satisfied the relation
S
 0 , which yields
a i
 t  
y Xa 
yi X ij a j  yi X ij jk  X ik yi   X t ki yi
ak
ak
 t t  
aX y
ai  X t ij y j   ik  X t ij y j   X t ki yi
ak
ak
 t t  
a X Xa 
al X il X ij a j   kl X il X ij a j  al X il X ij jk
a k
ak
 X ik X ij a j  X ik X il al  2 X t ki X ij a j
Final equation, so-called “Normal Equation” is


X t Xa  X t y  0
The solution of the normal equation is as,


a  ( X t X ) 1 X t y
A2)
   


y can be uniquely decomposed into two vectors y  y1  y2 y1  S ( X ) y2  S ( X ) 



t 

The existence of a vector  , which satisfies y1  X and y1 y2   t X t y2  0 is always



satisfied. From above, X t y1  0 and X t y  X t X


1
In consequence, we can derive the relation y1  X  X t X  X t y .
A3)
Dimensions of S (X ) and S ( X )  are m and n  m , respectively and orthonormal
bases exist in both spaces whose numbers are m and n  m . Two matrices can be
composed using each basis as follows.




C1  c1 ,, cm  , C2  cm1 ,, cn  , S ( X )  S (C1 ), S ( X )   S C 2 
Orthogonal projection matrices can be rewritten using two matrices and the
decomposition of error vector is rewritten as



  C1C1   C2C2 
t
t


Multiplying matrix C t  C1 , C2 ,
t
t

t
t 
t 
C t   1,,  n   C t C1C1   C t C2C2 
 1,,  m ,0,0  0,,0,  m 1 ,,  n 
t
t
12
As matrix C is orthogonal
matrix, the distribution of new

error vector C t  1,,  m  is also independent normal distribution. In consequence,
t
S   p1     m  2  2 n  m
2
2
A4)




E a     ( X t X ) 1 XE ( )  

   t
1
E a   a     ( X t X ) 1 X t E (  t ) X ( X t X ) 1   2  X t X 

And independence of a and S can be shown as,


 
Ea   S   ( X t X ) 1 X t   t ( I  X ( X t X ) 1 X )  0


A5)
Precise distribution can be represented using t-distribution.


t
1 
y *  N yˆ ,  2 1  x0  X t X  x0

S

2
  2 (n  m)
is already shown, and using above relation, the following formula can be derived as
y  yˆ

S
t
1 
1  x0  X t X  x0
nm

 t n  m .
Symbol t (k ) represents t-distribution with k degrees of freedom.
A6)
Let Pm , Pm 1 be the orthogonal projection matrices to S ( X m ), S ( X m1 ) , respectively. Using
these notations, sums of squared error are given as






S m   t ( I  Pm ) , S m1   t ( I  Pm1 ) , S m  S m1   t ( Pm1  Pm )
Expected value of the correlation between S m1 and S m  S m1 is as,





E ( Sm1 ( Sm  Sm1 ))  E ( t ( I  Pm1 )  t ( Pm1  Pm )   2 E ( t ( I  Pm1 )( Pm1  Pm ) )




2
  2 E ( t ( Pm1  Pm  Pm1  Pm1 Pm ) )   2 E ( t Pm ( Pm1  I ) )
Considering the relation S ( X m )  S ( X m1 ) , Pm Pm1  Pm is obvious. Hence,
E ( S m1 ( S m  S m1 ))  0
Above means independence of S m1 and S m  S m1 .
A7)

  

 
Substituting the following equations y  X    a  ( X t X ) 1 X t y y  X  
 

 



( y  Xa )t ( y  Xa )   t    2 t X ( X t X ) 1 X t    t X ( X t X ) 1 X t 
 
Observing the independency of two noise vectors, that is E ( t  )  0 , and also note the
13
 t X X t X  X t
  2 m  , FPM formula can be obtained,
2
 
relations

 t  
  2 n 
2

1
A8)
2m 
nm 
nm

n log e FPE  n log e 
S m   n log e S m  n log e 
  n log e S m  n log e 1 

n

m
n

m
n
m





2m
 n log e S m  n
 n log e S m  2m  AIC
nm
A9)
Let S (X ) and Ker( X ) be the subspace of matrix X and zero-space of X , respectively.
 



S ( X )  X ;  V m 
Ker ( X )   ; X  0 ,  V m 
That is,
Rank of matrix X is defined as the dimension of S (X )
rank( X )  dim S ( X )
At first, the fact Ker ( X t )  S ( X ) 
Ker ( X )  S ( X t )  will be shown.



Considering the inner product both 1  Ker( X t ) and  2  X 2  S ( X ) ,
t 
t


t
1  2  1 XA 2   2 ( X t 1 )  0

 1  S ( X )   Ker ( X t )  S ( X ) 
Also, considering the inner product both
t 
t

t

1  2  1 X 2   2 ( X t 1 )  0




1  S ( X ) and  2  X 2  S ( X ) ,


 X t 1  0  1  Ker( X t )  Ker( X t )  S ( X )
In consequence, Ker ( X t )  S ( X )  and simile results in Ker ( X )  S ( X t ) 
Next rank ( X )  rank ( X t )  rank ( X t X )  rank ( XX t ) will be shown.

For  V m , the following decomposition can be, from above results
  


  1   2  1  S ( X t )  2  K e r( X )
Multiplying matrix X to the above,





X  X1  X 2  X1  XX t 1

1 V n
It indicates S ( X )  S ( XX t ) . On the other hand, S ( A)  S ( AAt ) is always because of
S(X t )  V m .
 S ( X )  S ( XX t )
 dim S ( X )  dim S ( XX t )  rank ( X )  rank ( XX t )
And, using general relation rank ( XY )  min rank ( X ), rank (Y )
rank ( X )  rank ( X t )
S ( X t )  S ( X t X ) , rank ( X t )  rank ( X t X ) , and rank ( X )  rank ( X t ) can be shown by the
similar way.
In consequence, rank ( X )  rank ( X t )  rank ( X t X )  rank ( XX t )
A10)


In general, there are m x solutions of the eigen value problem X t Xe  e and eigen
14
vectors are perpendicular to each other and eigen values are real and non-negative.


if   0 , then X t Xe  e  S ( X t X ) . But number of orthogonal vectors in S ( X t X )
must be dim S ( X t X )  rank ( X )  n
So, number of non-zero eigen values is limited rank(X ) in the given case.
Eigen value being non-negative and orthogonality of eigen vectors in the problem


X t Xe  e , can be shown as below.
1 orthogonality
For two different eigen value,


X t Xei  i ei


X t Xe j   j e j
Multiplying transposed vector of the other to itself on the left hand side,
t

 t
t

t
e j X t Xei  i e j ei
ei X t Xe j   j ei e j
Rearranging above two equations,
 t
(i   j )e j ei  0
Above indicates orthogonality.
2 non-negativeness
Multiplying transposed vector to itself on the left hand side and considering that inner
product of the same vectors means distance.
t

 t 
t
ei X t Xei   Xei  Xei  i ei ei  i  0
A11)
The column vectors of A are on the subspace S ( X ) and orthogonal to each other
and also dim S ( X )  rank( A) .
Hence,
S ( X )  S ( A)
From this, orthogonal projection onto S ( X ) is identical to that onto S ( A) . That is




ˆ  X ( X t X ) 1 X t y
y
 A( At A) 1 At y  AAt y

The last expression of ŷ can be calculated avoiding the rank deficiency of ( X t X ) 1 .
Additionally, substituting A  XEDx
1



yˆ  XEDx 2 Ay

1
2
to the equation, we get
1




a  EDx 2 Ay  X  y
In the above, X  is identical to so-called Moore-Penrose Generalized inverse.
A12)
L  rk Aik Bij s j 

2
r r
k
k
 1 

2
s
k
sk  1
Partial derivative with rl is
15
L
 Aik Bij s j  kl  rk  kl  Ail Bij s j  rl
rl
Setting derivative to be zero and rearranging to matrix notation,


At Bs  r  0

Partial derivative with s is the same and followings are obtained.




At Bs  r  0
B t Ar  s  0
 
Multiplying r t , s t , respectively on the left hand side, and considering the followings,
 
r t r  1,
 
sts 1




     Ar t Bs  Bs t Ar
A13)




Multiplying C t to C t Cs   2 s , we obtain C t CC t r   2C t r




This means C t r is proportional to the eigen vector s . Using the relation r t CC t r   2 ,

 Ctr
we obtain s 

16
Download