1 Some Different Perspectives on Linear Least Squares A standard

advertisement
Some Different Perspectives on Linear Least Squares
A standard problem in statistics is to measure a response or dependent variable, y, at fixed values of one
or more independent variables. Sometimes there exists a deterministic model y  f  x,   , where x stands
collectively for all of the independent variables of the system and  stands collectively for all of the
model parameters. In the probabilistic model it is often assumed that each measured y value can be


expressed as yi  f xi ,    i , here i is an index that labels the fixed input condition at
xi and  i represents random error associated with the given measurement. For a valid model the mean
random error should be zero. For a given set of n data points the problem of choosing the “best” values
for the parameters is typically solved via the method of least squares. The parameters are chosen to
minimize the function F    
n
  yi  f  xi ,  
i 1
2


. If the function f xi ,  is linear in  the procedure
is referred to as a linear least squares or linear regression problem. The purpose of these notes is to
examine the least squares solution from several different points of view.
The Least Squares Approximation to a Vector in an n Dimensional Euclidean Space
Consider an n dimensional Euclidean space, V with an inner product of vectors v and u denoted as  u , v  .


Let W be a subspace of V, then V  W  W , where W is the orthogonal complement of W. Let v be

any element of V, then v  w  u , where w  W and u  W . The vector w is called the projection of v
2
onto W. Let g v    v     v   , v    . The following argument shows that   w is the
unique minimum of this function on W. First, by expanding the inner product, g v   can be expressed as
2
2
2
g v     v  w     w   v  w  2  v  w,  w     w . If   W , then   w  W
2
2
2
and  v  w,  w   u ,  w  0 . So on W, g v    u    w  u , with equality only at
n
  w . If V  R and W has the orthonormal basis C1 , C 2 , C m with m  n , then the column vector


m
which is the projection of v onto W is given by w    Cˆ j , v Cˆ j . Define the n  m matrix Q as
j 1
Q  C1 , C 2 , C m  , that is Q i, j   C j  . The n  m matrix QQT , called the “outer product”,


i,1
m
m
m
projects v onto W: wi,1    C j , v  C j     C j   QT v  j ,1  Q i , j  QT v  j ,1  QQT v  . Thus,
i,1 j 1
i ,1
i,1
j 1
j 1
the vector QQT v minimizes the squared distance from v to W.
Al Lehnen
2/17/2016
Madison Area Technical College
2
Some Different Perspectives on Linear Least Squares
Linear Least Squares in Matrix Notation: The Generalized Inverse
m
If a model is linear in its m parameters designated by the vector  in R , then all the n estimated y
n
responses can be generated as a vector in R given by A where A is an n  m matrix. We will assume that
A has rank m with m  n . If the rank of A is less than m, it means that the model parameters are not all
independent and need to be reduced in number so as to make a “design matrix” with all of its columns
linearly independent. Denote the k’th column of A as Ak , i.e., A   A1, A2 , Am  . Since the columns of A
n
are linearly independent vectors in R , Ab 
m
 bk Ak  0n if and only if each bk  0 . Thinking of A as a
k 1
linear transformation from R
m
to R , the image of A, W, is the span of  A1, A2 ,
n
Am  and has
dimension m, while the kernel of A has dimension zero and consists of only the zero vector 0m . Thus,


T
every vector in W is orthogonal to each column of A, or stated differently, if u  W , A u  0 m .

T
Conversely, if A u  0 m , then every column of A is perpendicular to u, so u  W . Thus, the kernel of
n
A transpose is the orthogonal complement of the column space of A. Let y  R , then y  w  u where

w  W and u  W . Unless u  0 n , the system of equations Ab  y has no solutions. However,
Ab  w has a solution,  , and since  A1, A2 , Am  is a linearly independent set, this solution is unique.
Furthermore, w is the closest vector in the column space of A to y and the minimum value of
2
2
2
m
Fy  b   y  Ab as b varies over R is Fy     y  w  y  A . Now, A  w , so
m
T

AT A  AT w  AT  y  u   AT y  AT u  AT y , since ker A  W . Suppose that for b  R ,
 
2
T
AT Ab  0m , then bT AT Ab   Ab  Ab   Ab, Ab   Ab  0 . So, if AT Ab  0m , Ab  0 n which
implies that b  0 m since the kernel of A consists of only 0m . Therefore, AT A is nonsingular and the least



1
squares solution to minimizing Fy  b  is given by   AT A
AT y , the m  n matrix AT A
sometimes called the generalized inverse of the n  m matrix A.

1
AT is
An Alternate Formulation of Linear Least Squares Using Multivariable Calculus
m
The least squares solution is the vector  in R that minimizes the function of m variables,
2
m
Fy  b   y  Ab   y  Ab, y  Ab    y , y   2  Ab, y    Ab, Ab  as b varies over R . Rewriting this
function using sigma notation gives the following expression.
Al Lehnen
2/17/2016
Madison Area Technical College
3
Some Different Perspectives on Linear Least Squares
Fy  b    y , y   2

n m

i 1 j 1
 Ab i,1 yi
n
i 1 1 j 1
b b Ai , j b j
  Ai, j b j yi   b b j  AT A , j
n m

 y, y  2
m m
   Ai ,
m m
i 1 j 1
1 j 1
Now a necessary condition for Fy  b  to be at a minimum is that all of the partial derivatives
F y
bk
b j
vanish for every integer k with 1  k  m . Since
Fy
bk


02
  Ai, j b

i 1

n
T
A
i 1
k ,i
 
1 j 1

m
yi 
k ,1
k


k ,1

b j
bk
  AT A



 AT A
,j
,j
  AT A ,k b
m
k, j
bj 
1
  AT A
m
T
 A Ab
bj  b
,k b j  b  j,k
j 1
k ,1
bk
1 j 1 
T
A A
   
T
T
 2  A y   2  A Ab 
 2 AT y
 b
  b
yi 
m m
 2  Ai ,k yi
 2
m m
k
n


b j
i 1 j 1
Fy
  j , k , the following simplification results.

n m
m m
 
T
 y , y   2

A  b j yi 
b
b
A
A

j
i
,
j


,
j
bk 
i 1 j 1
1 j 1

n m
The condition,
bk
T
1
b
k,
k ,1
 0 for every integer k with 1  k  m , requires that AT Ab  AT y . This has the

unique solution given by   AT A

1
AT y , which is identical to the result obtained in the last section.
A Formulation of Linear Least Squares Using QR Factorization
Since the columns of A are linearly independent a Gram Schmidt procedure on the columns yields an

orthonormal basis for W. Designate this basis as C1 , C 2 ,
defined recursively for j  2 as C j 
Al Lehnen
Cj
Cj
, where C j  A j 
2/17/2016

C m . Specifically, this orthonormal basis is
j 1
  Ak , C k  C k and C1 
k 1
A1
.
A1
Madison Area Technical College
4
Some Different Perspectives on Linear Least Squares
The column vector which is the projection of y onto W is given by w  QQT y , where the n  m matrix Q
 
is defined as Q i, j  C j
, i.e., Q  C1 , C 2 , C m  . Now, A  QQT A , since the projection of each


i ,1
column of matrix A onto A’s column space is just that original column of A . Let the m  m matrix R be
 
defined as R  QT A . Now,  R i, j  C i

span of C1 , C 2 ,


Ci , A j 
j

k 1

m
  j Ri, j


C i , Ck C k , A j  


A j  C i , A j . By the Gram Schmidt procedure if j  m the
 A1 , A2 ,
C j is equal to the span of
spanned by A1 , A2 ,
1 , ,  2 , ,  3 ,
T

A j . So A j 
j

k 1


C k C k , A j . Now, if i  j ,
j
  0   C k , A j   0 , i.e., if i  j , C i
is orthogonal to the subspace
k 1
A j . Therefore, the matrix R is upper triangular. Consider for m real numbers,
,  m , the sum
m

 j Ri, j 
j 1
m


 

 j C i , A j  C i , z , where z 
j 1
m
  j A j W . If
j 1
  C i , z  C i  0 . But since the columns of A are
m
 0 for every integer i, 1  i  m , then z 
i 1
j 1
linearly independent, this requires that 1   2  3    m  0 . Hence, the columns of the matrix R
are linearly independent and R is therefore nonsingular. This factorization of A, A  QQT A  QR , is
generically called the QR factorization. Since the unique solution of the linear least squares problem
solves A  w  QQT y , we have QR   QQT y . Now, QT Q  I m , i.e.,
QT Q i, j  QT C1, C 2 ,

T


Cm 
 C i C j  C i , C j  i, j .
 i, j
So R   QT QR   QT QQT y  QT y and the unique linear least squares solution is given by
  R 1QT y .
The following steps show that this solution is the same as that obtained by the generalized inverse.
AT   QR   RT QT
T
AT A  RT QT QR  RT I m R  RT R
 AT A   RT R   R1  RT 
1
1
 AT A AT  R1  RT  RT QT  R1QT
1
1
1
Calculating the Parameter Variances
From the theory of the probability distributions of random variables we have the following fundamental
result. If y1 , y2 , y3 , , yn are n statistically independent random variables with the variance of y j being
Al Lehnen
2/17/2016
Madison Area Technical College
5
Some Different Perspectives on Linear Least Squares
n
n
j 1
j 1
 2j , and h    j y j , then  h2    2j  2j . Now, using the generalized inverse the linear least squares
solution for parameter j is given by
n 
m
n
1 T 
1

T 1 AT 
.
 j   AT A
A y 
yk   yk  A
AT A
 A A

k ,i
j ,i
k 1 i 1

 j ,1 k 1 
 j, k
Hence,





m
n


 k2   Ak ,i AT A
 j  k

1
i 1
2


1 
 j,i 
2
.

In the special case where all of the random variables have a common variance,  2 , this simplifies to
2
2
n m
n 
1 
1
 
   2   AT A AT   .
 2   2    A
AT A
k ,i
j ,i 
j
 j ,k 
k 1 i 1
k 1 





In the QR factorization solution of the linear least squares problem,

 j  R 1QT y
 j,1  
k 1

n
R 1QT
2
 j ,k

yk 
n
n
m
 
k 1
yk
 m
i 1
R j ,1i QiT,k 

 k 
  k2   R j,1i C i
  Rj,1i C i k , so that
n
k 1
m
yk
i 1
2
.

k 1
 i 1

In the special case where all of the random variables have a common variance this simplifies as follows.
j
2
n  m
m m
n
 m 1

  m 1

2
1
2
1 1







R j ,i C i

R j ,i C i
R j, C

R j ,i R j ,
C
Ci
 j 


k
k 
k
k
k
k 1  i 1
k 1  i 1
i 1 1
k 1

  1

2
2
2
n

m m

i 1 1
 



R j ,1i R j ,1 C , C i   2
m m

i 1 1
  
 
R j ,1i R j ,1  ,i   2
  R j,1i 
m

   
2
i 1
An Example of the Methods: The Linear Model in One Independent Variable x
 
If the model is that yestimate  1 x  0 , and the 2  1 parameter vector is defined as    1  , then the
 0 
n 2 n 
 x1 1
  xi  xi 
 x 1
 x2 x 
i 1
i 1 
n  2 “design” matrix given by A   2  . Hence, AT A  

n

 and

n



x
1


  xi
n 


x
1


 n 
 i 1

Al Lehnen
2/17/2016
Madison Area Technical College
6
Some Different Perspectives on Linear Least Squares
 1 x 
 1 x 




2
2
1



x
x

x
x
1 n
 
 . Here the average value of a variable is denoted as
AT A
 
   i
2
SS xx
n i 1
n  x 2   x  


while the covariation of two variables SS  , is defined as


SS  
n
n
i 1
i 1
 i   i     ii 
n
n n
n n
i 
 i  n      i i  n   
n i 1
n i 1
i 1


 n

1 n
  i i    i   i   n     
n  i 1   i 1 
i 1
n






.
n

  xi yi 
i 1
 , so the least squares solution is given by the following expression,
Also, AT y  

n
  yi 
 i 1 
 n
 n

  n 
1 n

xi yi  
yi   xi  
 xi yi 



1

x
n




1



 
1
1 i 1
i 1
 i 1   i 1   .
   1   AT A AT y 






n
n
SS xx   x x 2   n

 0 
 SS xx 
2
yi 
x
yi  x xi yi



 i 1



i 1
i 1
These are the “regression” equations which are sometimes expressed as
n
 n 
1 n
xi yi  
yi   xi 
n  i 1   i 1  SS xy
1  i 1

SS xx
SS xx












 n 2  n
  n  n

 xi  
yi    xi   xi yi 


 



0   i 1   i 1   i 1   i 1
nSS xx




2
 n 21 n
  n  1  n
 1  n  n
 1  n  1  n  n

 xi  

yi    xi 
yi     xi   xi yi    xi   xi  
yi  

 
 
 2



 n
 


 i 1  n  i 1   i 1  n  i 1   n  i 1   i 1

 i 1  n  i 1   i 1  

SS xx









2
n
 n 2 1 n 

1  n  n
 xi    xi 
xi yi   xi  
yi 

 n

n  i 1   i 1 
SS xy
SS
i 1

 i 1   i 1
y
 x   y xx 
x
SS xx
SS xx
SS xx SS xx
 y  1 x





Assuming a common variance for each independent random variable yi , the variances of the model
parameters are computed as follows.
Al Lehnen
2/17/2016
Madison Area Technical College
Some Different Perspectives on Linear Least Squares

2

2
1


2
2
n 
n  x  x 2  2 n
1
 
2
2
T
T
k

x

x

SS



  A A A     2   SS


k
xx SS
1,k 
xx 
k 1 
k 1 
SS 2 k 1
SS 2
xx
xx
xx
2
2
7
0


2
n 
n  x2   x  x
1
 
T
T
2


A
A
A


 
  SS k

2,k 
xx
k 1 
k 1 


2
2

2 n
 x2  x x 
  
  k


2 k 1

SS

xx
2
n 
 x2  x 2   x x  x 2 






 

k


SS 2 k 1 
xx
2
 2 n  2
2
2
2
2
 2

 x   x    2  x   x   x    xk  x    x   xk  x  





SS 2 k 1 
xx
2
n
n

2   2
2
2
2 
2
 2

 n  x   x    2  x   x   x     xk  n  x     x    xk  x  


  k 1

k 1

SS 2  
xx
2


 x2  x 2 

n

2





 2   
 2   SS xx 
 
2
2
2

2


 2  x   x   x    0    x  SS xx 
  x  SS xx 


2  n

n


SS 2 

 SS xx 
xx


 1  x 2 

2  
 n SS xx 



2


Using the QR formulation to calculate the parameters requires an orthonormal basis for the column space
of A. A Gram Schmidt procedure results in the following vectors.
 x1 
 x1 
x 
 
n
2
2  x2 

C1 
xi 
n x2

 
 
i 1
 
 
 xn 
 xn 
n
 n 2

  xi  x1  xi 
 x2  x x 
i 1 
 i 1
1 
1  1  x1    x1 

n

n
1  1  x    x 
2
 2

n
  x  x2  xi 
x  x2 x 

C2        ,  2    2 
xi2  i 1 i

n


i 1
      




i 1
         




1  1  xn    xn 
 x2  x x 
n

n
n 

  xi2  xn  xi 
i 1
i 1 
Al Lehnen
2/17/2016
Madison Area Technical College
8
Some Different Perspectives on Linear Least Squares
Normalizing C2 requires the previously encountered sum,
2
2
2
n
 x 2  x x    x 2  x 2   x x  x   SS xx   x 2 SS . So that
    i 
  xx

 
i 
n



i 1 
i 1 
n
 i
C2

x 2  xi x
SS xx

  x1 
 x 
Thus, Q    2 
 
 
  xn 

1 x

n SS xx
2
.


2 
1 x 
and
nx 2
SS xx

n SS xx 



 nx 2
 
nx
nx

  nx 2


2
 C1 , A
x2
nx 2  
C1 , A2   nx
1


R  QT A  
1

1

  0
0


C 2 , A2 

2
 C 2 , A1
2
 
1 x
1 x  



 
n SS xx
n SS xx  


2 

2
1
nx 
x 

1
x
1
x



1







2 
2 n SS xx 
 1  x 2
2
n
SS
x
x
xx
 and


R 1  
  nx


n
SS


xx
2
nx 2


1 x



 0

2 

0
nx 
n SS xx 




1 n
 xi yi


i 1


2
nx


n
QT y  
 2 n
 .
1

 x  yi  x  xi yi  
2  i 1

i 1

1 x
 SS


 xx n SS xx



 x2  x x 
1 

 2

 x  x2 x 




 x2  x x 
n 

 
 






 . Proceeding,




Computing the parameter vector gives
n
 1 n
 2 n

x

 xi yi  2
 x  yi  x  xi yi  
 nx 2 i 1
i 1

x SS xx  i 1
 1 
1 T

R
Q
y


.
 
 1  n 2   n   n   n
 0
 
  xi    yi     xi    xi yi   

 nSS xx  i 1   i 1   i 1   i 1
  
Al Lehnen
2/17/2016
Madison Area Technical College
Some Different Perspectives on Linear Least Squares
The expression for  0 agrees with the earlier derivation, while the previous expression for 1 is recovered
as is shown by the following steps.
n
 2 n

1 n
x
1 
 xi yi  2
 x  yi  x  xi yi 
i 1

x SS xx  i 1
nx 2 i 1

n
n

 2 n

 SS xx  xi yi  nx  x  yi  x  xi yi  
i 1
i 1
 i 1
 
SS xx nx 2 
1
n
n
n 2 n
 n
 n  2 n

  xi  xi yi  x  xi   xi yi    xi  x  yi  x  xi yi  
i 1  i 1
i 1
 i 1  i 1
 
SS xx nx 2 i 1 i 1
n
n
n 2 n

1
2

  xi  xi yi   xi  x  yi  
i 1
 i 1  
nx 2 SS i 1 i 1
xx
n 2
 xi
n
1  n   n   SS xy
i 1

  xi yi    xi    yi   
n  i 1   i 1   SS xx
 n

i 1
 xi2  SS

 xx
 i 1 
1
1 

Assuming a common variance for each independent random variable yi , in the QR factorization the
variances of the model parameters are computed as follows.
 
2
1
2

2
i 1
R 1,1i

2
2

 x   1  x 2
1



  
 2  2   n SS xx
x  
 nx
2

2
  
2

  n  x 2 
 
2
2


n x 
 x 2   x 2  1     


SS xx  




2
4
2  2
2
4
 2
 2
 x   x   SS xx  n  x   
 x   x    x   x     x  
2
2
 x 2  SS
n  x 2  SS xx
 
xx
 
 
 2  2   2
2

 x   
2
 x 2  SS    SS xx
 
xx
 
2 
2 

2


2
2
2
1
2  2  1  x  
2  1 x  .
 
R 2,i   0 




 n SS xx  
 n SS xx 
0
i 1







2

Both of these results agree with the previous analysis that used the generalized inverse.
Al Lehnen
2/17/2016
Madison Area Technical College
9
Download