Some notes on linear modelling

advertisement
Some notes on linear modelling
Prof. P. Lewis
UCL
plewis@geog.ucl.ac.uk
01/12/2010
Introduction
The purpose of these notes is to introduce a little of the maths you may need or like
to know concerning what we might quite generally call ‘linear modelling’. These
notes are aimed at final year Geography undergraduate students taking geog3052
(Computing for Image Analysis), and provide supporting material for that course. It is
not vital that the undergraduates fully comprehend this material to be able to complete
the programming exercise for that course. At the very least, they should be interested
to read the section ‘Applications’. Some brief notes on more ‘advanced’ concepts are
mentioned in a section ‘further maths’ at the end, for those with a keen interest in
these matters (or possibly for future reference when you hit a problem which needs
such concepts).
Background
Model
We can’t avoid using some mathematical notations and some technical terms in
these notes, but you need to have a clear understanding of what is meant by them.
First, what do we mean in this context by a model? We suppose that there is some
mathematical function that we might call f x  that gives an estimate of something
that we might measure, this ‘thing’ that we might measure being called here y . The
quantity x is something that might affect y . We can say then that f x  is a model of
the effect of x on y .

Now, we don’t expect f x  to be perfect, so we say that it produces
 an estimate of
 y , and we might call this estimate
 yˆ (pronounced y-hat).
One
it may not be perfect is that the model itself may be in error (we call
 reason

this ‘model error’, not surprisingly). This could be because y depends on more things
than just x orit could be that our understanding of the form of the relationship
between x and y is far 
from perfect., or many similar reasons. Knowing something
about model error is important, but often quite difficult to quantify, so we will not

worry about it further here.
In any case, we can write the following:


(1)
yˆ  f x

A simple form of model might be:

f x  p0  p1x
(2)
where these terms p0 and p1 are the parameters of the model (sometimes called
state variables). So, if we happened to know (or think we knew) the ‘correct’ values of
the parameters p0 and p1, we could use our model in equation 2 to produce an
estimate of y from equation 1.
So far so
good – 
if we ‘know’ the model and ‘know’ its parameters, we can predict
what some measure y will be for some particular value (or set of values) of x . That is
 concept
 in all sorts of applications. In many ways it is ‘obvious’, but it is
a powerful

best to be clear what we mean by these terms.


Now, if we took some set of measurements of this quantity y for some set of known
values of x , we could compare these values y with what our model predicts, yˆ .
There will likely be some discrepancy between these that we might call e . Since we
are referring to a set of measurements, we need some way of referring to some
 purpose. Then:
member of the set, and might use a subscript i for this




yˆ  f x

Measurement
i
 i
and

ei  yi  yˆ i
so ei is the ‘error’ between our prediction and our measurement for the ith sample
of the set. We might recognise ei as being the ‘residual’.


Vectors
In fact, it can be rather tedious to write all of these subscripts and the equations we
are interested in can 
often be more neatly and concisely written using vectors and
matrices (all of these subscripted terms, ei etc. are what we call scalars, i.e. they have
only a single value). Also, if we are writing computer code for this model, we might
also find it convenient to group everything together in vectors (or arrays we might call
them in that context). There are different forms of notation that are used, but here, we

will write a vector with and underscore,
e.g. yˆ . These vectors will just be a ‘grouping’
of all of the samples we were discussing above into a convenient ‘array’ (as we might
say in programming).
So, we can put all of the observations into a vector y and all of the x values at

which these observations were made in a vector x and we can write:
yˆ  f x 
and


e  y  yˆ


where yˆ is obviously the vector of model predictions (i.e. of estimates of y ) and e
is the vector of residuals. If the ‘set’ consists of n values of x i and therefore n values
of y i then clearly the vector has dimension n, i.e. it is an array of dimension 1 x n (the
1 is perhaps implicit and redundant).





Now, if out model is still of the form of equation 2, we can write, in vector form:
Matrices
yˆ  p0  p1 x
(3)
That’s all very well when we only have two parameters, but if the model were e.g.:

yˆ  p0  p1 x  p2 x 2  p3 x 3
(4)
we have four parameters and this is again getting a little tedious in the amount of
notation we have to write.

In this second example, we have written some terms such as x 2 : clearly that is just
the same as taking each element in the vector and replacing it by its square. If a
computer language allows you to directly manipulate vectors, this will normally be
achieved simply as x*x (similarly for x 3 etc).

What we can do in the above example is to group the parameter terms p j into a
vector p . Here, I am using j as the subscript for the jth element of p, because there are
clearly a different number ofelements of p to the number of elements in x or y . We
have said there are n elements in x or y (i.e. n observations), so 
we might say there
 are m model parameters. So, the vector p , as an array is of dimension 1 x m. Just to
 4: 
be clear, we could write p out explicitly
for the model in equation

 
p0 

 
p1

(5)
p   
p2 
 
p3 


so here, m is 4.
To be able to make full use of this concept, we need to ‘stack up’ the terms x , x 2
etc. Since these are already vectors, we end up with a matrix. Since each x etc. is of
length n and there are m model parameters, we need a matrix M that will be of
dimensions m x n. Note that here we use a double underscore to represent
  matrices.
This will look something like:


1 x 0
x 02
x 03 


1 x1
x12
x13 

M
(6)
M M
M
M 

2
3 
1 x n 1 x n 1 x n 1 
Note also that here, the first column is all 1s. That is because the model in equation
4 can be thought of as yˆ  p01 p1 x  p2 x 2  p3 x 3 .
To write our model now, in vector-matrix form, we need the concept of multiplying
a matrix by a vector. Then we can write simply:
yˆ  M p 

In some computing languages, this operation may be represented by # (e.g. we
might write: yhat=M#p). To understand what multiplying a matrix by a vector does,
we might expand our notation a little:
 yˆ 0  1 x 0
 ˆ  
 y1  1 x1
 M  M M
ˆ  
y n 1  1 x n 1

(7)
x 02
x12
M
x n2 1
x 03 p0 
 
x13 p1 
M p2 
 
x n3 1 p3 
We multiply rows in the matrix by the column in p to get entries in the rows of yˆ .
That sounds more complicated than it is … looking at equation 8, you should be able
to visualise:
yˆ 0  p01 p1 x 0  p2 x 02  p3 x 03
yˆ1  p01 p1 x1  p2 x12  p3 x13
etc.


(8)


Linear Models
It turns out that equation 7 is the general form of a linear model. We can use it to
express the expression in equation 3 or 4 or anything of this form. That’s a useful
concept when writing computer code in particular: we have the same equation, no
matter how big m or n are.
A linear model is one in which the output (y) can be written as a linear combination
of the model parameters (p). The term obviously covers the ‘traditional’ case of fitting
a straight line through some points (which you might have seen phrased y  mx  c ,
so here c  p0 , m  p1 and m=2), but it also covers polynomials of arbitrary order (we
used a cubic expression above) as well as a wide range of other cases. If you find that
confusing, remember that the linear name relates to the parameters p, not to however

x might be constructed.
 In fact,
 often even if a model is non-linear, we will often approximate it for some
sorts of problems as being locally linear through what you might have come across as
a Taylor Series1.
1
http://mathworld.wolfram.com/TaylorSeries.html
Method of Least Squares
Formulating the problem as an optimisation
Now we have got to grips with using vectors and matrices, we can start to ask more
complex questions. Above, we said that, if we know p we can estimate y as yˆ  M p .
We also noted that the ‘error’, expressed as a vector of residuals is given by:
e  y  yˆ



so


e  y Mp
That’s all very well if we know p . In the more general case, we simply don’t know
it. In such a case, we want to use our set of observations y to give the ‘best’ estimate
(an optimal estimate in some mathematical sense) of the model parameters p . We
might term this process ‘parameter estimation’ or often ‘model calibration’. When
you have fitted the model yˆ  mx  c to datasetsin the past, you will have been doing
a form of this: trying to estimate the parameters m and c from a set of observations y .

Before going into the maths, there are some obvious points to consider that we can
think through from the linear regression example.

Perhaps the most significant of these is that, in the absence of further information,
if we have fewer than two points through which to fit the line, both mand c do not
have unique values (i.e. we can’t draw a line unless we have at least two points).
Second, if our measurement or model is likely to contain significant error, then
generally, the more observations we have, the ‘better’ (more robust) our estimate of
the parameters will be.
Third, if we ‘calibrate’ the model only using a range of x values between say x0 and
x1, then the calibrated model is probably more unreliable for values of x outside of
this range.
These are all ‘intuitive’ issues we can think through with the linear regression
example, and all of these generalise to the m-dimensional case. You may have come
across the m-dimensional case of linear regression before under the name ‘multilinear’ regression.
In the general case then, without further constraints, you need a minimum of m
observations to be able to estimate m parameters. As a rule of thumb, you probably
want more than twice as many observations as there are parameters, i.e. n>2m. The
larger the number of observations, the more robust the estimate of the model
parameters, as above, provided the form of model is appropriate. Actually, the total
number of samples is rather less important than the ‘information content’ of the
observations, which relates to where the samples fall. This also impacts the regions of
parameter space for which the model will become unreliable. These may seem quite
difficult concepts, but its useful to try to gain an intuitive feel for the ideas at least.
So, now onto the Method of least squares. To proceed, we need to define some
function that we can mathematically optimise (i.e. ‘find the best’). We want to find
the ‘best’ value of the model parameters that is consistent with our experience (i.e.
our observations here). Optimisation involves calculating the rate of change of some
function with respect to each of the model parameters, and finding the value of the
parameters for which this rate of change is zero (for all parameters). When the slope
is zero, we will have reached the minimum or maximum of some function (or perhaps
a saddle point, but we won’t in this case).
The core of this should normally be based on the difference between what your
model says an observation ought to be (if your setting of the parameters is correct)
and what you observe, i.e. based on the residual vector. One of the easier
mathematical operations here is the sum of the square of the residuals.
For a vector, this can be found using the vector dot product, denoted  (dot). So, if:
 e0 
 
e1
e   
 M 
 
en 1 

then we can show the dot product of e with itself is:



 e0   e0 
   
e1
e1

e e      e0e0  e1e1 L  en 1en 1  e02  e12 L  en2 1   2
 M   M 
   
en 1  en 1 
i.e. the sum of the square of the residuals, as we wanted. We use the Greek symbol
 2 (epsilon) here to represent this sum of squares term (having already used the
symbol e).
From above, we can write:



2  y  M p  y  M p

The dot product works much the same way as normal (scalar) multiplication, so we
can expand this to:
 2  y y  y M p  pT M T y  pT M T M p

Here, T is what is called the transpose operation. This involves changing the rows
and columns of a matrix or vector around, so:
1 x 0

1 x1
T
M  
M M

1 x n 1
and

(9)
x 02
x12
M
x n2 1
x 03  1
 
x13  x 0
 2
M  x 0
  3
x n3 1  x 0
T
1 L
x1 L
x12 L
x13 L
1 

x n 1 
x n2 1 

x n3 1 
p0 T
 
p1
T
p     p0
p2 
 
p3 
p1
p2
p3 
Formally, we define a function J   2


J  y y  y M p  pT M T y  pT M T M p

and find the minimum of this function. Again, formally, this involves finding the
rate of change of J with respect to the model parameters p . This involves calculating
partial derivatives of J with respect to p , and solving for
J 
 0.
p




i.e. find the values of p which minimise the sum of the square of the residuals. If
you have not dealt with calculus before you may not have come across this next part,
but I will show an easier way to arrive at the solution below. If you can follow the
calculus, then:

J
 y M  M T y  2M T M p  0
p
It can be shown that y M  M T y for the case considered here, so:

0  2M T y  2M T M p  0

or, rearranging and dividing both sides by 2:


MT M p  MT y
(10)
With equation 10, we are most of the way at the solution. We could have also
found our way to this point by considering the model:
Mp y

and multiplying both sides by M T , but that does not prove that this gives the route
to the optimal estimation of p (it might be easier to comprehend and remember
though).


Finding a solution to the problem
Next, we need the concept of the inverse of a matrix. We demote the inverse of
some matrix M by the superscript -1 (i.e. raising to the power of -1). If you think about
it, this is the same as for scalars: the inverse of x is x 1  1 x . Also from considering
scalars, we might note that:
 1
xx  x x  1


i.e. the inverse of something, times itself is one (‘unity’). The equivalent concept
for matrices is the identity matrix, usually denoted I :
1 0 L

0 1
I  
M
O

0 0

0 

0 


1 

It is a square matrix full of zero values, except along the leading diagonal, where it
is one (unity). So, if we have some square matrix S :
SS 1  S 1 S  I


Anything times the (equivalent dimensioned) identity matrix then is itself – in the
same way that multiplying some scalar by 1 leaves it unchanged. Now, looking back
at equation 10, we remember:
MT M p  MT y



so, we calculate the inverse of M T M , M T M
the equation by this term:
M M M Mp  M M
1
T
T

 M M I




Since M T M

T
I p  MT M
1
1
1

1
, and pre-multiply both sides of
T
M
 y
T
MT y

so


p  MT M
1
MT y
(11)
This is where we wanted to be: equation 11 is an expression that allows us to
estimate p from some set of observations y .


Applications
Application 1
In the computing exercise ‘IDL part 2’2 you came across a problem (4.2) where we
had three columns of data, representing samples of x, x2, and y. You were told that the
‘observations’ in y had come from a quadratic function of x, with some noise added to
it. The purpose of the exercise was to estimate the parameters of the quadratic
equation.
A quadratic model can be written:
yˆ  p0  p1 x  p2 x 2
From above (equation 7), we can write this in full vector-matrix notation as:

yˆ  M p
where

p0 
 
p  p1 

p2 

which is of dimension 1 x 3 (i.e. m=3) and


1 x 0

1 x1
M  
M M

1 x n 1
x 02 

x12 
M 

x n2 1 
which is of dimensions 3 x n, where n is the number of samples. We will also need
a vector for the observations:
 y 0 
 
y1
y   
 M 
 
y n 1 

In the computer code, we read the second and third columns of M from the first
and second columns of the data file. We read the information in y as the final
(third) column in the data file. Initially, we read these data into a matrix called
data.


2
http://www2.geog.ucl.ac.uk/~plewis/teaching/unix/idl/idl2.html
Inside the function linearRegress, we set up the matrix M (called X in the
programme), put values of 1 in the first column, and loaded the subsequent
columns from the array data. We also loaded y (called Y in the programme)
from this array.

Equation 11 gives us:

p  MT M


1
MT y

so, we find the transpose of M (XT), multiply this by M (Matrix = X # XT),
noting that the operation seems to be reversed in the code as we have swapped
the meaning of rows and columns and this gives us M T M . We then find the
 
XT) and multiply M M 


inverse of this (M1 
= invert(Matrix)), giving
MT M
Now we find M T y (V = Y #
T
1
1
.
by this, giving p (or,
in the code A = V # M1) and return p (A) from the function.
This function is quite general, in that itsolves equation 11, assuming that the
first model
parameter is an ‘offset’ (as in 
make it
p0x1 above). We could perhaps


more general by making theuser explicitly pass the vectors X and Y, rather than
‘loading’ them up in the function, although this mechanism is convenient if we
have a model with an offset, as here.
 might make to the code could flow from a
Any other enhancements you
discussion of some of the issues above, e.g. what if n was only 2? In that case,
there isn’t enough information to solve for the parameters.
Application 2
In the ‘assessed practical’3 you need to estimate the parameters of a linear model
for each pixel in an image. In this case, the samples for each pixel come from a time
series of MODIS observations. The method you would normally use is exactly the
same as above. The model is of the form:
rˆ  f 0  f1 k1  f 2 k2
where rˆ is a set of reflectance data for a particular waveband for a particular
T
pixel (i.e. over time) and the model parameters are p   f 0 f1 f 2  for a

particular waveband for that pixel. In the MODIS image data, you are given k1 , k2 ,
and rˆ for some different wavebands for each pixel. Whilst it might be of more
general value to you to understand the maths above,
 you do not need to to apply
the function linearRegress to solve this problem.
 
One computing complexity in this problem is that sometimes, data are missing

in the images (they appear as zero values). If you imagine loading the data for a
particular waveband and pixel into the data array of the function
linearRegress, then you should probably remove these missing data points
from the array before applying linearRegress.
An alternative strategy might be to simply use all of the data as they come, but to
set the first column of M to zero wherever there is a missing data point (see if you
can work out why that comes to the same thing).
No more hints! Good luck. Once you start the assessed practical, I cannot
answer questions on that, other than on points of clarification of the problem or
assessment.
3
http://www2.geog.ucl.ac.uk/~plewis/teaching/unix/idl/IDLLewispart4.html
Further maths
Dealing with uncertainty in the observations
If you can easily work your way around the maths above, probably the next
level in complexity is to consider the case when the uncertainty in each
observation is known and may be different. If we assume the observational
uncertainty to be Normally distributed (i.e. Gaussian), we can represent this by a
variance-covariance matrix, C . The leading diagonal of this is simply the variance
associated with each observation. If available, any off-diagonal elements of the matrix
will express covariances between observational uncertainties.
In this case, we write equation 11 as:


p  M T C 1 M

1
M T C 1 y
(12)
Now, we can find the uncertainty structure of the parameter estimates, C p :




K  k0
k1 k2 
C p  M T C 1 M
1

If we then want to know the uncertainty associated with modelling a particular
linear combination of model parameters (e.g. observation) then, where that
combination is modelled by
T
so that

y  KT p
then


  K T Cp K
is the standard deviation associated with y . You can use this to work out the
uncertainty in the model parameters themselves (in that case, just the square root of
the leading diagonal terms), or the observations, or any other linear combination of
the parameters.
 consider the case where all observations have
For an interesting application of this,
the same uncertainty, obs. Then

p  MT M

1
MT y

i.e. the amount of uncertainty does not affect the estimate, but:

  obs K T M T M  K
1

The term M T M


1
is only dependent on the way in which the observation set
samples the domain of the model. In Lucht and Lewis (2000)4, for example, this
is used to examine the impact of particular satellite sensor angular sampling
regimes on the determination of biophysical quantities.

Further constraints
Quite often, we find that the estimate of the parameters is only poorly
constrained from observational datasets. In such cases, we say that the problem
is ill conditioned. The condition number5 of the matrix M T M (formally, the ratio


of the largest to smallest singular value in the singular value decomposition of a
matrix) relates to the linear independence of the set of simultaneous equations
expressed by the matrix. If this is too large, the matrix is ill-conditioned.
 for as many parameters as you
In essence this means that you cannot solve
have specified in the problem. This may be because some of the parameters are
formally linearly dependent (i.e. there is a linear transformation between on
parameter and another … e.g. if you specified the model as yˆ  p0  p1x  p2 x 1
this would be the case), or it may be that there is simply not enough information in the
observations to solve the problem as you have stated it.
The options then are either to rephrase the problem (if appropriate) or to consider
what other information might be brought to bear onfinding the parameter estimate.
One useful example of constraints in many problems, but especially linear problems
may be considered as Regularisation methods. In essence, this means that you assume
that there is some degree of smoothness between model parameters. In a geographic
or temporal estimation, this might make a lot of sense: model parameters close
together in space and/or time are more likely to be similar than those spaced widely
apart is one way of phrasing this. This has the effect of improving (lowering) the
condition number of the matrix.
We will not go into detail on how to solve such problems here, but see e.g. Quaife
and Lewis (2010)6 for an example application. There are various excellent text books
which cover this and related concepts in detail, although they are not aimed
specifically at Geography undergraduates. My current favourites are Twomey (1996)7
and Rogers (2000)8.
Lucht, W. and P. Lewis (2000) Theoretical noise sensitivity of BRDF and albedo
retrieval from the EOS-MODIS and MISR sensors with respect to angular
sampling. International Journal of Remote Sensing 21(1) 81-89.
5 http://mathworld.wolfram.com/ConditionNumber.html
6 T. Quaife and P. Lewis (2010) Temporal constraints on linear BRF model
parameters IEEE Transactions on Geoscience and Remote Sensing,
doi:10.1109/TGRS.2009.2038901
7 S. Twomey (1996) Introduction to the mathematics of inversion in remote
sensing and indirect measurements, Dover Publications, 1996 - 243 pages
8 Clive D Rodgers (2000) Inverse Methods for Atmospheric Sounding: Theory
and Practice, Series on Atmospheric Oceanic and Planetary Physics - Vol. 2, ISBN:
978-981-02-2740-1
4
Download