Some notes on linear modelling Prof. P. Lewis UCL plewis@geog.ucl.ac.uk 01/12/2010 Introduction The purpose of these notes is to introduce a little of the maths you may need or like to know concerning what we might quite generally call ‘linear modelling’. These notes are aimed at final year Geography undergraduate students taking geog3052 (Computing for Image Analysis), and provide supporting material for that course. It is not vital that the undergraduates fully comprehend this material to be able to complete the programming exercise for that course. At the very least, they should be interested to read the section ‘Applications’. Some brief notes on more ‘advanced’ concepts are mentioned in a section ‘further maths’ at the end, for those with a keen interest in these matters (or possibly for future reference when you hit a problem which needs such concepts). Background Model We can’t avoid using some mathematical notations and some technical terms in these notes, but you need to have a clear understanding of what is meant by them. First, what do we mean in this context by a model? We suppose that there is some mathematical function that we might call f x that gives an estimate of something that we might measure, this ‘thing’ that we might measure being called here y . The quantity x is something that might affect y . We can say then that f x is a model of the effect of x on y . Now, we don’t expect f x to be perfect, so we say that it produces an estimate of y , and we might call this estimate yˆ (pronounced y-hat). One it may not be perfect is that the model itself may be in error (we call reason this ‘model error’, not surprisingly). This could be because y depends on more things than just x orit could be that our understanding of the form of the relationship between x and y is far from perfect., or many similar reasons. Knowing something about model error is important, but often quite difficult to quantify, so we will not worry about it further here. In any case, we can write the following: (1) yˆ f x A simple form of model might be: f x p0 p1x (2) where these terms p0 and p1 are the parameters of the model (sometimes called state variables). So, if we happened to know (or think we knew) the ‘correct’ values of the parameters p0 and p1, we could use our model in equation 2 to produce an estimate of y from equation 1. So far so good – if we ‘know’ the model and ‘know’ its parameters, we can predict what some measure y will be for some particular value (or set of values) of x . That is concept in all sorts of applications. In many ways it is ‘obvious’, but it is a powerful best to be clear what we mean by these terms. Now, if we took some set of measurements of this quantity y for some set of known values of x , we could compare these values y with what our model predicts, yˆ . There will likely be some discrepancy between these that we might call e . Since we are referring to a set of measurements, we need some way of referring to some purpose. Then: member of the set, and might use a subscript i for this yˆ f x Measurement i i and ei yi yˆ i so ei is the ‘error’ between our prediction and our measurement for the ith sample of the set. We might recognise ei as being the ‘residual’. Vectors In fact, it can be rather tedious to write all of these subscripts and the equations we are interested in can often be more neatly and concisely written using vectors and matrices (all of these subscripted terms, ei etc. are what we call scalars, i.e. they have only a single value). Also, if we are writing computer code for this model, we might also find it convenient to group everything together in vectors (or arrays we might call them in that context). There are different forms of notation that are used, but here, we will write a vector with and underscore, e.g. yˆ . These vectors will just be a ‘grouping’ of all of the samples we were discussing above into a convenient ‘array’ (as we might say in programming). So, we can put all of the observations into a vector y and all of the x values at which these observations were made in a vector x and we can write: yˆ f x and e y yˆ where yˆ is obviously the vector of model predictions (i.e. of estimates of y ) and e is the vector of residuals. If the ‘set’ consists of n values of x i and therefore n values of y i then clearly the vector has dimension n, i.e. it is an array of dimension 1 x n (the 1 is perhaps implicit and redundant). Now, if out model is still of the form of equation 2, we can write, in vector form: Matrices yˆ p0 p1 x (3) That’s all very well when we only have two parameters, but if the model were e.g.: yˆ p0 p1 x p2 x 2 p3 x 3 (4) we have four parameters and this is again getting a little tedious in the amount of notation we have to write. In this second example, we have written some terms such as x 2 : clearly that is just the same as taking each element in the vector and replacing it by its square. If a computer language allows you to directly manipulate vectors, this will normally be achieved simply as x*x (similarly for x 3 etc). What we can do in the above example is to group the parameter terms p j into a vector p . Here, I am using j as the subscript for the jth element of p, because there are clearly a different number ofelements of p to the number of elements in x or y . We have said there are n elements in x or y (i.e. n observations), so we might say there are m model parameters. So, the vector p , as an array is of dimension 1 x m. Just to 4: be clear, we could write p out explicitly for the model in equation p0 p1 (5) p p2 p3 so here, m is 4. To be able to make full use of this concept, we need to ‘stack up’ the terms x , x 2 etc. Since these are already vectors, we end up with a matrix. Since each x etc. is of length n and there are m model parameters, we need a matrix M that will be of dimensions m x n. Note that here we use a double underscore to represent matrices. This will look something like: 1 x 0 x 02 x 03 1 x1 x12 x13 M (6) M M M M 2 3 1 x n 1 x n 1 x n 1 Note also that here, the first column is all 1s. That is because the model in equation 4 can be thought of as yˆ p01 p1 x p2 x 2 p3 x 3 . To write our model now, in vector-matrix form, we need the concept of multiplying a matrix by a vector. Then we can write simply: yˆ M p In some computing languages, this operation may be represented by # (e.g. we might write: yhat=M#p). To understand what multiplying a matrix by a vector does, we might expand our notation a little: yˆ 0 1 x 0 ˆ y1 1 x1 M M M ˆ y n 1 1 x n 1 (7) x 02 x12 M x n2 1 x 03 p0 x13 p1 M p2 x n3 1 p3 We multiply rows in the matrix by the column in p to get entries in the rows of yˆ . That sounds more complicated than it is … looking at equation 8, you should be able to visualise: yˆ 0 p01 p1 x 0 p2 x 02 p3 x 03 yˆ1 p01 p1 x1 p2 x12 p3 x13 etc. (8) Linear Models It turns out that equation 7 is the general form of a linear model. We can use it to express the expression in equation 3 or 4 or anything of this form. That’s a useful concept when writing computer code in particular: we have the same equation, no matter how big m or n are. A linear model is one in which the output (y) can be written as a linear combination of the model parameters (p). The term obviously covers the ‘traditional’ case of fitting a straight line through some points (which you might have seen phrased y mx c , so here c p0 , m p1 and m=2), but it also covers polynomials of arbitrary order (we used a cubic expression above) as well as a wide range of other cases. If you find that confusing, remember that the linear name relates to the parameters p, not to however x might be constructed. In fact, often even if a model is non-linear, we will often approximate it for some sorts of problems as being locally linear through what you might have come across as a Taylor Series1. 1 http://mathworld.wolfram.com/TaylorSeries.html Method of Least Squares Formulating the problem as an optimisation Now we have got to grips with using vectors and matrices, we can start to ask more complex questions. Above, we said that, if we know p we can estimate y as yˆ M p . We also noted that the ‘error’, expressed as a vector of residuals is given by: e y yˆ so e y Mp That’s all very well if we know p . In the more general case, we simply don’t know it. In such a case, we want to use our set of observations y to give the ‘best’ estimate (an optimal estimate in some mathematical sense) of the model parameters p . We might term this process ‘parameter estimation’ or often ‘model calibration’. When you have fitted the model yˆ mx c to datasetsin the past, you will have been doing a form of this: trying to estimate the parameters m and c from a set of observations y . Before going into the maths, there are some obvious points to consider that we can think through from the linear regression example. Perhaps the most significant of these is that, in the absence of further information, if we have fewer than two points through which to fit the line, both mand c do not have unique values (i.e. we can’t draw a line unless we have at least two points). Second, if our measurement or model is likely to contain significant error, then generally, the more observations we have, the ‘better’ (more robust) our estimate of the parameters will be. Third, if we ‘calibrate’ the model only using a range of x values between say x0 and x1, then the calibrated model is probably more unreliable for values of x outside of this range. These are all ‘intuitive’ issues we can think through with the linear regression example, and all of these generalise to the m-dimensional case. You may have come across the m-dimensional case of linear regression before under the name ‘multilinear’ regression. In the general case then, without further constraints, you need a minimum of m observations to be able to estimate m parameters. As a rule of thumb, you probably want more than twice as many observations as there are parameters, i.e. n>2m. The larger the number of observations, the more robust the estimate of the model parameters, as above, provided the form of model is appropriate. Actually, the total number of samples is rather less important than the ‘information content’ of the observations, which relates to where the samples fall. This also impacts the regions of parameter space for which the model will become unreliable. These may seem quite difficult concepts, but its useful to try to gain an intuitive feel for the ideas at least. So, now onto the Method of least squares. To proceed, we need to define some function that we can mathematically optimise (i.e. ‘find the best’). We want to find the ‘best’ value of the model parameters that is consistent with our experience (i.e. our observations here). Optimisation involves calculating the rate of change of some function with respect to each of the model parameters, and finding the value of the parameters for which this rate of change is zero (for all parameters). When the slope is zero, we will have reached the minimum or maximum of some function (or perhaps a saddle point, but we won’t in this case). The core of this should normally be based on the difference between what your model says an observation ought to be (if your setting of the parameters is correct) and what you observe, i.e. based on the residual vector. One of the easier mathematical operations here is the sum of the square of the residuals. For a vector, this can be found using the vector dot product, denoted (dot). So, if: e0 e1 e M en 1 then we can show the dot product of e with itself is: e0 e0 e1 e1 e e e0e0 e1e1 L en 1en 1 e02 e12 L en2 1 2 M M en 1 en 1 i.e. the sum of the square of the residuals, as we wanted. We use the Greek symbol 2 (epsilon) here to represent this sum of squares term (having already used the symbol e). From above, we can write: 2 y M p y M p The dot product works much the same way as normal (scalar) multiplication, so we can expand this to: 2 y y y M p pT M T y pT M T M p Here, T is what is called the transpose operation. This involves changing the rows and columns of a matrix or vector around, so: 1 x 0 1 x1 T M M M 1 x n 1 and (9) x 02 x12 M x n2 1 x 03 1 x13 x 0 2 M x 0 3 x n3 1 x 0 T 1 L x1 L x12 L x13 L 1 x n 1 x n2 1 x n3 1 p0 T p1 T p p0 p2 p3 p1 p2 p3 Formally, we define a function J 2 J y y y M p pT M T y pT M T M p and find the minimum of this function. Again, formally, this involves finding the rate of change of J with respect to the model parameters p . This involves calculating partial derivatives of J with respect to p , and solving for J 0. p i.e. find the values of p which minimise the sum of the square of the residuals. If you have not dealt with calculus before you may not have come across this next part, but I will show an easier way to arrive at the solution below. If you can follow the calculus, then: J y M M T y 2M T M p 0 p It can be shown that y M M T y for the case considered here, so: 0 2M T y 2M T M p 0 or, rearranging and dividing both sides by 2: MT M p MT y (10) With equation 10, we are most of the way at the solution. We could have also found our way to this point by considering the model: Mp y and multiplying both sides by M T , but that does not prove that this gives the route to the optimal estimation of p (it might be easier to comprehend and remember though). Finding a solution to the problem Next, we need the concept of the inverse of a matrix. We demote the inverse of some matrix M by the superscript -1 (i.e. raising to the power of -1). If you think about it, this is the same as for scalars: the inverse of x is x 1 1 x . Also from considering scalars, we might note that: 1 xx x x 1 i.e. the inverse of something, times itself is one (‘unity’). The equivalent concept for matrices is the identity matrix, usually denoted I : 1 0 L 0 1 I M O 0 0 0 0 1 It is a square matrix full of zero values, except along the leading diagonal, where it is one (unity). So, if we have some square matrix S : SS 1 S 1 S I Anything times the (equivalent dimensioned) identity matrix then is itself – in the same way that multiplying some scalar by 1 leaves it unchanged. Now, looking back at equation 10, we remember: MT M p MT y so, we calculate the inverse of M T M , M T M the equation by this term: M M M Mp M M 1 T T M M I Since M T M T I p MT M 1 1 1 1 , and pre-multiply both sides of T M y T MT y so p MT M 1 MT y (11) This is where we wanted to be: equation 11 is an expression that allows us to estimate p from some set of observations y . Applications Application 1 In the computing exercise ‘IDL part 2’2 you came across a problem (4.2) where we had three columns of data, representing samples of x, x2, and y. You were told that the ‘observations’ in y had come from a quadratic function of x, with some noise added to it. The purpose of the exercise was to estimate the parameters of the quadratic equation. A quadratic model can be written: yˆ p0 p1 x p2 x 2 From above (equation 7), we can write this in full vector-matrix notation as: yˆ M p where p0 p p1 p2 which is of dimension 1 x 3 (i.e. m=3) and 1 x 0 1 x1 M M M 1 x n 1 x 02 x12 M x n2 1 which is of dimensions 3 x n, where n is the number of samples. We will also need a vector for the observations: y 0 y1 y M y n 1 In the computer code, we read the second and third columns of M from the first and second columns of the data file. We read the information in y as the final (third) column in the data file. Initially, we read these data into a matrix called data. 2 http://www2.geog.ucl.ac.uk/~plewis/teaching/unix/idl/idl2.html Inside the function linearRegress, we set up the matrix M (called X in the programme), put values of 1 in the first column, and loaded the subsequent columns from the array data. We also loaded y (called Y in the programme) from this array. Equation 11 gives us: p MT M 1 MT y so, we find the transpose of M (XT), multiply this by M (Matrix = X # XT), noting that the operation seems to be reversed in the code as we have swapped the meaning of rows and columns and this gives us M T M . We then find the XT) and multiply M M inverse of this (M1 = invert(Matrix)), giving MT M Now we find M T y (V = Y # T 1 1 . by this, giving p (or, in the code A = V # M1) and return p (A) from the function. This function is quite general, in that itsolves equation 11, assuming that the first model parameter is an ‘offset’ (as in make it p0x1 above). We could perhaps more general by making theuser explicitly pass the vectors X and Y, rather than ‘loading’ them up in the function, although this mechanism is convenient if we have a model with an offset, as here. might make to the code could flow from a Any other enhancements you discussion of some of the issues above, e.g. what if n was only 2? In that case, there isn’t enough information to solve for the parameters. Application 2 In the ‘assessed practical’3 you need to estimate the parameters of a linear model for each pixel in an image. In this case, the samples for each pixel come from a time series of MODIS observations. The method you would normally use is exactly the same as above. The model is of the form: rˆ f 0 f1 k1 f 2 k2 where rˆ is a set of reflectance data for a particular waveband for a particular T pixel (i.e. over time) and the model parameters are p f 0 f1 f 2 for a particular waveband for that pixel. In the MODIS image data, you are given k1 , k2 , and rˆ for some different wavebands for each pixel. Whilst it might be of more general value to you to understand the maths above, you do not need to to apply the function linearRegress to solve this problem. One computing complexity in this problem is that sometimes, data are missing in the images (they appear as zero values). If you imagine loading the data for a particular waveband and pixel into the data array of the function linearRegress, then you should probably remove these missing data points from the array before applying linearRegress. An alternative strategy might be to simply use all of the data as they come, but to set the first column of M to zero wherever there is a missing data point (see if you can work out why that comes to the same thing). No more hints! Good luck. Once you start the assessed practical, I cannot answer questions on that, other than on points of clarification of the problem or assessment. 3 http://www2.geog.ucl.ac.uk/~plewis/teaching/unix/idl/IDLLewispart4.html Further maths Dealing with uncertainty in the observations If you can easily work your way around the maths above, probably the next level in complexity is to consider the case when the uncertainty in each observation is known and may be different. If we assume the observational uncertainty to be Normally distributed (i.e. Gaussian), we can represent this by a variance-covariance matrix, C . The leading diagonal of this is simply the variance associated with each observation. If available, any off-diagonal elements of the matrix will express covariances between observational uncertainties. In this case, we write equation 11 as: p M T C 1 M 1 M T C 1 y (12) Now, we can find the uncertainty structure of the parameter estimates, C p : K k0 k1 k2 C p M T C 1 M 1 If we then want to know the uncertainty associated with modelling a particular linear combination of model parameters (e.g. observation) then, where that combination is modelled by T so that y KT p then K T Cp K is the standard deviation associated with y . You can use this to work out the uncertainty in the model parameters themselves (in that case, just the square root of the leading diagonal terms), or the observations, or any other linear combination of the parameters. consider the case where all observations have For an interesting application of this, the same uncertainty, obs. Then p MT M 1 MT y i.e. the amount of uncertainty does not affect the estimate, but: obs K T M T M K 1 The term M T M 1 is only dependent on the way in which the observation set samples the domain of the model. In Lucht and Lewis (2000)4, for example, this is used to examine the impact of particular satellite sensor angular sampling regimes on the determination of biophysical quantities. Further constraints Quite often, we find that the estimate of the parameters is only poorly constrained from observational datasets. In such cases, we say that the problem is ill conditioned. The condition number5 of the matrix M T M (formally, the ratio of the largest to smallest singular value in the singular value decomposition of a matrix) relates to the linear independence of the set of simultaneous equations expressed by the matrix. If this is too large, the matrix is ill-conditioned. for as many parameters as you In essence this means that you cannot solve have specified in the problem. This may be because some of the parameters are formally linearly dependent (i.e. there is a linear transformation between on parameter and another … e.g. if you specified the model as yˆ p0 p1x p2 x 1 this would be the case), or it may be that there is simply not enough information in the observations to solve the problem as you have stated it. The options then are either to rephrase the problem (if appropriate) or to consider what other information might be brought to bear onfinding the parameter estimate. One useful example of constraints in many problems, but especially linear problems may be considered as Regularisation methods. In essence, this means that you assume that there is some degree of smoothness between model parameters. In a geographic or temporal estimation, this might make a lot of sense: model parameters close together in space and/or time are more likely to be similar than those spaced widely apart is one way of phrasing this. This has the effect of improving (lowering) the condition number of the matrix. We will not go into detail on how to solve such problems here, but see e.g. Quaife and Lewis (2010)6 for an example application. There are various excellent text books which cover this and related concepts in detail, although they are not aimed specifically at Geography undergraduates. My current favourites are Twomey (1996)7 and Rogers (2000)8. Lucht, W. and P. Lewis (2000) Theoretical noise sensitivity of BRDF and albedo retrieval from the EOS-MODIS and MISR sensors with respect to angular sampling. International Journal of Remote Sensing 21(1) 81-89. 5 http://mathworld.wolfram.com/ConditionNumber.html 6 T. Quaife and P. Lewis (2010) Temporal constraints on linear BRF model parameters IEEE Transactions on Geoscience and Remote Sensing, doi:10.1109/TGRS.2009.2038901 7 S. Twomey (1996) Introduction to the mathematics of inversion in remote sensing and indirect measurements, Dover Publications, 1996 - 243 pages 8 Clive D Rodgers (2000) Inverse Methods for Atmospheric Sounding: Theory and Practice, Series on Atmospheric Oceanic and Planetary Physics - Vol. 2, ISBN: 978-981-02-2740-1 4