1.017/1.010 Class 22 Linear Regression Regression Models Fluctuations in measured (dependent) variables can often be attributed (in part) to other (independent) variables. ANOVA identifies likely independent variables. Regression methods quantify relationship between dependent and independent variables. Consider problem with one random dependent variable y and one independent variable x related by a regression model: y = g(x, a1, a2,…, am) + e g(..) = known function (e.g. a polynomial) a1, a2, …, am= m unknown regression parameters e = random residual , E[e] = 0, Var[e] = σe2, CDF = Fe(e). Data for 1872-1986 (xi , yi) Illustrate basic concepts with the following special case, where g(..) is quadratic in x and linear in the ai's: y(x) = g(x, a1, a2, a3) + e = a1+ a2x +a3x2 + e Mean of y(x) is: E[y(x)]= a1+ a2x +a3x2 1 Objective is to estimate the ai's from a set of y measurements [y1 y2 .... yn] taken at different known x values [x1 x2 .... xn]. The complete set of measurement equations is: y i = y ( xi ) = a1 + a 2 xi + a3 xi2 + e i ; i = 1 ,..., n The residual errors [e1e2 .... en] are all assumed to be independent with identical distributions. Matrix Notation Regression models and calculations are most easily expressed in terms of matrix operations. Suppose: A = matrix (MATLAB array) with m rows and n columns B = matrix with n rows and m columns C, D= matrices with m rows and m columns V = vector with n rows and 1 column Vectors are special cases with only 1 row or column. Operation Matrix Indexed Matrix product C = AB of A and B n C ik = ∑ Aij B jk ; i = 1...m, k = 1...m MATLAB C=A*B j =1 Matrix A = B' transpose of B Matrix inverse of C Aij = B ji i = 1...m, j = 1...n m m D = C -1 C ij D jk = Dij C jk = I ik where D is defined j =1 j =1 by: i, k = 1...m CD=DC=I I ik = 1 i = k ∑ ∑ A=B' D=inv(C) I ik = 0 i ≠ k Sum-ofsquares of elements of V SSV = V'V n SSV = ∑ V j' V j = j =1 n ∑ V j2 j =1 It is convenient for MATLAB computations to write the set of measurement equations in matrix form: Y = HA + E 2 SSV=V'*V where: y1 Y = y2 M y n 1 x1 H = 1 x 2 M 1 x n x12 x 22 M x n2 a1 A = a 2 a 3 e1 E = e 2 M e n Least-Squares Estimates of Regression Parameters Estimated ai's are selected to give best fit between measurements and predictions . The predicted Y is computed from the ai estimates (predictions and estimates are indicated by ^ symbols): Yˆ = HAˆ ; aˆ1 ˆ A = aˆ 2 aˆ 3 Measurement / prediction fit is described by sum-of-squared prediction errors (estimates indicated by ^ symbols) : SSE ( Aˆ ) = [Y − HAˆ ]′[Y − HAˆ ] = ∑ [y − (aˆ n i i =1 2 1 + aˆ 2 xi + aˆ 3 xi )] 2 SSE is minimized when: [ H ′H ] Aˆ = H ′Y This matrix equation is a concise way to represent three simultaneous equations in the three unknown ai estimates. The formal solution is: Aˆ = [ H ′H ] −1 H ′Y Note that H'H is a 3 by 3 matrix and H' Y is a 3 by 1 vector for the example. The estimation equations can be solved (for any particular set of measurements Y) with the MATLAB backslash \ operator: >> ahat = (H’*H)\(H’*y) These equations only have a unique solution if n >= m (i.e. if there are at least as many measurements as unknowns). 3 The predicted y(x) is obtained by substituting the ai estimates for the true ai values in the regression function g(x, a1, a2, a3): [ yˆ ( x) = h( x ) Aˆ = aˆ1 + aˆ 2 x + aˆ 3 x 2 ; h(x) = 1 x x2 ] Estimates and prediction are random variables. Same approach extends to any model with a g(x, a1, a2,…, am) that depends linearly on ai's. Simply redefine H and h(x). Example -- Regression Model of Soil Sorption A laboratory experiment provides measurements of organic solvent y sorbed onto soil particles (in mg. of solvent sorbed/kg. of soil) for different aqueous concentrations x of the solvent (in mg dissolved solvent/liter of water). Assume that the regression model proposed above applies. Suppose specified (controlled) x values and corresponding y values are: [x1 x2 .... x4] = [0.5 2.0 3.0 4.0 ] [y1 y2 .... y4] = [0.4134 2.1453 1.7466 3.0742] 1 1 H = 1 1 x1 x2 x3 x4 x12 1 x 22 1 = x32 1 1 x 42 0.5 0.25 2.0 4.0 3.0 9.0 4.0 16.0 MATLAB gives: >> ahat = (H’*H)\(H’*y) ahat = 0.0924 0.8829 -0.0471 So prediction equation is: yˆ ( x ) = 0.0924 + 0.8829 x − 0.0471x 2 Plot this equation on same axes as measurements. Copyright 2003 Massachusetts Institute of Technology Last modified Oct. 8, 2003 4 0.4134 2.1453 Y = 1.7466 3.0742