Chapter Four Linear Least-Squares Regression Variables and Residual Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Introduction to Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Residuals and Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Introduction to Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Procedure and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Least-Squares Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Obtaining Estimates of Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Advanced Topic: Partitioning Variability in Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 90 Calculation of Least-Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Estimating Measurement Noise: Homogeneous Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Properties of Least-Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Least-Squares Regression with Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Assumptions and Common Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Prediction using Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Prediction of Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 “Inverse” Regression: Prediction of Independent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Variables and Residual Error Introduction to Residuals In studying linear regression, we move into a field that is both familiar and new in many ways. Regression analysis is the study of the relationship between two different classes of variables: dependent variables and independent variables. Before moving into studying the relationship between variables, it will be very helpful to introduce a new way at looking at random variables. Let y be a random variable with mean µy and standard deviation σy. Any single observation (i.e., measurement) of y can be written as yi = ✙y + ✒i [4.1] where yi is the observation and εi is the “true error” in the observation, the difference between yi and µy. We will call this error the residual error, or simply a residual. Now, rather than thinking of y as the variable, we can think of the residual as a random variable ε, with a mean of zero and a standard deviation of σy. Of course, the value of µy is not always known. If we obtain a series of n measurements of y, we can calculate the sample mean of the values, and represent the each of the measurements as yi = y + ei [4.2] where ei is the observed residual value in the ith measurement. This residual, ei, will not be the same as the true residual value, εi, because the sample mean is not likely to be exactly equal to the true mean. Let’s imagine that we obtain 5 measurements on a variable y with µy = 10 and σy = 1. The following figure shows these measurements (obtained from a random number generator), and shows how each measurement of y can be broken down into the mean and the residual values. Page 82 Chapter 4 Linear Regression 12 observed residual true residual measurement value 11 sample mean 10 population mean 9 Figure 4.1: Difference between the true and observed residuals for a measurement. The dotted line is the true mean, µy, of the measurements, while the solid line is the sample mean, y , of the five measurements. Each measurement value yi can be described as either ✙ y + ✒ i or y + e i . Note that, if we wanted, instead of plotting the actual values of the measurements, we could plot the observed residuals for each observation. A comparison of plots of the original observations and the observed residuals is shown in the next figure. 2 11 1 Observed Residuals 12 Y 10 9 0 -1 8 -2 0 1 2 3 Observation 4 5 0 1 2 3 4 5 Observation (a) (b) Figure 4.2: Comparison of (a) the original measurements and (b) the residuals. As we can see, the two plots look identical, except for the shift in mean (the residuals are “mean-centered,” so that the mean of the residuals is zero). In a way, we can now think of the residuals as our random variable, rather than y. Page 83 Chapter 4 Linear Regression Thus, we see that this new way of looking at a random variable y is not really so different then our previous method. Instead of thinking of y as a simple random variable, we can break it up into two parts: a constant value µy and a variable part ε. The variable ε has the same probability distribution as the original variable y, except that the mean of ε is zero. Thus, if y is normally distributed with a mean of µy and a standard deviation of σy, then ε is also normally distributed with a mean of 0 and a standard deviation of σy . Residuals and Sample Statistics At this point, let’s consider the relationship of the residuals with two sample statistics: the sample mean y and the sample variance, s 2y . We have defined the observed residual of a measurement as the difference between the measurement and the sample mean of all the collected measurements: e i = y i − y. For the moment, let’s consider a more general definition of the observed residuals: ei = yi − k where k is any fixed value, not necessarily the sample mean. It turns out that when k is equal to the sample mean, the residuals have an important property: the sum of the squared residuals is at the minimum possible value. In other words, setting k = y minimizes the sum of the squared residuals, ✟ e 2i . For this reason, the mean is sometimes referred to as the least-squares estimate of µx. In addition, when k = y , the sum of the observed residuals must add up to zero: ✟ ei = 0 The sample variance (and the sample standard deviation) of a group of measurements is actually calculated using the residuals. If the value of µy is known, then we may use the actual residuals: s 2y = 1n ✟ ✒ 2i [4.3] You should convince yourself that this formula is identical to eqn. 2.4. Now, usually the value of µy is not know, so we must use the observed residuals in an equation that is derived from eqn. 2.5: s 2y = 1 e2 n−1 ✟ i A more general form of this equation is obtained by using the degrees of freedom: s 2y = 1✚ ✟ e 2i where ν is the number of degrees of freedom of the residuals. Page 84 [4.4] Chapter 4 Linear Regression Introduction to Regression Analysis Figures 4.1 and 4.2 are concerned with five measurements of a variable y. Each of these measurements is taken from a single population, as described by the population mean µy and standard deviation σy. Now, however, let’s change one important aspect of this experiment: let’s consider what would happen if the population mean µy changes for each measurement. For example, let’s say that the value of µy for a particular measurement will depend on the value of some other property, x, of our system: µy = f(x) Thus, if the value of x is different for each measurement, then the measurements are no longer taken from the same population, but from different populations, each with a different population mean µy. If we consider that, as before, y = ✙y + ✒ then we see that the measurements will exhibit two sources of variability: 1. Variability due to the change in µy with each measurement, and 2. Variability due to the random nature of ε. This situation is exactly the one that is examined by regression analysis. A number of observations of a random variable y are obtained; for each observation, the value of some property x is also known. It is the relationship between y and x that is studied in regression analysis. In quantitative chemical analysis, we are very often interested in the relationship between the analyte concentration and the measured response of an analytical instrument. The signal will depend in some way on the analyte concentration (hopefully!); however, any measured value will also contain random error. Thus, in quantitative analysis, we can think of the analyte concentration as the “x” variable, and the measurement as the “y” variable. The population mean µy will depend on the analyte concentration, and the residual variable ε reflects the presence of measurement noise. In these examples, the y variables in the previous examples are the dependent variable, while the x variables are the independent variables. Although x is a variable, its value for each observation is presumably known exactly – i.e., without error. Regression assumes that all measurement error resides in the dependent variable(s). Page 85 Chapter 4 Linear Regression Aside: causal relationships and variable correlation It is tempting to think that the value of the dependent variable y is “caused” by the value of the variable x. In chemical analysis, it certainly seems reasonable that an increase in analyte concentration will cause an increase in signal (in fact, quantitative analysis is based on this very principle). Such relationships are called causal relationships. Regression analysis, however, can never prove a causal relationship; it can only prove some kind of correlation between the values of variables. A correlation between variables simply means that they “change together”; for example, an increase in one variable may be accompanied by an increase in the other. However, both variables may be “caused” by some other factor entirely. The scientific literature is filled with examples in which correlation is mistaken for a causal relationship. Regression Models If we suppose that the population mean, µy, depends on some variable x, then we can postulate a relationship between the two. The postulated relationship is the regression model that we will use in our analysis. The simplest regression model assumes that the value of µy will vary linearly with x: linear model ✙y = ✎1x + ✎0 [4.5] where β1 is the slope of the line, and β0 is a constant offset. The values β1 and β0 are the regression model parameters. We will have a lot to say about these parameters. A first-order linear model (such as in eqn. 4.5) is the most common regression model, but there are many others. Any functional relationship can be used as a model. For example, you may wish to describe your observations by assuming a polynomial dependence of y on x: 2nd-order polynomial model ✙y = ✎2x2 + ✎1x + ✎0 [4.6] In this case, the model parameters are β2, β1 and β0. Even though a 2nd-order polynomial is not a line, this model is still linear in the model parameters. In other words, there is a linear dependence of µy on the parameters; changes in the values of any of the parameters (β2, β1, or β0) will produce linear changes in µy. Thus, both of the above models are linear regression models; the linear relationship (eqn. 4.5) is sometimes called a first-order linear model, while the model specified in eqn. 4.6 is a second-order linear model. Higher-order polynomials are likewise considered to be linear regression models. Nonlinear regression is also common. For example, we might wish to assume an exponential dependence of µy on x: exponential model: ✙ y = ✎ 0 e ✎1 x The exponential model is nonlinear because µy depends nonlinearly on the regression parameter β1. Linear models are by far the most common regression models; we will henceforth restrict ourselves to first-order linear regression. The same general principles apply to all least-squares regression methods, but the equations are different. Page 86 Chapter 4 Linear Regression Dependent Variable We must remember that y is a random variable, just like those we have studied in previous chapters. The only difference is that the population mean µy is now changeable between measurements. The following figure shows the situation when µy depends linearly on x. random component of response linear dependence of µy on x Independent Variable Figure 4.3: Illustration of regression analysis. The dependent variable, y, is a random variable whose population mean, µy, depends on the value of the independent variable, x. The line in the figure shows the functional dependence of µy on x; at any fixed value of x, there is a probability distribution that governs the values of y that will be observed in an experiment. From the figure, we can see that if we obtain measurements of the dependent variable y, we would expect that the data points will be scattered about the line due to the random nature of the variable. To use an example from quantitative analysis, imagine that we are obtaining absorbance measurements of a set of samples. According to Beer’s Law, we would expect a linear relationship between absorbance and analyte concentration. However, if we plot the absorbance values as a function of concentration, it is extremely unlikely that they will all fall exactly on a line, even when measured under conditions in which Beer’s Law is true. The reason that the points are not all along the line is that each measurement is subject to random error. If we were to obtain repeated measurements on a single sample (i.e., the concentration is fixed at a given value), then we would see random error due to the measurement noise. This measurement noise is present in all measurements, even those obtained at different concentration. Procedure and Purpose Regression analysis consists of the following steps. Page 87 Chapter 4 Linear Regression 1. Choose a regression model. 2. Collect measurements to determine estimates of the regression parameters. 3. Calculate estimates of the model parameters. We will study the least-squares procedure for calculating parameter estimates. 4. Check the appropriateness of the model. This step may simply involve a quick visual inspection of plots of data or the residuals, or may involve more extensive analysis. If there is problem with the chosen model, then an alternate regression model is chosen and step #3 is repeated. 5. Use the regression parameter estimates for the desired purpose. The goal of regression analysis is usually one or more of the following: • Prediction. The analyst desires the ability to predict the values of some of the variables from the values of the other variables. This is usually the goal of applying regression to chemical analysis. • Model specification. The analyst is interested in the regression model that best explains the observed variation in the dependent variables. The idea is to investigate system behavior. • Independent variable screening. It is possible to specify a model in which there is more than one independent variable. In such cases, the analyst may wish to determine which independent variables are truly significant in terms of their effect on the dependent variables, and which can be ignored. • Parameter estimation. Sometimes the values of the parameters which appear in the regression model are themselves of primary interest. For example, the slope of a Beer’s Law plot is related to the molar extinction coefficient, and it may be this value that is of interest. Although all of these applications are common in chemistry, in this chapter we are concerned mostly with the first application, that of prediction of variables. Page 88 Chapter 4 Linear Regression Least-Squares Linear Regression Obtaining Estimates of Regression Parameters During a regression experiment, the system response yi is measured for various sets of values of the possible controlling variables xi. Thus, we end up with a set of number pairs, (xi, yi): two “columns” of numbers, one for the independent variable whose value is known exactly, and the other for the dependent variable, a random variable. For example, during the calibration step in quantitative analysis we might collect the data shown in the following figure: Calibration Data 22 21 Instrument Response 20 19 18 17 16 15 0 0.5 1 1.5 2 2.5 Concentration Figure 4.4: Data collected during the calibration step in quantitative analysis. The data in the figure was generated using the first-order linear model: the population mean of each measurement is given by ✙y = ✎1x + ✎0 Since the regression parameters β1 and β0 are (presumably) not known to us, we must obtain estimates, and use them to estimate the population mean of a measurement: predicted response ŷ = b 1 x + b 0 [4.7] where b1 and b0 are our estimates for the corresponding regression parameters β1 and β0, respectively. Thus, ŷ represents our best guess of the population mean, µy, for a given value of x; thus, it serves exactly the same function in regression as the sample mean y has in our previous chapters, except that now µy is a function of x. The most common method for obtaining the estimates b1 and b0 is the least-squares procedure, which will be explained in more detail shortly. The following figure shows the data collected, the value ŷ predicted using the least-squares estimates, and the true values of µy. Page 89 Chapter 4 Linear Regression 22 21 Instrument Response 20 least-squares estimates 19 18 true response 17 16 15 0 0.5 1 1.5 2 2.5 Concentration Figure 4.5: comparison of least-squares estimate (“best fit”) and the actual response. The solid line is the function ŷ = b 1 x + b 0 , and the dotted line is the function ✙ y = ✎ 1 x + ✎ 0 . Any observation, yi in figure 4.5 can be described in terms of the true residual error, εi, or the observed residual error, ei yi = ✙y + ✒ = ✎1xi + ✎0 + ✒i [4.8] y i = ŷ + e i = b 1 x i + b 0 + e i [4.9] You should compare these equations to eqns 4.1 and 4.2. The values of the observed residuals, ei, are the differences between the data points and the solid line in figure 4.5. Advanced Topic: Partitioning Variability in Regression Analysis Although the material in this section is not completely necessary in order to perform least-squares linear regression, an understanding of this material will help you to “appreciate” regression analysis a little better. As mentioned previously, there are two types of variation in the calibration data: 1. variability of the measurement y about µy, due to the random nature of y; and 2. variability that is explained by the change in µy with x. Since these two types of variability are independent of one another, we may write ✤ 2tot = ✤ 2reg + ✤ 2y where ✤ 2tot is the overall variance of the measurements, and ✤ 2reg is the portion of the variance that is explained by the regression model. A convenient method of specifying the “fit” of the regression is by using the coefficient of determination, R2: Page 90 Chapter 4 Linear Regression R2 = ✤ 2reg ✤ 2y = 1 − ✤ 2tot ✤ 2tot Note that R2 is simply the square of the correlation coefficient, which was introduced in chapter 2. Now, R2 is a value between 0 and 1; it represents that fraction of the total variance that is explained by the regression model. Thus, if R2 = 0, then the variance in the measurements is entirely due to the random nature of y; the mean µy does not change at all between measurements. On the other hand, if R2 = 1, then the regression model explains all of the observed variance; for a first-order linear model, all the data points would lie exactly on a line. In practice, R2 can be used as a crude way to compare two different regression models: a model that results in a value of R2 closer to one has smaller residuals (a “better fit”) than a model that gives a smaller R2 value. However, caution must be used when using R2 to compare different calibration models. For example, a second-order linear model (a “polynomial” fit) will always give a larger R2 value than a first-order linear model. This improved fit, however, doesn’t mean that the second-order model will result in better predictions of dependent or independent variables. It is generally better to choose models with as few parameters as possible; in quantitative chemical analysis, it is best to begin with the simplest possible model (usually a first-order linear model) and only go to more complicated models when the simple model is obviously inadequate. Calculation of Least-Squares Estimates We will now discuss the philosophy of least-squares estimation, and give the equations necessary to calculate the least-squares estimates of the regression model. To illustrate the calculation of least-squares estimates, we will use the data previously presented in figure 4.4, which was generated using a first-order linear model with β1 = 2.5 and β0 = 15: ✙ y = 2.5x + 15 where x is analyte concentration and y is the instrument response. This equation determines the population mean of each measurement; in addition, for all the measurements, σy = 1. A random number generator was used to obtain the data points according to these population parameters. The following figure shows the least-squares fit to the line, as well as the observed residuals. Page 91 Chapter 4 Linear Regression 2 22 y = 2.577x + 15.024 R2 = 0.8751 1.5 Observed Residual Error 21 Signal 20 19 18 17 1 0.5 0 0 0.5 1 1.5 2 2.5 -0.5 16 15 0 0.5 1 1.5 2 2.5 -1 Concentration Concentration (a) (b) Figure 4.6: (a) Least-squares best-fit line with n = 5 observations. Associated with each measurement is an observed residual error, which is the distance of the data point from the best-fit line; (b) Plot of residuals. The sum of the residuals is zero. The least-squares fit minimizes the sum of the squared residuals. The least-squares estimates are shown in part (a) of the figure: b1 = 2.577 and b0 = 15.024. These estimates are sample statistics, estimators of the corresponding regression model parameters (just as the sample mean and sample standard deviation are estimators of the corresponding population parameters). Thus, we have least-squares prediction ŷ = 2.577x + 15.024 The plot of the residuals is very interesting; if you compare to figure 4.2(b), you will find the plots to be very similar. (Actually, they are identical, since the same base data was used to generate both sets of measurements). Recall that when we discussed the sample mean of a set of measurements, all with constant µy, two important properties of the sample mean were mentioned: 1. The sum of the residuals are zero: ✟ e i = 0 2. The sum of the squares of the residuals, ✟ e 2i , was the minimum possible value. The least-squares estimate ŷ possess these same qualities! The least-squares estimation procedure is such that the value of b1 and b0 minimize the sum of the squared residuals. In fact, this very property is what gives the name “least-squares” to these estimates. The derivation of the equations of the least-squares estimates is beyond the scope of this chapter; what is more important is to understand the philosophy behind the equations (i.e., that they minimize the sum of the squared residuals). For first-order linear regression, the least-squares estimates are calculated according to the following formulas: Page 92 Chapter 4 Linear Regression b1 = S xy S xx b0 = y − b1 x [4.10] where S xy h ✟(x i − x)(y i − y) and S xx h ✟(x i − x) 2 , and y and x are the sample means of the y and x observations, respectively. Aside: using calculators for first-order linear regression Most scientific calculators can provide least-squares regression parameters for a first-order linear model. If not, then the following formulas are a little easier to use than the previous expressions. S xy = ✟ x i y i − n $ x $ y [4.11(a)] 2 S xx = (n − 1)s x [4.11(b)] 2 where s x is the sample variance of the x-values of the regression data. The data in the next example is the same as was used in the last figure; see if you can obtain the least-squares estimates with your calculator (it’s not very much fun by hand!). Example 4.1 The following is the set of data used to construct figure 4.6. Consider the independent variable to be analyte concentration (in ppm) and the dependent variable to be the instrument response (in arbitrary units): x (conc in ppm) y (instrument response) 0.5 15.658 1.0 17.773 1.5 20.155 2.0 19.745 2.5 21.115 Assume a first-order linear model, and calculate the least-squares estimates of the slope (b1) and intercept (b0) of the line (you should obtain the same values as shown in the figure). x ( 0.5 1.0 1.5 2.0 2.5 ) xi. y i S xy T y ( 15.658 17.773 20.155 19.745 21.115) 5. mean( x) . mean( y ) S xy = 6.4430 i S xx b1 4. var( x) S xy S xx b 1 = 2.5772 S xx = 2.5000 b0 mean( y ) b 0 = 15.0234 b 1. mean( x) these are the least-squares estimates Page 93 T Chapter 4 Linear Regression When we use least-squares procedure, we see from figure 4.6 that the coefficient of determination, R2, is 0.8751. This means that 87.51% of the variance observed in the dependent variable is accounted for by the regression model; the remainder (the “scatter” about the fitted line) is presumably due to measurement noise. Estimating Measurement Noise: Homogeneous Variance In regression analysis, the value of µy can change between observations. Before proceeding, we must make another important assumption. Although it may seem that we have made this assumption all along, we must now state it explicitly: we assume that the magnitude of the measurement noise does not change between observations. Using statistical terminology, we say that the dependent variable y exhibits homogeneous variance, σ2, for all the values of x in the data set. We will use the symbol σ (rather than σy) to emphasize that we assume a common standard deviation for all measurements As we will see, it is important to be able to estimate this common measurement noise. One way to obtain an estimate would be to obtain repeated measurements at a fixed value of x; since the noise remains constant for all values of x, this estimate would be valid for all the measurements in the calibration data. For example, in instrumental analysis, you might want to take a number of measurements on one of the calibration standards. The sample standard deviation of the measurements on this standard is an estimate of the measurement noise. However, there is actually a way to estimate the measurement noise without taking repeated measurements. If our postulated regression model is accurate, then we can obtain an unbiased estimate of the homogeneous variance from the residuals, using eqn. 4.4: homogenous variance for linear model: s 2 = 1✚ ✟ e 2i where the degrees of freedom, ν, of our estimate is equal to n – p, with n being the number of data points and p the number of regression parameters. For first-order linear regression, then, there are n – 2 degrees of freedom. Thus, to estimate the homogeneous variance, we must first obtain the least-square regression estimates, then calculate the observed residuals: ei = yi − b1xi − b0 and then use eqn. 4.4. The following example illustrates how this is done. Example 4.2 Estimate the measurement noise in the data from the previous example; assume that the regression model is correct and that the random variance is homogeneous. Page 94 Chapter 4 Linear Regression Recall that b 0 = 15.0234 b 1 = 2.5772 Let's calculate the y-residuals of the data points: e y b 1. x b 0 T e = ( 0.654 0.172 1.266 0.433 0.351 ) The sum of the squared residuals is easily calculated SS res ei 2 SS res = 2.3705 remember that the least-squares estimates minimize this value i s res SS res 3 s res = 0.8889 This is the std deviation of the residuals (n-2 degrees of freedom), which is an estimate for the homogeneous measurement noise. Thus, we estimate that the common measurement noise on all the measurements is 0.889. This value is probably most properly referred to as the standard deviation of the residuals, since it only estimates the homogeneous noise when certain assumptions are valid, as we will see. In this case, since the data was actually generated by a random generator with σ = 1, the estimate seems reasonable. One final note: the regression routines in most spreadsheet programs (such as Excel or Quattro Pro) will provide the standard deviation of the residuals – although they won’t call it by that name. Properties of Least-Squares Estimates Our estimates of the regression parameters, b1 and b0, are sample statistics; as such, there will be some error in the estimates, due to the random error present in the data used to calculate them. In other words, the estimates are variables: if we repeated the entire calibration procedure, we would almost certainly obtain different values for b1 and b0, even though the regression parameters β1 and β0 are the same. We are interested in characterizing the variability of the least-squares estimates of the regression parameters. The variance of the least-squares estimates can be derived using propagation of error. If we assume that the measurement variance is homogeneous, then the variance of the estimates is given by 2 ✤ 2 (b 1 ) = ✤ S xx 2 ✤ 2 (b 0 ) = ✤ 2 1n + x S xx [4.12] where σ2 is the magnitude of the homogeneous variance, n is the number of measurements, and Sxx is given by eqn. 4.11(b). The standard error (i.e., the standard deviation) of these estimates are the roots of the variance calculated using this equation. If the true value of the measurement variance is not known, then we can only estimate the standard error of the regression parameter estimates. For the least-squares estimates, we may use the following expressions. Page 95 Chapter 4 Linear Regression s(b 1 ) = 2 s(b 0 ) = s res 1n + x S xx s res S xx [4.13] where sres is the estimate of the homogeneous noise, σ, obtained from the residuals. The degrees of freedom in these estimates of the standard error is the same as the number of degrees of freedom in sres: i.e., ν = n − p. Example 4.3 Calculate confidence intervals for the slope, β1, and intercept, β0, for the regression data in example 4.1 using the least-squares procedure, and assuming a first-order linear model and homogeneous variance. Remember that s_b1 s res S xx = 2.5000 s_b1 = 0.5622 s res = 0.8889 standard error of least-squares estimate of the slope S xx s_b0 s res . 1 5 mean( x) S xx 2 s_b0 = 0.9323 std error of the LS estimate of the intercept Important point: if the measurements are normally distributed, then the least-squares estimates also follow normal probability distributions. This allows us to construct confidence intervals for these estimates. To illustrate, let’s construct 95% CI’s for the slope and intercept calculated in example 4.1. The standard errors of these estimates have been calculated in the last example; all we need is to find the proper tν value. Since we have 3 degrees of freedom, we must use t3,0.025, which is 3.182. Thus, we see that b1 = 2.6 ± 1.8 [95%, n = 5] b0 = 15.0 ± 3.0 [95%, n = 5] Least-Squares Regression with Excel Linear regression in Quattro Pro is performed in the following manner. Choose “Tools", then “Data Analysis”, then “Regression” [note that the “Analysis ToolPak” add-in must be activated to do this]. Next, choose the independent variable (e.g., analyte concentration) and the dependent variable (e.g. instrument response). Choose the output region, if you want. Note that you may choose more than one column for your independent variable; in this manner, you can obtain least-squares estimates for higher-order (“polynomial”) linear regression models. I used Excel97 for Windows95 to analyze the data in Example 4.1. The output is shown in the following table, along with a brief explanation of each line. Page 96 Chapter 4 Linear Regression SUMMARY OUTPUT Regression Statistics Multiple R 0.93546 R Square 0.87509 Adjusted R Square 0.83345 Standard Error 0.88879 Observations 5 Intercept X Variable 1 std dev of residuals df of residuals ANOVA Regression Residual Total coeff of determination df 1 3 4 SS 16.6024 2.3699 18.9723 Coefficients Standard Error 15.0238 0.9322 2.5770 0.5621 LS estimates of β0 and β1 MS 16.6024 0.7900 F 21.0170 Significance F 0.0195 t Stat 16.1169 4.5844 P-value 0.0005 0.0195 Lower 95% 12.0572 0.7881 std dev of LS estimates Upper 95% 17.9903 4.3659 95% conf intervals for β0 and β1 You should compare the values in the table to the values we have calculated in examples 4.1−4.3, to familiarize yourself with Excel’s somewhat terse terminology. If you use another spreadsheet, you should try to interpret the output in terms of the material we cover in this chapter. Assumptions and Common Violations Let’s summarize the skills we have obtained up to this point: • we have learned how to obtain the least-squares estimates for a first-order linear regression model; • we have learned how to calculate the standard error of the least-squares regression estimates • we have learned how to construct confidence intervals for the least-squares estimates. At one level, of course, these abilities are not difficult to obtain: just learn how to use the appropriate equations. Of course, there is much more to linear regression than just using the equations; you must understand regression analysis in order to appreciate what the formulas are telling you. An important part of this understanding is knowing the assumptions that are typically made in regression analysis. Violation of these assumptions is by no means a rare occurrence, so it is important to understand the nature of the assumptions in order to appreciate the limitations of regression analysis. The common assumptions, roughly in the order they were made in this chapter, are: Page 97 Chapter 4 Linear Regression 1. No error in x values. Throughout this chapter, we have assumed that there is no random error in the independent variable(s), and that all such random error is in the dependent variable. Violation of this assumption is common. In quantitative analysis, it is rarely true that the concentration of the calibration standards are known without error; in fact, error in the standard concentrations can be the main contribution to error in the estimated analyte concentration! In quantitative analysis, violation of this assumption is actually not very serious. It has been proven that random error in the x variable can be described by inflated random error of the y value. In other words, random error of x serves to increase the standard deviation of the residuals, and we can think of this error as being due to “measurement noise” (i.e., noise in y). What this means is that the scatter about the fitted regression line is greater than would be predicted by obtaining repeated measurements on any one standard. Practically speaking, the consequences of violation of this assumption can be minimized or even eliminated if you • take the same number of measurements on all the standards and all the samples (unknowns). If you take more than one measurement on each of these, use the averages when performing regression; • estimate the measurement noise from the standard deviation of the residuals. “Concentration error” inflates the scatter about the fitted line, and should be included in our estimate of the measurement noise. The “concentration error” will not be included in any estimate of measurement noise that is calculated from repeated measurements on a single standard. Naturally, these precautions do not correct for the presence of bias in the x values; this type of error will result in biased estimates of the analyte concentration (as would measurement bias in the y values, of course). 2. Homogeneous variance. The assumption of homogeneous error allows easy (or at least, relatively easy!) calculation of measurement noise and the standard error of the regression parameter estimates. In quantitative analysis, this assumption is suspect whenever the calibration standards cover a wide range of concentrations. In such cases, the measurement noise σy may well depend on the value of x; generally, as the value of y (and x) increases, the noise increases as well. This is an example of inhomogeneous error. Least-squares estimates can still be used in the presence of inhomogeneous error; however, the least-squares estimates are no longer as “attractive” as they are in the case of homogeneous error. In statistical parlance, the least-squares estimates are no longer the “best linear unbiased estimates” (don’t worry too much about what this phrase means). More seriously, the standard error calculated for the regression estimates can no longer be calculated using formulas such as those in eqn. 4.12. This practical consequence of this violation is that the confidence levels are not correct – we may think that we have calculated 95% confidence intervals, for example, but the actual confidence level will be somewhat different than 95%. Page 98 Chapter 4 Linear Regression If the measurement noise of every data point is known, then the weighted least-squares procedure can be used to obtain regression estimates. Weighted least-squares is a modification of the least-squares procedure in which a data point is given a “weight” that is inversely proportional to its population variance. The standard error of these estimates can be calculated. 3. Normal distribution of measurements. If the measurements are normally distributed, then so too are the least-squares estimates, which allows us to construct confidence intervals fro the standard error of these estimates. If this assumption is violated, then the confidence intervals cannot be calculated using either z or t tables; fortunately, this assumption is often a reasonable one. Page 99 Chapter 4 Linear Regression Prediction using Linear Regression Prediction of Dependent Variable Let’s assume that a first-order linear regression model with homogeneous noise applies to the measurements obtained during a regression experiment: ✙(y i ) = ✎ 1 x i + ✎ 0 where µ(yi) is the mean of the measurements when x = xi. Of course, the values of the parameters, β1 and β0, are not known, and we must calculate regression estimates b1 and b0 from the calibration data. Using these values, the best estimate of the population mean µ(yi) when x = xi is given by: ŷ i = b 1 x i + b 0 where ŷ i is our estimate of µ(yi), exactly like the sample mean. And, just like the sample mean, this predicted value ŷ i is a variable, due to the error in the regression estimates. In the last section, we calculated the standard error in our estimates b1 and b0; how about the standard error in the value ŷ i predicted at a fixed value xi? If we assume homogeneous variance, then the variability in the dependent value at x = xi is given by the following equation: variance of predicted value (x i − x) 2 ✤ 2 (ŷ i ) = ✤ 2 1n + S xx where n is the number of measurements used to determine b1 and b0, xi is the value of x at which we want to find the variance of ŷ, x is the mean of the x variable for the calibration data, and Sxx can be calculated using eqn. 4.11(b). The square root of the variance in eqn. 4.12 is the standard error of our predicted response. In other words, this is the standard error of the fitted regression line. Note that the magnitude of the standard error depends on how far xi is from the mean of the values used to obtain the regression estimates b1 and b0. There is greater random error near the “ends” of the fitted line than in the “middle.” If we know the standard error of the predicted value ŷ i at some value xi, then it is a simple matter to construct a confidence interval (assuming a normal pdf); this is an interval estimate of the true response µ(yi) at xi. CI for µ(yi), σ unknown (x i − x) 2 ŷ ! t ✚,✍/2 $ s 1n + S xx [4.14] At the edges of the calibration curve, this confidence interval will “flare,” because the standard error of our point estimate increases. Page 100 Linear Regression Dependent Variable Chapter 4 confidence interval for µy at a fixed value of x Independent Variable Figure 4.7: A plot of the confidence interval for µy as a function of x. Note that the CI becomes wider at the edges of the calibration curve; the CI is narrowest at x, the mean value of the independent variable for the calibration data. The confidence interval in eqn 4.14 gives an interval that contains the true mean response (µy) at a given value of x; however, imagine that we wish to obtain a new measurement yi at a value of x = xi; can we find an interval that will (probably) contain the new measurement? This interval is a prediction interval of a future measurement. We can calculate the residual error of this hypothetical future measurement: e i = y i − ŷ i The residual is simply the distance that the measurement will be from the regression line. What is the variance of this distance? From propagation of error, we can calculate: ✤ 2 (e i ) = ✤ 2 (ŷ i ) + ✤ 2 (y i ) We see that the variance of the future residual is due to the measurement noise on yi and due to the uncertainty in ŷ i ; in other words, random error in both the regression line and in the future measurement itself. If we assume homogeneous variance, then σ(yi) = σ, and we can use eqn. 4.14 to obtain std dev of “future” residuals (x i − x) 2 ✤ 2 (e i ) = ✤ 2 + ✤ 2 1n + S xx This expression is used to construct the desired prediction interval: PI for single measurements (x i − x) 2 ŷ i ! t n−2,✍/2 $ s $ 1 + 1n + S xx Page 101 [4.15] Chapter 4 Linear Regression To summarize, equation 4.14 gives an interval (a confidence interval) around ŷ that contains µy, the true (mean) response, while equation 4.15 gives an interval (a prediction interval) that will contain a future measurement. The first interval is always going to be smaller then the second. The figure shows the two intervals for a data set; these intervals were actually calculated from the data set. The smaller interval is a 90% confidence interval for µy, while the wider interval is a 90% prediction interval for single measurements. Notice that, while a number of the data points lie outside the smaller interval, all of them are contained within the wider interval. This behavior is expected; the CI is only supposed to contain the true value of µy; the PI, however, is supposed to contain all individual measurements (with 90% probability, anyway). 16 14 prediction interval for future measurements Dependent Variable 12 10 8 6 confidence interval for population mean 4 2 0 0 5 10 15 20 25 30 35 Independent Variable Figure 4.8: Intervals associated with a regression line. The narrower interval is the 90% confidence interval for µy, while the wider interval is a 90% prediction interval for future measurements. Page 102 40 Chapter 4 Linear Regression Example 4.4 Imagine that the data given in example 4.1 is for the calibration curve of an analytical technique, with the x values being the concentration of an analyte in ppm and the y values being the instrument response. Now imagine that another chemical sample, with a true analyte concentration of 1.20 ppm, is analyzed. Using the calibration curve data: (a) Calculate a 95% confidence interval for the (true) mean instrument response to this sample. (b) Calculate a prediction interval within which a single measurement of the sample will fall with 95% probability. First the preliminaries, all of which have actually been calculated in the previous examples: x ( 0.5 1.0 1.5 2.0 2.5 ) xi. y i S xy T y ( 15.658 17.773 20.155 19.745 21.115) 5. mean( x) . mean( y ) T S xy = 6.4430 i S xx 4. var( x) S xx = 2.5000 b1 S xy b0 S xx b 1. mean( x) mean( y ) least-squares estimates res y b 1. x b 0 1. s res 3 res i 2 s res = 0.8889 i std dev of the residuals, which is an estimate of the homogeneous measurement noise Now we need to understand the distinction between the intervals asked for in parts (a) and (b). If we obtain measurements on the sample, which contains 1.20 ppm analyte, these measurements will be distributed (presumably according to a normal distribution) around a measurement mean, ✙ y = ✎ 1 x + ✎ 0 . In part (a), we are asked to provide a confidence interval for µy when x = 1.20 ppm. Equation 4.14 gives us the desired confidence interval. At x = 1.2 ppm, s_ypred s res . ypred b 1. 1.2 b 0 1 ( 1.2 mean( x) ) 5 S xx ypred = 18.1160 2 s_ypred = 0.4318 This is the std error in the value predicted by the regression line at x = 1.2 ppm. We can use this value to calculate the desired confidence interval. t = 3.1820 width_ypred 95% confidence level and 3 degrees of freedom t . s_ypred width_ypred = 1.3741 Thus, the confidence interval for the mean response is 18.1 ± 1.4 units [95%]. Page 103 Chapter 4 Linear Regression In part (b) we are asked to provide a range of values that will contain a single measurement of the sample with 95% probability. Equation 4.14 can be used to calculate this interval: The std deviation of the residuals at x = 1.2 ppm is given by s_future s res 2 s_ypred 2 s_future = 0.9883 Note that this gives us the same result as width_future t . s_future s res . 1 1 ( 1.2 mean( x) ) 5 S xx 2 = 0.9883 width_future = 3.1446 Thus, the 95% prediction interval for a future measurement response is 18.1 ± 3.1 units. Thus, if we measure a sample that contains an analyte concentration of 2.1 ppm, there is a 95% probability that the measurement value will fall within 3.1 units of 18.1 units, the response predicted by the regression equation. “Inverse” Regression: Prediction of Independent Variable The previous section has described two intervals that are used to predict the properties of the dependent variable, y, at any value of x: the confidence interval is used to estimate µy, and the prediction interval is used to predict future values of y. In analytical chemistry, however, the most common use of the regression equation is to predict the value of the independent variable, x, given a measurement value. For first-order linear regression, the analyte concentration xu is estimated from the instrumental response yu on a sample as follows: x̂ u = yu − b0 b1 where xu is the true analyte concentration in the sample, and x̂ u is our best estimate (a point estimate) of this concentration. This use of regression analysis has been termed inverse regression. It is important to realize that x̂ u is a random variable, even though x is not. The reason that the concentration estimate contains error is that it is calculated from terms that all contain some random error. Since the predicted value x̂ u is a random variable, we should have a way to construct a confidence interval that contains the true analyte concentration. Before trying to find an expression that can be used to calculate such a confidence interval, it is worthwhile to understand the factors that contribute to the variability of the predicted concentration, x̂ u . For a given sample, the two sources of random error are the measurement noise in the dependent variable, and calibration noise due to the random error in the regression estimates. Page 104 Chapter 4 Linear Regression measurement noise (a) calibration noise (b) Figure 4.9: (a) random error in the measurement (the yu variable) results in error in the estimated value x̂ u , even if there is no error in the regression line; (b) even if there were no error in the measured value (yu), random error in the calibration line will cause random error to be present in the estimated concentration (x̂ u ). In part (a) of the figure, we imagine a situation in which the values of β1 and β0 are known exactly; however, the measurement noise present in the dependent variable yu will still cause random error to be present in the value x̂ u . Part (b) shows the situation when µ(yu) (the mean of yu) is known, but the regression parameters are not known. In this situation, the random nature of the statistics b1 and b0 will still cause random error in x̂ u . If we know the true concentration, xu, of the analyte in a sample, we would be able to construct a prediction interval for possible values for measurements that would be obtained when the sample is analyzed (from eqn. 4.15). Corresponding to this range of possible measurement values, yu, is a range of x̂ u values that would be calculated from the measurements, as shown in the following figure: Page 105 Chapter 4 Linear Regression range of predicted responses true conc range of calc'd concs Figure 4.10: construction of the prediction interval for x̂ u from the calibration curve. Based on this type of reasoning, we can write the following for the true standard error of estimated concentrations (assuming homogeneous noise σ): (x u − x) 2 ✤(x̂ u ) l ✤ 1 + 1n + S xx b1 1/2 [4.16] For reasons we will not go into, this equation is only a good approximation for the standard error. There are three terms in this expression. The first term in the parentheses is due to the measurement noise. The second two terms are due to the uncertainty in the least-squares estimates of the slope and intercept terms, the so-called calibration noise. As can be seen, the calibration error contribution is at a minimum when x u = x. Now, this last equation contains the true value xu of the independent variable (i.e., the analyte concentration). Normally, of course, the values of xu and σ are not known; in such cases, the following expression must be used: s(x̂ u ) l (x u − x) 2 s res 1 + 1n + S xx b1 1/2 [4.17] The number of degrees of freedom for the standard error is n − 2 (this equation is specifically for first-order linear regression). Page 106 Chapter 4 Linear Regression Example 4.5 Assume again that the data given in example 1 is for a calibration curve, with the x values being the concentration in ppm and the y values being the instrument response. A sample is analyzed, giving a response of 18.41; report the estimated analyte concentration in the form of a confidence interval. From previous examples, we know that S xx = 2.5000 s res = 0.8889 b 1 = 2.5772 b 0 = 15.0234 We can calculate a point estimate for the analyte concentration in the sample: yu xu measured response of the "unknown" 18.41 yu b0 estimated analyte concentration x u = 1.3141 b1 Now to estimate the std error of the predicted analyte concentration s_x u s res . 1 b1 1 5 xu mean( x) S xx 2 s_x u = 0.3800 The half-width of the 95% confidence interval will be t = 3.1820 t . s_x u = 1.2092 Thus, the analyte concentration is 1.3 ± 1.2 ppm [95%]. Page 107 Chapter 4 Linear Regression Chapter Checkpoint The following terms/concepts were introduced in this chapter: calibration least-squares estimates coefficient of determination (R2) regression model parameters dependent variables regression model homogeneous variance regression analysis independent variables residuals inverse least-squares weighted least-squares In addition to being able to understand and use these terms, after mastering this chapter, you should • be able to calculate least-squares estimates of regression model parameters for a first-order linear model • be able to estimate the measurement noise σ from the residuals in data used for regression • be able to determine the coefficient of determination, R2, for regression data • be able to calculate the standard error of the least-squares estimates of the first-order linear model parameters, and use these values to construct confidence intervals • understand the major assumptions made in deriving the equations used in linear least-squares regression • be able to construct a confidence interval for µ(yi), the population mean of the dependent variable at any value xi of the independent variable • be able to construct a prediction interval for future measurements obtained at any value xi of the independent variable • be able to calculate the standard error in x̂ u , our best estimate of the analyte concentration in a sample. • be able to construct a confidence interval for xu, the true value of the independent variable, from the measured value yu of the dependent variable. Page 108