Word Pro - Ch 4 - Linear Regression.lwp

advertisement
Chapter Four
Linear Least-Squares Regression
Variables and Residual Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Introduction to Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Residuals and Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Introduction to Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Procedure and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Least-Squares Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Obtaining Estimates of Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Advanced Topic: Partitioning Variability in Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 90
Calculation of Least-Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Estimating Measurement Noise: Homogeneous Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Properties of Least-Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Least-Squares Regression with Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Assumptions and Common Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Prediction using Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Prediction of Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
“Inverse” Regression: Prediction of Independent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Variables and Residual Error
Introduction to Residuals
In studying linear regression, we move into a field that is both familiar and new in many ways.
Regression analysis is the study of the relationship between two different classes of variables:
dependent variables and independent variables. Before moving into studying the relationship
between variables, it will be very helpful to introduce a new way at looking at random variables.
Let y be a random variable with mean µy and standard deviation σy. Any single observation (i.e.,
measurement) of y can be written as
yi = ✙y + ✒i
[4.1]
where yi is the observation and εi is the “true error” in the observation, the difference between yi
and µy. We will call this error the residual error, or simply a residual. Now, rather than thinking
of y as the variable, we can think of the residual as a random variable ε, with a mean of zero and
a standard deviation of σy.
Of course, the value of µy is not always known. If we obtain a series of n measurements of y, we
can calculate the sample mean of the values, and represent the each of the measurements as
yi = y + ei
[4.2]
where ei is the observed residual value in the ith measurement. This residual, ei, will not be the
same as the true residual value, εi, because the sample mean is not likely to be exactly equal to
the true mean.
Let’s imagine that we obtain 5 measurements on a variable y with µy = 10 and σy = 1. The
following figure shows these measurements (obtained from a random number generator), and
shows how each measurement of y can be broken down into the mean and the residual values.
Page 82
Chapter 4
Linear Regression
12
observed
residual
true residual
measurement value
11
sample
mean
10
population
mean
9
Figure 4.1: Difference between the true and observed residuals for a measurement. The
dotted line is the true mean, µy, of the measurements, while the solid line is the sample
mean, y , of the five measurements. Each measurement value yi can be described as either
✙ y + ✒ i or y + e i .
Note that, if we wanted, instead of plotting the actual values of the measurements, we could plot
the observed residuals for each observation. A comparison of plots of the original observations
and the observed residuals is shown in the next figure.
2
11
1
Observed Residuals
12
Y
10
9
0
-1
8
-2
0
1
2
3
Observation
4
5
0
1
2
3
4
5
Observation
(a)
(b)
Figure 4.2: Comparison of (a) the original measurements and (b) the residuals.
As we can see, the two plots look identical, except for the shift in mean (the residuals are
“mean-centered,” so that the mean of the residuals is zero). In a way, we can now think of the
residuals as our random variable, rather than y.
Page 83
Chapter 4
Linear Regression
Thus, we see that this new way of looking at a random variable y is not really so different then
our previous method. Instead of thinking of y as a simple random variable, we can break it up
into two parts: a constant value µy and a variable part ε. The variable ε has the same probability
distribution as the original variable y, except that the mean of ε is zero. Thus, if y is normally
distributed with a mean of µy and a standard deviation of σy, then ε is also normally distributed
with a mean of 0 and a standard deviation of σy .
Residuals and Sample Statistics
At this point, let’s consider the relationship of the residuals with two sample statistics: the
sample mean y and the sample variance, s 2y . We have defined the observed residual of a
measurement as the difference between the measurement and the sample mean of all the
collected measurements: e i = y i − y. For the moment, let’s consider a more general definition of
the observed residuals:
ei = yi − k
where k is any fixed value, not necessarily the sample mean. It turns out that when k is equal to
the sample mean, the residuals have an important property: the sum of the squared residuals is at
the minimum possible value. In other words, setting k = y minimizes the sum of the squared
residuals, ✟ e 2i . For this reason, the mean is sometimes referred to as the least-squares estimate
of µx.
In addition, when k = y , the sum of the observed residuals must add up to zero:
✟ ei = 0
The sample variance (and the sample standard deviation) of a group of measurements is actually
calculated using the residuals. If the value of µy is known, then we may use the actual residuals:
s 2y = 1n ✟ ✒ 2i
[4.3]
You should convince yourself that this formula is identical to eqn. 2.4. Now, usually the value of
µy is not know, so we must use the observed residuals in an equation that is derived from eqn.
2.5:
s 2y =
1
e2
n−1 ✟ i
A more general form of this equation is obtained by using the degrees of freedom:
s 2y = 1✚ ✟ e 2i
where ν is the number of degrees of freedom of the residuals.
Page 84
[4.4]
Chapter 4
Linear Regression
Introduction to Regression Analysis
Figures 4.1 and 4.2 are concerned with five measurements of a variable y. Each of these
measurements is taken from a single population, as described by the population mean µy and
standard deviation σy. Now, however, let’s change one important aspect of this experiment: let’s
consider what would happen if the population mean µy changes for each measurement. For
example, let’s say that the value of µy for a particular measurement will depend on the value of
some other property, x, of our system:
µy = f(x)
Thus, if the value of x is different for each measurement, then the measurements are no longer
taken from the same population, but from different populations, each with a different population
mean µy. If we consider that, as before,
y = ✙y + ✒
then we see that the measurements will exhibit two sources of variability:
1. Variability due to the change in µy with each measurement, and
2. Variability due to the random nature of ε.
This situation is exactly the one that is examined by regression analysis. A number of
observations of a random variable y are obtained; for each observation, the value of some
property x is also known. It is the relationship between y and x that is studied in regression
analysis.
In quantitative chemical analysis, we are very often interested in the relationship between the
analyte concentration and the measured response of an analytical instrument. The signal will
depend in some way on the analyte concentration (hopefully!); however, any measured value will
also contain random error. Thus, in quantitative analysis, we can think of the analyte
concentration as the “x” variable, and the measurement as the “y” variable. The population mean
µy will depend on the analyte concentration, and the residual variable ε reflects the presence of
measurement noise.
In these examples, the y variables in the previous examples are the dependent variable, while the
x variables are the independent variables. Although x is a variable, its value for each observation
is presumably known exactly – i.e., without error. Regression assumes that all measurement
error resides in the dependent variable(s).
Page 85
Chapter 4
Linear Regression
Aside: causal relationships and variable correlation
It is tempting to think that the value of the dependent variable y is “caused” by the value of the
variable x. In chemical analysis, it certainly seems reasonable that an increase in analyte
concentration will cause an increase in signal (in fact, quantitative analysis is based on this very
principle). Such relationships are called causal relationships.
Regression analysis, however, can never prove a causal relationship; it can only prove some kind
of correlation between the values of variables. A correlation between variables simply means
that they “change together”; for example, an increase in one variable may be accompanied by an
increase in the other. However, both variables may be “caused” by some other factor entirely.
The scientific literature is filled with examples in which correlation is mistaken for a causal
relationship.
Regression Models
If we suppose that the population mean, µy, depends on some variable x, then we can postulate a
relationship between the two. The postulated relationship is the regression model that we will
use in our analysis. The simplest regression model assumes that the value of µy will vary linearly
with x:
linear model
✙y = ✎1x + ✎0
[4.5]
where β1 is the slope of the line, and β0 is a constant offset. The values β1 and β0 are the
regression model parameters. We will have a lot to say about these parameters.
A first-order linear model (such as in eqn. 4.5) is the most common regression model, but there
are many others. Any functional relationship can be used as a model. For example, you may wish
to describe your observations by assuming a polynomial dependence of y on x:
2nd-order polynomial model
✙y = ✎2x2 + ✎1x + ✎0
[4.6]
In this case, the model parameters are β2, β1 and β0. Even though a 2nd-order polynomial is not a
line, this model is still linear in the model parameters. In other words, there is a linear
dependence of µy on the parameters; changes in the values of any of the parameters (β2, β1, or β0)
will produce linear changes in µy. Thus, both of the above models are linear regression models;
the linear relationship (eqn. 4.5) is sometimes called a first-order linear model, while the model
specified in eqn. 4.6 is a second-order linear model. Higher-order polynomials are likewise
considered to be linear regression models.
Nonlinear regression is also common. For example, we might wish to assume an exponential
dependence of µy on x:
exponential model:
✙ y = ✎ 0 e ✎1 x
The exponential model is nonlinear because µy depends nonlinearly on the regression parameter
β1. Linear models are by far the most common regression models; we will henceforth restrict
ourselves to first-order linear regression. The same general principles apply to all least-squares
regression methods, but the equations are different.
Page 86
Chapter 4
Linear Regression
Dependent Variable
We must remember that y is a random variable, just like those we have studied in previous
chapters. The only difference is that the population mean µy is now changeable between
measurements. The following figure shows the situation when µy depends linearly on x.
random
component of
response
linear
dependence
of µy on x
Independent Variable
Figure 4.3: Illustration of regression analysis. The dependent variable, y, is a random variable
whose population mean, µy, depends on the value of the independent variable, x. The line in
the figure shows the functional dependence of µy on x; at any fixed value of x, there is a
probability distribution that governs the values of y that will be observed in an experiment.
From the figure, we can see that if we obtain measurements of the dependent variable y, we
would expect that the data points will be scattered about the line due to the random nature of the
variable.
To use an example from quantitative analysis, imagine that we are obtaining absorbance
measurements of a set of samples. According to Beer’s Law, we would expect a linear
relationship between absorbance and analyte concentration. However, if we plot the absorbance
values as a function of concentration, it is extremely unlikely that they will all fall exactly on a
line, even when measured under conditions in which Beer’s Law is true.
The reason that the points are not all along the line is that each measurement is subject to random
error. If we were to obtain repeated measurements on a single sample (i.e., the concentration is
fixed at a given value), then we would see random error due to the measurement noise. This
measurement noise is present in all measurements, even those obtained at different
concentration.
Procedure and Purpose
Regression analysis consists of the following steps.
Page 87
Chapter 4
Linear Regression
1. Choose a regression model.
2. Collect measurements to determine estimates of the regression parameters.
3. Calculate estimates of the model parameters. We will study the least-squares procedure for
calculating parameter estimates.
4. Check the appropriateness of the model. This step may simply involve a quick visual
inspection of plots of data or the residuals, or may involve more extensive analysis. If there is
problem with the chosen model, then an alternate regression model is chosen and step #3 is
repeated.
5. Use the regression parameter estimates for the desired purpose.
The goal of regression analysis is usually one or more of the following:
• Prediction. The analyst desires the ability to predict the values of some of the variables from
the values of the other variables. This is usually the goal of applying regression to chemical
analysis.
• Model specification. The analyst is interested in the regression model that best explains the
observed variation in the dependent variables. The idea is to investigate system behavior.
• Independent variable screening. It is possible to specify a model in which there is more than
one independent variable. In such cases, the analyst may wish to determine which independent
variables are truly significant in terms of their effect on the dependent variables, and which can
be ignored.
• Parameter estimation. Sometimes the values of the parameters which appear in the regression
model are themselves of primary interest. For example, the slope of a Beer’s Law plot is
related to the molar extinction coefficient, and it may be this value that is of interest.
Although all of these applications are common in chemistry, in this chapter we are concerned
mostly with the first application, that of prediction of variables.
Page 88
Chapter 4
Linear Regression
Least-Squares Linear Regression
Obtaining Estimates of Regression Parameters
During a regression experiment, the system response yi is measured for various sets of values of
the possible controlling variables xi. Thus, we end up with a set of number pairs, (xi, yi): two
“columns” of numbers, one for the independent variable whose value is known exactly, and the
other for the dependent variable, a random variable. For example, during the calibration step in
quantitative analysis we might collect the data shown in the following figure:
Calibration Data
22
21
Instrument Response
20
19
18
17
16
15
0
0.5
1
1.5
2
2.5
Concentration
Figure 4.4: Data collected during the calibration step in quantitative analysis.
The data in the figure was generated using the first-order linear model: the population mean of
each measurement is given by
✙y = ✎1x + ✎0
Since the regression parameters β1 and β0 are (presumably) not known to us, we must obtain
estimates, and use them to estimate the population mean of a measurement:
predicted response
ŷ = b 1 x + b 0
[4.7]
where b1 and b0 are our estimates for the corresponding regression parameters β1 and β0,
respectively. Thus, ŷ represents our best guess of the population mean, µy, for a given value of x;
thus, it serves exactly the same function in regression as the sample mean y has in our previous
chapters, except that now µy is a function of x.
The most common method for obtaining the estimates b1 and b0 is the least-squares procedure,
which will be explained in more detail shortly. The following figure shows the data collected, the
value ŷ predicted using the least-squares estimates, and the true values of µy.
Page 89
Chapter 4
Linear Regression
22
21
Instrument Response
20
least-squares estimates
19
18
true response
17
16
15
0
0.5
1
1.5
2
2.5
Concentration
Figure 4.5: comparison of least-squares estimate (“best fit”) and the actual response. The
solid line is the function ŷ = b 1 x + b 0 , and the dotted line is the function ✙ y = ✎ 1 x + ✎ 0 .
Any observation, yi in figure 4.5 can be described in terms of the true residual error, εi, or the
observed residual error, ei
yi = ✙y + ✒ = ✎1xi + ✎0 + ✒i
[4.8]
y i = ŷ + e i = b 1 x i + b 0 + e i
[4.9]
You should compare these equations to eqns 4.1 and 4.2. The values of the observed residuals, ei,
are the differences between the data points and the solid line in figure 4.5.
Advanced Topic: Partitioning Variability in Regression Analysis
Although the material in this section is not completely necessary in order to perform
least-squares linear regression, an understanding of this material will help you to “appreciate”
regression analysis a little better.
As mentioned previously, there are two types of variation in the calibration data:
1. variability of the measurement y about µy, due to the random nature of y; and
2. variability that is explained by the change in µy with x.
Since these two types of variability are independent of one another, we may write
✤ 2tot = ✤ 2reg + ✤ 2y
where ✤ 2tot is the overall variance of the measurements, and ✤ 2reg is the portion of the variance that
is explained by the regression model. A convenient method of specifying the “fit” of the
regression is by using the coefficient of determination, R2:
Page 90
Chapter 4
Linear Regression
R2 =
✤ 2reg
✤ 2y
=
1
−
✤ 2tot
✤ 2tot
Note that R2 is simply the square of the correlation coefficient, which was introduced in chapter
2.
Now, R2 is a value between 0 and 1; it represents that fraction of the total variance that is
explained by the regression model. Thus, if R2 = 0, then the variance in the measurements is
entirely due to the random nature of y; the mean µy does not change at all between measurements.
On the other hand, if R2 = 1, then the regression model explains all of the observed variance; for
a first-order linear model, all the data points would lie exactly on a line.
In practice, R2 can be used as a crude way to compare two different regression models: a model
that results in a value of R2 closer to one has smaller residuals (a “better fit”) than a model that
gives a smaller R2 value. However, caution must be used when using R2 to compare different
calibration models. For example, a second-order linear model (a “polynomial” fit) will always
give a larger R2 value than a first-order linear model. This improved fit, however, doesn’t mean
that the second-order model will result in better predictions of dependent or independent
variables.
It is generally better to choose models with as few parameters as possible; in quantitative
chemical analysis, it is best to begin with the simplest possible model (usually a first-order linear
model) and only go to more complicated models when the simple model is obviously inadequate.
Calculation of Least-Squares Estimates
We will now discuss the philosophy of least-squares estimation, and give the equations necessary
to calculate the least-squares estimates of the regression model. To illustrate the calculation of
least-squares estimates, we will use the data previously presented in figure 4.4, which was
generated using a first-order linear model with β1 = 2.5 and β0 = 15:
✙ y = 2.5x + 15
where x is analyte concentration and y is the instrument response. This equation determines the
population mean of each measurement; in addition, for all the measurements, σy = 1. A random
number generator was used to obtain the data points according to these population parameters.
The following figure shows the least-squares fit to the line, as well as the observed residuals.
Page 91
Chapter 4
Linear Regression
2
22
y = 2.577x + 15.024
R2 = 0.8751
1.5
Observed Residual Error
21
Signal
20
19
18
17
1
0.5
0
0
0.5
1
1.5
2
2.5
-0.5
16
15
0
0.5
1
1.5
2
2.5
-1
Concentration
Concentration
(a)
(b)
Figure 4.6: (a) Least-squares best-fit line with n = 5 observations. Associated with each measurement
is an observed residual error, which is the distance of the data point from the best-fit line; (b) Plot of
residuals. The sum of the residuals is zero. The least-squares fit minimizes the sum of the squared
residuals.
The least-squares estimates are shown in part (a) of the figure: b1 = 2.577 and b0 = 15.024. These
estimates are sample statistics, estimators of the corresponding regression model parameters (just
as the sample mean and sample standard deviation are estimators of the corresponding
population parameters). Thus, we have
least-squares prediction
ŷ = 2.577x + 15.024
The plot of the residuals is very interesting; if you compare to figure 4.2(b), you will find the
plots to be very similar. (Actually, they are identical, since the same base data was used to
generate both sets of measurements). Recall that when we discussed the sample mean of a set of
measurements, all with constant µy, two important properties of the sample mean were
mentioned:
1. The sum of the residuals are zero: ✟ e i = 0
2. The sum of the squares of the residuals, ✟ e 2i , was the minimum possible value.
The least-squares estimate ŷ possess these same qualities! The least-squares estimation
procedure is such that the value of b1 and b0 minimize the sum of the squared residuals. In fact,
this very property is what gives the name “least-squares” to these estimates.
The derivation of the equations of the least-squares estimates is beyond the scope of this chapter;
what is more important is to understand the philosophy behind the equations (i.e., that they
minimize the sum of the squared residuals). For first-order linear regression, the least-squares
estimates are calculated according to the following formulas:
Page 92
Chapter 4
Linear Regression
b1 =
S xy
S xx
b0 = y − b1 x
[4.10]
where S xy h ✟(x i − x)(y i − y) and S xx h ✟(x i − x) 2 , and y and x are the sample means of the y and x
observations, respectively.
Aside: using calculators for first-order linear regression
Most scientific calculators can provide least-squares regression parameters for a first-order linear
model. If not, then the following formulas are a little easier to use than the previous expressions.
S xy = ✟ x i y i − n $ x $ y
[4.11(a)]
2
S xx = (n − 1)s x
[4.11(b)]
2
where s x is the sample variance of the x-values of the regression data.
The data in the next example is the same as was used in the last figure; see if you can obtain the
least-squares estimates with your calculator (it’s not very much fun by hand!).
Example 4.1
The following is the set of data used to construct figure 4.6. Consider the independent
variable to be analyte concentration (in ppm) and the dependent variable to be the instrument
response (in arbitrary units):
x (conc in ppm)
y (instrument response)
0.5
15.658
1.0
17.773
1.5
20.155
2.0
19.745
2.5
21.115
Assume a first-order linear model, and calculate the least-squares estimates of the slope (b1)
and intercept (b0) of the line (you should obtain the same values as shown in the figure).
x
( 0.5 1.0 1.5 2.0 2.5 )
xi. y i
S xy
T
y
( 15.658 17.773 20.155 19.745 21.115)
5. mean( x) . mean( y )
S xy = 6.4430
i
S xx
b1
4. var( x)
S xy
S xx
b 1 = 2.5772
S xx = 2.5000
b0
mean( y )
b 0 = 15.0234
b 1. mean( x)
these are the least-squares estimates
Page 93
T
Chapter 4
Linear Regression
When we use least-squares procedure, we see from figure 4.6 that the coefficient of
determination, R2, is 0.8751. This means that 87.51% of the variance observed in the dependent
variable is accounted for by the regression model; the remainder (the “scatter” about the fitted
line) is presumably due to measurement noise.
Estimating Measurement Noise: Homogeneous Variance
In regression analysis, the value of µy can change between observations. Before proceeding, we
must make another important assumption. Although it may seem that we have made this
assumption all along, we must now state it explicitly: we assume that the magnitude of the
measurement noise does not change between observations. Using statistical terminology, we say
that the dependent variable y exhibits homogeneous variance, σ2, for all the values of x in the
data set. We will use the symbol σ (rather than σy) to emphasize that we assume a common
standard deviation for all measurements
As we will see, it is important to be able to estimate this common measurement noise. One way
to obtain an estimate would be to obtain repeated measurements at a fixed value of x; since the
noise remains constant for all values of x, this estimate would be valid for all the measurements
in the calibration data. For example, in instrumental analysis, you might want to take a number of
measurements on one of the calibration standards. The sample standard deviation of the
measurements on this standard is an estimate of the measurement noise.
However, there is actually a way to estimate the measurement noise without taking repeated
measurements. If our postulated regression model is accurate, then we can obtain an unbiased
estimate of the homogeneous variance from the residuals, using eqn. 4.4:
homogenous variance for linear model:
s 2 = 1✚ ✟ e 2i
where the degrees of freedom, ν, of our estimate is equal to n – p, with n being the number of
data points and p the number of regression parameters. For first-order linear regression, then,
there are n – 2 degrees of freedom.
Thus, to estimate the homogeneous variance, we must first obtain the least-square regression
estimates, then calculate the observed residuals:
ei = yi − b1xi − b0
and then use eqn. 4.4. The following example illustrates how this is done.
Example 4.2
Estimate the measurement noise in the data from the previous example; assume that the
regression model is correct and that the random variance is homogeneous.
Page 94
Chapter 4
Linear Regression
Recall that
b 0 = 15.0234
b 1 = 2.5772
Let's calculate the y-residuals of the data points:
e
y
b 1. x b 0
T
e = ( 0.654 0.172 1.266 0.433 0.351 )
The sum of the squared residuals is easily calculated
SS res
ei
2
SS res = 2.3705 remember that the least-squares estimates minimize this value
i
s res
SS res
3
s res = 0.8889
This is the std deviation of the residuals (n-2 degrees of freedom),
which is an estimate for the homogeneous measurement noise.
Thus, we estimate that the common measurement noise on all the measurements is 0.889. This
value is probably most properly referred to as the standard deviation of the residuals, since it
only estimates the homogeneous noise when certain assumptions are valid, as we will see. In this
case, since the data was actually generated by a random generator with σ = 1, the estimate seems
reasonable.
One final note: the regression routines in most spreadsheet programs (such as Excel or Quattro
Pro) will provide the standard deviation of the residuals – although they won’t call it by that
name.
Properties of Least-Squares Estimates
Our estimates of the regression parameters, b1 and b0, are sample statistics; as such, there will be
some error in the estimates, due to the random error present in the data used to calculate them. In
other words, the estimates are variables: if we repeated the entire calibration procedure, we
would almost certainly obtain different values for b1 and b0, even though the regression
parameters β1 and β0 are the same. We are interested in characterizing the variability of the
least-squares estimates of the regression parameters.
The variance of the least-squares estimates can be derived using propagation of error. If we
assume that the measurement variance is homogeneous, then the variance of the estimates is
given by
2
✤ 2 (b 1 ) = ✤
S xx
2
✤ 2 (b 0 ) = ✤ 2 1n + x
S xx
[4.12]
where σ2 is the magnitude of the homogeneous variance, n is the number of measurements, and
Sxx is given by eqn. 4.11(b). The standard error (i.e., the standard deviation) of these estimates
are the roots of the variance calculated using this equation.
If the true value of the measurement variance is not known, then we can only estimate the
standard error of the regression parameter estimates. For the least-squares estimates, we may use
the following expressions.
Page 95
Chapter 4
Linear Regression
s(b 1 ) =
2
s(b 0 ) = s res 1n + x
S xx
s res
S xx
[4.13]
where sres is the estimate of the homogeneous noise, σ, obtained from the residuals. The degrees
of freedom in these estimates of the standard error is the same as the number of degrees of
freedom in sres: i.e., ν = n − p.
Example 4.3
Calculate confidence intervals for the slope, β1, and intercept, β0, for the regression data in
example 4.1 using the least-squares procedure, and assuming a first-order linear model and
homogeneous variance.
Remember that
s_b1
s res
S xx = 2.5000
s_b1 = 0.5622
s res = 0.8889
standard error of least-squares estimate of the slope
S xx
s_b0
s res .
1
5
mean( x)
S xx
2
s_b0 = 0.9323
std error of the LS estimate of the intercept
Important point: if the measurements are normally distributed, then the least-squares estimates
also follow normal probability distributions. This allows us to construct confidence intervals for
these estimates. To illustrate, let’s construct 95% CI’s for the slope and intercept calculated in
example 4.1. The standard errors of these estimates have been calculated in the last example; all
we need is to find the proper tν value. Since we have 3 degrees of freedom, we must use t3,0.025,
which is 3.182. Thus, we see that
b1 = 2.6 ± 1.8 [95%, n = 5]
b0 = 15.0 ± 3.0 [95%, n = 5]
Least-Squares Regression with Excel
Linear regression in Quattro Pro is performed in the following manner. Choose “Tools", then
“Data Analysis”, then “Regression” [note that the “Analysis ToolPak” add-in must be activated
to do this]. Next, choose the independent variable (e.g., analyte concentration) and the dependent
variable (e.g. instrument response). Choose the output region, if you want. Note that you may
choose more than one column for your independent variable; in this manner, you can obtain
least-squares estimates for higher-order (“polynomial”) linear regression models.
I used Excel97 for Windows95 to analyze the data in Example 4.1. The output is shown in the
following table, along with a brief explanation of each line.
Page 96
Chapter 4
Linear Regression
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.93546
R Square
0.87509
Adjusted R Square
0.83345
Standard Error
0.88879
Observations
5
Intercept
X Variable 1
std dev of residuals
df of residuals
ANOVA
Regression
Residual
Total
coeff of determination
df
1
3
4
SS
16.6024
2.3699
18.9723
Coefficients Standard Error
15.0238
0.9322
2.5770
0.5621
LS estimates
of β0 and β1
MS
16.6024
0.7900
F
21.0170
Significance F
0.0195
t Stat
16.1169
4.5844
P-value
0.0005
0.0195
Lower 95%
12.0572
0.7881
std dev of LS estimates
Upper 95%
17.9903
4.3659
95% conf intervals
for β0 and β1
You should compare the values in the table to the values we have calculated in examples
4.1−4.3, to familiarize yourself with Excel’s somewhat terse terminology. If you use another
spreadsheet, you should try to interpret the output in terms of the material we cover in this
chapter.
Assumptions and Common Violations
Let’s summarize the skills we have obtained up to this point:
• we have learned how to obtain the least-squares estimates for a first-order linear regression
model;
• we have learned how to calculate the standard error of the least-squares regression estimates
• we have learned how to construct confidence intervals for the least-squares estimates.
At one level, of course, these abilities are not difficult to obtain: just learn how to use the
appropriate equations. Of course, there is much more to linear regression than just using the
equations; you must understand regression analysis in order to appreciate what the formulas are
telling you. An important part of this understanding is knowing the assumptions that are typically
made in regression analysis. Violation of these assumptions is by no means a rare occurrence, so
it is important to understand the nature of the assumptions in order to appreciate the limitations
of regression analysis.
The common assumptions, roughly in the order they were made in this chapter, are:
Page 97
Chapter 4
Linear Regression
1. No error in x values. Throughout this chapter, we have assumed that there is no random
error in the independent variable(s), and that all such random error is in the dependent
variable. Violation of this assumption is common. In quantitative analysis, it is rarely true
that the concentration of the calibration standards are known without error; in fact, error in
the standard concentrations can be the main contribution to error in the estimated analyte
concentration!
In quantitative analysis, violation of this assumption is actually not very serious. It has been
proven that random error in the x variable can be described by inflated random error of the y
value. In other words, random error of x serves to increase the standard deviation of the
residuals, and we can think of this error as being due to “measurement noise” (i.e., noise in
y). What this means is that the scatter about the fitted regression line is greater than would be
predicted by obtaining repeated measurements on any one standard.
Practically speaking, the consequences of violation of this assumption can be minimized or
even eliminated if you
• take the same number of measurements on all the standards and all the samples
(unknowns). If you take more than one measurement on each of these, use the averages when
performing regression;
• estimate the measurement noise from the standard deviation of the residuals.
“Concentration error” inflates the scatter about the fitted line, and should be included in our
estimate of the measurement noise. The “concentration error” will not be included in any
estimate of measurement noise that is calculated from repeated measurements on a single
standard.
Naturally, these precautions do not correct for the presence of bias in the x values; this type
of error will result in biased estimates of the analyte concentration (as would measurement
bias in the y values, of course).
2. Homogeneous variance. The assumption of homogeneous error allows easy (or at least,
relatively easy!) calculation of measurement noise and the standard error of the regression
parameter estimates. In quantitative analysis, this assumption is suspect whenever the
calibration standards cover a wide range of concentrations. In such cases, the measurement
noise σy may well depend on the value of x; generally, as the value of y (and x) increases, the
noise increases as well. This is an example of inhomogeneous error.
Least-squares estimates can still be used in the presence of inhomogeneous error; however,
the least-squares estimates are no longer as “attractive” as they are in the case of
homogeneous error. In statistical parlance, the least-squares estimates are no longer the “best
linear unbiased estimates” (don’t worry too much about what this phrase means). More
seriously, the standard error calculated for the regression estimates can no longer be
calculated using formulas such as those in eqn. 4.12. This practical consequence of this
violation is that the confidence levels are not correct – we may think that we have calculated
95% confidence intervals, for example, but the actual confidence level will be somewhat
different than 95%.
Page 98
Chapter 4
Linear Regression
If the measurement noise of every data point is known, then the weighted least-squares
procedure can be used to obtain regression estimates. Weighted least-squares is a
modification of the least-squares procedure in which a data point is given a “weight” that is
inversely proportional to its population variance. The standard error of these estimates can be
calculated.
3. Normal distribution of measurements. If the measurements are normally distributed, then
so too are the least-squares estimates, which allows us to construct confidence intervals fro
the standard error of these estimates. If this assumption is violated, then the confidence
intervals cannot be calculated using either z or t tables; fortunately, this assumption is often a
reasonable one.
Page 99
Chapter 4
Linear Regression
Prediction using Linear Regression
Prediction of Dependent Variable
Let’s assume that a first-order linear regression model with homogeneous noise applies to the
measurements obtained during a regression experiment:
✙(y i ) = ✎ 1 x i + ✎ 0
where µ(yi) is the mean of the measurements when x = xi. Of course, the values of the parameters,
β1 and β0, are not known, and we must calculate regression estimates b1 and b0 from the
calibration data. Using these values, the best estimate of the population mean µ(yi) when x = xi is
given by:
ŷ i = b 1 x i + b 0
where ŷ i is our estimate of µ(yi), exactly like the sample mean. And, just like the sample mean,
this predicted value ŷ i is a variable, due to the error in the regression estimates. In the last
section, we calculated the standard error in our estimates b1 and b0; how about the standard error
in the value ŷ i predicted at a fixed value xi?
If we assume homogeneous variance, then the variability in the dependent value at x = xi is given
by the following equation:
variance of predicted value
(x i − x) 2
✤ 2 (ŷ i ) = ✤ 2 1n +
S xx
where n is the number of measurements used to determine b1 and b0, xi is the value of x at which
we want to find the variance of ŷ, x is the mean of the x variable for the calibration data, and
Sxx can be calculated using eqn. 4.11(b).
The square root of the variance in eqn. 4.12 is the standard error of our predicted response. In
other words, this is the standard error of the fitted regression line. Note that the magnitude of the
standard error depends on how far xi is from the mean of the values used to obtain the regression
estimates b1 and b0. There is greater random error near the “ends” of the fitted line than in the
“middle.”
If we know the standard error of the predicted value ŷ i at some value xi, then it is a simple matter
to construct a confidence interval (assuming a normal pdf); this is an interval estimate of the true
response µ(yi) at xi.
CI for µ(yi), σ unknown
(x i − x) 2
ŷ ! t ✚,✍/2 $ s 1n +
S xx
[4.14]
At the edges of the calibration curve, this confidence interval will “flare,” because the standard
error of our point estimate increases.
Page 100
Linear Regression
Dependent Variable
Chapter 4
confidence interval
for µy at a
fixed value of x
Independent Variable
Figure 4.7: A plot of the confidence interval for µy as a function of x. Note that the CI
becomes wider at the edges of the calibration curve; the CI is narrowest at x, the mean value
of the independent variable for the calibration data.
The confidence interval in eqn 4.14 gives an interval that contains the true mean response (µy) at
a given value of x; however, imagine that we wish to obtain a new measurement yi at a value of
x = xi; can we find an interval that will (probably) contain the new measurement? This interval is
a prediction interval of a future measurement.
We can calculate the residual error of this hypothetical future measurement:
e i = y i − ŷ i
The residual is simply the distance that the measurement will be from the regression line. What is
the variance of this distance? From propagation of error, we can calculate:
✤ 2 (e i ) = ✤ 2 (ŷ i ) + ✤ 2 (y i )
We see that the variance of the future residual is due to the measurement noise on yi and due to
the uncertainty in ŷ i ; in other words, random error in both the regression line and in the future
measurement itself. If we assume homogeneous variance, then σ(yi) = σ, and we can use eqn.
4.14 to obtain
std dev of “future” residuals
(x i − x) 2
✤ 2 (e i ) = ✤ 2 + ✤ 2 1n +
S xx
This expression is used to construct the desired prediction interval:
PI for single measurements
(x i − x) 2
ŷ i ! t n−2,✍/2 $ s $ 1 + 1n +
S xx
Page 101
[4.15]
Chapter 4
Linear Regression
To summarize, equation 4.14 gives an interval (a confidence interval) around ŷ that contains µy,
the true (mean) response, while equation 4.15 gives an interval (a prediction interval) that will
contain a future measurement. The first interval is always going to be smaller then the second.
The figure shows the two intervals for a data set; these intervals were actually calculated from the
data set. The smaller interval is a 90% confidence interval for µy, while the wider interval is a
90% prediction interval for single measurements. Notice that, while a number of the data points
lie outside the smaller interval, all of them are contained within the wider interval. This behavior
is expected; the CI is only supposed to contain the true value of µy; the PI, however, is supposed
to contain all individual measurements (with 90% probability, anyway).
16
14
prediction interval
for future measurements
Dependent Variable
12
10
8
6
confidence interval
for population mean
4
2
0
0
5
10
15
20
25
30
35
Independent Variable
Figure 4.8: Intervals associated with a regression line. The narrower interval is the 90%
confidence interval for µy, while the wider interval is a 90% prediction interval for future
measurements.
Page 102
40
Chapter 4
Linear Regression
Example 4.4
Imagine that the data given in example 4.1 is for the calibration curve of an analytical
technique, with the x values being the concentration of an analyte in ppm and the y values
being the instrument response. Now imagine that another chemical sample, with a true
analyte concentration of 1.20 ppm, is analyzed. Using the calibration curve data:
(a) Calculate a 95% confidence interval for the (true) mean instrument response to this
sample.
(b) Calculate a prediction interval within which a single measurement of the sample will fall
with 95% probability.
First the preliminaries, all of which have actually been calculated in the previous examples:
x
( 0.5 1.0 1.5 2.0 2.5 )
xi. y i
S xy
T
y
( 15.658 17.773 20.155 19.745 21.115)
5. mean( x) . mean( y )
T
S xy = 6.4430
i
S xx
4. var( x)
S xx = 2.5000
b1
S xy
b0
S xx
b 1. mean( x)
mean( y )
least-squares estimates
res
y
b 1. x b 0
1.
s res
3
res i
2
s res = 0.8889
i
std dev of the residuals, which is an
estimate of the homogeneous
measurement noise
Now we need to understand the distinction between the intervals asked for in parts (a) and (b). If
we obtain measurements on the sample, which contains 1.20 ppm analyte, these measurements
will be distributed (presumably according to a normal distribution) around a measurement mean,
✙ y = ✎ 1 x + ✎ 0 . In part (a), we are asked to provide a confidence interval for µy when
x = 1.20 ppm. Equation 4.14 gives us the desired confidence interval.
At x = 1.2 ppm,
s_ypred
s res .
ypred
b 1. 1.2 b 0
1
( 1.2 mean( x) )
5
S xx
ypred = 18.1160
2
s_ypred = 0.4318
This is the std error in the value predicted by the regression line at x = 1.2 ppm. We can use this
value to calculate the desired confidence interval.
t = 3.1820
width_ypred
95% confidence level and 3 degrees of freedom
t . s_ypred
width_ypred = 1.3741
Thus, the confidence interval for the mean response is 18.1 ± 1.4 units [95%].
Page 103
Chapter 4
Linear Regression
In part (b) we are asked to provide a range of values that will contain a single measurement of the
sample with 95% probability. Equation 4.14 can be used to calculate this interval:
The std deviation of the residuals at x = 1.2 ppm is given by
s_future
s res
2
s_ypred
2
s_future = 0.9883
Note that this gives us the same result as
width_future
t . s_future
s res . 1
1
( 1.2 mean( x) )
5
S xx
2
= 0.9883
width_future = 3.1446
Thus, the 95% prediction interval for a future measurement response is 18.1 ± 3.1 units.
Thus, if we measure a sample that contains an analyte concentration of 2.1 ppm, there is a 95%
probability that the measurement value will fall within 3.1 units of 18.1 units, the response
predicted by the regression equation.
“Inverse” Regression: Prediction of Independent Variable
The previous section has described two intervals that are used to predict the properties of the
dependent variable, y, at any value of x: the confidence interval is used to estimate µy, and the
prediction interval is used to predict future values of y. In analytical chemistry, however, the
most common use of the regression equation is to predict the value of the independent variable,
x, given a measurement value. For first-order linear regression, the analyte concentration xu is
estimated from the instrumental response yu on a sample as follows:
x̂ u =
yu − b0
b1
where xu is the true analyte concentration in the sample, and x̂ u is our best estimate (a point
estimate) of this concentration.
This use of regression analysis has been termed inverse regression. It is important to realize that
x̂ u is a random variable, even though x is not. The reason that the concentration estimate contains
error is that it is calculated from terms that all contain some random error. Since the predicted
value x̂ u is a random variable, we should have a way to construct a confidence interval that
contains the true analyte concentration.
Before trying to find an expression that can be used to calculate such a confidence interval, it is
worthwhile to understand the factors that contribute to the variability of the predicted
concentration, x̂ u . For a given sample, the two sources of random error are the measurement
noise in the dependent variable, and calibration noise due to the random error in the regression
estimates.
Page 104
Chapter 4
Linear Regression
measurement noise
(a)
calibration noise
(b)
Figure 4.9: (a) random error in the measurement (the yu variable) results in error in the
estimated value x̂ u , even if there is no error in the regression line; (b) even if there were no
error in the measured value (yu), random error in the calibration line will cause random error
to be present in the estimated concentration (x̂ u ).
In part (a) of the figure, we imagine a situation in which the values of β1 and β0 are known
exactly; however, the measurement noise present in the dependent variable yu will still cause
random error to be present in the value x̂ u . Part (b) shows the situation when µ(yu) (the mean of
yu) is known, but the regression parameters are not known. In this situation, the random nature of
the statistics b1 and b0 will still cause random error in x̂ u .
If we know the true concentration, xu, of the analyte in a sample, we would be able to construct a
prediction interval for possible values for measurements that would be obtained when the sample
is analyzed (from eqn. 4.15). Corresponding to this range of possible measurement values, yu, is a
range of x̂ u values that would be calculated from the measurements, as shown in the following
figure:
Page 105
Chapter 4
Linear Regression
range of
predicted
responses
true
conc
range of
calc'd concs
Figure 4.10: construction of the prediction interval for x̂ u from the calibration curve.
Based on this type of reasoning, we can write the following for the true standard error of
estimated concentrations (assuming homogeneous noise σ):
(x u − x) 2
✤(x̂ u ) l ✤ 1 + 1n +
S xx
b1
1/2
[4.16]
For reasons we will not go into, this equation is only a good approximation for the standard error.
There are three terms in this expression. The first term in the parentheses is due to the
measurement noise. The second two terms are due to the uncertainty in the least-squares
estimates of the slope and intercept terms, the so-called calibration noise. As can be seen, the
calibration error contribution is at a minimum when x u = x.
Now, this last equation contains the true value xu of the independent variable (i.e., the analyte
concentration). Normally, of course, the values of xu and σ are not known; in such cases, the
following expression must be used:
s(x̂ u ) l
(x u − x) 2
s res
1 + 1n +
S xx
b1
1/2
[4.17]
The number of degrees of freedom for the standard error is n − 2 (this equation is specifically for
first-order linear regression).
Page 106
Chapter 4
Linear Regression
Example 4.5
Assume again that the data given in example 1 is for a calibration curve, with the x values
being the concentration in ppm and the y values being the instrument response. A sample is
analyzed, giving a response of 18.41; report the estimated analyte concentration in the form
of a confidence interval.
From previous examples, we know that
S xx = 2.5000
s res = 0.8889
b 1 = 2.5772
b 0 = 15.0234
We can calculate a point estimate for the analyte concentration in the sample:
yu
xu
measured response of the "unknown"
18.41
yu
b0
estimated analyte concentration
x u = 1.3141
b1
Now to estimate the std error of the predicted analyte concentration
s_x u
s res
. 1
b1
1
5
xu
mean( x)
S xx
2
s_x u = 0.3800
The half-width of the 95% confidence interval will be
t = 3.1820
t . s_x u = 1.2092
Thus, the analyte concentration is 1.3 ± 1.2 ppm [95%].
Page 107
Chapter 4
Linear Regression
Chapter Checkpoint
The following terms/concepts were introduced in this chapter:
calibration
least-squares estimates
coefficient of determination (R2)
regression model parameters
dependent variables
regression model
homogeneous variance
regression analysis
independent variables
residuals
inverse least-squares
weighted least-squares
In addition to being able to understand and use these terms, after mastering this chapter, you
should
• be able to calculate least-squares estimates of regression model parameters for a first-order
linear model
• be able to estimate the measurement noise σ from the residuals in data used for regression
• be able to determine the coefficient of determination, R2, for regression data
• be able to calculate the standard error of the least-squares estimates of the first-order linear
model parameters, and use these values to construct confidence intervals
• understand the major assumptions made in deriving the equations used in linear least-squares
regression
• be able to construct a confidence interval for µ(yi), the population mean of the dependent
variable at any value xi of the independent variable
• be able to construct a prediction interval for future measurements obtained at any value xi of
the independent variable
• be able to calculate the standard error in x̂ u , our best estimate of the analyte concentration in a
sample.
• be able to construct a confidence interval for xu, the true value of the independent variable,
from the measured value yu of the dependent variable.
Page 108
Download