Regression and Correlation

advertisement
ARG/PDW: MCEN4027F00
VIII: 1
Regression and Correlation
Introduction
One of the most important applications of statistics involves estimating
the value of a response variable y or predicting some future value of y
based upon knowledge of a set of independent variables, x1, x2, ....xk.
For example, an engineer might want to relate the rate of
malfunction y of a mechanical assembler to such variables as the
speed of operation and the assembler operator.
The objective would be to develop a prediction equation relating
the dependent variable y to the independent variables and to use
the prediction equation to predict the value of the rate of
malfunction y for various combinations of speed of operation and
operator.
The models used to relate a dependent variable y to the independent
variables, x1, x2, ....xk, are termed regression models because they
express the mean value of y for given values x1, x2, ....xk, as a function
of a set of unknown parameters.
These parameters are estimated from sample data using a process
to be described.
This basic approach is applicable to situations ranging from simple
linear regression to more complex nonlinear multiple regression.
ARG/PDW: MCEN4027F00
VIII: 2
A Simple Linear Regression Model
Suppose that the developer of a new insulation material wants to
determine the amount of compression that would be produced on a 2inch thick specimen of material when subjected to different amounts of
pressure.
Five experimental pieces of the material were tested under
different pressures.
The values of x (in units of 10 psi) and the resulting amounts of
compression y (in units of 0.1 inch) are given in Table 1.
Table 1. Compression Versus Pressure for an Insulation Material
Specimen
Pressure, x (10 psi)
Compression, y (0.1 inch)
1
1
1
2
2
1
3
3
2
4
4
2
5
5
4
ARG/PDW: MCEN4027F00
VIII: 3
A plot of the data, called a scattergram, is shown in Figure 1.
5
Compression (0.1 inch)
4
3
2
1
0
0
1
2
3
4
Pressure (10 psi)
Figure 1. Scattergram for data in Table 1.
5
6
ARG/PDW: MCEN4027F00
VIII: 4
Suppose we believe that the value of y tends to increase in a linear
manner as x increases.
Then we could select a model relating y to x by drawing a line
through the points in the figure.
Such a deterministic model – one that does not allow for
errors of prediction – might be adequate if all of the points in
the figure fell on the fitted line.
However, this will not occur for the data of Table 1.
No matter how the line is drawn through the points, at least
some of the points will deviate substantially from the fitted
line.
ARG/PDW: MCEN4027F00
VIII: 5
The solution to this problem is to construct a probabilistic model
relating y to x – one that acknowledges the random variation of the
data points about a line.
One type of probabilistic model, a simple linear regression
model, makes the assumption that the mean value of y for a
given value of x can be represented by a straight line and that
points deviate about this line of means by a random (positive
or negative) amount equal to , i.e.,
y  o  1x  
where the first two terms represent the mean value of y for a
given value of x (and are unknown parameters of the
deterministic (nonrandom) portion of the model) and the last
term represents the random error.
If we assume that the points deviate above and below the line
of means, with some deviations positive, some negative, and
with E() = 0, then the mean value of y is:
E y   Eo  1x     o  1x  E   o  1x
Therefore, the mean value of y for a given value of x, represented
by E(y), is given by a straight line with a y-intercept equal to o
and slope equal to 1.
ARG/PDW: MCEN4027F00
VIII: 6
5
Compression (0.1 inch)
4
3
2
E(y) = o + 1x
1
0
0
1
2
3
4
5
Pressure (10 psi)
Figure 2. Hypothetical line of means for the data of Table 1.
In order to fit a simple linear regression model to a set of data, we must
find estimators for the unknown parameters, o and 1.
Valid inferences about o and 1 will depend on the probability
distribution of the random error, ; therefore, we must first make
specific assumptions about, .
6
ARG/PDW: MCEN4027F00
VIII: 7
These assumptions, summarized below, are basic to every statistical
regression analysis.
1. The mean of the probability distribution of  is 0. That is, the
average of the errors over an infinitely long series of
experiments is 0 for each setting of the independent variable x.
This assumption implies that the mean value of y, E(y), for a
given value of x is E(y) = o and 1x.
2. The variance of the probability distribution of  is constant for
all settings of the independent variable x. For our straight-line
model, this assumption means that the variance of  is equal to
a constant, such as 2, for all values of x.
3. The probability distribution of  is normal.
4. The errors associated with any two different observations are
independent. That is, the error associated with one value of y
has no effect on the errors associated with other y values.
The implications of the first three assumptions can be observed in
Figure 3 which shows distributions of errors for three particular values
of x, namely x1, x2 and x3.
Note that the relative frequency distributions of the errors are
normal, with a mean of 0 and a constant variance 2 (each
distribution has the same degree of spread or variability).
Generally, these assumptions are robust.
ARG/PDW: MCEN4027F00
VIII: 8
ARG/PDW: MCEN4027F00
VIII: 9
The Method of Least Squares
In order to choose the “best fitting” line for a set of data, we must
estimate the unknown parameters, o and 1, of the simple linear
regression model.
These estimators can be found using the method of least squares.
The reasoning behind the method of least squares can be seen by
considering Figure 4 which shows a scattergram of the data points of
Table 1.
ARG/PDW: MCEN4027F00
VIII: 10
The vertical line segments represent deviations of the points from
the line.
Although there are many lines for which the sum of deviations (or
errors) is equal to 0, there is one and only one line for which the
sum of squares of the deviations is a minimum.
The sum of squares of the deviations is called the sum of
squares for error and is denoted by the symbol SSE.
The line is termed the least squares or regression line.
To find the least squares line for a set of data assume that we have
a sample of n data points which can be identified by
corresponding values of x and y, i.e., (x1,y1), (x2,y2), ...... (xn,yn).
For the straight-line model the response y in terms of x is
y  o  1x  
The line of means is
E  y   o  1x
The fitted line, which we hope to find, is represented as
yˆ  ˆo  ˆ1x
ARG/PDW: MCEN4027F00
VIII: 11
Thus, ŷ is an estimator of the mean value of y, E(y), and a
predictor of some future value of y; ˆo and ˆ1 are estimators of
o and 1, respectively.
For a given data point, e.g., (x1,y1), the observed value of y is y1 and the
predicted value of y would be obtained by substituting x1 into the
prediction equation
yˆi  ˆo  ˆ1xi
The deviation of the ith value of y from its predicted value is
 yi  yˆi   yi  ˆo  ˆ1xi 
Then the sum of squares of the deviations of the y values about
their predicted values for all of the n data points is
n
 
SSE   yi  ˆo  ˆ1xi
i 1
2

The quantities ˆo and ˆ1 that make the SSE a minimum are called the
least squares estimates of the population parameters o and 1, and the
prediction equation yˆ  ˆo  ˆ1x is called the least squares line.
The values of ˆo and ˆ1 that minimize the SSE are obtained by
setting the two partial derivatives SSE ˆo and SSE ˆ1 , equal
to 0 and solving the resulting simultaneous linear system of least
squares equations.
ARG/PDW: MCEN4027F00
VIII: 12
We then obtain
Slope: ̂1 
SS xy
SS xx
and y-intercept: ˆo  y  ˆ1x
where n equals the sample size and
SS xy
 n  n 
  xi   yi 
n
   xi  x  yi  y    xi yi   i 1  i 1 
n
i 1
n
2
SS xx    xi  x 
i 1
n 
  xi 
  xi2   i 1 
n
2
In summary, we have defined the best-fitting straight line to be the one
that satisfies the least-squares criterion, i.e. the sum of the squared
errors will be smaller than for any other straight-line model.
ARG/PDW: MCEN4027F00
VIII: 13
The Least Squares Estimators
An examination of the formulas for the least squares estimators reveals
that they are linear functions of the observed y values, y1, y2, ….yn.
Since we have assumed that the random errors associated with these y
values 1, 2, ….n, are independent, normally distributed random
variables with mean 0 and variance 2, it follows that the y values will
be normally distributed with mean E(y) = o + 1x and variance 2.
An Estimator of 2
In most practical situation, the variance 2 of the random error  will be
unknown and must be estimated from the sample data.
Since 2 measures the variation of the y values about the line
E(y) = o + 1x, it seems reasonable to estimate 2 by dividing SSE by
an appropriate number.
Estimation of 2
s2 
SSE
SSE

Degrees of freedom for error n-2
where
2
SSE    yi  yˆi   SS yy  ˆ1SS xy and
2

yi 

SS yy    yi  y    
n
2
2
yi
ARG/PDW: MCEN4027F00
VIII: 14
Inferences About the Slope 1
What could be said about the values of o + 1 in the hypothesized
probabilistic model, y = o + 1x + , if x contributes no information
for the prediction of y?
The implication is that the mean of y, i.e. the deterministic part of
the model E(y) = o + 1x, does not change as x changes.
Regardless of the value of x, the same value of y is always
predicted.
In the straight-line model, this means that the true slope 1 is equal
to 0.
Therefore, to test the null hypothesis that x contributes no information
for the prediction of y against the alternative hypothesis that these
variables are linearly related with a slope differing from 0, we test
Ho: 1 = 0
Ha: 1  0
If the data support the alternative hypothesis, we will conclude
that x does contribute information for the prediction of y using the
straight-line model (although the relationship between E(y) and x
could be more complex than a straight line).
Thus, to some extent, this is a test of the utility of the
hypothetical model.
ARG/PDW: MCEN4027F00
VIII: 15
Since  will usually be unknown, the appropriate test statistic will
be a student’s t statistic such that the test for model utility of a
simple linear regression is given by
One-tailed test:
Two-tailed test:
Ho: 1 = 0
Ho: 1 = 0
Ha: 1 < 0 or 1 > 0
Ha: 1  0
Test statistic: t 
ˆ1
s 1

ˆ1
s
SS xx
Rejection region: t < -t or : t > t Rejection region: t  t 2
Note that t and t/2 are based upon n-2 degrees of freedom.
ARG/PDW: MCEN4027F00
VIII: 16
EXAMPLE
We can use the data in Table 1 to show the results of the necessary
calculations using the linear regression analysis package in Excel.
ˆ1
ˆ1
0.7
t


 3.7
s1 s SS xx 0.19
EXCEL: LINEAR REGRESSION ANALYSIS FOR THE DATA IN TABLE 1
Pressure, x
1
2
3
4
5
Compression,y
1
1
2
2
4
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.903696114
R Square
0.816666667
Adjusted R Square
0.755555556
Standard Error
0.605530071
Observations
5
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
SS
MS
1
3
4
4.9
1.1
6
4.9
0.366666667
Coefficients
-0.1
0.7
Standard Error
0.635085296
0.191485422
t Stat
-0.157459164
3.655630775
F
13.36363636
P-value
0.88488398
0.035352847
Significance F
0.035352847
Lower 95%
Upper 95%
-2.12112675 1.92112675
0.090607356 1.309392644
ARG/PDW: MCEN4027F00
VIII: 17
Unfortunately, the spreadsheet does not indicate that the critical t
value for a two-tailed test (p = 0.025) is 3.182.
However, the information provided in the table (p = 0.035)
provides the necessary information for a rejection of the null
hypothesis and a warrants a conclusion that the slope is not
zero.
Therefore, the sample evidence indicates that x contributes
information for the prediction of y using a linear model for
the relationship between compression and pressure.
ARG/PDW: MCEN4027F00
VIII: 18
The Coefficient of Correlation
The least squares slope, ̂1, provides useful information on the linear
relationship or association between two variables y and x.
Another way to measure association is to compute the Pearson product
moment correlation coefficient r.
The correlation coefficient, defined as
SS xy
r
SS xx SS yy
provides a quantitative measure of the strength of the linear
relationship between x and y in the sample just as does ̂1.
However, unlike the slope, the correlation coefficient r is
scaleless, i.e. the value of r is always between 1- and +1 no
matter the units of x and y.
ARG/PDW: MCEN4027F00
VIII: 19
Since both r and ̂1 provide information about the utility of the model,
it is not surprising that there is a similarity in their computational
formulas.
In particular, note that SSxy appears in the numerators of both
expressions and, since both denominators are always positive, r
and ̂1 will always be of the same sign.
A value of r near or equal to 0 implies little or no linear
relationship between y and x.
In contrast, the closer r is to 1 or –1, the stronger the linear
relationship between y and x.
And if r = 1 or r = -1, all the points fall exactly on the least
squares line.
Positive values of r imply that y increases as x increases; negative
values imply that y decreases as x increases.
It is important to note that high correlation does not imply causality.
If a large positive or negative value of the sample correlation
coefficient r is observed, it is incorrect to conclude that a change
in x causes a change in y.
The only valid conclusion is that a linear trend may exist between
x and y.
ARG/PDW: MCEN4027F00
VIII: 20
ARG/PDW: MCEN4027F00
VIII: 21
The population correlation coefficient is denoted by the symbol .
As expected,  is estimated by the corresponding sample statistic
r.
It is easy to show that r  ˆ1  SS xx SS yy .
Thus, ̂1 = 0 implies r = 0, and visa versa.
Consequently, the null hypothesis Ho:  = 0 is equivalent to the
hypothesis Ho: ̂1 = 0.
The only real difference between the least-squares slope ̂1 and r is the
measurement scale.
Therefore, the information that they provide about the utility of
the least-squares model is to some extent redundant.
However, the slope ̂1 provides additional information on the
amount increase (or decrease) in y for every 1-unit increase in x.
For this reason, the slope is the preferred parameter for making
inferences about the existence of a positive or negative linear
relationship between two variables.
ARG/PDW: MCEN4027F00
VIII: 22
Test of Hypothesis for Linear Correlation
One-tailed test:
Two-tailed test:
Ho:  = 0
Ho:  = 0
Ha:  < 0 or  > 0
Ha:   0
Test statistic: t 
r n2
1 r2
Rejection region: t < -t or : t > t Rejection region: t  t 2
Note that t and t/2 are based upon n-2 degrees of freedom.
The correlation coefficient r describes only the linear relationship
between x and y.
For nonlinear relationships, the value of r may be misleading, and
other methods must be used for describing and testing such
relationships.
ARG/PDW: MCEN4027F00
VIII: 23
The Coefficient of Determination
Another way to measure the contribution of x in predicting y is to
consider how much the errors of prediction of y can be reduced by
using the information provided by x.
Suppose a sample of data has the scattergram shown in Figure 6a.
ARG/PDW: MCEN4027F00
VIII: 24
If we assume that x contributes no information for the prediction of y,
the best prediction for a value of y is the sample mean y , which is
represented by the horizontal line in Figure 6b.
The vertical line segments in Figure 6b are the deviations of the
points about the mean y .
Note that the sum of squares of deviations for the model yˆ  y is
SSyy =   yi  y 2 .
Now suppose that a least-squares line is fitted to the same set of data
and the deviations of the points about the line are determined as
indicated in Figure 6c.
Comparison of the deviations about the prediction lines in parts b and c
indicates that:
1. If x contributes little or no information for the prediction of y, then
the sums of squares of the two lines will be nearly equal, i.e.
SS yy    yi  y 2  SSE    yi  yˆ 2
2. If x does contribute information for the prediction of y, then SSE
will be smaller than SSyy. In fact, if all the points fall on the least
squares line, then SSE = 0.
ARG/PDW: MCEN4027F00
VIII: 25
A convenient way of measuring how well the least-squares equation
yˆ  ˆo  ˆ1x performs as a predictor of y is to compute the reduction in
the sum of squares of deviations that can be attributed to x, expressed
as a proportion of SSyy.
This quantity termed the coefficient of determination is given by
SSE
2 SS yy  SSE
r 
1
SS yy
SS yy
In simple linear regression it can be shown that this quantity is
equal to the square of the simple linear coefficient of correlation r.
Note that r2 is always between 0 and 1 because r is between –
1 and +1.
Thus r2 = 0.60 means that the sum of squares of deviations of the y
values about their predicted values has been reduced by 60%, by
the use of ŷ instead of y to predict y.
Or, more practically, r2 = 0.60 implies that the straight-line model
relating y to x can explain or account for 60% of the variation
present in the sample of y values.
ARG/PDW: MCEN4027F00
VIII: 26
Model Estimation and Prediction
If we are satisfied that a useful model has been identified, we are ready
to accomplish the original objectives for building the model, i.e. using
the model to estimate or predict.
In our example, we might predict or estimate the amount of
compression for a particular level of pressure.
The most common uses of a probabilistic model can be divided into
two categories.
The first is the use of the model for estimating the mean value of
y, E(y), for a specific value of x.
For example, we may want to estimate the mean amount of
compression for all specimens of insulation subjected to a
pressure of 40 (x = 4) psi.
The second use of the model entails predicting a particular value
of y for a given value of x.
If we decide to install insulation in a particular piece of
equipment in which we believe it will be subjected to a
pressure of 40 psi, we will want to predict the compression
for this particular specimen of insulation material.
In the case of estimating a mean value of y, we are attempting to
estimate the mean result of a very large number of experiments at the
given x value.
In the second case, we are trying to predict the outcome of a single
experiment at the given x value.
ARG/PDW: MCEN4027F00
VIII: 27
Sampling Errors for the Estimator of the Mean of y
The standard deviation of the sampling distribution of the estimator
ŷ of the mean value of y at a particular value of x, say xp, is
2
1 x p  x 
 yˆ  

n
SS xx
where  is the standard deviation of the random error .
Sampling Errors for the Predictor of an Individual Value of y
The standard deviation of the prediction error for the predictor ŷ of an
individual y value for x = xp is
2
1 x p  x 
  y  yˆ    1  
n
SS xx
where  is the standard deviation of the random error .
The true value of  will rarely be known – therefore, we estimate
 by s.
ARG/PDW: MCEN4027F00
VIII: 28
The error in estimating the mean value of y, E(y), for a given value of
x, say xp,is the distance between the least squares line and the true line
of means, E(y) = o + 1x.
This error, yˆ  E  y  , is shown in the figure below.
ARG/PDW: MCEN4027F00
VIII: 29
In contrast, the error  y p  yˆ  in predicting some future value of y is the
sum of the two errors – the error of estimating the mean of y, E(y), plus
the random error that is a component of the value of y to be predicted
as indicated in the figure below.
ARG/PDW: MCEN4027F00
VIII: 30
Consequently, the error of predicting a particular value of y will usually
be larger than the error of estimating the mean value of y for a
particular value of x.
Note from their respective formulas that both the error of
estimation and the error of prediction take their smallest values
when x p  x.
Using the least-squares prediction equation to estimate a mean value
of y or to predict a particular value of y for values of x that lie outside
the range of values of x contained in the sample data may lead to
errors of estimation or prediction that are much larger than expected.
ARG/PDW: MCEN4027F00
VIII: 31
EXAMPLE
Suppose a fire insurance company wants to relate the amount of fire
damage in major residential fires to the distance between the residence
and the nearest fire station. The study is to be conducted in a large
suburb of a major city; a sample of 15 recent fires in this suburb is
selected. The amount of damage y and the distance x between the fire
and the nearest fire station are recorded for each fire.
The data and the regression analysis are incorporated in the
following Excel spreadsheet.
ARG/PDW: MCEN4027F00
VIII: 32
LINEAR REGRESSION ANALYSIS
Fire Damage Data
Distance
x, miles
3.4
1.8
4.6
2.3
3.1
5.5
0.7
3.0
2.6
4.3
2.1
1.1
6.1
4.8
3.8
Damage
y, k$
26.2
17.8
31.3
23.1
27.5
36.0
14.1
22.3
19.6
31.3
24.0
17.3
43.2
36.4
26.1
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.960977715
R Square
0.923478169
Adjusted R Square
0.917591874
Standard Error
2.316346184
Observations
15
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
1
13
14
SS
MS
841.766358 841.766358
69.75097535 5.365459643
911.5173333
F
156.8861596
Coefficients
10.27792855
4.919330727
Standard Error
t Stat
1.420277811 7.236562082
0.392747749 12.52542054
P-value
6.58556E-06
1.2478E-08
Significance F
1.2478E-08
Lower 95%
Upper 95%
7.209605476 13.34625162
4.070850963 5.767810491
ARG/PDW: MCEN4027F00
VIII: 33
Regression Analysis
50
Fire Damage (k$)
40
30
20
10
0
0
1
2
3
4
Distance (miles)
5
6
7
ARG/PDW: MCEN4027F00
VIII: 34
Application of the Methodology
Step 1
First, we hypothesize a model to relate fire damage y to the
distance x from the nearest fire station.
We will hypothesize a straight-line probabilistic model:
y  o  1x  
Step 2
Next, we use statistical software to perform a linear regression.
We find that the estimate of the slope is ˆ1  4.919331 and
the estimate of the y-intercept is ˆo  10.277929 .
Thus, the least-squares equations is
yˆ  10.278  4.919 x
The data and the prediction equation are shown in the Figure
accompanying the ANOVA table.
ARG/PDW: MCEN4027F00
VIII: 35
Step 3
Now we specify the probability distribution of the random error
component .
Although we know that the assumptions that we previously
considered are not completely satisfied (they seldom are for
any practical problem), we are willing to assume that they are
approximately satisfied for this example.
The estimate of the variance 2 of  is given in the table as the MS
error (residual), i.e. s2 = MSE = 5.36546.
The estimated standard deviation of  is s  5.36546  2.31635 .
The value s implies that most of the observed fire damage (y)
values will fall within approximately 2s = 4.64 k$ of their
respective predicted values.
ARG/PDW: MCEN4027F00
VIII: 36
Step 4: Test of Model Utility
We can now check the utility of the hypothesized model, i.e. whether x
really contributes information for the prediction of y using the straightline model.
First, test the null hypothesis that the slope 1 is equal to 0, i.e.
that there is no linear relationship between the fire damage and the
distance from the nearest station, against the alternative that x and
y are positively linearly related, i.e.
Ho: 1 = 0; Ha: 1 > 0.
The value of the t-test statistic is given in the row marked x
variable 1; t = 12.525 with an associated probability p =
1.2478E-08.
This small p value leaves little doubt that x contributes
information for the prediction of y.
ARG/PDW: MCEN4027F00
VIII: 37
Step 5: Numerical Descriptive Measures of Model Adequacy
The coefficient of determination is given as r square where r2 =
0.9235.
This value implies that about 92% of the sample variation in
fire damage (y) is explained by the distance x.
The coefficient of correlation r which measures the strength of the
linear relationship between y and x is given as multiple r with a
value r = 0.96.
The high value of r confirms our conclusion that 1 differs
from 0.
ARG/PDW: MCEN4027F00
VIII: 38
Step 6:
We are now prepared to use the least-squares model for prediction.
Suppose the insurance company wants to predict fire damage if a
major residential fire were to occur 3.5 miles from the nearest fire
station, i.e. xp = 3.5.
The predicted value can be calculated using our model with
coefficients
from
the
ANOVA
table,
i.e.
yˆ  10.278  4.919 x p .
The result is that yˆ  27.4956.
Note that we would not use this prediction model to make
predictions for homes < 0.7 or > 6.1 miles from the nearest
station.
A straight-line model might not be appropriate for the
relationship between the mean value of y and the value of x
when stretched over a wider range of x values.
ARG/PDW: MCEN4027F00
VIII: 39
Summary of Linear Regression Methodology
1. Hypothesize a probabilistic model – for our use a straight-line model
whereby y  o  1x   .
2. Use the method of least squares to estimate the unknown parameters
in the deterministic component, o  1x . The least squares
estimates yield a model yˆ  ˆo  ˆ1x with a sum of squared errors
(SSE) that is smaller than the SSE for any other straight-line model.
3. Specify the probability distribution of the random error component .
4. Assess the utility of the hypothesized model. Included here are
making inferences about the slope 1 and calculating r and r2.
5. If we are satisfied with the model, we are prepared to use it to
estimate the mean y value, E(y), for a given x as well as to predict an
individual y value for a specific value of x.
Summary: Appropriate Use of Regression and Correlation
Purpose of
Investigator
Nature of the two variables
Y random, X fixed
Y1, Y2 both random
Establish and estimate
dependence of one
variable upon another
Model I regression
yˆ  ˆo  ˆ1x
Model II regression
(not described)
Establish and estimate
association between
two variables
Meaningless for this
case.
Correlation coefficient
Download