Statistics 2014, Fall 2001

advertisement
1
Chapter 11 – Regression Analysis
Definition: When the values of two variables are measured for each
member of a population or sample, the resulting data is called
bivariate.
When both variables are quantitative, we may represent the data set
as a set of ordered pairs of numbers, (x, y). The variable x is called
the input (or independent) variable; the variable y is called the
response (or dependent) variable. We may examine the relationship
between the two variables graphically using a scatter diagram, or
scatterplot.
The simplest type of model relating two quantitative variables is
called a simple linear regression model, in which there is an
assumed linear relationship between two variables. One variable is
called the independent variable, or predictor variable. The other
variable is called the dependent variable, or the response variable.
Simple Linear Regression Model
The response variable is assumed to be related to the predictor
variable according to the following equation:
Yi   0  1 xi   i , where
Yi  the value of the response variable for the ith member of the
sample,
 0  a parameter, called the intercept of the line of best fit, or the
regression line,
1  a parameter, called the slope of the line of best fit, or the
regression line,
xi  the value of the predictor variable for the ith member of the
sample,
2
 i  a random error variable associated with the ith member of the
sample; it is assumed that the random errors are independent and
2
identically distributed, with  i ~ Normal 0, .


A picture of the model is shown on p. 309.
Since it is assumed that a linear trend relationship exists between the
predictor variable and the response variable, before we proceed to
use the model, we must do a scatterplot to see whether the
assumption of linearity is reasonable.
We need to use sample data to estimate the three parameters,
 0 ,  1 ,  2 . The estimation will be done using the method of
least squares. Given a sample of size n, the data consists of ordered
pairs, (x1, y1), (x2, y2), …, (xn, yn).
We will find the best estimators of the slope and intercept by
minimizing the residual sum of squares (also called the error sum of
squares):
n
n
n
2
SSE    yi  yˆi    e    yi   0  1 xi  ,
i 1
2
i 1
2
i
i 1
with respect to the two parameters.
In doing this, we are simultaneously minimizing the squared vertical
distances of the data points from the line of best fit to the data. A
concrete example is useful here.
Example: p. 302
3
Imagine constructing this scatterplot concretely as follows:
1)
2)
3)
4)
5)
Draw the coordinate axes on a sheet of plywood.
Hammer nails into the board at each data point.
Obtain a thin wooden dowel and six rubber bands.
Place each rubber band around the dowel and one of the nails.
Wait until the dowel comes to rest.
The rest position of the dowel will be the minimum energy
configuration of the system, the configuration for which there will
be the least total stretching of the rubber bands. This position will
also be the least squares regression line relating thermal conductivity
and density.
We differentiate SSE w.r.t. each parameter, and set each derivative
equal to 0, obtaining
n
SSE
 2  yi   0  1 xi   0 , and
 0
i 1
n
SSE
 2  yi   0  1 xi xi  0 .
1
i 1
This gives us two equations in two unknowns, called the normal
equations:
n
n
i 1
i 1
nˆ 0  ˆ1  xi   y i , and
n
n
n
i 1
i 1
i 1
2
ˆ
x


x


0
i
1
i   xi yi .
ˆ
The solution is
4
n
ˆ1 
 x
i 1
i
 x  y i  y 
n
 x
i 1
i
 x
2
1  n  n

x
y

x
y






i i
i
i 
n
i 1
 i 1  i 1 
n

SS xy
SS xx

1 n 
2
xi    xi 

n  i 1 
i 1
n
2
,
ˆ0  y  ˆ1 x .
Then the estimated regression line, or line of best fit to the data, is
given by:
Yˆ  ˆ0  ˆ1 x .
The estimate of the error variance is found from the error sum of
SSE
2
squares to be ˆ  MSE  n  2 . There are only n – 2 degrees of
freedom associated with the error sum of squares because two
parameters, the slope and the intercept, have already been estimated.
To do inference, we need to know the distributional properties of the
ˆ1 and ̂ 0 . One of the basic assumptions of the model
is that the random error terms,  i are i.i.d. normal with mean 0 and
estimators,
2
common variance  . Then Yi ~ Normal 0  1 xi ,   .
Furthermore, the Y’s are independent of each other. From the
normal equations, it is clear that
ˆ1 is a linear function of the Y’s,
and that ̂ 0 is also a linear function of the Y’s. We know that a
statistic that is a linear function of independent normal random
variables also has a normal distribution.
5
Specifically, it can be shown that both estimators are unbiased, and
that

ˆ1 ~ Normal 1 ,



SS xx


1
x2
ˆ


 ~ Normal  0 , 

 , and that 0

n SS xx




.

We may use these facts to do hypothesis testing and interval
estimation about the slope and intercept. The standard error of
is given by
MSE
SE ̂1 
̂
SS xx . The standard error of 0 is given by
ˆ1
 
 
1
x2 

SE ̂0  MSE  
.
 n SS xx 
ˆ1  1
~ t n  2
Therefore, we find that MSE
, and that
SS xx
ˆ0   0
~ t n  2
2
. We want to test whether there is a
1 x 


MSE  

 n SS xx 
linear trend relationship between the predictor and the response
variable. Our hypotheses are H0: 1  0
v. Ha: 1  0 .
We may use the distributional properties of the estimated slope to
find a test statistic.
We may do the hypothesis test using the t-distribution of the
estimator.
6
Example: The paper “A study of stainless steel stress-corrosion
cracking by potential measurements” (Corrosion, 1962, pp. 425432) reported on the relationship between applied stress (the
predictor variable, x, in kg/mm2) and time to fracture (the response
variable, in hours) for 18-8 stainless steel under uniaxial tensile
stress in a 40% CaCl2 solution at 100C. Tend different settings of
applied stress were used, and the resulting data values (as read from
a graph which appeared in the paper) are given in the table below:
x i 2.5 5
y i 63 58
10 15 17.5 20
55 61 62 37
25
38
30
45
35
46
40
19
We want to 1) determine whether there is a linear trend relationship
between applied tensile stress and time to fracture, and 2) estimate
the relationship.
We first do a scatterplot, using Excel:
Scatterplot of Time to Fracture v. Tensile Stress
70
Timee to Fracture (Hours)
60
50
40
30
20
10
0
0
10
20
30
40
50
Tensile Stress (kg/square mm)
It appears that there is a moderately strong negative linear trend
relationship between time to fracture and tensile stress.
7
Next we want to test whether this relationship generalizes to the
entire population of 18-8 stainless steel samples.
Step 1: H0: 1  0
Ha: 1  0 .
Step 2: n  10 .  = 0.05
Step 3: The test statistic that will be used is F 
MSR
, which under
MSE
the null hypothesis has an F(1, 7).
Step 4: We will reject the null hypothesis if the value of the test
statistic is greater than F 1, 7, 0.05  5.59. .
Step 5: We enter the data in Excel. We choose Tools, Data
Analysis, and Regression. Excel produces the following ANOVA
table.
SUMMARY
OUTPUT
Regression Statistics
Multiple R
0.79531017
R Square
0.632518266
Adjusted R
Square
0.58658305
Standard Error
9.124307466
Observations
10
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
Significance
SS
MS
F
F
1 1146.376106 1146.376106 13.76978954 0.005949788
8 666.0238938 83.25298673
9
1812.4
Standard
Coefficients
Error
t Stat
P-value
66.41769912 5.648129399 11.75923822 2.50156E-06
-0.900884956 0.242775962 3.710766706 0.005949788
8
Step 6: We reject the null hypothesis at the 0.05 level of
significance. We have sufficient evidence to conclude that 1  0 ;
i.e., there is a linear trend relationship between tensile stress and
time to fracture.
Defn: The coefficient of determination is defined by
R2  1
SSE SSR

SST SST . This quantity is the proportion of the variation
of the response variable that is explained by the linear relationship
between the predictor variable and the response variable.
In our example, R2 = 0.6325. Hence 63.25% of the variation in time
to fracture is explained by the linear relationship between tensile
stress and time to fracture.
A large value for R2 (near 1) indicates that the model has good
explanatory power. A value for R2 near 0 indicates that the model
does not have good explanatory power.
The estimated regression equation (line of best fit), may also be read
from the last table in the Excel output. We have
Yˆ  66.4177  0.9009x . This says that for every 1 kg/mm2
increase in tensile stress, the time to fracture decreases by 0.9009
hours, on average.
If the applied tensile stress is 12 kg/mm2, then the predicted time to
ˆ
fracture is Y  66.4177  (0.9009)(12)  55.6069 hours.
Download