STAT355 - Probability & Statistics Chapter 12

advertisement
STAT355 - Probability & Statistics
Chapter 12: Simple Linear Regression and Correlation
Fall 2011
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
1 / Linea
21
Chapter 12: Simple Linear Regression and Correlation
1
12.1 The Simple Linear Regression Model
2
12.2 Estimating Model Parameters
3
12.3 Inferences About the Slope Parameter β1
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
2 / Linea
21
The Simple Linear Regression Model
The simplest deterministic mathematical relationship between two
variables x and y is a linear relationship
y = β0 + β1 x.
The set of pairs (x, y ) for which y = β0 + β1 x determines a straight line
with slope β1 and y-intercept β0 . The objective of this section is to
develop a linear probabilistic model.
If the two variables are not deterministically related, then for a fixed value
of x, there is uncertainty in the value of the second variable.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
3 / Linea
21
The Simple Linear Regression Model
For example, if we are investigating the relationship between age of child
and size of vocabulary and decide to select a child of age x = 5.0 years,
then before the selection is made, vocabulary size is a random variable Y .
After a particular 5-year-old child has been selected and tested, a
vocabulary of 2000 words may result. We would then say that the
observed value of Y associated with fixing x = 5.0 was y = 2000.
More generally, the variable whose value is fixed by the experimenter will
be denoted by x and will be called the independent, predictor, or
explanatory variable.
For fixed x, the second variable will be random; we denote this random
variable and its observed value by Y and y , respectively, and refer to it as
the dependent or response variable.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
4 / Linea
21
The Simple Linear Regression Model
Usually observations will be made for a number of settings of the
independent variable.
Let x1 , x2 , ..., xn denote values of the independent variable for which
observations are made, and let Yi and yi , respectively, denote the random
variable and observed value associated with xi . The available bivariate
data then consists of the n pairs (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ).
A picture of this data called a scatter plot gives preliminary impressions
about the nature of any relationship. In such a plot, each (xi , yi ) is
represented as a point plotted on a two dimensional coordinate system.
Examples -
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
5 / Linea
21
A Linear Probabilistic Model
For the deterministic model y = β0 + β1 x, the actual observed value of y
is a linear function of x.
The appropriate generalization of this to a probabilistic model assumes
that the expected value of Y is a linear function of x, but that for fixed x
the variable Y differs from its expected value by a random amount.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
6 / Linea
21
A Linear Probabilistic Model
Definition
There are parameters β0 , β1 , and σ 2 , such that for any fixed value of the
independent variable x, the dependent variable is a random variable
related to x through the model equation
Y = β0 + β1 x + (1)
The quantity in the model equation is a random variable, assumed to be
normally distributed with
E () = 0 and V () = σ 2 .
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
7 / Linea
21
A Linear Probabilistic Model
The variable is usually referred to as the random deviation or random
error term in the model.
Without , any observed pair (x, y ) would correspond to a point falling
exactly on the line y = β0 + β1 x, called the true (or population) regression
line.
The inclusion of the random error term allows (x, y ) to fall either above
the true regression line (when > 0) or below the line (when < 0).
The slope β1 of the true regression line is interpreted as the expected
change in Y associated with a 1-unit increase in the value of x.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
8 / Linea
21
Estimating Model Parameters
Principle of Least Squares
The vertical deviation of the point (xi , yi ) from the line y = b0 + b1 x is
height of point-height of line = yi − (b0 + b1 xi )
The sum of squared vertical deviations from the points (x1 , y1 ), ..., (xn , yn )
to the line is then
f (b0 , b1 ) =
n
X
[yi − (b0 + b1 xi )]2
i=1
The point estimates of β0 and β1 , denoted by β̂0 and β̂1 and called the
least squares estimates, are those values that minimize f (b0 , b1 ). That is,
β̂0 and β̂1 are such that f (β̂0 , β̂1 ) ≤ f (b0 , b1 ) for any b0 and b1 .
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
9 / Linea
21
Estimating Model Parameters
The least squares estimate of the intercept β0 of the true regression line is
P
P
yi − β̂1 xi
= ȳ − β̂1 x̄
b0 = β̂0 =
n
The least squares estimate of the slope coefficient β1 of the true regression
line is
P
Sxy
(xi − x̄)(yi − ȳ )
P
b1 = β̂1 =
=
2
(xi − x̄)
Sxx
Remarks: Computing formulas for the numerator and denominator of Sxy
and Sxx are
X
X
X
X
X
Sxy =
xi yi − (
xi )(
yi )/n and Sxx =
xi2 − (
xi )2 /n
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
10 / Linea
21
Example
Consider the following data:
x
y
2 3.1 3.9 5.3
6
7.2
8
5.3 6.3 9.0 12.2 11.5 16.7 16.9
Fit a linear model to the data and obtain the regression line (Using your
calculator).
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
11 / Linea
21
Fitted Values and Residuals
The parameter σ 2 determines the amount of variability inherent in the
regression model.
A large value of σ 2 will lead to observed (xi , yi )s that are quite spread out
about the true regression line, whereas when σ 2 is small the observed
points will tend to fall very close to the true line.
Definition
The fitted (or predicted) values ŷ1 , ŷ2 , ..., ŷn are obtained by successively
substituting x1 , ..., xn into the equation of the estimated regression line:
ŷ1 = β̂0 + β̂1 x1 , ŷ2 = β̂0 + β̂1 x2 , ..., ŷn = β̂0 + β̂1 xn .
The residuals are y1 − ŷ1 , y2 − ŷ2 , ..., yn − ŷn , the differences between the
observed and fitted y values.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
12 / Linea
21
Estimating σ 2
Definition
The error sum of squares (also called residual sum of squares), denoted by
SSE , is
X
X
SSE =
(yi − ŷi )2 =
[yi − (β̂0 + β̂1 xi )]2
and the estimate of σ 2 is
σ̂ 2 = s 2 =
SSE
=
n−2
P
(yi − ŷi )2
n−2
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
13 / Linea
21
The Coefficient of Determination
Examples using R
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
14 / Linea
21
The Coefficient of Determination
Definition
The coefficient of determination, denoted by r 2 , is given by
r2 = 1 −
SSE
SST
P
where SST = Syy = (yi − ȳ )2 . It is interpreted as the proportion of
observed y variation that can be explained by the simple linear regression
model.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
15 / Linea
21
The Coefficient of Determination - Remarks
I The higher the value of r 2 , the more successful is the simple linear
regression model in explaining y variation.
I When regression analysis is done by a statistical computer package,
either r 2 or 100r 2 (the percentage of variation explained by the regression)
is a prominent part of the output.
I If r 2 is small, an analyst will usually want to search for an alternative
model (either a nonlinear model or a multiple regression model that
involves more than a single independent variable) that can more effectively
explain y variation.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
16 / Linea
21
Inferences About the Slope Parameter β1
The estimators (statistics, and thus random variables) for β0 , β1 and σ 2
are obtained by replacing yi by Yi in the expression of β̂0 , β̂1 , and σ̂ 2 .
P
(xi − x̄)(Yi − Ȳ )
P
β̂1 =
(xi − x̄)2
P
P
Yi − β̂1 xi
β̂0 =
n
P 2
P
P
Yi − β̂0 Yi − β̂1 xi Yi
2
2
σ̂ = S =
n−2
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
17 / Linea
21
Inferences About the Slope Parameter β1
Proposition
1
The mean value of β̂1 is E (β̂1 ) = β1 , so β̂1 is an unbiased estimator
of β1 .
2
The variance and standard deviation of β1 are
V (β̂1 ) = σβ̂2 =
1
σ
σ2
and σβ̂1 = √
Sxx
Sxx
P
where Sxx = (xi − x̄)2 . Replacing σ by its estimate s gives an
estimate for σβ̂1 (the estimated standard error of β̂1 ):
σ̂β̂1 = sβ̂1 = √
3
s
Sxx
The estimator β̂1 has a normal distribution (because it is a linear
function of independent normal rv’s).
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
18 / Linea
21
Inferences About the Slope Parameter β1
Theorem
The assumptions of the simple linear regression model imply that the
standardized variable
T =
β̂1 − β1
β̂1 − β1
√
=
Sβ̂1
S/ Sxx
has a t distribution with n − 2 df.
Proposition
A 100(1 − α)% confidence interval for β1 of the true regression line is
β̂1 ± tα/2,n−2 sβ̂1
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
19 / Linea
21
Inferences About the Slope Parameter β1
Hypothesis-Testing Procedures
Null hypothesis: H0 : β1 = β10
Test statistic value: t =
β̂1 −β10
sβ̂
1
Alternative Hypothesis
Ha : β1 > β10
Ha : β1 < β10
Ha : β1 6= β10
Rejection Region for Level α Test
t ≥ tα,n−2
t ≤ −tα,n−2
either t ≥ tα/2,n−2 or t ≤ −tα/2,n−2
A P-value based on n − 2 can be calculated just as was done previously for
t tests.
STAT355 ()
- Probability & Statistics
Chapter
Fall 2011
12: Simple
20 / Linea
21
Inferences About the Slope Parameter β1
Regression and ANOVA
P
The decomposition of the total sum of squares (yi − ȳ )2 into a part
SSE , which measures unexplained variation, and a part SSR, which
measures variation explained by the linear relationship.
Source of Variation
Regression
Error
Total
df
1
n−2
n−1
S. of Squares
SSR
SSE
SST
STAT355 ()
- Probability & Statistics
Mean Square
SSR/1
2
s = SSE /(n − 2)
F
SSR/s 2
Chapter
Fall 2011
12: Simple
21 / Linea
21
Download