Regression Summary Fall 2015 Steve Vardeman Iowa State University

advertisement
Regression Summary
Fall 2015
Steve Vardeman
Iowa State University
October 12, 2015
Abstract
This outline summarizes the main points of basic Simple and Multiple
Linear Regression analysis.
Contents
1 Simple Linear Regression
1.1 SLR Model . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Descriptive Analysis of Approximately Linear (x; y) Data
1.3 Parameter Estimates for SLR . . . . . . . . . . . . . . . .
1.4 Interval-Based Inference Methods for SLR . . . . . . . . .
1.5 Hypothesis Tests and SLR . . . . . . . . . . . . . . . . . .
1.6 ANOVA and SLR . . . . . . . . . . . . . . . . . . . . . . .
1.7 Standardized Residuals and SLR . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Multiple Linear Regression
2.1 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . .
2.2 Descriptive Analysis of Approximately Linear (x1 ; x2 ; : : : ; xk ; y)
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Parameter Estimates for MLR . . . . . . . . . . . . . . . . . . .
2.4 Interval-Based Inference Methods for MLR . . . . . . . . . . . .
2.5 Hypothesis Tests and MLR . . . . . . . . . . . . . . . . . . . . .
2.6 ANOVA and MLR . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 "Partial F Tests" in MLR . . . . . . . . . . . . . . . . . . . . . .
2.8 Standardized Residuals in MLR . . . . . . . . . . . . . . . . . . .
2.9 Intervals and Tests for Linear Combinations of ’s in MLR . . .
1
2
2
2
4
4
5
5
6
6
6
7
8
8
9
9
10
11
11
1
Simple Linear Regression
1.1
SLR Model
The basic (normal) "simple linear regression" model says that a response/output
variable y depends on an explanatory/input/system variable x in a "noisy but
linear" way. That is, one supposes that there is a linear relationship between
x and mean y,
yjx = 0 + 1 x
and that (for …xed x) there is around that mean a distribution of y that is
normal. Further, the model assumption is that the standard deviation of the
response distribution is constant in x. In symbols it is standard to write
y=
0
+
1x
+
where is normal with mean 0 and standard deviation . This describes one
y. Where several observations yi with corresponding values xi are under consideration, the assumption is that the yi (the i ) are independent. (The i are
conceptually equivalent to unrelated random draws from the same …xed normal
continuous distribution.) The model statement in its full glory is then
yi =
i
0
+
1 xi
+
i
for i = 1; 2; :::; n
for i = 1; 2; :::; n are independent normal (0;
2
) random variables
The model statement above is a perfectly theoretical matter. One can begin
with it, and for speci…c choices of 0 ; 1 and …nd probabilities for y at given
values of x. In applications, the real mode of operation is instead to take n
data pairs (x1 ; y1 ); (x2 ; y2 ); :::; (xn ; yn ) and use them to make inferences about
the parameters 0 ; 1 and and to make predictions based on the estimates
(based on the empirically …tted model).
1.2
Descriptive Analysis of Approximately Linear (x; y)
Data
After plotting (x; y) data to determine that the "linear in x mean of y" model
makes some sense, it is reasonable to try to quantify "how linear" the data look
and to …nd a line of "best …t" to the scatterplot of the data. The sample
correlation between y and x
Pn
(xi x)(yi y)
r = pPn i=1
Pn
x)2
y)2
i=1 (xi
i=1 (yi
is a measure of strength of linear relationship between x and y.
Calculus can be invoked to …nd a slope 1 and intercept 0 minimizing
the
Pn
sum of squared vertical distances from data points to a …tted line, i=1 (yi
2
(
+
0
and
1 xi ))
2
. These "least squares" values are
Pn
x)(yi y)
(x
Pn i
b1 = i=1
x)2
(x
i
i=1
b0 = y
b1 x
It is further common to refer to the value of y on the "least squares line"
corresponding to xi as a …tted or predicted value
ybi = b0 + b1 xi
One might take the di¤erence between what is observed (yi ) and what is "predicted" or "explained" (b
yi ) as a kind of leftover part or "residual" corresponding
to a data value
ei = yi ybi
Pn
2
The sum i=1 (yi y) is most of the sample variance of the n values yi . It
is a measure of raw variation in the response variable. People often call it the
"total sum of squares" and write
SST ot =
n
X
(yi
y)2
i=1
Pn
2
The sum of squared residuals
i=1 ei is a measure of variation in response
remaining unaccounted for after …tting a line to the data. People often call it
the "error sum of squares" and write
SSE =
n
X
e2i =
i=1
n
X
(yi
i=1
ybi )2
One is guaranteed that SST ot
SSE. So the di¤erence SST ot SSE is a
non-negative measure of variation accounted for in …tting a line to (x; y) data.
People often call it the "regression sum of squares" and write
SSR = SST ot
SSE
The coe¢ cient of determination expresses SSR as a fraction of SST ot and
is
SSR
SST ot
which is interpreted as "the fraction of raw variation in y accounted for in the
model …tting process." As it turns out, this quantity is the squared correlation
between the yi and ybi , and thus in SLR (since the ybi are perfectly correlated
with the xi ) also the squared correlation between the yi and xi .
R2 =
3
1.3
Parameter Estimates for SLR
The descriptive statistics for (x; y) data can be used to provide "single number
estimates" of the (typically unknown) parameters of the simple linear regression
model. That is, the slope of the least squares line can serve as an estimate of
1,
b1 = b1
and the intercept of the least squares line can serve as an estimate of
0,
b0 = b0
The variance of y for a given x can be estimated by a kind of average of
squared residuals
n
SSE
1 X 2
ei =
s2 =
n 2 i=1
n 2
p
Of course, the square root of this "regression sample variance" is s = s2 and
serves as a single number estimate of .
1.4
Interval-Based Inference Methods for SLR
The normal simple linear regression model provides inference formulas for model
parameters. Con…dence limits for are
s
s
n 2
n 2
s
and s
2
2
upp er
lower
where 2upp er and 2lower are upper and lower percentage points of the 2 distribution with = n 2 degrees of freedom. And con…dence limits for 1 (the
slope of the line relating mean y to x ... the rate of change of average y with
respect to x) are
s
b1 t pPn
x)2
i=1 (xi
where t is a quantile of the p
tP
distribution with = n 2 degrees of freedom.
n
x)2 the standard error of b1 and use
One might call the ratio s=
i=1 (xi
the symbols SEb1 for it. In these terms, the con…dence limits for 1 are
b1
Con…dence limits for
x) are
yjx
=
(b0 + b1 x)
0
+
s
ts
tSEb1
1x
(the mean value of y at a given value
1
(x x)2
+ Pn
n
x)2
i=1 (xi
4
s
1
(x x)2
the standard
+ Pn
n
x)2
i=1 (xi
error of the …tted mean and use the symbol SEyb for it. (JMP calls this the Std
Error of Predicted.) In these terms, the con…dence limits for yjx are
One can abbreviate (b0 +b1 x) as yb, and call s
yb
tSEyb
(Note that by choosing x = 0, this formula provides con…dence limits for
though this parameter is rarely of independent practical interest.)
Prediction limits for an additional observation y at a given value x are
s
1
(x x)2
(b0 + b1 x) ts 1 + + Pn
n
x)2
i=1 (xi
0,
s
1
(x x)2
+ Pn
the prediction standard error
n
x)2
i=1 (xi
for predicting an individual y and use the symbol SEyn e w for it. (JMP calls this
Std Error of Individual). It is on occasion useful to notice that
q
SEyn e w = s2 + SEyb2
People sometimes call s 1 +
Using the SEyn e w notation, prediction limits for an additional y at x are
1.5
yb
tSEyn e w
Hypothesis Tests and SLR
The normal simple linear regression model supports hypothesis testing. H0 :
# can be tested using the test statistic
T =
b1
pPn
#
i=1 (xi
and a tn
statistic
2
reference distribution.
T = s
s
and a tn
1.6
2
=
s
x)2
H0 :
yjx
(b0 + b1 x)
#
1
=
b1 #
SEb1
= # can be tested using the test
1
(x x)2
+ Pn
n
x)2
i=1 (xi
=
yb #
SEyb
reference distribution.
ANOVA and SLR
The breaking down of SST ot into SSR and SSE can be thought of as a kind
of "analysis of variance" in y. That enterprise is often summarized in a special
5
kind of table. The general form is as below.
Source
Regression
Error
Total
SS
SSR
SSE
SST ot
ANOVA Table (for SLR)
df
MS
1
M SR = SSR=1
n 2 M SE = SSE=(n 2)
n 1
F
F = M SR=M SE
In this table the ratios of sums of squares to degrees of freedom are called
"mean squares." The mean square for error is, in fact, the estimate of 2 (i.e.
M SE = s2 ).
As it turns out, the ratio in the "F" column can be used as a test statistic
for the hypothesis H0 : 1 = 0. The reference distribution appropriate is the
F1;n 2 distribution. As it turns out, the value F = M SR=M SE is the square
of the t statistic for testing this hypothesis, and the F test produces exactly the
same p-value as a two-sided t test.
1.7
Standardized Residuals and SLR
The theoretical variances of the residuals turn out to depend upon their corresponding x values. As a means of putting these residuals all on the same
footing, it is common to "standardize" them by dividing by an estimated standard deviation for each. This produces standardized residuals
ei
ei = s
(x
x)2
Pn i
x)2
i=1 (xi
1
n
s 1
These (if the normal simple linear regression model is a good one) "ought" to
look as if they are approximately normal with mean 0 and standard deviation
1. Various kinds of plotting with these standardized residuals (or with the raw
residuals) are used as means of "model checking" or "model diagnostics."
2
2.1
Multiple Linear Regression
Multiple Linear Regression Model
The basic (normal) "multiple linear regression" model says that a response/
output variable y depends on explanatory/input/system variables x1 ; x2 ; : : : ; xk
in a "noisy but linear" way. That is, one supposes that there is a linear
relationship between x1 ; x2 ; : : : ; xk and mean y,
yjx1 ;x2 ;:::,xk
=
0
+
1 x1
+
2 x2
+
+
k xk
and that (for …xed x1 ; x2 ; : : : ; xk ) there is around that mean a distribution
of y that is normal. Further, the standard assumption is that the standard
6
deviation of the response distribution is constant in x1 ; x2 ; : : : xk . In symbols
it is standard to write
y=
0
+
1 x1
+
2 x2
+
+
k xk
+
where is normal with mean 0 and standard deviation . This describes one
y. Where several observations yi with corresponding values x1i ; x2i ; : : : ; xki
are under consideration, the assumption is that the yi (the i ) are independent.
(The i are conceptually equivalent to unrelated random draws from the same
…xed normal continuous distribution.) The model statement in its full glory is
then
yi =
i
0
+
1 x1i
+
2 x2i
+
+
k xki
+
i
for i = 1; 2; :::; n
for i = 1; 2; :::; n are independent normal (0;
2
) random variables
The model statement above is a perfectly theoretical matter. One can begin
with it, and for speci…c choices of 0 ; 1 ; 2 ; : : : ; k and …nd probabilities for
y at given values of x1 ; x2 ; : : : ; xk . In applications, the real mode of operation
is instead to take n data vectors (x11 ; x21 ; : : : ; xk1 ; y1 ); (x12 ; x22 ; : : : ; xk2 ; y2 ); :::;
(x1n ; x2n ; : : : ; xkn ; yn ) and use them to make inferences about the parameters
and to make predictions based on the estimates (based
0 ; 1 , 2 ; : : : ; k and
on the empirically …tted model).
2.2
Descriptive Analysis of Approximately Linear (x1 ; x2 ; : : : ;
xk ; y) Data
Calculus can be invoked to …nd coe¢ cients 0 ; 1 ; 2 ; : : : ; k minimizing the sum
of squared vertical
space to
Pndistances from data points in (k + 1) dimensional
a …tted surface, i=1 (yi ( 0 + 1 x1i + 2 x2i +
+ k xki ))2 . These "least
squares" values DO NOT have simple formulas (unless one is willing to use
matrix notation). And in particular, one can NOT simply somehow use the
formulas from simple linear regression in this more complicated context. We will
call these minimizing coe¢ cients b0 ; b1 ; b2 ; : : : ; bk and need to rely upon JMP to
produce them for us.
It is further common to refer to the value of y on the "least squares surface"
corresponding to x1i ; x2i ; : : : ; xki as a …tted or predicted value
ybi = b0 + b1 x1i + b2 x2i +
+ bk xki
Exactly as in SLR, one takes the di¤erence between what is observed (yi ) and
what is "predicted" or "explained" (b
yi ) as a kind of leftover part or "residual"
corresponding to a data value
ei = yi
ybi
Pn
The total sum of squares, SST ot =
y)2 , is (still) most of the
i=1 (yi
sample variance of the n values yi and measures raw variation in the response
7
Pn
variable. Just as in SLR, the sum of squared residuals i=1 e2i is a measure
of variation in response remaining unaccounted for after …tting the equation to
the data. As in SLR, people call it the error sum of squares and write
SSE =
n
X
e2i =
i=1
n
X
(yi
i=1
ybi )2
(The formula looks exactly like the one for SLR. It is simply the case that now
ybi is computed using all k inputs, not just a single x.) One is still guaranteed
that SST ot SSE. So the di¤erence SST ot SSE is a non-negative measure
of variation accounted for in …tting the linear equation to the data. As in SLR,
people call it the regression sum of squares and write
SSR = SST ot
SSE
The coe¢ cient of (multiple) determination expresses SSR as a fraction of
SST ot and is
SSR
R2 =
SST ot
which is interpreted as "the fraction of raw variation in y accounted for in the
model …tting process." This quantity can also be interpreted in terms of a
correlation, as it turns out to be the square of the sample linear correlation
between the observations yi and the …tted or predicted values ybi .
2.3
Parameter Estimates for MLR
The descriptive statistics for (x1 ; x2 ; : : : ; xk ; y) data can be used to provide "single number estimates" of the (typically unknown) parameters of the multiple
linear regression model. That is, the least squares coe¢ cients b0 ; b1 ; b2 ; : : : ; bk
serve as estimates of the parameters 0 ; 1 ; 2 ; : : : ; k . The …rst of these is a
kind of high-dimensional "intercept" and in the case where the predictors are
not functionally related, the others serve as rates of change of average y with
respect to a single x, provided the other x’s are held …xed.
The variance of y for a …xed set of values x1 ; x2 ; : : : ; xk can be estimated by
a kind of average of squared residuals
2
s =
n
1
k
1
n
X
i=1
e2i =
n
SSE
k 1
The square root of this "regression sample variance" is s =
single number estimate of .
2.4
p
s2 and serves as a
Interval-Based Inference Methods for MLR
The normal multiple linear regression model provides inference formulas for
model parameters. Con…dence limits for are
s
s
n k 1
n k 1
s
and s
2
2
upp er
lower
8
where 2upp er and 2lower are upper and lower percentage points of the 2 distribution with = n k 1 degrees of freedom. Con…dence limits for j (the
rate of change of average y with respect to xj with other x’s held …xed) are
bj
tSEbj
where t is a quantile of the t distribution with = n k 1 degrees of freedom
and SEbj is a standard error of bj . There is no simple formula for SEbj , and
in particular, one can NOT simply somehow use the formula from simple linear
regression in this more complicated context. It IS the case that this standard
error is a multiple of s, but we will have to rely upon JMP to provide it for us.
Con…dence limits for yjx1 ;x2 ;:::,xk = 0 + 1 x1 + 2 x2 +
+ k xk (the
mean value of y at a particular choice of the x1 ; x2 ; : : : ; xk ) are
yb
tSEyb
where SEyb depends upon which values of x1 ; x2 ; : : : ; xk are under consideration.
There is no simple formula for SEyb and in particular one can NOT simply
somehow use the formula from simple linear regression in this more complicated
context. It IS the case that this standard error is a multiple of s, but we will
have to rely upon JMP to provide it for us.
Prediction limits for an additional observation y at a given vector (x1 ; x2 ; : : : ;
xk ) are
yb tSEyn e w
q
where it remains true (as in SLR) that SEyn e w = s2 + SEyb2 but otherwise there
is no simple formula for it, and in particular one can NOT simply somehow use
the formula from simple linear regression in this more complicated context.
2.5
Hypothesis Tests and MLR
The normal multiple linear regression model supports hypothesis testing. H0 :
# can be tested using the test statistic
T =
j
bj #
SEbj
and a tn k 1 reference distribution. H0 : yjx1 ;x2 ;:::,xk = # can be tested using
the test statistic
yb #
T =
SEyb
and a tn
2.6
k 1
reference distribution.
ANOVA and MLR
As in SLR, the breaking down of SST ot into SSR and SSE can be thought of
as a kind of "analysis of variance" in y, and be summarized in a special kind of
9
=
table. The general form for MLR is as below.
Source
Regression
Error
Total
ANOVA Table (for MLR Overall F Test)
SS
df
MS
SSR
k
M SR = SSR=k
SSE
n k 1 M SE = SSE=(n k 1)
SST ot n 1
F
F = M SR=M SE
(Note that as in SLR, the mean square for error is, in fact, the estimate of 2
(i.e. M SE = s2 ).)
As it turns out, the ratio in the "F" column can be used as a test statistic
for the hypothesis H0 : 1 = 2 =
= k = 0. The reference distribution
appropriate is the Fk;n k 1 distribution.
2.7
"Partial F Tests" in MLR
It is possible to use ANOVA ideas to invent F tests for investigating whether
some whole group of ’s (short of the entire set) are all 0. For example one
might want to test the hypothesis
H0 :
l+1
=
l+2
=
=
k
=0
(This is the hypothesis that only the …rst l of the k input variables xi have any
impact on the mean system response ... the hypothesis that the …rst l of the
x’s are adequate to predict y ... the hypothesis that after accounting for the
…rst l of the x’s, the others do not contribute "signi…cantly" to one’s ability to
explain or predict y.)
If we call the model for y in terms of all k of the predictors the "full model"
and the model for y involving only x1 through xl the "reduced model" then an
F test of the above hypothesis can be made using the statistic
F =
(SSRFull
SSRReduced )=(k
M SEFull
l)
=
n
k
k
1
l
2
RFull
1
2
RReduced
2
RFull
and an Fk l;n k 1 reference distribution. SSRFull
SSRReduced so the numerator here is non-negative.
Finding a p-value for this kind of test is a means of judging whether R2 for the
full model is "signi…cantly"/detectably larger than R2 for the reduced model.
(Caution here, statistical signi…cant is not the same as practical importance.
With a big enough data set, essentially any increase in R2 will produce a small
p-value.) It is reasonably common to expand the basic MLR ANOVA table to
organize calculations for this test statistic. This is
10
Source
Regression
x1 ; :::; xl
xl+1 ; :::; xk jxl ; :::; xp
Error
Total
2.8
SS
SSRFull
SSRRed
SSRFull
SSEFull
SST ot
(Expanded) ANOVA Table (for MLR)
df
MS
k
M SRFull = SSR=k
p
SSRRed k l
(SSRFull SSRRed )=(k l)
n k 1 M SEFull = SSEFull =(n k 1)
n 1
F
F = M SRFull =M SEFull
(SSRF u l l
Standardized Residuals in MLR
As in SLR, people sometimes wish to standardize residuals before using them
to do model checking/diagnostics. While it is not possible to give a simple
formula for the "standard error of ei " without using matrix notation, most
MLR programs will compute these values. The standardized residual for data
point i is then (as in SLR)
ei =
ei
standard error of ei
If the normal multiple linear regression model is a good one these "ought" to
look as if they are approximately normal with mean 0 and standard deviation
1.
2.9
Intervals and Tests for Linear Combinations of ’s in
MLR
It is sometimes important to do inference for a linear combination of MLR
model coe¢ cients
L = c0
0
+ c1
1
+ c2
2
+
+ ck
k
(where c0 ; c1 ; : : : ; ck are known constants). Note, for example, that yjx1 ;x2 ;:::,xk
is of this form for c0 = 1; c1 = x1 ; c2 = x2 ; : : : ;and ck = xk . Note too that a
di¤erence in mean responses at two sets of predictors, say (x1 ; x2 ; : : : ; xk ) and
(x01 ; x02 ; : : : ; x0k ) is of this form for c0 = 0; c1 = x1 x01 ; c2 = x2 x02 ; : : : ;and
ck = xk x0k .
An obvious estimate of L is
b = c0 b0 + c1 b1 + c2 b2 +
L
+ ck bk
Con…dence limits for L are
b
L
b
t(standard error of L)
b This standard error
There is no simple formula for "standard error of L".
is a multiple of s, but we will have to rely upon JMP to provide it for us.
11
SSRR e d )=(k l)
M SEF u l l
b and its standard error is under the "Custom Test" option
(Computation of L
in JMP.)
H0 :L = # can be tested using the test statistic
T =
b
L
#
b
standard error of L
and a tn k 1 reference distribution. Or, if one thinks about it for a while, it
is possible to …nd a reduced model that corresponds to the restriction that the
null hypothesis places on the MLR model and to use a "l = k 1" partial F
test (with 1 and n k 1 degrees of freedom) equivalent to the t test for this
purpose.
12
Download