Stat 328 Formulas

advertisement
Stat 328 Formulas
Quantiles
Roughly speaking, the : quantile of a distribution is a number UÐ:Ñ such that a fraction : of the
distribution is to the left and a fraction Ð" • :Ñ is to the right of the number. There are various
possible conventions for making this notion precise for an empirical distribution/data set. One is
that for ordered data values C" Ÿ C# Ÿ á Ÿ C8 the 3th ordered value is the 3•Þ&
8 quantile, i.e.
C3 œ UÐ
3 • Þ&
Ñ
8
(and other quantiles are gotten by interpolation).
A particularly important quantile is UÐÞ&Ñ, the distribution median. This is a number that puts
half of the distribution to its left and half to its right.
Mean, Variance, and Standard Deviation
For a data set C" ß C# ß á ß C8 the sample mean is
" 8
C œ "C3
8 3œ"
and if the data set constitutes an entire population of interest, the population mean is
" R
. œ "C3
R 3œ"
Further, the sample variance is
=# œ
8
"
"ÐC3 • CÑ#
8 • " 3œ"
and the corresponding population variance is
5# œ
" R
"ÐC3 • .Ñ#
R 3œ"
The square root of the variance is in the original units and is called the standard deviation.
-1-
Inference Based on the Extremes of a Single Sample
If one adopts a probability model that says that observations are independent random draws from
a fixed continuous distribution/population/universe it is easy to see how to make some simple
inferences based on the sample extremes. For data C" Ÿ C# Ÿ á Ÿ C8 , the interval ÐC" ß C8 Ñ can
serve as an interval meant to bracket the distribution median or as an interval meant to bracket a
single additional value drawn from the distribution.
As a confidence interval for UÐÞ&Ñ, the interval ÐC" ß C8 Ñ has associated confidence/reliability
" • #ÐÞ&Ñ8
As a prediction interval for C8€" (a single additional observation from this distribution) the
appropriate associated confidence/reliability is
"•
#
8€"
Calculation of Normal Probabilities
Areas under the normal curve with . œ ! and 5 œ " ("standard normal probabilities'' between
! and D ) are given in Table 1, Appendix C, page 812 of the text. The symmetry of the standard
normal curve about ! then allows one to find arbitrary standard normal probabilities using the
table.
Areas under the normal curve with mean . and standard deviation 5 are available by converting
values on the original scale to values on the "D scale'' via
C•.
Dœ
5
and then using the standard normal table. Many handheld calculators will compute normal
probabilities.
Normal Plotting
It is extremely useful to be able to model a data-generating mechanism as "independent random
draws from a normal distribution.'' A means of investigating the extent to which this is sensible is
to make a so-called "normal plot'' of a data set. This is a plot that facilitates comparison of data
quantiles and normal quantiles. If these are roughly linearly related, the normal model is
plausible. If they are not, the normal model is implausible.
For data C" Ÿ C# Ÿ á Ÿ C8 and UD Ð:Ñ the standard normal quantile function (UD Ð:Ñ is the
number that puts standard normal probability : to its left), a normal plot can be made by plotting
points
ÐC3 ß UD Ð
3 • Þ&
ÑÑ
8
-2-
One- and Two-Sample Intervals for Normal Distributions
If it is plausible to think of C" ß C# ß á ß C8 as random draws from a single normal distribution, then
probability theory can be invoked to support quantitative inference. Several useful estimation
and prediction results (based on theory for derived distributions of appropriate statistics) are as
follows.
Confidence limits for the distribution mean . are
–C „ > =
È8
where > is a quantile of the so-called "> distribution with / œ 8 • " degrees of freedom.'' (The >
distributions are tabled in Table 2, Appendix C, page 813 of the text.) Prediction limits for
C8€" , a single additional observation from the distribution are
–C „ >=Ê1 € 1
8
Confidence limits for the distribution standard deviation 5 are
=Ë
8•"
8•"
and =Ë #
#
;upper
;lower
where ;#upper and ;#lower are upper and lower quantiles of the so-called ";# distribution
with / œ 8 • " degrees of freedom.'' (The ;# distributions are tabled in Table 10, Appendix C,
pages 827 and 828 of the text.)
If it is plausible to think of two samples as independently drawn from two normal distributions
with a common variance (this might be investigated by normal plotting both samples on a single
set of axes, looking for two linear plots with comparable slopes), probability theory can again be
invoked to support quantitative inferences for the difference in means. That is, for
=P œ Ë
Ð8" • "Ñ=#" € Ð8# • "Ñ=##
Ð8" • "Ñ € Ð8# • "Ñ
a "pooled estimate'' of 5 (based on a weighted average of the two sample variances), confidence
limits for ." • .# are
–C • –C „ >=P Ê 1 € 1
1
2
81
82
where > is a quantile of the >8" €8# •# distribution.
-3-
If one drops the equal variance assumption (but maintains that two samples are independently
drawn from two normal distributions) probability theory can again be invoked to support
quantitative inferences for the ratio of standard deviations. That is, confidence limits for 5" Î5#
are
="
"
="
ÈJ8# •"ß8" •"
and
È
=# J8" •"ß8# •"
=#
where J/" ß/# is an upper quantile of the so-called "J distribution with /" numerator degrees of
freedom and /# denominator degrees of freedom.'' (The J distributions are tabled in Tables 3-6,
Appendix C, pages 814-821 of the text.)
One- and Two-Sample Testing for Normal Distributions
If it is plausible to think of C" ß C# ß á ß C8 as random draws from a single normal distribution, then
probability theory can be invoked to support hypothesis testing. That is, H0 :. œ # can be tested
using the test statistic
–C • #
X œ =
È8
and a >8•" reference distribution. Further, H0 :5 œ # can be tested using the test statistic
\2 œ
Ð8 • "Ñ=2
##
and a ;#8•" reference distribution.
If it is plausible to think of two samples as independently drawn from two normal distributions
with a common variance, probability theory can again be invoked to support testing for the
difference in means. That is, H0 :.1 • .2 œ # can be tested using the test statistic
–C • –C • #
2
X œ 1
1
=P É 81 € 812
and a >8" €8# •# reference distribution.
And if one drops the equal variance assumption (but maintains that two samples are
independently drawn from two normal distributions) probability theory can again be invoked to
support testing for the ratio of standard deviations. That is, H0 :5" Î5# œ " can be tested using
the statistic
J œ
and an J8" •"ß8# •" reference distribution.
-4-
=21
=22
Simple Linear Regression Model
The basic (normal) "simple linear regression" model says that a response/output variable C
depends on an explanatory/input/system variable B in a "noisy but linear" way. That is, one
supposes that there is a linear relationship between B and mean C,
.ClB œ "! € "" B
and that (for fixed B) there is around that mean a distribution of C that is normal. Further, the
standard assumption is that the standard deviation of the response distribution is constant in B. In
symbols it is standard to write
C œ "! € "" B € %
where % is normal with mean ! and standard deviation 5. This describes one C. Where several
observations C3 with corresponding values B3 are under consideration, the assumption is that the
C3 (the %3 ) are independent. (The %3 are conceptually equivalent to unrelated random draws from
the same fixed normal continuous distribution.) The model statement in its full glory is then
C3 œ "! € "" B3 € %3 for 3 œ "ß #ß ÞÞÞß 8
%3 for 3 œ "ß #ß ÞÞÞß 8 are independent normal Ð!ß 5# Ñ random variables
The model statement above is a perfectly theoretical matter. One can begin with it, and for
specific choices of "! ß "" and 5 find probabilities for C at given values of B. In applications, the
real mode of operation is instead to take 8 data pairs ÐB" ß C" Ñß ÐB# ß C# Ñß ÞÞÞß ÐB8 ß C8 Ñ and use them
to make inferences about the parameters "! ß "" and 5 and to make predictions based on the
estimates (based on the empirically fitted model).
Descriptive Analysis of Approximately Linear ÐBß CÑ Data
After plotting ÐBß CÑ data to determine that the "linear in B mean of C" model makes some sense,
it is reasonable to try to quantify "how linear" the data look and to find a line of "best fit" to the
scatterplot of the data. The sample correlation between C and B
! ÐB3 • BÑÐC3 • CÑ
8
<œ
3œ"
Ë
! ÐB3 • BÑ# † ! ÐC3 • CÑ#
8
8
3œ"
3œ"
is a measure of strength of linear relationship between B and C.
Calculus can be invoked to find a slope "" and intercept "! minimizing the sum of squared
vertical distances from data points to a fitted line, ! ÐC3 • Ð"! € "" B3 ÑÑ# . These "least squares"
8
3œ"
-5-
values are
! ÐB3 • BÑÐC3 • CÑ
8
," œ
3œ"
! ÐB3 • BÑ#
8
3œ"
and
,! œ C • ," B
It is further common to refer to the value of C on the "least squares line" corresponding to B3 as a
fitted or predicted value
sC3 œ ,! € ," B3
One might take the difference between what is observed (C3 ) and what is "predicted" or
"explained" (sC3 ) as a kind of leftover part or "residual" corresponding to a data value
/3 œ C3 • sC3
The sum ! ÐC3 • CÑ# is most of the sample variance of the 8 values C3 . It is a measure of raw
8
3œ"
variation in the response variable. People often call it the "total sum of squares" and write
WW>9> œ "ÐC3 • CÑ#
8
3œ"
The sum of squared residuals ! /#3 is a measure of variation in response remaining unaccounted
8
3œ"
for after fitting a line to the data. People often call it the "error sum of squares" and write
WWI œ "/#3 œ "ÐC3 • sC3 Ñ#
8
8
3œ"
3œ"
One is guaranteed that WWX 9> WWI . So the difference WWX 9> • WWI is a non-negative
measure of variation accounted for in fitting a line to ÐBß CÑ data. People often call it the
"regression sum of squares" and write
WWV œ WW>9> • WWI
The coefficient of determination expresses WWV as a fraction of WWX 9> and is
V# œ
WWV
WWX 9>
which is interpreted as "the fraction of raw variation in C accounted for in the model fitting
process."
-6-
Parameter Estimates for SLR
The descriptive statistics for ÐBß CÑ data can be used to provide "single number estimates" of the
(typically unknown) parameters of the simple linear regression model. That is, the slope of the
least squares line can serve as an estimate of "" ,
s " œ ,"
"
and the intercept of the least squares line can serve as an estimate of " ! ,
s ! œ ,!
"
The variance of C for a given B can be estimated by a kind of average of squared residuals
8
"
WWI
"/#3 œ
= œ
8 • # 3œ"
8•#
#
Of course, the square root of this "regression sample variance" is = œ È=# and serves as a single
number estimate of 5.
Interval-Based Inference Methods for SLR
The normal simple linear regression model provides inference formulas for model parameters.
Confidence limits for 5 are
=Ë
8•#
8•#
and
=
Ë
;#upper
;#lower
where ;#upper and ;#lower are upper and lower quantiles of the ;# distribution with / œ 8 • #
degrees of freedom. And confidence limits for "" (the slope of the line relating mean C to B ...
the rate of change of average C with respect to B) are
=
," „ >
8
#
!
Ë ÐB3 • BÑ
3œ"
where > is a quantile of the > distribution with / œ 8 • # degrees of freedom.
Confidence limits for .ClB œ "! € "" B (the mean value of C at a given value B) are
Í
Í"
ÐB • BÑ#
Ð,! € ," BÑ „ >=Í
Í8 € 8
! ÐB3 • BÑ#
Ì
3œ"
(Note that by choosing B œ !, this formula provides confidence limits for "! , though this
parameter is rarely of independent practical interest.)
-7-
Prediction limits for an additional observation C at a given value B are
Í
Í
"
ÐB • BÑ#
Ð,! € ," BÑ „ >=Í
"
€
€
8
Í
8 !
ÐB3 • BÑ#
Ì
3œ"
Hypothesis Tests and SLR
The normal simple linear regression model supports hypothesis testing. H0 :"" œ # can be tested
using the test statistic
," • #
X œ
=
Ë
! ÐB3 •BÑ#
8
3œ"
and a >8•# reference distribution. H0 :.ClB œ # can be tested using the test statistic
X œ
Ð,! € ," BÑ • #
=
Ë
"
8
€
ÐB•BÑ#
! ÐB3 •BÑ#
8
3œ"
and a >8•# reference distribution.
ANOVA and SLR
The breaking down of WWX 9> into WWV and WWI can be thought of as a kind of "analysis of
variance" in C. That enterprise is often summarized in a special kind of table. The general form
is as below.
ANOVA Table (for SLR)
Source
SS
df
MS
F
Regression WWV
"
QWV œ WWVÎ"
J œ QWVÎQWI
Error
WWI
8 • # QWI œ WWIÎÐ8 • #Ñ
Total
WW>9> 8 • "
In this table the ratios of sums of squares to degrees of freedom are called "mean squares." The
mean square for error is, in fact, the estimate of 5# (i.e. QWI œ =# ).
As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis
H0 :"" œ 0. The reference distribution appropriate is the J"ß8•# distribution. As it turns out, the
value J œ QWVÎQWI is the square of the > statistic for testing this hypothesis, and the J test
produces exactly the same :-values as a two-sided > test.
-8-
Standardized Residuals and SLR
The theoretical variances of the residuals turn out to depend upon their corresponding B values.
As a means of putting these residuals all on the same footing, it is common to "standardize" them
by dividing by an estimated standard deviation for each. This produces standardized residuals
/3
/‡3 œ
#
= " • 8" • 8ÐB3 •BÑ
! ÐB3 •BÑ#
Ë
3œ"
These (if the normal simple linear regression model is a good one) "ought" to look as if they are
approximately normal with mean ! and standard deviation ". Various kinds of plotting with
these standardized residuals (or with the raw residuals) are used as means of "model checking" or
"model diagnostics."
Multiple Linear Regression Model
The basic (normal) "multiple linear regression" model says that a response/output variable C
depends on explanatory/input/system variables B" ß B# ß á ß B5 in a "noisy but linear" way. That
is, one supposes that there is a linear relationship between B" ß B# ß á ß B5 and mean C,
.ClB" ßB# ßá,B5 œ "! € "" B" € "# B# € â € "5 B5
and that (for fixed B" ß B# ß á ß B5 ) there is around that mean a distribution of C that is normal.
Further, the standard assumption is that the standard deviation of the response distribution is
constant in B" ß B# ß á B5 . In symbols it is standard to write
C œ "! € "" B" € "# B# € â € "5 B5 € %
where % is normal with mean ! and standard deviation 5. This describes one C. Where several
observations C3 with corresponding values B"3 ß B#3 ß á ß B53 are under consideration, the
assumption is that the C3 (the %3 ) are independent. (The %3 are conceptually equivalent to
unrelated random draws from the same fixed normal continuous distribution.) The model
statement in its full glory is then
C3 œ "! € "" B"3 € "# B#3 € â € "5 B53 € %3 for 3 œ "ß #ß ÞÞÞß 8
%3 for 3 œ "ß #ß ÞÞÞß 8 are independent normal Ð!ß 5# Ñ random variables
The model statement above is a perfectly theoretical matter. One can begin with it, and for
specific choices of "! ß "" ß "# ß á ß "5 and 5 find probabilities for C at given values of
B" ß B# ß á ß B5 . In applications, the real mode of operation is instead to take 8 data vectors
ÐB"" ß B#" ß á ß B5" ß C" Ñß ÐB"# ß B## ß á ß B5# ß C# Ñß ÞÞÞß ÐB"8 ß B#8 ß á ß B58 ß C8 Ñ and use them to make
inferences about the parameters "! ß "" ,"# ß á ß "5 and 5 and to make predictions based on the
estimates (based on the empirically fitted model).
-9-
Descriptive Analysis of Approximately Linear ÐB" ß B# ß á ß B5 ß CÑ Data
Calculus can be invoked to find coefficients "! ß "" ß "# ß á ß "5 minimizing the sum of squared
vertical distances from data points in Ð5 € "Ñ dimensional space to a fitted surface,
! ÐC3 • Ð"! € "" B"3 € "# B#3 € â € "5 B53 ÑÑ# . These "least squares" values DO NOT have
8
3œ"
simple formulas (unless one is willing to use matrix notation). And in particular, one can NOT
simply somehow use the formulas from simple linear regression in this more complicated
context. We will call these minimizing coefficients ,! ß ," ß ,# ß á ß ,5 and need to rely upon JMP
to produce them for us.
It is further common to refer to the value of C on the "least squares surface" corresponding to
B"3 ß B#3 ß á ß B5 3 as a fitted or predicted value
sC3 œ ,! € ," B"3 € ,# B#3 € â € ,5 B53
Exactly as in SLR, one takes the difference between what is observed (C3 ) and what is "predicted"
or "explained" (sC3 ) as a kind of leftover part or "residual" corresponding to a data value
/3 œ C3 • sC3
The total sum of squares, WWX 9> œ ! ÐC3 • CÑ# , is (still) most of the sample variance of the 8
8
3œ"
values C3 and measures raw variation in the response variable. Just as in SLR, the sum of
squared residuals ! /#3 is a measure of variation in response remaining unaccounted for after
8
3œ"
fitting the equation to the data. As in SLR, people call it the error sum of squares and write
WWI œ "/#3 œ "ÐC3 • sC3 Ñ#
8
8
3œ"
3œ"
(The formula looks exactly like the one for SLR. It is simply the case that now sC3 is computed
using all 5 inputs, not just a single B.) One is still guaranteed that WWX 9> WWI . So the
difference WWX 9> • WWI is a non-negative measure of variation accounted for in fitting the
linear equation to the data. As in SLR, people call it the regression sum of squares and write
WWV œ WW>9> • WWI
The coefficient of (multiple) determination expresses WWV as a fraction of WWX 9> and is
V# œ
WWV
WWX 9>
which is interpreted as "the fraction of raw variation in C accounted for in the model fitting
process." This quantity can also be interpreted in terms of a correlation, as it turns out to be the
square of the sample linear correlation between the observations C3 and the fitted or predicted
values sC3 .
-10-
Parameter Estimates for MLR
The descriptive statistics for ÐB" ß B# ß á ß B5 ß CÑ data can be used to provide "single number
estimates" of the (typically unknown) parameters of the multiple linear regression model. That
is, the least squares coefficients ,! ß ," ß ,# ß á ß ,5 serve as estimates of the parameters
"! ß "" ß "# ß á ß "5 . The first of these is a kind of high-dimensional "intercept" and in the case
where the predictors are not functionally related, the others serve as rates of change of average C
with respect to a single B, provided the other B's are held fixed.
The variance of C for a fixed values B" ß B# ß á ß B5 can be estimated by a kind of average of
squared residuals
8
"
WWI
"/#3 œ
= œ
8 • 5 • " 3œ"
8•5•"
#
The square root of this "regression sample variance" is = œ È=# and serves as a single number
estimate of 5.
Interval-Based Inference Methods for MLR
The normal multiple linear regression model provides inference formulas for model parameters.
Confidence limits for 5 are
=Ë
8•5•"
8•5•"
and =Ë
#
;upper
;#lower
where ;#upper and ;#lower are upper and lower quantiles of the ;# distribution with / œ 8 • 5 • "
degrees of freedom.
Confidence limits for "4 (the rate of change of average C with respect to B4 ) are
,4 „ >astandard error of ,4 b
where > is a quantile of the > distribution with / œ 8 • 5 • " degrees of freedom. There is no
simple formula for "standard error of ,4 " and in particular, one can NOT simply somehow use the
formula from simple linear regression in this more complicated context. It IS the case that this
standard error is a multiple of =, but we will have to rely upon JMP to provide it for us.
Confidence limits for .ClB" ßB# ßá,B5 œ "! € "" B" € "# B# € â € "5 B5 (the mean value of C
at a particular choice of the B" ß B# ß á ß B5 ) are
sC „ >Ðstandard error of sC)
There is no simple formula for "standard error of sC" and in particular one can NOT simply
somehow use the formula from simple linear regression in this more complicated context. It IS
the case that this standard error is a multiple of =, but we will have to rely upon JMP to provide it
for us.
-11-
Prediction limits for an additional observation C at a given vector ÐB" ß B# ß á ß B5 Ñ are
sC „ >È=# € Ðstandard error of sC)#
Hypothesis Tests and MLR
The normal multiple linear regression model supports hypothesis testing. H0 :"4 œ # can be
tested using the test statistic
X œ
,4 • #
standard error of ,4
and a >8•5•" reference distribution. H0 :.ClB" ßB# ßá,B5 œ # can be tested using the test statistic
X œ
sC • #
standard error of sC
and a >8•5•" reference distribution.
ANOVA and MLR
As in SLR, the breaking down of WWX 9> into WWV and WWI can be thought of as a kind of
"analysis of variance" in C, and summarized in a special kind of table. The general form for
MLR is as below.
Source
Regression
Error
Total
ANOVA Table (for MLR Overall F Test)
SS
df
MS
F
WWV
5
QWV œ WWVÎ5
J œ QWVÎQWI
WWI
8 • 5 • " QWI œ WWIÎÐ8 • 5 • "Ñ
WW>9> 8 • "
(Note that as in SLR, the mean square for error is, in fact, the estimate of 5# (i.e. QWI œ =# ).)
As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis
H0 :"" œ "# œ â œ "5 œ 0. The reference distribution appropriate is the J5ß8•5•"
distribution.
"Partial F Tests" in MLR
It is possible to use ANOVA ideas to invent F tests for investigating whether some whole group
of " 's (short of the entire set) are all !. For example one might want to test the hypothesis
H0 :":€" œ ":€# œ â œ "5 œ 0
-12-
(This is the hypothesis that only the first : of the 5 input variables B3 have any impact on the
mean system response ... the hypothesis that the first : of the B's are adequate to predict C ... the
hypothesis that after accounting for the first : of the B's, the others do not contribute
"significantly" to one's ability to explain or predict C.)
If we call the model for C in terms of all 5 of the predictors the "full model" and the model for C
involving only B" through B: the "reduced model" then an J test of the above hypothesis can
be made using the statistic
J œ
ÐWWVFull • WWVReduced ÑÎÐ5 • :Ñ
QWIFull
and an J5•:ß8•5•" reference distribution. WWVFull
negative.
WWVReduced so the numerator here is non-
Finding a :-value for this kind of test is a means of judging whether V # for the full model is
"significantly"/detectably larger than V # for the reduced model. (Caution here, statistical
significant is not the same as practical importance. With a big enough data set, essentially any
increase in V # will produce a small :-value.) It is reasonably common to expand the basic MLR
ANOVA table to organize calculations for this test statistic. This is
(Expanded) ANOVA Table (for MLR)
Source
Regression
B" ß ÞÞÞß B:
SS
WWVFull
WWVRed
df
5
:
MS
QWVFull œ WWVÎ5
F
J œ QWVFull ÎQWIFull
B:€" ß ÞÞÞß B5 lB" ß ÞÞÞß B:
Error
Total
WWVFull • WWVRed
WWIFull
WW>9>
5•:
8•5•"
8•"
ÐWWVFull • WWVRed ÑÎÐ5 • :Ñ
QWIFull œ WWIFull ÎÐ8 • 5 • "Ñ
ÐWWVFull •WWVRed ÑÎÐ5•:Ñ
QWIFull
Standardized Residuals in MLR
As in SLR, people sometimes wish to standardize residuals before using them to do model
checking/diagnostics. While it is not possible to give a simple formula for the "standard error of
/3 " with using matrix notation, most MLR programs will compute these values. The
standardized residual for data point 3 is then (as in SLR)
/3
/‡3 œ
standard error of /3
If the normal multiple linear regression model is a good one these "ought" to look as if they are
approximately normal with mean ! and standard deviation ".
-13-
Intervals and Tests for Linear Combinations of " 's in MLR
It is sometimes important to do inference for a linear combination of MLR model coefficients
P œ -! "! € -" "" € -# "# € â € -5 "5
(where -! ß -" ß á ß -5 are known constants). Note, for example, that .ClB" ßB# ßá,B5 is of this form for
-! œ "ß -" œ B" ß -# œ B# ß á ß and -5 œ B5 . Note too that a difference in mean responses at two
sets of predictors, say ÐB" ß B# ß á ß B5 Ñ and ÐBw" ß Bw# ß á ß Bw5 Ñ is of this form for
-! œ !ß -" œ B" • Bw" ß -# œ B# • Bw# ß á ß and -5 œ B5 • Bw5 .
An obvious estimate of P is
s œ -! ,! € -" ," € -# ,# € â € -5 ,5
P
Confidence limits for P are
s „ >Ðstandard error of PÑ
s
P
s"Þ This standard error is a multiple of =, but
There is no simple formula for "standard error of P
s and its standard error is
we will have to rely upon JMP to provide it for us. (Computation of P
under the "Custom Test" option in JMP.)
H0 :P œ # can be tested using the test statistic
X œ
s•#
P
s
standard error of P
and a >8•5•" reference distribution. Or, if one thinks about it for a while, it is possible to find a
reduced model that corresponds to the restriction that the null hypothesis places on the MLR
model and to use a ": œ 5 • "" partial J test (with " and 8 • 5 • " degrees of freedom)
equivalent to the > test for this purpose.
MLR Model-Building
The MLR model and inference formulas above form the core of a set of tools in common use for
building reliable predictions of C from a large set of predictors B" ß á ß B5 . There are a number of
extensions and ways of using these tools (and their extensions) that combine to form a full
model-building technology. Some of the additional ideas are summarized below.
Diagnostic Tools
Residual Plotting
The residuals /3 œ C3 • sC 3 from MLR are meant to be empirical approximates of the
"random errors"
%3 œ C3 • .ClB" ßB# ßá,B5
in the MLR model. The MLR model says that the %3 are normal (with mean 0 and
standard deviation 5) and independent. So one should expect the residuals to be
-14-
describable in approximately these terms. They should be essentially "patternless normal
random noise" and if they aren't, a problem with the corresponding MLR model is
indicated. (This possibility then causes one to be skeptical of the appropriateness of any
probability-based inferences based on the MLR model.)
Common ways of looking at MLR residuals (or their standardized/Studentized versions
/‡3 ) are to
1) normal-plot them hoping to see an approximately linear plot, and
2) plot them against any variables of interest (like, for example, B" ß á ß B5 ß sC or
C, time order of observation, values of any variable potentially of importance
but not included in the model, etc.) looking for a pattern that can
simultaneously suggest a problem with a current model and identify possible
remedial measures.
Chapter 7 of the text discusses residual plotting in some detail, and we'll return to this
topic later.
Diagnostic Measures/Statistics
A Pooled Sample Standard Deviation and a Lack-of-Fit F Test
In problems where there are one or more ÐB" ß B# ß á ß B5 Ñ vectors that have multiple
responses C, it is possible to make an estimate of 5# that doesn't depend for its
appropriateness on the particular form of the relationship between B" ß B# ß á ß B5 and
mean C used in the MLR model. That is, if there are 1 groups of C's each coming from a
single ÐB" ß B# ß á ß B5 Ñ combination and having a group sample variance =#4 , then one can
make a kind of "pooled standard deviation" from these as
=Pooled
Ð8" • "Ñ=#" € Ð8# • "Ñ=## € â € Ð81 • "Ñ=#1
œË
Ð8" • "Ñ € Ð8# • "Ñ € â € Ð81 • "Ñ
Provided that 5 doesn't change with B" ß B# ß á ß B5 , this is a legitimate estimate of 5,
regardless of whether or not one has an appropriate form for .ClB" ßB# ßá,B5 . On the other
hand, the MLR sample standard deviation = ( œ ÈQWI ) will tend to overestimate 5 if
.ClB" ßB# ßá,B5 Á "! € "" B" € "# B# € â € "5 B5 . So informal comparison of = to =Pooled is
a means of doing model diagnosis (a large difference being indicative of poor model fit).
This can be made more formal by inventing a related J test statistic. That is, the MLR
error sum of squares is sometimes broken down as
WWI œ Ð8 • 1Ñ=#Pooled € ÐWWI • Ð8 • 1Ñ=#Pooled Ñ
and the (nonnegative) terms of the right of this equation given the names WWT I ("pure
error" sum of squares) and WWP9J ("lack of fit" sum of squares). That is
WWT I œ Ð8 • 1Ñ=#Pooled
-15-
and
WWP9J œ WWI • WWT I
In this notation, the statistic
J œ
WWP9J ÎaÐ8 • 5 • "Ñ • Ð8 • 1Ñb
WWT I ÎÐ8 • 1Ñ
is an index of whether the two estimates of 5 are detectably different, with large values
corresponding to cases where = is much larger than =Pooled . Then, as it turns out, under
the MLR model this statistic has an J1•5•"ß8•1 reference distribution and there is a
formal :-value to be associated with a disparity between = and =Pooled (the J1•5•"ß8•1
probability to the right of the observed value). People sometimes even go so far as to add
a pair of lines to the MLR ANOVA table, breaking down the "Error" source (and
correpsonding sum of squares and df) into "LoF" and "Pure Error" components.
V # and = œ È Q W I
As one searches through myriad possible MLR models potentially describing the
relationship between predictors B and a response, C, one generally wants a model with a
"small" number of predictors (a simple or parsimonious model), a "large" value of V # and
a "small" value of =. The naive "solution" to the model search problem of just "picking
the biggest possible model" is in fact no solution, in light of the potential for "overfitting"
a data set (adopting a model with too many "wiggles" that can nicely reproduce the C's in
the data set, but that does a very poor job when used to produce mild extrapolations or
interpolations).
Other Functions of WWI
For a given prediction problem (and therefore fixed WWX 9>), V # and = œ ÈQWI are
"equivalent" in the sense that one could be obtained from the other (and WWX 9>). They
don't, however, necessarily produce the same ordering of reduced models of some grand
MLR model in terms of "best looking" values of the criteria (e.g. the full model has the
largest V # but may not have the smallest =). There are several other functions of WWI
that have been suggested as possible statistics for model diagnostics/selection. Among
them are Mallows' G: and Akaike's Information Criterion (the AIC).
Mallows' G: is based on the fact that the average total squared difference between the 8
values sC3 from a MLR fit and their (real) means .3 can be worked out theoretically.
When divided by 5# , this quantity is a quantity >: that is Ð: € "Ñ when there is a choice
of " 's in a :-predictor MLR model that produces correct means for all 8 data points and is
otherwise larger. Mallows' suggestion for comparing reduced versions of a full (5 predictor) MLR model, is that for a reduced model with : Ÿ 5 predictors, one compute
G: œ
WWIRed
€ #Ð: € "Ñ • 8
QWIFull
-16-
(an estimate of >: if the full 5-variable MLR model is correct) and look for small : and
G: no more than about Ð: € "Ñ. The thinking is that such a reduced model is simpler than
the full model and appears to produce predictions comparable to the full model
predictions at the ÐB" ß B# ß á ß B5 Ñ vectors in the data set.
Akaike's Information Criterion is based on considerations beyond the scope of this
exposition. For a MLR model with 5 predictors, it is
EMG œ 8lnŒ
WWI
• € #Ð5 € "Ñ
8
and people look for small values of EMG .
Press Statistic
This is a diagnostic measure built on the notion that a model should not be terribly
sensitive to individual data points used to fit it, or equivalently that one ought to be able
to predict a response even without using that response to fit the model. Beginning with a
particular form for a MLR model and 8 data points, let
sCÐ3Ñ œ the value of C3 predicted by a model fit to the other Ð8 • "Ñ data points
(note that this is not necessarily sC3 ). The "prediction sum of squares" is
TVIWW œ "ÐC3 • sCÐ3Ñ Ñ#
8
3œ"
and one wants small values of this.
Model Search Algorithms
Given a particular full model (a particular set of 5 predictors), to do model selection one needs
some computer-assisted means of "poking around" in the (often very large) set of possible
reduced models looking for good reduced models. Statistical packages must offer some methods
for this. Probably the best such methodology is of the "all possible regressions" variety. This
kind of routine will, for a given set of 5 predictors and a choice of #, produce a list of the top #
models with " predictor, the top # models with # predictors, ..., the top # models with Ð5 • "Ñ
predictors. (Slick computational schemes make this possible despite the fact that the number of
models to be checked grows astronomically with the size of the full model, 5.)
An older (and really, inferior and completely ad hoc) methodology attempts to "add in" or "drop
out" variables in regression models one at a time on the basis of :-values for (> or J ) tests of
hypotheses that their regression coefficients are !. (An algorithm adds in or drops out the most
obvious predictor at each step.) This kind of "stepwise regression" methodology can be run in
a purely "backwards elimination" mode (that begins with the full model and successively drops
single variables), in a purely "forward selection" mode (that begins with a model containing
only the predictor most highly correlated with C) or in a "mixed" mode that at any step can
either add or drop a predictor variable depending upon the :-values for adding and dropping.
JMP has implemented stepwise regression as its model searching tool.
-17-
The reason that stepwise searching is inferior to all-possible-regressions searches is that when
one by some stepwise means or another gets to a particular :-variable model, one is NOT
guaranteed that such a model is even the "best" one available of size : according to an V# (or any
other) criterion. Only the exhaustive search provided by an all-possible-regressions algorithm
produces such a guarantee.
Creating New Variables from Exisiting Ones
Where neither a full MLR model for C in terms of all available predictors B" ß á ß B5 , nor any
reduction of it is satisfactory, one is far from "out of tricks to pull" in the quest to find a useful
means of predicting C. One obvious possibility is to replace C and/or one or more of the B's with
"transformed" versions of themselves. One might take square roots or logarithms (or ?????) of
the responses and/or one or more of the predictors, do the modeling and inference and
"untransform" back to original scales of measurement in order to interpret the inferences (by
squaring or exponentiating, or ?????).
Another possibility is to fit models that are not simply linear in the predictor(s) but quadratic, or
cubic, or ... For example, the full quadratic MLR regression model for C in the 5 œ #
predictors B" and B# is
C œ "! € "" B" € "# B# +"$ B#" € "% B## € "& B" B# € %
For a single predictor B, JMP will do the fitting of a polynomial for C under the Fit Y by X
menu. For 5 # different B's, one needs to create powers of individual predictors and cross
product terms by using the "cross" button to add them to the "Effects in Model" portion of the
dialog box (or to use the "Response Surface" macro to fill that portion of the box after
highlighting the original predictors in the list of columns) under the JMP Fit Model menu.
The appearance of the cross product term in the quadratic model above raises the possibility of
using predictors that are functions of more than one of a set of basic variables B" ß á ß B5 . Such
terms are often called "interaction" terms. A model without interaction terms is sometimes
called "additive" in that a mean response is gotten by simply adding to an overall "! the
separate contributions due to each of the 5 predictors. For an additive model (one without
interactions) plots of mean C against any one of the predictors, say B4 , are parallel for different
sets of the other B's. Models with interactions have plots of mean C verus an B4 involved in an
interaction that are NOT parallel.
Qualitative Factors/Inputs and Dummy Variables
At first look, it would seem that MLR has nothing to say about problems where some or all of the
basic system inputs that determine the nature of a response are qualitative rather than
quantitative. But as a matter of fact, with the proper amount of cleverness, it's possible to put
even qualitative factors into the MLR framework.
Consider a factor, call it A, that has M possible "levels" or settings. (A could, for example, be
something like employee gender with M œ #.) It is then possible to represent A in MLR notation
through the creation of M • " dummy variables. That is, one defines
-18-
BA" œ œ
"
0
if the observation is from level " of A
otherwise
BA# œ œ
"
0
if the observation is from level # of A
otherwise
ã
"
BAßM•" œ œ
0
if the observation is from level M • " of A
otherwise
Then, the model
C œ "! € ""BA" € "#B A# € â € "M•" BAßM•" € %
says that observations are normal with standard deviation 5 and mean
"! € ""
"! € "#
ã
"! € "M•"
"!
if observation is from level " of A
if observation is from level # of A
ã
if observation is from level M • " of A
if observation is from level M of A
All of the MLR machinery is available to do inference for the " 's and sums and differences
thereof (that amount to means and differences in mean responses under various levels of A).
The approach above is the one taken in the textbook. But other (equivalent) versions of this
business are possible. Two are of special interest because of the way the JMP does its automatic
coding for qualitative "nominal" and "ordinal" variables.
That is, a first alternative to the method above is to do what JMP seems to do for "nominal"
variables. Define M • " variables
BwA"
BwA2
BwAßM•"
Ú"
œÛ •"
Ü!
if the observation is from level " of A
if the observation is from level M of A
otherwise
œÛ •"
Ü!
if the observation is from level # of A
if the observation is from level M of A
otherwise
Ú"
Ú"
œÛ •"
Ü!
ã
if the observation is from level M • " of A
if the observation is from level M of A
otherwise
-19-
The model
C œ "! € ""B wA" € "#B wA# € â € "M•" BwAßM•" € %
then says that observations are normal with standard deviation 5 and mean
"! € ""
"! € "#
ã
"! € "M•"
if observation is from level " of A
if observation is from level # of A
ã
if observation is from level M • " of A
"! • Œ ! "3 •
M•"
if observation is from level M of A
3œ"
With this coding, the sum of the means is M"! and thus "! is the arithmetic average of the M
means. The other " 's are then deviations of the "first" M • " means from this arithmetic
average of the M means.
A third version of this is what JMP seems to do for "ordinal" variables. Define M • " variables
BwwA2 œ œ
BwwA$ œ œ
"
0
"
0
if the observation is from level # of A
otherwise
if the observation is from level # or $ of A
otherwise
ã
"
BwwAßM œ œ
0
if the observation is from level #ß $ß á ß or M of A
otherwise
Then, the model
C œ "! € "2B wwA# € " $B wwA$ € â € "M BwwAßM € %
says that observations are normal with standard deviation 5 and mean
"!
"! € "#
"! € "# € "$
ã
"! € ! "3
if observation is from level " of A
if observation is from level # of A
if observation is from level $ of A
ã
M
if observation is from level M of A
3œ#
The "intercept" here is the mean for the first level of A and the other " 's are the differences
between means for successive levels of A (in the order " through M ).
-20-
Once one has seen this idea of using M • " dummies to represent a single qualitative factor with M
levels, it is easy to go on and include more than one qualitative factor in a model (through the use
of a second set of dummies) and to create interactions involving qualitative factors (by taking
products with all of its dummies), etc.
It is worth considering in detail what one gets from using dummy variables where there are two
qualitative factors and all possible combinations of levels of those factors are represented in the
data set. (The standard jargon for "all possible combinations of M levels of A and N levels of B
represented in a data set" is that the data contain a "(full) two-way factorial in the factors A
and B." The figure below shows the M † N different combinations of levels of the factors laid out
in a table, with "cell mean" responses filling the cells.
Factor A
"
#
M
"
.""
.#"
ã
.M"
Factor B
#
N
."# â ."N
.## â .#N
ã
ã
.M# â .MN
It is reasonable to ask what dummy variables can provide in terms of modeling a response in this
context. Since it is the most sensible coding provided automatically by your software, let us
consider the Bw coding above, instead of the B coding discussed in Chapter 5 of your text in what
follows.
For 3 œ "ß #ß ÞÞÞß 3 • " let
BwAi
Ú"
œÛ •"
Ü!
if the observation is from level 3 of A
if the observation is from level M of A
otherwise
and for 4 œ "ß #ß ÞÞÞß 4 • " let
BwB4
Ú"
œÛ •"
Ü!
if the observation is from level 4 of B
if the observation is from level N of B
otherwise
A MLR regression model for response C (first) involving only the dummies themselves is
C œ "! € "A" BwA" € "A# BwA# € â € "A ßM•" BwAßM•" € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € %
This (no-interactions) model says that for 3 Ÿ M • " and 4 Ÿ N • "
.34 œ "! € "A3 € "B4
-21-
(*)
for 4 Ÿ N • "
.M4 œ "! • Ž""A3 • € "B4
M•"
3œ"
for 3 Ÿ M • "
œ "! € "A3 • Ž""B4 •
N•"
.3N
4œ"
and that
.MN œ "! • Ž""A3 • • Ž""B4 •
M•"
N•"
3œ"
4œ"
The no interactions model says that with level of A (B) held fixed, as one moves across levels of
B (A), the mean responses are changed by adding different " 's to "! , and that the same addition
would be done on every fixed level of A (B). There are "parallel traces of means" as one moves
across levels of B (A) for the different levels of A (B).
Consider too what one gets from averaging means across rows and down columns in the two way
table. Letting a dot subscript indicate that one has averaged out over the missing subscript, one
can extend the table above to get
Factor B
"
#
N
" ."" ."# â ."N ."Þ
Factor A # .#" .##
.#N .#Þ
ã
ã
ã
ã
M .M" .M# â .MN .MÞ
.Þ" .Þ# â .ÞN .ÞÞ
Adding the above expressions for the .34 in terms of the " 's across a row and dividing by N it
becomes clear that for 3 Ÿ M • ",
"A3 œ .3Þ • .ÞÞ
the difference between the row average mean and the grand mean. Similarly, adding the
above expressions for the .34 in terms of the " 's down a column and dividing by M it becomes
clear that for 4 Ÿ N • " ,
"B4 œ .Þ4 • .ÞÞ
the difference between the column average mean and the grand mean. These functions of
the means .34 are common summaries of a complete two-way table of M † N means and standard
jargon is that
the "main effect of A at its 3th level" œ .3Þ • .ÞÞ
-22-
while
the "main effect of B at its 4th level" œ .Þ4 • .ÞÞ
Our exposition here says that the regression coefficients " with the JMP "nominal" coding of
qualitative factors are (at least in the no-interaction model) exactly the factor main effects. Note
that for the last level of the factors, the fact that the .3Þ • .ÞÞ (and the .Þ4 • .ÞÞ ) sum to zero
means that the main effect for the last level of the factors is the negative sum of the other main
effects.
Now consider what happens when one adds to the no-interaction model (*) all cross products of
BwA3 and BBw 4 terms. One then has a model with number of predictors
"5" œ ÐM • "Ñ € ÐN • "Ñ € ÐM • "ÑÐN • "Ñ œ MN • "
It should not then be completely surprising that such a model then allows for any possible choice
of the MN means .34 . That is, the model with all the A dummies, the B dummies and the products
of A and B dummies in it, is really equivalent to starting with MN levels of a single omnibus
factor and making up MN • " dummies from scratch. The advantage of using the present coding
(instead of starting all over and making up a new coding for a single omnibus factor) is that the
cross product terms are interpretable. That is, as it turns out, a model extending (*) by including
all possible cross product terms still ends up implying that for 3 Ÿ M • "
"A3 œ .3Þ • .ÞÞ
and for 4 Ÿ N • " ,
"B4 œ .Þ4 • .ÞÞ
and then that for 3 Ÿ M • " and 4 Ÿ N • "
.34 œ "! € "A3 € "B4 € "AB34
so that for such 3 and 4
"AB34 œ .34 • a"! € "A3 € "B4 b
œ .34 • Ðmean from the no-interaction model)
œ .34 • Ðgrand mean € 3th A main effect € 4th B main effectÑ
It is common to define for all 3 and 4
interaction of A at level 3 and B at level 4 œ .34 • Ðmean from the no-interaction modelÑ
œ .34 • Ðgrand mean € 3th A main effect € 4th B main effectÑ
so that the ÐM • "ÑÐN • "Ñ cross product " 's can be interpreted as "interactions" measuring
departure from additivity/parallelism. As it turns out, the "interactions" for a given row or
column sum to !, so that one can get them for the last row level M of A or level N of B as the
negative sum of the others in the corresponding column or row.
-23-
Notice that considering a full model including all A dummies, all B dummies and all AB dummy
products, various reduced models have sensible interpretations. The reduced model
C œ "! € "A" BwA" € "A# BwA# € â € "A ßM•" BwAßM•" € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € %
is obviously one of "no A ‚ B interactions." (This model says that there are parallel traces on
a plot of mean C versus level of one of the factors.) The reduced model
C œ "! € "A" BwA" € "A# BwA# € â € " AßM•" BwAßM•" € %
is one of "A main effects only." (This model says that all means in a given row are the same.)
And the reduced model
C œ "! € "B" BwB" € "B# BwB# € â € "B,N•" BwBßN•" € %
is one of "B main effects only." (This model says that all means in a given column are the
same.)
Of course, all of the MLR regression machinery is then available for doing everything from
making confidence intervals for the factorial effects (that are " 's or linear combinations thereof)
to looking at plots of mean responses on JMP, to predicting new responses at particular
combinations of levels of the factors, to testing reduced models against the full model, etc.
Dummy Variables and Piece-Wise Regressions
Dummy variables are very useful objects. We saw above that they can be used to incorporate
basically qualitative information into a MLR analysis. They can also be used to allow one to
"piece together" different fitted curves defined over different regions. To illustrate what is
possible, suppose that one wants to model C so that .ClB is a continuous function of B, having the
properties that for 5" • 5# (the known locations of so-called "knots"), .ClB is a linear function
of B for B Ÿ 5" , a possibly different linear function of B for 5" Ÿ B Ÿ 5# , and a yet possibly
different linear function of B for 5# Ÿ B. This can be done as follows.
Define two dummy variables
! if B • 5"
B" œ œ
" if 5" Ÿ B
and
B# œ œ
! if B • 5#
" if 5# Ÿ B
A model with the target properties is then
C œ "! € "" B € "# ÐB • 5" ÑB" € "$ ÐB • 5$ ÑB# € %
Notice that this model has
.ClB œ "! € "" B for B Ÿ 5"
-24-
and
.ClB œ Ð"! • "# 5" Ñ € Ð"" +"# ÑB for 5" Ÿ B Ÿ 5#
and
.ClB œ Ð"! • "# 5" • "$ 5# Ñ € Ð"" +"# € "$ ÑB for 5# Ÿ B
The same kind of thing can be done with other numbers of knots and with higher order
polynomials (like, e.g., quadratics or cubics). For higher order polynomials, it can even be done
in a way that forces the curve defined by .ClB to not have "sharp corners" at any knot. All of this
is still in the framework provided by MLR.
Diagnostic Plots and More Diagnostic Measures
There are various kinds of residuals, ways of plotting them and measures of "influence" on a
regression that are meant to help in the black art of model building. We have already alluded to
the fact that under the MLR model, we expect ordinary residuals
/3 œ C3 • sC3
to look like mean ! normal random noise and that standardized residual=
/3
/‡3 œ
standard error of /3
should like standard normal random noise. In the context of defining the TVIWW statistic we
alluded to the notion of deleted residuals
/Ð3Ñ œ C3 • sC Ð3Ñ
and the hope that if a model is a good one and not overly sensitive to the exact data vectors used
to fit it, these shouldn't be ridiculously larger in magnitude than the regular residuals, /3 . This
does not exhaust the ways in which people have suggested using the residual idea. It is possible
to invent standardized/Studentized deleted residuals
/Ð3Ñ
/‡Ð3Ñ œ
standard error of /Ð3Ñ
and there are yet other possibilities.
Partial Residual Plots (JMP "Effect Leverage Plots")
In somewhat nonstandard language, SAS/JMP makes what it calls "effect leverage plots"
that accompany its "effect tests." These are based on another kind of residuals,
sometimes called partial residuals. With 5 predictor variables, I might think about
understanding the importance of variable 4 by considering residuals computed using only
the other 5 • " predictor variables to do prediction (i.e. using a reduced model not
including B4 ). Although it is nearly impossible to see this from their manual and help
functions or how the axes of the plots are labeled, the effect leverage plot in JMP for
variable 4 is a plot of
-25-
/Ð4Ñ ÐC3 Ñ œ the 3th C residual regressing on all predictor variables except B4
versus
/Ð4Ñ ÐB43 Ñ œ the 3th B4 residual regressing on all predictor variables except B4
On this plots there is a horizontal line drawn (ostensibly at C) that really represents C
partial residual equal to 0 (C perfectly predicted by all predictors excepting B4 ). (The
vertical axis IS in the original C units, but should not really be labeled as C, but rather as
partial residual.) The sum of squared vertical distances from the plotted points to this line
is then WWI for a model without predictor 4.
The horizontal plotting positions of the points are in the original B4 units, but are partial
residuals of the B4 's NOT B4 's themselves. The horizontal center of the plot is at B4
partial residual of !, not at B4 as JMP (inaccurately) represents things. The nonhorizontal line on the plots is in fact the least squares line through the plotted points.
What is interesting is that the usual residuals from that least squares line are the residuals
for the full MLR fit to the data. So the sum of the squared vertical distances from points
to sloped line is then WWI for the full model. The larger is reduction in SSE from the
horizontal line to the sloped one, the smaller the :-value for testing H! :"4 œ !.
Highlighting a point on a JMP partial residual plot makes it bigger on the other plots and
highlights it in the data table (for examination or, for example, potential exclusion). We
can at least on these plots see which points are fit poorly in a model that excludes a given
predictor and the effect the addition of that last predictor has on the prediction of that C.
(Note that points near the center of the horizontal scale are ones that have B4 that can
already be predicted from the other B's and so addition of B4 to the prediction equation
does not much change the residual. Points far to the right or left of center have values of
predictor 4 that are unlike their predictions from the other B's. They both tend to more
strongly influence the nature of the change in the model predictions as B4 is added to the
model, and tend to have their residuals more strongly affected than points in the middle of
the plot (where B4 might be predicted from the other B's).
Leverage
The notion of how much potential influence a single data point has on a fit is an
important one. The JMP partial residual plot/"effect leverage" plot is aimed at addressing
this issue by highlighting points with large B4 partial residuals. Another notion of the
same kind is based on the fact that there are 8# numbers 233w (3 œ "ß á ß 8 and
3w œ "ß á ß 8) depending upon the 8 vectors ÐB"3 ß B#3 ß á ß B5 3 Ñ only (and not the C's) so
that each sC3 is
sC3 œ 23" C" € 23# C# € â23ß3•" C3•" € 233 C3 € 23ß3€" C3€" € â € 238 C8
233 is then somehow a measure of how heavily C3 is counted in its own prediction and is
usually called the leverage of the corresponding data point. It is a fact that ! • 233 • "
and ! 233 œ 5 € ". So the 233 's average to Ð5 € "ÑÎ8, and a plausible rule of thumb is
8
3œ"
-26-
that when a single 233 is more than twice this average value, the corresponding data point
has an important ÐB"3 ß B#3 ß á ß B5 3 Ñ.
It is not at all obvious, but as it turns out, the TVIWW statistic has the formula
/3
TVIWW œ ! Š "•2
‹ involving these leverage values. This shows that big TVIWW
33
8
#
3œ"
occur when big leverages are associated with large ordinary residuals.
Cook's D
The leverage 233 involves only predictors and no C's. A proposal by Cook to measure
overall effect that point 3 has on the regression is the statistic
233
/3
H3 œ
Œ
•
Ð5 € "ÑQWI " • 233
#
where large values of this supposedly identify points that by virtue of either their leverage
or their large ordinary residual are "influential." H3 is Cook's Distance.
0-1 Responses
Sometimes the response, C, is an indicator of whether or not some event of interest has occurred
"
Cœœ
!
if the event occurs
if the event does not occur
It is possible to think about assessing the impact of some predictor variables B" ß B# ß á ß B5 "on
C." But ordinary regression analysis is not the right vehicle for doing so. The standard
regression models have normal C's, not !-" C's. Here the mean of C is
.ClB" ßB# ßáßB5 œ T ÒC œ "Ó œ :
and is constrained to be between ! and 1. The most commonly available technology for doing
inference here is so-called "logistic regression," that says that the "log odds ratio is linear in
B" ß B# ß á ß B5 ." That is, the assumption is that
lnŒ
:
• œ "! € "" B" € "# B# € â € "5 B5
"•:
(**)
and the " 's become the increase in log odds ratio for a unit increase in predictor, the other
predictors held fixed. (Note that as log odds ratio increases, : increases, log odds ratio of ! being
: œ Þ&).
The actual technology required to fit the relationship (**) is more complicated than least squares,
and the methods of inference are based on different mathematics. But using a package like JMP,
these differences are largely invisible to a user, and one can reason from the JMP report mostly
by analogy to ordinary regression. Both "Fit Y by X" and "Fit Model" in JMP will automatically
fit (**) if one gives it a nominal response variable. One bit of confusion that is possible here
concerns the fact that JMP knows numerical order and alphabetical order. So if you ask it to do
logistic regression, it will do so for : corresponding to what it considers to be the "first" category.
-27-
The !-" coding that most people use thus ends up giving the fit for Ð" • :) instead of :. So with
JMP and !'s and "'s as above and wishing to do inference for :, one needs to use Cw œ " • C as
the response in order to get the signs of the "fitted regression coefficients, ," correct.
Some Ideas About Looking at Data Over Time
A fundamental issue in business is the detection and prediction of change over time. What
follows as a wrap-up of this course is a brief introduction to the some statistical methods useful
in this enterprise.
Shewhart "Control" Charts
Working over 70 years ago at Bell Labs (and originally interested in manufacturing),
Walter Shewhart invented the so-called "control chart" as a process monitoring device,
aimed at change detection. (He correctly reasoned that unnecessary process variation
over time degrades performance of both the process and any product it produces. He
therefore sought to invent means of detecting unnecessary/removable process change and
avoiding ill-advised reaction to purely "random/inherent" fluctuation.) His method was to
plot summary statistics from samples taken over time, and to compare them to "control
limits" separating "common values" from "unusual" values of the same. His "control
limits" for a statistic [ are
PGP[ œ .[ • $5[ and Y G P[ œ .[ € $5[
For various [ , probability theory is called upon to supply the appropriate means and
standard deviations. Some standard types of charts and their corresponding means and
standard deviations (for use in control limits) are given in the following table.
Charted Statistic
B (sample mean of 8 measurements)
= (sample std dev of 8 measurements)
V (sample range of 8 measurements)
Mean
.
-% 5
.# 5
?
s (sample mean "defects per unit")
-
s: (sample fraction "defective")
:
Standard Deviation
5 /È 8
5È" • -%#
.$ 5
É -5
É :Ð"•:Ñ
8
In the table, . and 5 are the mean and standard deviation of a supposedly normal "stable
process distribution"à -%ß .# and .$ are "control chart constants" derived from normal
distribution assumptions and depending upon 8; - is a "stable process mean rate of
occurrence of 'defects'" and ?
s is a rate based on a count from 5 units; : is the "stable
process rate of producing 'defectives'" and s: is based on 8 items.
Where past experience supplies values for the process parameters (. and/or 5ß - or :Ñ one
can apply the control limits to data as they are generated and signal alarms/the need for
intervention/corrective action when out-of-control points are detected, on-line. Where
there is no such past experience available, one must take several samples of data in hand,
temporarily assume process stability and estimate the parameters, and apply the limits
-28-
only "retrospectively" to the data in hand. There is an entire "SPC" culture built around
various variations on exactly how these estimates are made, etc.
Time Series Ideas
A time series, C> , is (cleverly enough) a series of numbers collected with a time order, >,
attached. There is a huge statistical literature on the modeling and prediction of time
series of measurements. To finish off Stat 328 we will talk about only the most
elementary ideas of the subject, ones that are supported by JMP-In 3.2.6 and discussed in
Chapter 16 of the JMP Start Statistics book. (JMP 4.0 offers far more than JMP 3 in this
area.)
The default implicit assumption in (Shewhart control charting and) one sample inference
is that a data generating mechanism is producing what look like "iid" successive
measurements. In many business contexts, this is clearly NOT a sensible description of
reality, and other tools are needed beyond simple parameter estimates for a single
distribution.
To begin with, many real time series have a clear long-term trend in them. A first step
in their analysis is to account for that trend (and sometimes to "remove" it by subtraction).
There are a variety of old-style smoothing methods that have been employed in this
enterprise. JMP-IN offers a very nice modern facility in its "Fit Spline" option in "Fit Y
by X." It is also possible to fit low order polynomials to the C> using ordinary regression,
i.e. to fit equations like
C> œ ,! € ," > € ,# >#
in the "Fit Y by X." routine. One advantage of fitting a function of > like this (rather than
using another more flexible smoother) is that one then has an equation for limited
extrapolation/prediction into the future.
After accounting for a trend, there often remain medium term seasonal or business cycle
effects in business time series. One way to try to model these is using dummy variables.
That is, if I invent a nominal variable "month" or "quarter" and use it in a regression for C>
(or a detrended version of C> obtained by subtracting from it a smoothed series) I can
often capture an adjustment for "period" in terms of estimated regression coefficients for
the dummies JMP creates.
Another useful way of operating with either raw time series or ones that have had trends
and/or seasonal effects removed is through the use of "autoregressions." That is, plots of
/> versus />•" (or />•# , etc.)
(residual series developed by removing trend and seasonal components) will often show
substantial correlations. That suggests using the "lagged values" />•" ß />•# ß á as
"predictor variables" in a regression fit like
s/> œ ,! € ," />•" € ," />•#
-29-
This way of operating is not only often effective, it has the virtue of allowing one-stepahead forecasts to be made from the present.
-30-
Download