Lecture 1: Overview of the Model Comparison Approach (Part 1)

advertisement
Lecture 1: Overview of the Model Comparison Approach
Psych 5510/6510
Spring, 2009
The Source of this Approach
Judd, C. M., & McClelland, G. H. (1989). Data Analysis: A Model-comparison Approach. San
Diego, CA: Hardcourt, Brace, Jovanovich.
The Process of Modeling
Model creation involves an interplay between three elements:
1) Our conceptual model.
2) Our mathematical model.
3) Our data
The focus of data analysis is on the mathematical model, which consists of a variable measuring
what we are trying to model (the ‘dependent’ or ‘criterion’ variable), and variables (the
‘independent’ or ‘predictor’ variables) that can be used to predict the dependent variable.
But, the selection of variables to test for inclusion in our model is based upon our conceptual
understanding of what we are modeling, and our understanding in turn is influenced by the data
and its effect on the mathematical mode.
The Process of Data Analysis
DATA = MODEL + ERROR
The DATA consists of a set of scores. We will be using the scores of a sample to help construct
the model. The goal of modeling, however, is to arrive at a model that will handle all the scores
in the population from which we sampled (and thus create a model that generalizes beyond just
our sample).
Our MODEL consists of a formula for estimating or predicting the data. We will be using the
data from our sample to arrive at a MODEL that best describes the data in the population from
which we sampled.
ERROR = DATA – MODEL
ERROR is simply the amount by which the model fails to represent the data accurately (i.e. the
difference between what the model predicts each score will be and what those scores actually
are).
The ERROR of the model in the sample is often referred to as the residual—that which is left
after the model has done what it can to account for the data.
Goals of Modeling
1) To make ERROR as small as possible
2) To accomplish #1 with as simple a model as possible (principle of parsimony).
3) The ultimate goal--to which the other two are subservient--is to arrive at a better
conceptual understanding of what we are modeling.
Parsimony
We want as simple a model as possible that has the least amount of error. Given a choice
between two models, science tends to favor the simpler of the two. Complexity often arises in a
model when it is fundamentally flawed. If the model is flawed it won’t fit the data easily but
with enough additions and qualifications it can still be made to work. We want to avoid that.
The mathematical model that has the earth as the center of the solar system can still be made to
work--it can model the movement of the planets in the night sky--but it is an incredibly complex
model (circles inside of circles inside of circles...). The model that has the sun as the center of
the solar system works at least as well and is much, much simpler. The complexity of the former
model is due to trying to make a fundamentally flawed model handle the data. Science prefers
the simpler model.
Conceptual Understanding
While the focus will be on how to reduce error in the model we don’t want to lose track of the
paramount importance of developing a conceptual understanding of the variable we are
modeling.
For Example: A very simple model of people’s heights in inches, that contains no error at all,
would be to measure their height in centimeters and then times that by 2.54, but that ‘perfect’
model (i.e. it has no error) is perfectly bankrupt in terms of leading to any real understanding. A
model of height that looks at daily caloric intake and the height of parents is less accurate but of
more interest. If you find that the model becomes more accurate (loses error) when you include
the altitude at which they live then the model starts to become even more interesting.
Symbols
If the general approach is: DATA = MODEL + ERROR
For any particular score we have: Yi = Ŷi + ei
Yi is a particular score in the data set (the ‘ith’ score).
Ŷi (sometimes symbolized as Y’i) is what the model predicts the ‘ith’ score will be.
ei is the error of that prediction, in other words ei = Yi – Ŷi. Again this is called a
‘residual’.
Reducing ERROR
1) Use reliable and accurate measuring techniques (a matter of experimental design).
2) Make the Model conditional upon additional information about each observation (e.g.
including information about a person’s height in a model designed to predict their
weight). This will involve adding independent (predictor) variables to the Model.
Example DATA from Text
Auto Fatality Rates: Deaths per 100 million vehicle miles (1978)
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
4.2
4.7
5.3
3.6
3.4
3.4
2.3
3.3
3.2
3.3
4.0
4.8
3.0
2.9
3.0
3.5
3.2
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
N. Hampshire
N. Jersey
N. Mexico
N. York
No. Carolina
No. Dakota
4.0
3.0
2.0
2.0
3.0
3.0
4.0
3.0
4.0
2.0
5.0
2.0
2.0
5.0
3.0
3.0
3.0
Ohio
Oklahoma
Oregon
Penn
Rhode Is.
S. Carolina
S. Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
W. Virginia
Wisconsin
Wyoming
2.9
3.5
3.9
3.0
1.7
3.8
3.4
3.3
4.0
3.9
3.4
2.7
3.5
4.0
3.1
5.4
MODELS
Model: Ŷi = some formula for predicting the value of Yi
Simplest Model: A Constant
A constant is some value that is included in the model and is not computed from the sample.
Constants usually come from some theory or from a previous experiment (the are specifically not
calculated from our current data). For example, let’s say some model of traffic fatalities predicts
that the number of auto fatalities in each state should be 3.7 We use the capital Roman letter ‘B’
to represent a constant in the model.
MODEL: Ŷi = B0 where B0=3.7 (a constant)
A Somewhat More Sophisticated Model (Using an Unconditional Parameter)
It is much more common for our models to consist of ‘parameters’ rather than constants.
A parameter is an attribute of our population that can be estimated from our sample. For
example, the mean of a population is a parameter. We rarely know the actual value of the
parameters, we usually have to estimate their value based upon our sample data.
We will use the Greek letter ‘β’ (‘beta’) to represent a parameter, and we will use the small
Roman letter ‘b’ to represent an estimate of a parameter calculated from our sample.1
Keeping the b’s Straight
‘B’: a constant
‘β’: a parameter
‘b’: an estimate of a parameter
A step up from using a constant in a model is to use a parameter instead. For example, the mean
of the population (i.e. ‘μ’) is a parameter, we could use the mean of the population as our model.
In this case we are predicting that everyone’s score equals the mean of the population.
MODEL (referring to applying the model to the population):
Ŷi = β0 where β0= μ
If we don’t know the value of μ then we can use an unbiased estimate of μ for our model.
MODEL (referring to applying the model to the scores in the sample):
Ŷi = b0 where b0= est. μ = Y (i.e. the mean of our sample...from last semester)
A More Sophisticated Model (Using Conditional Parameters)
So far our models have predicted the same score for everyone in the population. A ‘conditional’
parameter is one whose contribution to the model is dependent upon the person’s score on
another variable (X). For example we may want to incorporate each state’s legal minimum
drinking age (X1) into our model of traffic fatalities.
MODEL: Ŷi = β0 + (β1)(X1i) where X1i is the legal drinking age for State ‘i’.
MODEL: Ŷi = β0 + (β1)(X1i)
(referring to the population)
MODEL: Ŷi = b0 + (b1)(X1i)
(referring to the sample)
If you have studied multiple regression before you may have run into ‘beta weights’, which are used when all the
variables are first transformed into standard scores. That is not the way this model comparison approach is using ‘β’
(to represent parameters).
1
This is the regression formula we used last semester (Y’=a+bX) but with new symbols. We are
modeling Y by creating a formula for a linear relationship between X and Y, and that model calls
for both an ‘intercept’ (β0) and a ‘slope’ (β1) of the regression line.
Even More Sophisticated Models
We can keep adding predictor variables (each with its own parameter) to help improve the ability
of our model to predict traffic fatalities (and subsequently reduce the error of the model):
X1=drinking age in the State
X2=maximum speed limit in the State
X3=mean annual snowfall
MODEL:
Ŷi = β0 + β1X1i + β2X2i + β3X3i (in the population)
or MODEL: Ŷi = b0 + b1X1i + b2X2i + b3X3i (in the sample)
General Linear Model
Expressing the relationship between the dependent variable (Y) and the predictor variables (the
X’s) in this way is an example of what is known as the general linear model. When there is
only one dependent variable (as will be the case this semester) then it is also known as the
multiple regression model.
Model Creation
The formulas we will be using will calculate the values of ‘b’ that will lead to the smallest
possible amount of error for a model to predict the scores in our sample. These values of ‘b’ also
serve as estimates of the values of ‘β’ for the model that would lead to the smallest amount of
error in predicting the scores in the population.
ERROR
DATA = MODEL + ERROR
For any particular score: Yi = Ŷi + ei
Remember that one of the criteria for our model is that it’s error be as small as possible (thus
demonstrating it has a good fit with the reality of our data). For an individual score the error
(residual) of the model would be:
ei = Yi - Ŷi
Total Error in the Model
We will need a measure of the total amount of error in the model. We can’t simply add up all of
the errors...i.e. total model error = Σ(Yi - Ŷi)...because that would have the errors of being too
low in the prediction (leading to positive error scores) cancel out the errors of being too high in
the prediction (leading to negative error scores). In fact in most of our models, Σ(Yi - Ŷi) will
always equal zero.
SSE
The solution for how to measure the total amount of error in a model is a familiar one (we did
something similar last semester in measuring variance), we will use SSE (sum of squared errors),
which involves squaring each error term before adding them up.
SSE = Σ(Yi - Ŷi)²
Let’s say we have a simple model: Ŷi = 5, and our data are Y = 1, 4, 5, 7, 8
Yi
1
4
5
7
8
- Ŷi = error error²
- 5 = -4
16
- 5 = -1
1
- 5 =
0
0
- 5 =
2
4
- 5 =
3
9
SSE = 30
If our data were Y= 5, 5, 5, 5, 5 then the model Ŷi = 5 would have no error (SSE=0)
Yi
5
5
5
5
5
- Ŷi = error error²
- 5 =
0
0
- 5 =
0
0
- 5 =
0
0
- 5 =
0
0
- 5 =
0
0
SSE = 0
The Model-Comparison Approach
We have looked at the types of models we might use (varying in sophistication) and how we will
measure the amount of error in each model (SSE). The final step is to look at the ‘modelcomparison approach’, which is the context in which we will use these concepts to analyze and
make sense of our data.
The model-comparison approach involves (gasp) comparing models. Specifically in every
analysis we will have two models to compare; a ‘compact model’ called ‘Model C’, and an
‘augmented model’ called ‘Model A’. Model A will always consist of the same parameters as
Model C plus one or more additional parameters.
The analysis will involve determining whether it is worthwhile to include the additional
parameters of Model A in our model (i.e. determining whether the additional parameters
significantly reduce the error of the model). Everything we do this semester will follow this
basic concept.
For example, let’s say we want to model the number of traffic fatalities in the various States of
the U.S., and that we have previously established that including in our model the drinking age
(X1) in the state was worthwhile. Now we want to know if it would be worthwhile to also
include the mean annual snowfall (X2) in the state to the model
Model C: Ŷi = β0 + β1X1i
Model A: Ŷi = β0 + β1X1i + β2X2i
If including X2 in the model significantly lowers the error of the model then we will want to
include it.
Moving on with this example, let’s say that we do find it is worthwhile to include X2 in our
model, that now becomes our new Model C. Now we want to know if adding the maximum
speed limit in the State (X3) to our model is worthwhile.
Model C: Ŷi = β0 + β1X1i + β2X2i
Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i
And so on...
There will be times when we will want Model A to include more than one new parameter. For
example:
Model C: Ŷi = β0 + β1X1i
Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i
In this case we would be testing to see if it is worth adding both X2 and X3 at the same time to
the model.
Utility of the Model Comparison Approach
This approach (testing to see whether the additional parameters of Model A are worth adding to
Model C) accomplishes two things:
1) It supports an integrative way of doing research. The goal of the model comparison
approach is to gradually develop more and more sophisticated models of the phenomena
of interest. It is a theoretically valuable alternative to the piecemeal approach, where we
study one question at a time outside the context of creating a general model. As we
develop the model over many experiments we automatically tie all of the research
together. The replication of previous experiments is also built into this approach. This
will all become apparent as you learn more about the approach.
2) Even if you do not want to follow the integrative approach to research that is
supported by the model comparison approach, the use of multiple regression that we will
be learning for the model comparison approach can be used to perform virtually all of the
statistical analyses we covered last semester. The advantage of doing the analyses with
multiple regression is that it is a one-way-does-it-all approach (you are essentially
learning just one analytic procedure rather than several different ones) and a more
flexible approach (opening up additional ways of running your experiment).
Worthwhileness
The authors of the text phrase the decision of whether or not to adopt Model A as a question of
‘worthwhileness’, are the additional parameters of Model A worth adding to our model? That
decision will depend upon how much error was eliminated by adding the extra parameters of
Model A, and whether that reduction is worth the extra complexity that including those
parameters would add to the model.
SSE(C) and SSE(A)
Let’s look at a Model C and a Model A again, to remind us that Model C is the ‘compact model’
while Model A is the ‘augmented model. For example:
Model C: Ŷi = β0 + β1X1i
Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i
and remember that SSE = Σ(Yi - Ŷi)²
To measure the amount of error in Model C we use Model C to predict the scores, then use the
formula SSE = Σ(Yi - Ŷi)² to see how much error was in Model C, then we do the same thing
using Model A, again using SSE = Σ(Yi - Ŷi)² but this time putting in the predicted values based
upon Model A.. This gives us the following two measures:
SSE(C): the error in Model C: SSE = Σ(Yi - ŶCi)²
SSE(A): the error in Model A: SSE = Σ(Yi - ŶAi)²
SSR
The measure of how much the additional parameters of Model A reduced error is called the ‘sum
of squares reduced’ or ‘SSR’:
SSR = SSE(C) – SSE(A)
The nature of our formulas is such that Model A will never lead to more error than Model C.
Consider the following models:
Model C: Ŷi = β0 + β1X1i
Model A: Ŷi = β0 + β1X1i + β2X2i
If variable X2 has absolutely no ability to contribute to the prediction of Y then the value of β2
will be zero, consequently SSE(A) will equal SSE(C) and SSR will be zero.
Proportional Reduction in Error (PRE)
We will find that it is more important to consider what proportion of the error is reduced by
adopting Model A rather than simply the amount of error reduced. Consider the following two
scenarios:
Scenario One: SSE(C) = 100 and SSE(A) = 90, so SSR = 10.
Scenario Two: SSE(C)= 20 and SSE(A) = 10, so SSR = 10.
In both scenarios error was reduced by 10 (SSR=10) when moving from Model C to Model A,
but in scenario one the error was reduced by 10% (from 100 down to 90) while in scenario two
error was reduced by 50% (from 20 down to 10). The reduction in error of 10 in scenario two is
more impressive because it takes care of a higher proportion of the error found in Model C. We
will be using the ‘proportional reduction in error’ or ‘PRE’ as our measure of how much error
the extra parameters in Model A took care of.
PRE 
SSE(C) - SSE(A)
SSR

SSE(C)
SSE(C)
SSR
10

 0.10
SSE(C) 100
10
In scenario two PRE 
 0.50
20
In scenario one PRE 
An equivalent, alternative formula is:
PRE  1 -
SSE(A)
SSE(C)
With this second formula it is easy to see that if Model A does not reduce error at all (which
would mean that SSE(A)=SSE(C)) then PRE equals zero. While if Model A is perfect and
completely gets rid of all the error (i.e. SSE(A)=0) then PRE equals one (i.e. Model A provides a
100% reduction in error).
Error never goes up when we add more parameters (useless predictors don’t change the
prediction equation), so PRE is always between 0 and 1.00.
PC and PA
Model C: Ŷi = β0 + β1X1i
Model A: Ŷi = β0 + β1X1i + β2X2i
PC = the number of parameters in Model C (in the example above PC = 2)
PA = the number of parameters in Model A (in the example above PA = 3)
Note that the number of parameters in a model equals the number of predictor (X) variables plus
one (for the intercept)
N = the number of scores in the sample.
Worthwhileness and PRE
PRE measures the proportional reduction of error from Model C accomplished by the extra
parameter(s) of Model A, the question is how big does PRE have to be to say that the extra
parameters would be worth adding to our model?
It would seem that if we want to keep improving the accuracy of our model (i.e. make error as
small as possible) we would want to keep adding more predictor variables (X’s) to our model.
Remember, however, that in addition to reducing error we want to have as simple a model as
possible. So we want to strike a balance of creating the most accurate model possible with the
fewest amount of parameters (each predictor variable X added to the model adds another
parameter). The model comparison approach uses null hypothesis testing to determine the
worthwhileness of adding the additional predictor variables. If the proportional reduction in
error is statistically significant then we will use that as our criterion for adding the additional
parameter(s) to the model.
In determining the statistical significance of the PRE there are a couple of things we need to take
into account.
1) If there is no correlation between the extra parameter(s) of Model A and the dependent
variable then adding them to the model won’t accomplish anything. But even if there is no
correlation between the variables in the population there will probably still be at least a slight
correlation between them in the sample just due to chance, which would lead to some reduction
of error in the sample (i.e. would lead to PRE>0).
2) Due to an artifact of our formulas, as the number of parameters in our model starts
approaching the number of scores in our sample the model approaches perfection at predicting
the scores in the sample. When the number of parameters equals the N of our sample then
Model A has no error, it will exactly predict each score in the sample without error. This
happens even when all the scores are randomly determined (see ‘PRE and N’ below). This
artificial reduction in error, however, only applies to that particular sample, not to the population
or to other samples pulled from the population. We don’t want this artifact to fool us into
thinking that the PRE is large enough to be statistically significant.
PRE and N
To demonstrate the relationship between PRE and N I will begin with a small sample where
N=5. Y is the dependent (criterion) variable, X1 through X4 are the independent (predictor)
variables. Remember that because the intercept is also a parameter the number of parameters
equals the number of predictor variables plus one, thus with four predictor variables the number
of parameters equals five, which is also the N of the sample. In each example the scores for Y
and the predictor variables were randomly generated from populations with various means and
standard deviations. The data for example one are given below, again the data are randomly
generated so there should be no real predictive ability in the independent variables.
Y
45.00
50.00
49.00
48.00
41.00
X1
41.00
48.00
58.00
49.00
48.00
X2
67.00
58.00
84.00
61.00
66.00
X3
25.00
25.00
33.00
32.00
32.00
X4
50.00
46.00
58.00
49.00
50.00
I now want to see the PRE of using 1, then 2, then 3, then 4 predictor variables.
Model C:
Model A:
Ŷi= β0
Ŷi= β0 + β1X1 (one predictor variable)
Ŷi= β0 + β1X1 + β2X2 (two predictor variables)
Ŷi= β0 + β1X1 + β2X2 + β3X3 (three predictor variables)
Ŷi= β0 + β1X1 + β2X2 + β3X3+ β4X4 (four predictor variables)
Example One
PRE
X1
.072
X1,X2
.726
X1,X2,X3
.887
X1,X2,X3,X4
1.00
As can be seen, even though the data are random the PRE of a Model A using X1 was .072, a
Model A that used both X2 and X2 was 0.726 and by the time all four predictor variables were
put into Model A it could account for all of the error left over from Model C (i.e. each Y score in
the sample was predicted exactly by Model A).
I tried this with three more samples, each time randomly generating the values for all of the
variables. When I averaged across the four examples I found the following PRE’s.
PRE
X1
.33
X1,X2
.60
X1,X2,X3
.85
X1,X2,X3,X4
1.00
As the data are randomly generated this demonstrates that when N=5 having one predictor
variable will reduce the error of the model by about 33%, adding a randomly generated X2 will
reduce error by about 60% (compared to no predictor variables), adding X3 will reduce error by
about 85% (compared to no predictor variables), and adding X4 will remove all error from the
model. The reduction in error cited above from having one, two, or three predictors is simply the
average of these particular four samples and thus may alter somewhat if other examples were
generated, but a PRE of 100% when the number of parameters equals the size of the sample will
always be true.
Now let’s see what happens when we greatly increase N. We will look at the PRE’s from one to
four predictors when N=100.
Average across three examples where N=100.
X1
X1,X2
.004
.019
PRE
X1,X2,X3
.032
X1,X2,X3,X4
.040
If we had 99 predictor variables (i.e. 100 parameters) then the PRE of the model would be 1.00
There are really two artifacts going on here that can make the PRE of the sample greater than the
PRE of the population. One is the purely chance correlations that occur in the sample between
the predictor variables and the criterion variable, which will tend to cause the PRE in the sample
to increase as more predictor variables are added, this can be seen both when N=5 and when
N=100. The second is the inexorable march toward PRE=1.00 as the number of predictor
variables approaches the size of the sample (much more evident above when N=5 than when
N=100). Our determination of whether the PRE in the sample is evidence that the predictor
variables work in the population as a whole needs to take both of these into account.
Putting all of this together there are two important elements that influence the determination of
whether or not the PRE is statistically significant:
1. PRE per parameter added. E.g. a PRE=.3 is more impressive when accomplished by
adding 2 parameters then when adding 10.
2. PRE taking into account how many parameters could have been added. The PRE is less
impressive as the number of parameters approaches the number of observations.
Testing PRE for Significance
PRE is a statistic, it measures the proportional reduction in error in moving from Model C to
Model A, using our estimates of the parameters and applying the models to the scores in our
sample. It is influenced both by random error leading to chance correlations between our
dependent variable and our predictor variables, and by the artifact of our formulas that causes
error to diminish as the number of parameters in our model approaches the number of scores in
our sample.
² is the corresponding parameter, it is the proportional reduction in error in moving from Model
C to Model A, when the model consists of the actual values of the parameters (not estimates) and
is applied to all of the scores in the population (not just the sample).
Remembering that in hypothesis testing the hypotheses of interest are always about populations
and their parameters, the general hypotheses that fit everything we will be doing this semester
are:
H0: η² = 0 (i.e. there is no reduction in error when moving from Model C to Model A in our
population)
HA: η² > 0 (i.e. there is a reduction in error when moving from Model C to Model A in our
population)
There are three different ways to test to see if the PRE in our sample is sufficiently large to
conclude that it is worthwhile to move from Model C to Model A. All three approaches will lead
to exactly the same conclusion. It would be simpler to just settle on one of these and learn only
it, but there is an advantage to looking at all three. The names I have given these approaches are:
1) Testing PRE for statistical significance.
2) The ‘PRE to F’ approach.
3) The ‘MS to F’ approach.
1) Testing PRE for Statistical Significance
This approach for testing the statistical significance of our PRE is the most straightforward of the
three, you are simply going to compare the PRE computed from the sample to a table that gives
us the critical values of PRE (or in the case of my statistical tool you will simply get a p value for
that PRE).
As with the statistics we examined last semester, there is a ‘sampling distribution of PRE
assuming H0 is true’. If H0 is true then the PRE in our sample should be small, if HA is true
then the PRE should be large. What we are interested in is what values of PRE would we
consider to be ‘surprising’ if H0 were true, i.e. what values of PRE are there only a .05 chance of
obtaining in our sample if H0 were true. On the web site I provide a table that gives the ‘critical
values’ of PRE and I have also provided a software tool that when given a value of PRE and the
values of PA, PC, and N will return the PRE critical value (and the F critical value and the p
value for that PRE). If the PRE of our sample is equal to or above the PREcritical value then we
‘reject H0’ and conclude that it would be worthwhile to add the extra parameters of Model A to
our model. If the PRE of our sample is less than the PREcritical value then we ‘don’t reject H0’,
and say that we don’t have enough evidence to conclude that the extra parameters of Model A
are worth adding to our model. Not surprisingly, we can see in the table that the critical value of
PRE (and thus our test for the statistical significance of our PRE) depends upon how many more
parameters are in Model A than in Model C (i.e. PA-PC) and how close we are getting to the
maximum number of parameters we can have (N-PA).
Sampling Distribution of PRE assuming H0 is true
2) The ‘PRE to F’ approach
This approach to testing the statistical significance of our PRE has two advantages over the
previous approach: 1) it provides a better understanding of exactly what we are doing when we
look to see if our PRE is statistically significant; and 2) it uses Fcritical tables, which are much
easier to find in various text books than PREcritical tables.
As has been stated before, the impressiveness of a particular PRE depends upon how many
parameters Model A added to accomplish that PRE (it is more impressive to reduce error with
fewer parameters than with lots of parameters), and how many parameters we have left that we
could add to the model (knowing that the maximum amount of parameters we can have is equal
to N).
The ‘PRE to F’ approach is a good way to operationalize these concepts. We are going to
calculate an F ratio. In our text the symbol F* is used in place of Fobtained, and I have decided to
use the text’s symbols.
F* 
PRE per parameters added
proportion of error remaining per parameters remaining
Let’s look at the numerator first: PRE per parameters added. We divide the PRE resulting from
moving from Model C to Model A by the number of additional parameters in Model A (i.e. PAPC). This gives us the PRE per parameters added, if PRE=.30 and it took two parameters to
accomplish that then the PRE per parameter added would be 0.15
PRE per parameter added 
PRE
PA - PC
Now let’s look at the denominator: proportion of error remaining (i.e. 1-PRE) divided by how
many more parameters we could add after Model A (i.e. N-PA).
Proportion of error remaining per parameters remaining 
1 - PRE
n  PA
As we will see, if the PRE resulting from Model A is exactly what we would expect simply by
adding two parameters at random2 then the value of F* will be equal to one. If, however, the
PRE resulting from Model A is more than we would expect just by adding parameters at random
than the value of F* will be greater than one.
Example
We want to test to determine if the extra parameters of Model A (the ‘augmented model’) add
significantly to our Model compared to the fewer parameters of Model C (the ‘compact model’),
i.e. we are testing to see if the extra parameters of Model A are ‘worthwhile’. The question is
not whether Model A reduces error (it usually will), but whether it reduces error more than
chance would predict.
Basic scenario: let's say we have a set of DATA that consists of 20 observations. The
maximum number of parameters we can have in the model is thus 20. Model C has 4
parameters, and the Sum of Squared Error for Model C equals 64. This error is graphically
represented below, the 64 squares represent the amount of error in SSE(C).
As we can have a maximum of 20 parameters and Model C has 4 of those, we are free to use up
to 16 more parameters in an augmented Model A. If we use all 16 remaining parameters we
should get rid of all of the error that remains after Model C. Thus each remaining parameter
should on average reduce error by 1/16 (a proportional reduction of error of .0625), or in other
words each remaining parameter should reduce error by 4 (i.e. 64/16). This is show below. The
By ‘adding two parameters at random’ I mean adding two predictor variables whose scores are randomly generated
and thus should not actually improve the prediction of our dependent variable except by chance.
2
error of 64 has been divided into 16 pieces, each piece representing 1/16 of the total remaining
error, or how much we can expect each additional parameter to reduce error just due to chance.
For Model A to be ‘worthwhile’ it needs to reduce error more than you would get by simply
selecting some parameters at random. In other words, to be worthwhile Model A needs to reduce
error by more than 4 for every parameter it uses (i.e. its PRE per parameter added should be
greater than 1/16 or .0625).
We will examine a Model A that adds two more predictor variables (and thus two more
parameters) to the model:
Model C: Ŷi = b 0 + b 1Xi1 + b 2Xi2 + b 3Xi3
Model A: Ŷi = b 0 + b 1Xi1 + b 2Xi2 + b 3Xi3 + b 4Xi4 + b 5Xi5
PC=4
PA=6
n=20
SSE(C)=64
SSE(A)=?
SSR=?
Scenario One
In the first scenario Model A reduces error from 64 down to 56..
SSR = SSE(C) – SSE(A) = 64 – 56 = 8
So SSR (the reduction in error when moving from Model C to Model A) is 8. This reduction of
error by Model A (accomplished by adding two parameters) is show below.
It is obvious that Model A has not accomplished much of anything, it has only reduced the error
by as much as you would expect by selecting any two parameters at random. For conceptual
clarity, however, we will take a closer look.
Computing PRE. First, let’s look at the proportional reduction of error (PRE) accomplished by
Model A. To do this we will look at what proportion of the error of Model C was eliminated by
moving to Model A.
PRE 
SSR
8

 0.125
SSE(C) 64
This just goes to show that doing essentially nothing but adding a couple of randomly selected
parameters can be expected to lead to some amount of reduction in error. Although we know
that the results should not be significant, let’s test the worthwhileness of Model A anyway.
Using the 'PRE to F' method:
F* 
PRE per parameter added
proportion of error remaining per parameters remaining
The top part of the F ratio will be:
PRE per parameter added 
PRE
0.125

 .0625
PA - PC
2
The bottom part of the F ratio will be:
Proportion of error remaining per parameters remaining 
1 - PRE .875

 .0625
n  PA
14
And so the value of F* is:
F* 
PRE/(PA - PC)
.125 / 2 .0625


 1.00
(1 - PRE)/(n - PA) .875 / 14 .0625
Remember from last semester that when H0 is true the expected value of F* is 1.003, and when
H0 is false that the expected value of F* is greater than 1.00. If F* were greater than 1.00 we
would need to look up the critical value of F in a table, with PA-PC as the ‘degrees of freedom
numerator’ and n-PA as the ‘degrees of freedom denominator’. The web site for this class has
such a table. In that table it can be seen that in this experiment F* would have to be greater than
or equal to 3.74 to be statistically significant.
So these results are not statistically significant, Model A does not add significantly to the
reduction of error, it does not perform better than simply selecting two parameters at random, it’s
additional parameters are not a ‘worthwhile’ addition to the model, stick with Model C.
3
Actually it is (df denominator)/(df denominator-2) for reasons given last semester, but that is close to 1 when df
denominator is large, and ‘1’ makes better conceptual sense.
Scenario Two
Now let us examine a new and improved Model A. This Model A also adds two parameters to
those used in Model C, but this time it reduces the error from 64 down to 40
SSR = SSE(C) – SSE(A) = 64 – 40 = 24
So SSR (the reduction in error when moving from Model C to Model A) is 24. This reduction of
error that Model A accomplishes by adding two parameters is show below. It is obvious that
Model A has accomplished quite a bit by adding only two parameters.
Computing PRE. First, let’s look at the proportional reduction of error (PRE) accomplished by
Model A. To do this we will look at what proportion of the error of Model C was eliminated by
moving to Model A.
PRE 
SSR
24

 0.375
SSE(C) 64
Using the 'PRE to F' method
F* 
PRE per parameter added
proportion of error remaining per parameters remaining
The top part of the F ratio will be:
PRE per parameter added 
PRE
.375

 .1875
PA - PC
2
The bottom part of the F ratio will be:
Proportion of error remaining per parameters remaining 
And so the value of F* is:
1 - PRE .625

 .045
n  PA
14
F* 
PRE/(PA - PC)
.375 / 2 .1875


 4.17
(1 - PRE)/(n - PA) .625 / 14 .045
Again, the F critical table tells us that in this experiment F* would have to be greater than or
equal to 3.74 to be statistically significant.
So these results are statistically significant, Model A does add significantly to the reduction of
error, it does perform better than simply selecting two parameters at random, its additional
parameters are a ‘worthwhile’ addition to the model, go with Model A over Model C
3) The ‘Mean Squares to F’ Approach
The third way of determining whether or not the PRE in your sample is statistically significant is
to transform the SSE’s into mean squares and form an F ratio from them. This approach gives
you the exact same value for F as the ‘PRE to F’ method described above. It is offered as a third
alternative simply because this is how many statistical packages output their analyses (including
SPSS). The ‘PRE to F’ method, I believe, is conceptually clearer.
SSE(A), SSE(C), and SSR are all various ‘sums of squared errors’, what we called ‘SS’ last
semester. Remember that a ‘mean square’ (or ‘MS’) is a sum of squares divided by its degrees
of freedom, and that F* is a ratio of two MS’s.
MSR 
SSR
, with df = PA-PC
PA - PC
MSE 
SSE(A)
, with df = n-PA
n - PA
F* 
MSR
, with dfMSR in the numerator and dfMSE in the denominator
MSE
Let’s look at how the analysis from scenario two would be represented in a summary table.
Statistical packages, specifically SPSS, use different labels for the SSE’s than those use in our
text, so I have included both how our textbook would label the SSE’s and how SPSS would label
them in the table below. Recall that SSR=24, SSE(A)=40, SSE(C)=60, PC=4, PA=6, and N=20.
Source (textbook) Source (SPSS)
SSR
Regression
SSE(A)
Residual
SSE(C)
Total
SS
24
40
60
df
2
14
16
MS
12
2.59
F
4.2
The table above is slightly misleading in that SPSS insists that Model C always consist of exactly
one parameter, so you wouldn’t actually be able to obtain the summary table above in an SPSS
output (from a Model C with 4 parameters). As you will see, we have to do a little number
crunching on our own to get the summary table we want from SPSS.
On Statistical Significance
I’d like to return to the idea from last semester that perhaps the null hypothesis is never (or
almost never) true. In the model comparison approach:
H0: η² = 0
HA: η² > 0
Conceptually, however, in the model comparison approach we could agree that everything in the
world is related to everything else, and thus any variable we add to the model can legitimately be
expected to reduce the error of the model, and thus H0 is never true. What we are using null
hypothesis testing for, however, is not to determine whether the independent variable adds to the
prediction of our dependent variable, but whether it adds enough to be worthwhile, i.e. whether it
adds enough to overcome our preference for parsimony. Perhaps it would be better to say:
H0: η² is pretty darn small, too small to justify adding the parameters to our model.
HA: η² is large enough to justify adding the parameters to our model.
This might just be a semantic difference, but I think not. As you will see as we progress through
this approach, we may find that some variables that were worth having in our model when the
model was fairly simple might get dropped from the model later on, when we arrive at a more
elegant model. This doesn’t make the original variables magically ‘no longer related to the
dependent variable’, it simply means we came up with a better model, one that does a better job
of explaining the dependent variable. We aren’t stopping as soon as we reject H0 and can say
that some X variable predicts Y, we are trying to come up with the best possible model of Y.
Download