Lecture 1: Overview of the Model Comparison Approach Psych 5510/6510 Spring, 2009 The Source of this Approach Judd, C. M., & McClelland, G. H. (1989). Data Analysis: A Model-comparison Approach. San Diego, CA: Hardcourt, Brace, Jovanovich. The Process of Modeling Model creation involves an interplay between three elements: 1) Our conceptual model. 2) Our mathematical model. 3) Our data The focus of data analysis is on the mathematical model, which consists of a variable measuring what we are trying to model (the ‘dependent’ or ‘criterion’ variable), and variables (the ‘independent’ or ‘predictor’ variables) that can be used to predict the dependent variable. But, the selection of variables to test for inclusion in our model is based upon our conceptual understanding of what we are modeling, and our understanding in turn is influenced by the data and its effect on the mathematical mode. The Process of Data Analysis DATA = MODEL + ERROR The DATA consists of a set of scores. We will be using the scores of a sample to help construct the model. The goal of modeling, however, is to arrive at a model that will handle all the scores in the population from which we sampled (and thus create a model that generalizes beyond just our sample). Our MODEL consists of a formula for estimating or predicting the data. We will be using the data from our sample to arrive at a MODEL that best describes the data in the population from which we sampled. ERROR = DATA – MODEL ERROR is simply the amount by which the model fails to represent the data accurately (i.e. the difference between what the model predicts each score will be and what those scores actually are). The ERROR of the model in the sample is often referred to as the residual—that which is left after the model has done what it can to account for the data. Goals of Modeling 1) To make ERROR as small as possible 2) To accomplish #1 with as simple a model as possible (principle of parsimony). 3) The ultimate goal--to which the other two are subservient--is to arrive at a better conceptual understanding of what we are modeling. Parsimony We want as simple a model as possible that has the least amount of error. Given a choice between two models, science tends to favor the simpler of the two. Complexity often arises in a model when it is fundamentally flawed. If the model is flawed it won’t fit the data easily but with enough additions and qualifications it can still be made to work. We want to avoid that. The mathematical model that has the earth as the center of the solar system can still be made to work--it can model the movement of the planets in the night sky--but it is an incredibly complex model (circles inside of circles inside of circles...). The model that has the sun as the center of the solar system works at least as well and is much, much simpler. The complexity of the former model is due to trying to make a fundamentally flawed model handle the data. Science prefers the simpler model. Conceptual Understanding While the focus will be on how to reduce error in the model we don’t want to lose track of the paramount importance of developing a conceptual understanding of the variable we are modeling. For Example: A very simple model of people’s heights in inches, that contains no error at all, would be to measure their height in centimeters and then times that by 2.54, but that ‘perfect’ model (i.e. it has no error) is perfectly bankrupt in terms of leading to any real understanding. A model of height that looks at daily caloric intake and the height of parents is less accurate but of more interest. If you find that the model becomes more accurate (loses error) when you include the altitude at which they live then the model starts to become even more interesting. Symbols If the general approach is: DATA = MODEL + ERROR For any particular score we have: Yi = Ŷi + ei Yi is a particular score in the data set (the ‘ith’ score). Ŷi (sometimes symbolized as Y’i) is what the model predicts the ‘ith’ score will be. ei is the error of that prediction, in other words ei = Yi – Ŷi. Again this is called a ‘residual’. Reducing ERROR 1) Use reliable and accurate measuring techniques (a matter of experimental design). 2) Make the Model conditional upon additional information about each observation (e.g. including information about a person’s height in a model designed to predict their weight). This will involve adding independent (predictor) variables to the Model. Example DATA from Text Auto Fatality Rates: Deaths per 100 million vehicle miles (1978) Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky 4.2 4.7 5.3 3.6 3.4 3.4 2.3 3.3 3.2 3.3 4.0 4.8 3.0 2.9 3.0 3.5 3.2 Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada N. Hampshire N. Jersey N. Mexico N. York No. Carolina No. Dakota 4.0 3.0 2.0 2.0 3.0 3.0 4.0 3.0 4.0 2.0 5.0 2.0 2.0 5.0 3.0 3.0 3.0 Ohio Oklahoma Oregon Penn Rhode Is. S. Carolina S. Dakota Tennessee Texas Utah Vermont Virginia Washington W. Virginia Wisconsin Wyoming 2.9 3.5 3.9 3.0 1.7 3.8 3.4 3.3 4.0 3.9 3.4 2.7 3.5 4.0 3.1 5.4 MODELS Model: Ŷi = some formula for predicting the value of Yi Simplest Model: A Constant A constant is some value that is included in the model and is not computed from the sample. Constants usually come from some theory or from a previous experiment (the are specifically not calculated from our current data). For example, let’s say some model of traffic fatalities predicts that the number of auto fatalities in each state should be 3.7 We use the capital Roman letter ‘B’ to represent a constant in the model. MODEL: Ŷi = B0 where B0=3.7 (a constant) A Somewhat More Sophisticated Model (Using an Unconditional Parameter) It is much more common for our models to consist of ‘parameters’ rather than constants. A parameter is an attribute of our population that can be estimated from our sample. For example, the mean of a population is a parameter. We rarely know the actual value of the parameters, we usually have to estimate their value based upon our sample data. We will use the Greek letter ‘β’ (‘beta’) to represent a parameter, and we will use the small Roman letter ‘b’ to represent an estimate of a parameter calculated from our sample.1 Keeping the b’s Straight ‘B’: a constant ‘β’: a parameter ‘b’: an estimate of a parameter A step up from using a constant in a model is to use a parameter instead. For example, the mean of the population (i.e. ‘μ’) is a parameter, we could use the mean of the population as our model. In this case we are predicting that everyone’s score equals the mean of the population. MODEL (referring to applying the model to the population): Ŷi = β0 where β0= μ If we don’t know the value of μ then we can use an unbiased estimate of μ for our model. MODEL (referring to applying the model to the scores in the sample): Ŷi = b0 where b0= est. μ = Y (i.e. the mean of our sample...from last semester) A More Sophisticated Model (Using Conditional Parameters) So far our models have predicted the same score for everyone in the population. A ‘conditional’ parameter is one whose contribution to the model is dependent upon the person’s score on another variable (X). For example we may want to incorporate each state’s legal minimum drinking age (X1) into our model of traffic fatalities. MODEL: Ŷi = β0 + (β1)(X1i) where X1i is the legal drinking age for State ‘i’. MODEL: Ŷi = β0 + (β1)(X1i) (referring to the population) MODEL: Ŷi = b0 + (b1)(X1i) (referring to the sample) If you have studied multiple regression before you may have run into ‘beta weights’, which are used when all the variables are first transformed into standard scores. That is not the way this model comparison approach is using ‘β’ (to represent parameters). 1 This is the regression formula we used last semester (Y’=a+bX) but with new symbols. We are modeling Y by creating a formula for a linear relationship between X and Y, and that model calls for both an ‘intercept’ (β0) and a ‘slope’ (β1) of the regression line. Even More Sophisticated Models We can keep adding predictor variables (each with its own parameter) to help improve the ability of our model to predict traffic fatalities (and subsequently reduce the error of the model): X1=drinking age in the State X2=maximum speed limit in the State X3=mean annual snowfall MODEL: Ŷi = β0 + β1X1i + β2X2i + β3X3i (in the population) or MODEL: Ŷi = b0 + b1X1i + b2X2i + b3X3i (in the sample) General Linear Model Expressing the relationship between the dependent variable (Y) and the predictor variables (the X’s) in this way is an example of what is known as the general linear model. When there is only one dependent variable (as will be the case this semester) then it is also known as the multiple regression model. Model Creation The formulas we will be using will calculate the values of ‘b’ that will lead to the smallest possible amount of error for a model to predict the scores in our sample. These values of ‘b’ also serve as estimates of the values of ‘β’ for the model that would lead to the smallest amount of error in predicting the scores in the population. ERROR DATA = MODEL + ERROR For any particular score: Yi = Ŷi + ei Remember that one of the criteria for our model is that it’s error be as small as possible (thus demonstrating it has a good fit with the reality of our data). For an individual score the error (residual) of the model would be: ei = Yi - Ŷi Total Error in the Model We will need a measure of the total amount of error in the model. We can’t simply add up all of the errors...i.e. total model error = Σ(Yi - Ŷi)...because that would have the errors of being too low in the prediction (leading to positive error scores) cancel out the errors of being too high in the prediction (leading to negative error scores). In fact in most of our models, Σ(Yi - Ŷi) will always equal zero. SSE The solution for how to measure the total amount of error in a model is a familiar one (we did something similar last semester in measuring variance), we will use SSE (sum of squared errors), which involves squaring each error term before adding them up. SSE = Σ(Yi - Ŷi)² Let’s say we have a simple model: Ŷi = 5, and our data are Y = 1, 4, 5, 7, 8 Yi 1 4 5 7 8 - Ŷi = error error² - 5 = -4 16 - 5 = -1 1 - 5 = 0 0 - 5 = 2 4 - 5 = 3 9 SSE = 30 If our data were Y= 5, 5, 5, 5, 5 then the model Ŷi = 5 would have no error (SSE=0) Yi 5 5 5 5 5 - Ŷi = error error² - 5 = 0 0 - 5 = 0 0 - 5 = 0 0 - 5 = 0 0 - 5 = 0 0 SSE = 0 The Model-Comparison Approach We have looked at the types of models we might use (varying in sophistication) and how we will measure the amount of error in each model (SSE). The final step is to look at the ‘modelcomparison approach’, which is the context in which we will use these concepts to analyze and make sense of our data. The model-comparison approach involves (gasp) comparing models. Specifically in every analysis we will have two models to compare; a ‘compact model’ called ‘Model C’, and an ‘augmented model’ called ‘Model A’. Model A will always consist of the same parameters as Model C plus one or more additional parameters. The analysis will involve determining whether it is worthwhile to include the additional parameters of Model A in our model (i.e. determining whether the additional parameters significantly reduce the error of the model). Everything we do this semester will follow this basic concept. For example, let’s say we want to model the number of traffic fatalities in the various States of the U.S., and that we have previously established that including in our model the drinking age (X1) in the state was worthwhile. Now we want to know if it would be worthwhile to also include the mean annual snowfall (X2) in the state to the model Model C: Ŷi = β0 + β1X1i Model A: Ŷi = β0 + β1X1i + β2X2i If including X2 in the model significantly lowers the error of the model then we will want to include it. Moving on with this example, let’s say that we do find it is worthwhile to include X2 in our model, that now becomes our new Model C. Now we want to know if adding the maximum speed limit in the State (X3) to our model is worthwhile. Model C: Ŷi = β0 + β1X1i + β2X2i Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i And so on... There will be times when we will want Model A to include more than one new parameter. For example: Model C: Ŷi = β0 + β1X1i Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i In this case we would be testing to see if it is worth adding both X2 and X3 at the same time to the model. Utility of the Model Comparison Approach This approach (testing to see whether the additional parameters of Model A are worth adding to Model C) accomplishes two things: 1) It supports an integrative way of doing research. The goal of the model comparison approach is to gradually develop more and more sophisticated models of the phenomena of interest. It is a theoretically valuable alternative to the piecemeal approach, where we study one question at a time outside the context of creating a general model. As we develop the model over many experiments we automatically tie all of the research together. The replication of previous experiments is also built into this approach. This will all become apparent as you learn more about the approach. 2) Even if you do not want to follow the integrative approach to research that is supported by the model comparison approach, the use of multiple regression that we will be learning for the model comparison approach can be used to perform virtually all of the statistical analyses we covered last semester. The advantage of doing the analyses with multiple regression is that it is a one-way-does-it-all approach (you are essentially learning just one analytic procedure rather than several different ones) and a more flexible approach (opening up additional ways of running your experiment). Worthwhileness The authors of the text phrase the decision of whether or not to adopt Model A as a question of ‘worthwhileness’, are the additional parameters of Model A worth adding to our model? That decision will depend upon how much error was eliminated by adding the extra parameters of Model A, and whether that reduction is worth the extra complexity that including those parameters would add to the model. SSE(C) and SSE(A) Let’s look at a Model C and a Model A again, to remind us that Model C is the ‘compact model’ while Model A is the ‘augmented model. For example: Model C: Ŷi = β0 + β1X1i Model A: Ŷi = β0 + β1X1i + β2X2i + β3X3i and remember that SSE = Σ(Yi - Ŷi)² To measure the amount of error in Model C we use Model C to predict the scores, then use the formula SSE = Σ(Yi - Ŷi)² to see how much error was in Model C, then we do the same thing using Model A, again using SSE = Σ(Yi - Ŷi)² but this time putting in the predicted values based upon Model A.. This gives us the following two measures: SSE(C): the error in Model C: SSE = Σ(Yi - ŶCi)² SSE(A): the error in Model A: SSE = Σ(Yi - ŶAi)² SSR The measure of how much the additional parameters of Model A reduced error is called the ‘sum of squares reduced’ or ‘SSR’: SSR = SSE(C) – SSE(A) The nature of our formulas is such that Model A will never lead to more error than Model C. Consider the following models: Model C: Ŷi = β0 + β1X1i Model A: Ŷi = β0 + β1X1i + β2X2i If variable X2 has absolutely no ability to contribute to the prediction of Y then the value of β2 will be zero, consequently SSE(A) will equal SSE(C) and SSR will be zero. Proportional Reduction in Error (PRE) We will find that it is more important to consider what proportion of the error is reduced by adopting Model A rather than simply the amount of error reduced. Consider the following two scenarios: Scenario One: SSE(C) = 100 and SSE(A) = 90, so SSR = 10. Scenario Two: SSE(C)= 20 and SSE(A) = 10, so SSR = 10. In both scenarios error was reduced by 10 (SSR=10) when moving from Model C to Model A, but in scenario one the error was reduced by 10% (from 100 down to 90) while in scenario two error was reduced by 50% (from 20 down to 10). The reduction in error of 10 in scenario two is more impressive because it takes care of a higher proportion of the error found in Model C. We will be using the ‘proportional reduction in error’ or ‘PRE’ as our measure of how much error the extra parameters in Model A took care of. PRE SSE(C) - SSE(A) SSR SSE(C) SSE(C) SSR 10 0.10 SSE(C) 100 10 In scenario two PRE 0.50 20 In scenario one PRE An equivalent, alternative formula is: PRE 1 - SSE(A) SSE(C) With this second formula it is easy to see that if Model A does not reduce error at all (which would mean that SSE(A)=SSE(C)) then PRE equals zero. While if Model A is perfect and completely gets rid of all the error (i.e. SSE(A)=0) then PRE equals one (i.e. Model A provides a 100% reduction in error). Error never goes up when we add more parameters (useless predictors don’t change the prediction equation), so PRE is always between 0 and 1.00. PC and PA Model C: Ŷi = β0 + β1X1i Model A: Ŷi = β0 + β1X1i + β2X2i PC = the number of parameters in Model C (in the example above PC = 2) PA = the number of parameters in Model A (in the example above PA = 3) Note that the number of parameters in a model equals the number of predictor (X) variables plus one (for the intercept) N = the number of scores in the sample. Worthwhileness and PRE PRE measures the proportional reduction of error from Model C accomplished by the extra parameter(s) of Model A, the question is how big does PRE have to be to say that the extra parameters would be worth adding to our model? It would seem that if we want to keep improving the accuracy of our model (i.e. make error as small as possible) we would want to keep adding more predictor variables (X’s) to our model. Remember, however, that in addition to reducing error we want to have as simple a model as possible. So we want to strike a balance of creating the most accurate model possible with the fewest amount of parameters (each predictor variable X added to the model adds another parameter). The model comparison approach uses null hypothesis testing to determine the worthwhileness of adding the additional predictor variables. If the proportional reduction in error is statistically significant then we will use that as our criterion for adding the additional parameter(s) to the model. In determining the statistical significance of the PRE there are a couple of things we need to take into account. 1) If there is no correlation between the extra parameter(s) of Model A and the dependent variable then adding them to the model won’t accomplish anything. But even if there is no correlation between the variables in the population there will probably still be at least a slight correlation between them in the sample just due to chance, which would lead to some reduction of error in the sample (i.e. would lead to PRE>0). 2) Due to an artifact of our formulas, as the number of parameters in our model starts approaching the number of scores in our sample the model approaches perfection at predicting the scores in the sample. When the number of parameters equals the N of our sample then Model A has no error, it will exactly predict each score in the sample without error. This happens even when all the scores are randomly determined (see ‘PRE and N’ below). This artificial reduction in error, however, only applies to that particular sample, not to the population or to other samples pulled from the population. We don’t want this artifact to fool us into thinking that the PRE is large enough to be statistically significant. PRE and N To demonstrate the relationship between PRE and N I will begin with a small sample where N=5. Y is the dependent (criterion) variable, X1 through X4 are the independent (predictor) variables. Remember that because the intercept is also a parameter the number of parameters equals the number of predictor variables plus one, thus with four predictor variables the number of parameters equals five, which is also the N of the sample. In each example the scores for Y and the predictor variables were randomly generated from populations with various means and standard deviations. The data for example one are given below, again the data are randomly generated so there should be no real predictive ability in the independent variables. Y 45.00 50.00 49.00 48.00 41.00 X1 41.00 48.00 58.00 49.00 48.00 X2 67.00 58.00 84.00 61.00 66.00 X3 25.00 25.00 33.00 32.00 32.00 X4 50.00 46.00 58.00 49.00 50.00 I now want to see the PRE of using 1, then 2, then 3, then 4 predictor variables. Model C: Model A: Ŷi= β0 Ŷi= β0 + β1X1 (one predictor variable) Ŷi= β0 + β1X1 + β2X2 (two predictor variables) Ŷi= β0 + β1X1 + β2X2 + β3X3 (three predictor variables) Ŷi= β0 + β1X1 + β2X2 + β3X3+ β4X4 (four predictor variables) Example One PRE X1 .072 X1,X2 .726 X1,X2,X3 .887 X1,X2,X3,X4 1.00 As can be seen, even though the data are random the PRE of a Model A using X1 was .072, a Model A that used both X2 and X2 was 0.726 and by the time all four predictor variables were put into Model A it could account for all of the error left over from Model C (i.e. each Y score in the sample was predicted exactly by Model A). I tried this with three more samples, each time randomly generating the values for all of the variables. When I averaged across the four examples I found the following PRE’s. PRE X1 .33 X1,X2 .60 X1,X2,X3 .85 X1,X2,X3,X4 1.00 As the data are randomly generated this demonstrates that when N=5 having one predictor variable will reduce the error of the model by about 33%, adding a randomly generated X2 will reduce error by about 60% (compared to no predictor variables), adding X3 will reduce error by about 85% (compared to no predictor variables), and adding X4 will remove all error from the model. The reduction in error cited above from having one, two, or three predictors is simply the average of these particular four samples and thus may alter somewhat if other examples were generated, but a PRE of 100% when the number of parameters equals the size of the sample will always be true. Now let’s see what happens when we greatly increase N. We will look at the PRE’s from one to four predictors when N=100. Average across three examples where N=100. X1 X1,X2 .004 .019 PRE X1,X2,X3 .032 X1,X2,X3,X4 .040 If we had 99 predictor variables (i.e. 100 parameters) then the PRE of the model would be 1.00 There are really two artifacts going on here that can make the PRE of the sample greater than the PRE of the population. One is the purely chance correlations that occur in the sample between the predictor variables and the criterion variable, which will tend to cause the PRE in the sample to increase as more predictor variables are added, this can be seen both when N=5 and when N=100. The second is the inexorable march toward PRE=1.00 as the number of predictor variables approaches the size of the sample (much more evident above when N=5 than when N=100). Our determination of whether the PRE in the sample is evidence that the predictor variables work in the population as a whole needs to take both of these into account. Putting all of this together there are two important elements that influence the determination of whether or not the PRE is statistically significant: 1. PRE per parameter added. E.g. a PRE=.3 is more impressive when accomplished by adding 2 parameters then when adding 10. 2. PRE taking into account how many parameters could have been added. The PRE is less impressive as the number of parameters approaches the number of observations. Testing PRE for Significance PRE is a statistic, it measures the proportional reduction in error in moving from Model C to Model A, using our estimates of the parameters and applying the models to the scores in our sample. It is influenced both by random error leading to chance correlations between our dependent variable and our predictor variables, and by the artifact of our formulas that causes error to diminish as the number of parameters in our model approaches the number of scores in our sample. ² is the corresponding parameter, it is the proportional reduction in error in moving from Model C to Model A, when the model consists of the actual values of the parameters (not estimates) and is applied to all of the scores in the population (not just the sample). Remembering that in hypothesis testing the hypotheses of interest are always about populations and their parameters, the general hypotheses that fit everything we will be doing this semester are: H0: η² = 0 (i.e. there is no reduction in error when moving from Model C to Model A in our population) HA: η² > 0 (i.e. there is a reduction in error when moving from Model C to Model A in our population) There are three different ways to test to see if the PRE in our sample is sufficiently large to conclude that it is worthwhile to move from Model C to Model A. All three approaches will lead to exactly the same conclusion. It would be simpler to just settle on one of these and learn only it, but there is an advantage to looking at all three. The names I have given these approaches are: 1) Testing PRE for statistical significance. 2) The ‘PRE to F’ approach. 3) The ‘MS to F’ approach. 1) Testing PRE for Statistical Significance This approach for testing the statistical significance of our PRE is the most straightforward of the three, you are simply going to compare the PRE computed from the sample to a table that gives us the critical values of PRE (or in the case of my statistical tool you will simply get a p value for that PRE). As with the statistics we examined last semester, there is a ‘sampling distribution of PRE assuming H0 is true’. If H0 is true then the PRE in our sample should be small, if HA is true then the PRE should be large. What we are interested in is what values of PRE would we consider to be ‘surprising’ if H0 were true, i.e. what values of PRE are there only a .05 chance of obtaining in our sample if H0 were true. On the web site I provide a table that gives the ‘critical values’ of PRE and I have also provided a software tool that when given a value of PRE and the values of PA, PC, and N will return the PRE critical value (and the F critical value and the p value for that PRE). If the PRE of our sample is equal to or above the PREcritical value then we ‘reject H0’ and conclude that it would be worthwhile to add the extra parameters of Model A to our model. If the PRE of our sample is less than the PREcritical value then we ‘don’t reject H0’, and say that we don’t have enough evidence to conclude that the extra parameters of Model A are worth adding to our model. Not surprisingly, we can see in the table that the critical value of PRE (and thus our test for the statistical significance of our PRE) depends upon how many more parameters are in Model A than in Model C (i.e. PA-PC) and how close we are getting to the maximum number of parameters we can have (N-PA). Sampling Distribution of PRE assuming H0 is true 2) The ‘PRE to F’ approach This approach to testing the statistical significance of our PRE has two advantages over the previous approach: 1) it provides a better understanding of exactly what we are doing when we look to see if our PRE is statistically significant; and 2) it uses Fcritical tables, which are much easier to find in various text books than PREcritical tables. As has been stated before, the impressiveness of a particular PRE depends upon how many parameters Model A added to accomplish that PRE (it is more impressive to reduce error with fewer parameters than with lots of parameters), and how many parameters we have left that we could add to the model (knowing that the maximum amount of parameters we can have is equal to N). The ‘PRE to F’ approach is a good way to operationalize these concepts. We are going to calculate an F ratio. In our text the symbol F* is used in place of Fobtained, and I have decided to use the text’s symbols. F* PRE per parameters added proportion of error remaining per parameters remaining Let’s look at the numerator first: PRE per parameters added. We divide the PRE resulting from moving from Model C to Model A by the number of additional parameters in Model A (i.e. PAPC). This gives us the PRE per parameters added, if PRE=.30 and it took two parameters to accomplish that then the PRE per parameter added would be 0.15 PRE per parameter added PRE PA - PC Now let’s look at the denominator: proportion of error remaining (i.e. 1-PRE) divided by how many more parameters we could add after Model A (i.e. N-PA). Proportion of error remaining per parameters remaining 1 - PRE n PA As we will see, if the PRE resulting from Model A is exactly what we would expect simply by adding two parameters at random2 then the value of F* will be equal to one. If, however, the PRE resulting from Model A is more than we would expect just by adding parameters at random than the value of F* will be greater than one. Example We want to test to determine if the extra parameters of Model A (the ‘augmented model’) add significantly to our Model compared to the fewer parameters of Model C (the ‘compact model’), i.e. we are testing to see if the extra parameters of Model A are ‘worthwhile’. The question is not whether Model A reduces error (it usually will), but whether it reduces error more than chance would predict. Basic scenario: let's say we have a set of DATA that consists of 20 observations. The maximum number of parameters we can have in the model is thus 20. Model C has 4 parameters, and the Sum of Squared Error for Model C equals 64. This error is graphically represented below, the 64 squares represent the amount of error in SSE(C). As we can have a maximum of 20 parameters and Model C has 4 of those, we are free to use up to 16 more parameters in an augmented Model A. If we use all 16 remaining parameters we should get rid of all of the error that remains after Model C. Thus each remaining parameter should on average reduce error by 1/16 (a proportional reduction of error of .0625), or in other words each remaining parameter should reduce error by 4 (i.e. 64/16). This is show below. The By ‘adding two parameters at random’ I mean adding two predictor variables whose scores are randomly generated and thus should not actually improve the prediction of our dependent variable except by chance. 2 error of 64 has been divided into 16 pieces, each piece representing 1/16 of the total remaining error, or how much we can expect each additional parameter to reduce error just due to chance. For Model A to be ‘worthwhile’ it needs to reduce error more than you would get by simply selecting some parameters at random. In other words, to be worthwhile Model A needs to reduce error by more than 4 for every parameter it uses (i.e. its PRE per parameter added should be greater than 1/16 or .0625). We will examine a Model A that adds two more predictor variables (and thus two more parameters) to the model: Model C: Ŷi = b 0 + b 1Xi1 + b 2Xi2 + b 3Xi3 Model A: Ŷi = b 0 + b 1Xi1 + b 2Xi2 + b 3Xi3 + b 4Xi4 + b 5Xi5 PC=4 PA=6 n=20 SSE(C)=64 SSE(A)=? SSR=? Scenario One In the first scenario Model A reduces error from 64 down to 56.. SSR = SSE(C) – SSE(A) = 64 – 56 = 8 So SSR (the reduction in error when moving from Model C to Model A) is 8. This reduction of error by Model A (accomplished by adding two parameters) is show below. It is obvious that Model A has not accomplished much of anything, it has only reduced the error by as much as you would expect by selecting any two parameters at random. For conceptual clarity, however, we will take a closer look. Computing PRE. First, let’s look at the proportional reduction of error (PRE) accomplished by Model A. To do this we will look at what proportion of the error of Model C was eliminated by moving to Model A. PRE SSR 8 0.125 SSE(C) 64 This just goes to show that doing essentially nothing but adding a couple of randomly selected parameters can be expected to lead to some amount of reduction in error. Although we know that the results should not be significant, let’s test the worthwhileness of Model A anyway. Using the 'PRE to F' method: F* PRE per parameter added proportion of error remaining per parameters remaining The top part of the F ratio will be: PRE per parameter added PRE 0.125 .0625 PA - PC 2 The bottom part of the F ratio will be: Proportion of error remaining per parameters remaining 1 - PRE .875 .0625 n PA 14 And so the value of F* is: F* PRE/(PA - PC) .125 / 2 .0625 1.00 (1 - PRE)/(n - PA) .875 / 14 .0625 Remember from last semester that when H0 is true the expected value of F* is 1.003, and when H0 is false that the expected value of F* is greater than 1.00. If F* were greater than 1.00 we would need to look up the critical value of F in a table, with PA-PC as the ‘degrees of freedom numerator’ and n-PA as the ‘degrees of freedom denominator’. The web site for this class has such a table. In that table it can be seen that in this experiment F* would have to be greater than or equal to 3.74 to be statistically significant. So these results are not statistically significant, Model A does not add significantly to the reduction of error, it does not perform better than simply selecting two parameters at random, it’s additional parameters are not a ‘worthwhile’ addition to the model, stick with Model C. 3 Actually it is (df denominator)/(df denominator-2) for reasons given last semester, but that is close to 1 when df denominator is large, and ‘1’ makes better conceptual sense. Scenario Two Now let us examine a new and improved Model A. This Model A also adds two parameters to those used in Model C, but this time it reduces the error from 64 down to 40 SSR = SSE(C) – SSE(A) = 64 – 40 = 24 So SSR (the reduction in error when moving from Model C to Model A) is 24. This reduction of error that Model A accomplishes by adding two parameters is show below. It is obvious that Model A has accomplished quite a bit by adding only two parameters. Computing PRE. First, let’s look at the proportional reduction of error (PRE) accomplished by Model A. To do this we will look at what proportion of the error of Model C was eliminated by moving to Model A. PRE SSR 24 0.375 SSE(C) 64 Using the 'PRE to F' method F* PRE per parameter added proportion of error remaining per parameters remaining The top part of the F ratio will be: PRE per parameter added PRE .375 .1875 PA - PC 2 The bottom part of the F ratio will be: Proportion of error remaining per parameters remaining And so the value of F* is: 1 - PRE .625 .045 n PA 14 F* PRE/(PA - PC) .375 / 2 .1875 4.17 (1 - PRE)/(n - PA) .625 / 14 .045 Again, the F critical table tells us that in this experiment F* would have to be greater than or equal to 3.74 to be statistically significant. So these results are statistically significant, Model A does add significantly to the reduction of error, it does perform better than simply selecting two parameters at random, its additional parameters are a ‘worthwhile’ addition to the model, go with Model A over Model C 3) The ‘Mean Squares to F’ Approach The third way of determining whether or not the PRE in your sample is statistically significant is to transform the SSE’s into mean squares and form an F ratio from them. This approach gives you the exact same value for F as the ‘PRE to F’ method described above. It is offered as a third alternative simply because this is how many statistical packages output their analyses (including SPSS). The ‘PRE to F’ method, I believe, is conceptually clearer. SSE(A), SSE(C), and SSR are all various ‘sums of squared errors’, what we called ‘SS’ last semester. Remember that a ‘mean square’ (or ‘MS’) is a sum of squares divided by its degrees of freedom, and that F* is a ratio of two MS’s. MSR SSR , with df = PA-PC PA - PC MSE SSE(A) , with df = n-PA n - PA F* MSR , with dfMSR in the numerator and dfMSE in the denominator MSE Let’s look at how the analysis from scenario two would be represented in a summary table. Statistical packages, specifically SPSS, use different labels for the SSE’s than those use in our text, so I have included both how our textbook would label the SSE’s and how SPSS would label them in the table below. Recall that SSR=24, SSE(A)=40, SSE(C)=60, PC=4, PA=6, and N=20. Source (textbook) Source (SPSS) SSR Regression SSE(A) Residual SSE(C) Total SS 24 40 60 df 2 14 16 MS 12 2.59 F 4.2 The table above is slightly misleading in that SPSS insists that Model C always consist of exactly one parameter, so you wouldn’t actually be able to obtain the summary table above in an SPSS output (from a Model C with 4 parameters). As you will see, we have to do a little number crunching on our own to get the summary table we want from SPSS. On Statistical Significance I’d like to return to the idea from last semester that perhaps the null hypothesis is never (or almost never) true. In the model comparison approach: H0: η² = 0 HA: η² > 0 Conceptually, however, in the model comparison approach we could agree that everything in the world is related to everything else, and thus any variable we add to the model can legitimately be expected to reduce the error of the model, and thus H0 is never true. What we are using null hypothesis testing for, however, is not to determine whether the independent variable adds to the prediction of our dependent variable, but whether it adds enough to be worthwhile, i.e. whether it adds enough to overcome our preference for parsimony. Perhaps it would be better to say: H0: η² is pretty darn small, too small to justify adding the parameters to our model. HA: η² is large enough to justify adding the parameters to our model. This might just be a semantic difference, but I think not. As you will see as we progress through this approach, we may find that some variables that were worth having in our model when the model was fairly simple might get dropped from the model later on, when we arrive at a more elegant model. This doesn’t make the original variables magically ‘no longer related to the dependent variable’, it simply means we came up with a better model, one that does a better job of explaining the dependent variable. We aren’t stopping as soon as we reject H0 and can say that some X variable predicts Y, we are trying to come up with the best possible model of Y.