Solving Linear Regression Problems as a General Linear Model 1 of 23 Homework problems are multiple answer rather than multiple choice. The format for multiple answer questions is shown in the example below. The directions for the problems instruct you to mark the check boxes for all of the statements that are true. One or more answers must be marked for each problem. Full or partial credit is computed for each question. To receive full credit, you must mark all of the correct answers and not mark any of the incorrect answers. Partial credit is computed by summing the points for each correct response and subtracting points for each incorrect answer. If the computation for partial credit results in a negative number, zero credit is assigned. Level of Measurement Requirement and Sample Size Requirement Multiple regression requires that the dependent variable be interval and the independent variables be interval or dichotomous. If one of the variables is ordinal level, we will follow the common convention of treating ordinal variables as interval level, but we should consider noting the use of an ordinal variable as a limitation to our findings. These problems use the rule of thumb from Tabachnick and Fidell that the required number of cases should be the larger of the number of independent variables x 8 + 50 or the number of independent variables + 105. If the sample size requirement (along with the level of measurement requirement) is satisfied, the check box “The level of measurement requirement and the sample size requirement are satisfied” should be marked. In many of problems we have worked, failing to meet sample size implies that it is an inappropriate application of the statistic and we halted all further work on the problem. We will not apply that policy to these problems. If our sample size is less than the minimum requirement, we leave the check box unmarked and continue with the problem, mention the sample size issues as a limitation for the analysis. 2 of 23 The Assumption of Normality Regression assumes that the residual are normally distributed. We will meet this assumption if each of the interval variables is normally distributed, but there is general consensus that violations of this assumption do not seriously affect the probabilities needed for statistical decision making, especially when the sample size is large. The problems evaluate normality based on the criteria that the skewness and kurtosis of each variable falls within the range from -1.0 to +1.0. If the variables satisfies these criteria for skewness and kurtosis, the check box “The skewness and kurtosis of the variables satisfy the assumption of normality” should be marked. If the criteria for normality are not satisfied, the check box should remain unmarked and we should consider including a statement about the violation of this assumption in the discussion of our results. In these problems we will not test transformations or consider removing outliers to improve the normality of the variables. The Assumption of Homoscedasticity Regression assumes that the variance of the residuals is homogeneous across predicted values of the dependent variable. SPSS does not compute Levene’s test for equality of variance when all of the variables are interval (or ordinal treated as interval). The check box “The regression analysis satisfies assumption of homoscedasticity” will remain unchecked for these problems. The Assumption of Linearity The assumption of linearity is tested with the lack of fit test in the Univariate General Linear Model procedure. If the test is significant, it implies that there is a non-linear component that should be added to the model. If the test is not significant, we assume that a linear model is present and is an adequate representation of the relationship between the dependent and independent variables. If the lack of fit test is not significant at the alpha level for diagnositic statistics, the check box “The regression analysis satisfies the assumption of linearity” is marked. The Assumption of Independence of Errors SPSS does not compute the Durbin-Watson statistic in the Univariate General Linear Model procedure. In these problems, we will acknowledge that fact and not mark the check box “The regression analysis satisfies the assumption of independence of errors”. The Assumption of Independence of Variables SPSS does not compute tolerance for VIF in the Univariate General Linear Model procedure. In these problems, we will acknowledge that fact and not mark the check box “The regression analysis satisfies the assumption of independence of variables”. I have included the complete list of assumptions in the list of possible answers even though some will not ever be marked in this assignment because of limitations in the univariate general linear procedure. In the future, we will develop a strategy for testing all of the assumptions. Interpretation of the Overall Relationship 3 of 23 The presence of overall relationship between the dependent variable and the independent variables is represented by the statement that both predictors together have a relationship to the dependent variable. If the ANOVA test of the overall relationship (“Corrected Model” in the table of “Tests of Between Subjects Factors”) is not statistically significant, this statement is not marked as a correct finding. If the overall relationship is not statistically significant, we will not interpret the individual relationships. If the overall relationship is statistically significant, we should examine the adjective describing the strength of the relationship. SPSS computes partial eta squared as a measure of effect. We characterize it as trivial, small, moderate, or large, applying Cohen's criteria for effect size (less than .01 = trivial; .01 up to 0.06 = small; .06 up to .14 = moderate; .14 or greater = large). If the adjective describing the strength of the relationship is not correct, the check box for the overall relationship is not marked. Interpretation of the Individual Relationships Determination of the correctness of statements about individual relationships is a two stage process. First, it is required that the relationship be statistically significant (the test of the slope in the table of “Parameter Estimates”). Second, it is required that the statement be correct a correct interpretation of the direction of the relationship with the dependent variable. The problems also contain statements about which predictor was more important or had the greater impact. This is based on the magnitude of the partial eta squared statistic for each independent variable, provided the variable has a statistically significant individual relationship. Inappropriate application of the statistic The only limitation to the use of regression imposed on these problems is that we should not use regression if we violate the level of measurement requirement. Solving Problems in SPSS We will demonstrate the use of SPSS to compute a regression analysis with the general linear model procedure using this problem. The introductory statement identifies the variables for the analysis and the significance levels to use. Note that we use a more conservative alpha (.01) for diagnostic statistics than we do for the statistics that answer our research questions. Level of Measurement – 1 The first statement in the problem asks about level of measurement and sample size. Multiple regression requires that the dependent variable be interval and the independent variables be interval or dichotomous. In these problems, we will limit our analysis to the inclusion of interval independent variables. 4 of 23 5 of 23 Level of Measurement - 2 To determine the level of measurement, we examine the information about variables in the SPSS data editor, specifically the values and value labels. "Occupational prestige score" [prestg80] is interval level, satisfying the requirement for the dependent variable. "Age" [age] is interval level, satisfying the requirement for the independent variable. "Highest year of school completed" [educ] is interval level, satisfying the requirement for the independent variable. Using Univariate General Linear Model for Linear Regression - 1 Select General Linear Model > Univariate from the Analyze menu. To check for compliance with sample size requirements, we run the univariate general linear model procedure. This procedure will give us the correct number of cases used in the analysis, taking into account missing data for all of the variables in the analysis. Using Univariate General Linear Model for Linear Regression - 2 First, move prestg80 to the Dependent Variable text box. Third, click on the Options button. Second, move educ and age to the Covariate(s) list box. Interval level variables are treated as covariates rather than factors in the general linear model. Using Univariate General Linear Model for Linear Regression - 3 In the Options dialog box, we mark the statistics we want to include in the output. First, mark the check boxes for Descriptive statistics Estimates of effect size Parameter estimates Lack of fit Second, since this is the only output we need for now, click on the Continue button. 6 of 23 Using Univariate General Linear Model for Linear Regression – 4 We have finished entering the specifications we need for our analysis. Click on the OK button to obtain the output. Descriptive Statistics from the Univariate General Linear Model - 1 The table of Descriptive Statistics contains the number of cases used in the analysis. Using the rule of thumb from Tabachnick and Fidell that the required number of cases should be the larger of the number of independent variables x 8 + 50 or the number of independent variables + 105, multiple regression requires 107 cases. With 254 valid cases, the sample size requirement is satisfied. 7 of 23 Marking the Statement for the Level of Measurement and Sample Size Requirement Since we satisfied both the level of measurement and the sample size requirements for analysis, we mark the first checkbox for the problem. The Assumption of Normality The next statement in the problem focuses on the assumption of normality, using the skewness and kurtosis criteria that both statistical values should be between -1.0 and +1.0. 8 of 23 Computing Skewness and Kurtosis to Test for Normality – 1 Skewness and kurtosis are calculated in several procedures. We will use Descriptive Statistics. Select Descriptive Statistics > Descriptives from the Analyze menu. Computing Skewness and Kurtosis to Test for Normality – 2 We add the variables whose normality we are concerned about. First, move the variables, prestg80, age, and educ, to the Variable(s) list box. Second, click on the Options button to specify the statistics we want computed. 9 of 23 Computing Skewness and Kurtosis to Test for Normality – 3 Kurtosis and Skewness are not selected by default, so we mark their check boxes. Second, click on the Continue button to close the dialog box. First, mark the check boxes for Kurtosis and Skewness. Computing Skewness and Kurtosis to Test for Normality – 4 We have finished entering the specifications we need for the evaluation of normality. Click on the OK button to obtain the output. 10 of 23 Evaluating the Assumption of Normality - 1 "Occupational prestige score" [prestg80] satisfied the criteria for a normal distribution. The skewness of the distribution (.401) was between -1.0 and +1.0 and the kurtosis of the distribution (-.630) was between -1.0 and +1.0. "Age" [age] satisfied the criteria for a normal distribution. The skewness of the distribution (.595) was between -1.0 and +1.0 and the kurtosis of the distribution (-.351) was between -1.0 and +1.0. Evaluating the Assumption of Normality - 2 "Highest year of school completed" [educ] did not satisfy the criteria for a normal distribution. The skewness of the distribution (-.137) was between -1.0 and +1.0, but the kurtosis of the distribution (1.246) fell outside the range from -1.0 to +1.0. The variable highest year of school completed violates the assumption of normality. We should either test transformations and removing outliers or we should include the violation in the limitations for the analysis. 11 of 23 Marking the Statement for the Assumption of Normality Since the assumption of normality is not satisfied for all variables, the check box is not marked. The Assumption of Homoscedasticity The next statement in the problem focuses on the assumption of homoscedasticity. The Univariate General Linear Model only computes the test of homogeneity of variance for categorical variables, so this check box will not be marked. 12 of 23 13 of 23 The Assumption of Linearity The next statement in the problem focuses on the assumption of linearity. The Univariate General Linear Model computes a lack of fit test that we can use for this assumption. The Test of Linearity In the lack of fit test, the probability of the test statistic (F(187, 64) = 1.12, p = .301) was greater than the diagnostic alpha of p = .010. The null hypothesis that "a linear regression model is appropriate" is not rejected. A linear model is an adequate representation of the relationship among these variables. Marking the Statement for the Assumption of Linearity Since the lack of fit test supported the appropriateness of a linear relationship, we mark the check box. The Assumption of Independence of Errors The next statement in the problem focuses on the independence of errors. Since the Univariate General Linear Model does not compute the Durbin-Watson Statistic, this check box is not marked. 14 of 23 The Assumption of Independence of Variables The next statement in the problem focuses on the independence of errors. Since the Univariate General Linear Model does not compute tolerance or VIF values, this check box is not marked. The Overall Relationship between the Dependent and Independent Variables The next statement in the problem focuses on the overall relationship – its existence and strength. 15 of 23 Significance and Strength of the Overall Relationship – 1 The overall relationship between the independent variables "age" [age] and "highest year of school completed" [educ] and the dependent variable "occupational prestige score" [prestg80] was statistically significant, F(2, 251) = 46.167, p < .001, partial eta squared = 0.27. We reject the null hypothesis that all of the partial slopes (b coefficients) = 0 and conclude that at least one of the partial slopes (b coefficients) is not equal to 0. Significance and Strength of the Overall Relationship – 2 If the F-test for class had not been statistically significant, we do not interpret the effect size. Applying the criteria for interpreting eta squared (less than .01 = trivial; .01 up to 0.06 = small; .06 up to .14 = moderate; .14 or greater = large), the partial eta squared value of 0.27 was correctly interpreted as a strong effect. 16 of 23 Marking the Statement for the Overall Relationship Since the overall relationship was statistically significant and the effect size was correctly interpreted, the check box is marked. Relationships between the Dependent Variable and Each Individual Independent Variable The next two statements are possible interpretation of the individual relationships, both in terms of significance and direction of the relationship with the dependent variable. 17 of 23 Interpreting the Individual Relationships – 1 The t-test of the slope coefficient for age (b = .13, t(3) = 2.88, p = .004) is statistically significant at alpha = .05. The null hypothesis that the slope = 0 is rejected. The positive sign of the slope indicates a direct relationship. The statement that "survey respondents who were older had more prestigious occupations" is supported. Interpreting the Individual Relationships – 2 The t-test of the slope coefficient for highest year of school completed (b = 2.55, t(10) = 9.60, p < .001) is statistically significant at alpha = .05. The null hypothesis that the slope = 0 is rejected. The positive sign of the slope indicates a direct relationship. The statement that "survey respondents who completed more years of school had more prestigious occupations" is supported. 18 of 23 Interpreting the Individual Relationships – 3 Since both individual relationships were statistically significant and correctly interpreted, both check boxes are marked. Relative Importance of Predictors The next two statements identify each independent variable as being the most important in predicting values of the dependent variable. 19 of 23 20 of 23 Interpreting the Relative Importance of Predictors Highest year of school completed was the most influential predictor because its partial eta squared (.27) was larger than the partial eta squared for age (.03). Marking the Relative Importance of Predictors Since highest year of school completed was the most influential predictor, its check box is marked. We have now finished all of the statements for this problem. 21 of 23 The Problem Graded in BlackBoard When this assignment was submitted, BlackBoard indicated that all marked answers for this problem were correct, and we received full credit for the question. Logic Diagram for Linear Regression Problems – 1 Level of measurement ok? No 22 of 23 Do not mark check box Mark: Inappropriate application of the statistic Stop Yes Ordinal dv? Yes Consider limitation in discussion of findings Sample size ok? No Consider limitation in discussion of findings Yes Mark check box for correct answer Normality ok? (skewness and kurtosis +/-1) No Consider limitation in discussion of findings No Consider limitation in discussion of findings Yes Mark check box for normality assumption Assumption of Homoscedasticity (not tested) Assumption of Linearity (lack of fit test > α) Use α for diagnostic statistics Yes Mark check box for linearity assumption Assumption of Independence of Errors (not tested) 23 of 23 Logic Diagram for Linear Regression Problems – 2 Assumption of Independence of variables (not tested) Overall relationship (F-test Sig ≤ α)? Use α for statistical tests No Do not mark check box for overall relationship Stop Yes Correct adjective used to describe effect size? No Do not mark check box for overall relationship Yes Mark check box for overall relationship Individual relationship (t-test Sig ≤ α)? Use α for statistical tests No Yes Correct interpretation of direction of relationship? No Do not mark check box for individual relationship No Do not mark check box for relative importance Yes Mark check box for individual relationship Relative importance correctly identified? Yes Mark check box for relative importance