SIMPLE LINEAR REGRESSION: PREDICTION For the bivariate linear regression problem, data are collected on an independent or predictor variable (X) and a dependent or criterion variable (Y) for each individual. Bivariate linear regression computes an equation that relates predicted Y scores (Ŷ) to X scores. The regression equation includes a slope weight for the independent variable, Bslope (b), and an additive constant, Bconstant (a): Ŷ = Bslope X + Bconstant (or) Ŷ = bX + a Indices are computed to assess how accurately the Y scores are predicted by the linear equation. We will focus on applications in which both the predictor and the criterion are quantitative (continuous – interval/ratio data) variables. However, bivariate regression analysis may be used in other applications. For example, a predictor could have two levels like gender and be scored 0 for females and 1 for males. A criterion may also have two levels like pass-fail performance, scored 0 for fail and 1 for pass. Linear regression can be used to analyze data from experimental or non-experimental designs. If the data are collected using experimental methods (e.g., a tightly controlled study in which participants have been randomly assigned to different treatment groups), the X and Y variables may be referred to appropriately as the independent and the dependent variables, respectively. SPSS uses these terms. However, if the data are collected using non-experimental methods (e.g., a study in which subjects are measured on a variety of variables), the X and Y variables are more appropriately referred to as the predictor and the criterion, respectively. UNDERSTANDING BIVARIATE LINEAR REGRESSION A significance test can be conducted to evaluate whether X is useful in predicting Y. This test can be conceptualized as evaluating either of the following null hypotheses: the population slope weight is equal to zero or the population correlation coefficient is equal to zero. The significance test can be derived under two alternative sets of assumptions, assumptions for a fixed-effects model and those for a random-effects model. The fixed-effects model is probably more appropriate for experimental studies, while the random-effects model seems more appropriate for non-experimental studies. If the fixed-effects assumptions hold, linear or nonlinear relationships can exist between the predictor and criterion. On the other hand, if the random-effects assumptions hold, the only type of statistical relationship that can exist between two variables is a linear one. Regardless of the choice of assumptions, it is important to examine a bivariate scatterplot of the predictor and the criterion variables prior to conducting a regression analysis to assess if a nonlinear relationship exists between X and Y and to detect outliers. If the relationship appears to be non-linear based on the scatterplot, you should not conduct a simple bivariate regression analysis but should evaluate the inclusion of higher-order terms (variables that are squared, cubed, and so on) in your regression equation. Outliers should be checked to ensure that they were not incorrectly entered in the data set and, if correctly entered, to determine their effect on the results of the regression analysis. FIXED-EFFECTS MODEL ASSUMPTIONS FOR BIVARIATE LINEAR REGRESSION Assumption 1: The Dependent Variable is Normally Distributed in the Population for Each Level of the Independent Variable In many applications with a moderate or larger sample size, the test of the slope may yield reasonably accurate p values even when the normality assumption is violated. To the extent that population distributions are not normal and sample sizes are small, the p values may be invalid. In addition, the power of this test may be reduced if the population distributions are non-normal. Assumption 2: The Population Variances of the Dependent Variable are the same for All Levels of the Independent Variable To the extent that this assumption is violated and the sample sizes differ among the levels of the independent variables, the resulting p value for the overall F test is not trustworthy. Assumption 3: The Cases Represent a Random Sample from the Population, and the Scores are Independent of Each Other from One Individual to the Next The significance test for regression analysis will yield inaccurate p values if the independence assumption is violated. RANDOM-EFFECTS MODEL ASSUMPTIONS FOR BIVARIATE LINEAR REGRESSION Assumption 1: The X and Y Variables are Bivariately Normally Distributed in the Population If the variables are bivariately normally distributed, each variable is normally distributed ignoring the other variable and each variable is normally distributed at every level of the other variable. The significance test for bivariate regression yields, in most cases, relatively valid results in terms of Type I errors when the sample is moderate to large in size. If X and Y are bivariately normally distributed, the only type of relationship that exists between these variables is linear. Assumption 2: The Cases Represent a Random Sample from the Population, and the Scores on Each Variable are Independent of Other Scores on the Same Variable The significance test for regression analysis will yield inaccurate p values if the independence assumption is violated. REGRESSION PAGE - 2 EFFECT SIZE STATISTICS FOR BIVARIATE LINEAR REGRESSION Linear regression is a more general procedure that assesses how well one or more independent variables predict a dependent variable. Consequently, SPSS reports strength-of-relationship statistics that are useful for regression analyses with multiple predictors. Four correlational indices are presented in the output for the Linear Regression procedure: the Pearson productmoment correlation coefficient (r), the multiple correlation coefficient (R), its squared value (R2), and the adjusted R2. However, there is considerable redundancy among these statistics for the single-predictor case: R = |r|, R2 = r2, and the adjusted R2 is approximately equal to R2. Accordingly, the only correlational indices we need to report in our manuscript for a bivariate regression are r and r2. The Pearson product-moment correlation coefficient ranges in values from -1.00 to +1.00. A positive value suggests that as the independent variable X increases, the dependent variable Y increases. A zero value indicates that as X increases, Y neither increases nor decreases. A negative value indicates that as X increases, Y decreases. Values close to -1.00 or +1.00 indicate stronger linear relationships. The interpretation of strength of relationship should depend on the research context. By squaring r, we obtain an index that directly tells us how well we can predict Y from X, r2 indicates the proportion of Y variance that is accounted for by its linear relationship with X. Alternatively, r2 (coefficient of determination) can be conceptualized as the proportion reduction in error that we achieve by including X in the regression equation in comparison with not including X in the regression equation. Other strength-of-relationship indices may be reported for bivariate regression problems. For example, SPSS gives Standard Error of the Estimate on the output. The standard error of estimate is an index indicating how large the typical error is in predicting Y from X. it is a useful index over and above correlational indices because it indicates how badly we predict the dependent variable scores in the metric of these scores. In comparison, correlational statistics are unit-less indices and, therefore, are abstract and difficult to interpret. REGRESSION PAGE - 3 CONDUCTING A BIVARIATE LINEAR REGRESSION ANALYSIS 1. Open the data file 2. Click Analyze Regression then click Linear You will see the Linear Regression dialog box. 3. Select your dependent variable then click ► to move it to the Dependent box. (for this example – MATHACH was chosen) 4. Select your independent variable then click ► to move it to the Independent box. (for this example – VISUAL was chosen) 5. Click Statistics You will see the Linear Regression: Statistics dialog box 6. Click Confidence intervals and Descriptives Make sure that Estimates and Model fit are also selected. 7. Click Continue 8a. (For total sample information) Click OK For this example, we will look at the total sample information. 8b. (For group information) Click Paste Make the necessary adjustments to your syntax (i.e., temporary/select if command), then run the analysis. REGRESSION PAGE - 4 SELECTED SPSS OUTPUT FOR BIVARIATE LINEAR REGRESSION The results of the bivariate linear regression analysis example are shown below. The B’s, as labeled on the output in the Unstandardized Coefficients box, are the additive constant, a (8.853) and the slope weight, b (.745) of the regression equation used to predict the dependent variable from the independent variable. The regression or prediction equation is as follows: Ŷ = Bslope X + Bconstant (or) Ŷ = bX + a Predicted Mathematics Test Score = .745 Visualization Test Score + 8.853 Syntax: REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /MISSING LISTWISE /STATISTICS COEFF OUTS CI R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT mathach /METHOD=ENTER visual . De scri ptive Statistics Mean MATHACH Mathematics 13.09803 Test Score VISUAL Visualizat ion 5.69700 Test St d. Deviat ion N 6.604590 500 3.886535 500 REGRESSION PAGE - 5 Correl ations Pearson Correlation Sig. (1-tailed) N MA THACH Mathematics Test S core VISUA L Visualization Test 1.000 .438 .438 1.000 . .000 .000 . 500 500 500 500 MA THACH Mathematics Test S core VISUA L V isualization Test MA THACH Mathematics Test S core VISUA L V isualization Test MA THACH Mathematics Test S core VISUA L V isualization Test Variables Entered/Removedb Model 1 Variables Entered VISUAL Vis ualizati a on Tes t Variables Removed Method . Enter a. All requested variables entered. b. Dependent Variable: MATHACH Mathematics Test Score Model Summary Model 1 R .438a R Square .192 Adjusted R Square .191 Std. Error of the Estimate 5.941865 a. Predictors: (Constant), VISUAL Visualization Test ANOVAb Model 1 Regres sion Residual Total Sum of Squares 4184.416 17582. 267 21766. 682 df 1 498 499 Mean Square 4184.416 35.306 F 118.519 Sig. .000a a. Predic tors: (Constant), VISUAL Vis ualiz ation Test b. Dependent Variable: MATHACH Mathematic s Test Score REGRESSION PAGE - 6 Coefficientsa Model 1 (Constant) VISUAL Vis ualization Test Unstandardized Coefficients B Std. Error 8.853 .472 .745 Standardized Coefficients Beta .068 .438 t 18.763 Sig. .000 10.887 .000 95% Confidence Interval for B Lower Bound Upper Bound 7.926 9.780 .611 .880 a. Dependent Variable: MATHACH Mathematics Test Score Based on the magnitude of the correlation coefficient, we can conclude that the visualization test is moderately related to the mathematics test (r = .438). Approximately 19% (r2 = .192) of the variance of the mathematics test is associated with the visualization test. The hypothesis test of interest evaluates whether the independent variable predicts the dependent variable in the population. More specifically, it assesses whether the population correlation coefficient is equal to zero or, alternatively, whether the population slope is equal to zero. This significance test appears in two places for a bivariate regression analysis: the F test reported as part of the ANOVA table and the t test associated with the independent variable in the Coefficient table. They yield the same p value because they are identical tests: F (1, 498) = 118.519, p < .001 and t (498) = 10.887, p < .001. In addition, the fact that the 95% confidence interval for the slope does not contain the value of zero indicates that the hypothesis should be rejected at the .05 level. REGRESSION PAGE - 7 USING SPSS GRAPHS TO DISPLAY THE RESULTS A variety of graphs have been suggested for interpreting linear regression results. The results of the bivariate regression analysis can be summarized using a bivariate scatterplot. Conduct the following steps to create a simple bivariate scatterplot for our example: 1. Click Graphs (on the menu bar) then click Scatter 2. Click Simple then click Define 3. Click the Dependent (Criterion) Variable and click ► to move it to the Y axis box. 4. Click the Independent (Predictor) Variable and click ► to move it to the X axis box. 5. Click OK Once you have created a scatterplot showing the relationship between the two variables, you can add a regression line by following these steps: 1. Double-click on the chart to select it for editing, and maximize the chart editor. 2. Click Chart from the menu at the top of window in the chart editor then click Options. 3. Click Total in the Fit Line box. 4. Click OK then close the Chart1 – SPSS Chart Editor For Example: Your scatterplot would look like the one below: 30 Mathematics Test Score 20 10 0 -10 -10 0 10 20 V isualization Test An examination of the plot allows us to assess how accurately the regression equation predicts the dependent variable scores. In this case, the equation offers some predictability, but many points fall far off the line, indicating poor prediction for those points. REGRESSION PAGE - 8