Linear Regression (Bivariate) Creating a Model to Predict an Outcome [1] If we wanted to predict the win/loss record for each team in a league, what variables would we consider? This would require a very complex model — there are many variables involved. For starters, we will look at a two-variable model with one predictor variable and one outcome variable, for example, using the past performance of the quarterbacks to predict the teams’ performances. Creating a Model [2] How would you operationalize quarterback performance as a single interval-ratio predictor variable? How would you operationalize a team’s win/loss record as an interval-ratio variable? What is the Question? Are these two variables related? Does knowing about the distribution of the predictor variable (or IV) allow us to estimate values of the outcome variable (or DV)? Can we write the equation of a line to represent the relationship? Are these estimates any better than just guessing the mean of the DV distribution? What Type of Variables? Both variables should be at the interval-ratio level of measurement. Examples? Examples: Two-Variable Questions Can we estimate individuals’ weights if we know their heights? Can we estimate how long it takes a person to get to class (time) if we know how far people live (distance)? If we know students’ high school GPAs, can we estimate (or predict) their college GPAs? Examples: Places or Organizations as the UNITS OF ANALYSIS (cases) Can we estimate countries’ infant mortality rates if we know the number of physicians per 1,000 people? Are female literacy rates related to male life expectancies, for countries? Are cities’ unemployment rates related to their homicide rates? Are the number of books in the libraries of various colleges a good predictor of the incomes of the alumni from each of those colleges? First Step: Univariate Analysis Look at the distribution of each variable separately. Use histograms, boxplots, descriptives, and other SPSS/PASW functions (e.g., Analyze– Statistics–Explore). A very skewed or otherwise “non-normal” variable may not be suitable for the linear regression. Second Step: The Scatterplot IV (predictor variable) on the x-axis. DV (outcome variable) on the y-axis. Each point is a case located by its X score and its Y score (ordered pair). Does the shape of points look “sort of” linear? Put the line into the chart. What does it mean for a point (case) to be above or below the line? Direction of the Relationship: Positive and Negative Slopes A positive relation will show up as a line with positive slope, from lower left to upper right. A negative or inverse relationship will show up as a line with negative slope, from upper left to lower right. Whether a relationship is positive or negative will be apparent from the sign of R, the correlation coefficient. Pearson’s R: The Correlation Coefficient [1] If the plot looks “sort of linear” Find R (Pearson’s correlation coefficient), which ranges from –1 to +1. 0 means no relationship. –1 means a perfect negative or inverse relationship. +1 means a perfect positive relationship. Correlation Coefficient [2] The correlation coefficient is NOT a percentage or proportion. R = ∑ [ZxZy] / N R expresses the strength and the direction of the relationship. The Direction of the Relationship: Positive The maximum, +1, is reached, if for every case, Zx = Zy In a positive relationship, Z-scores will be multiplied together for a positive product, and negative Z-scores will be multiplied together for a positive product. An example is heights and weights. The Direction of the Relationship: Negative The minimum, –1, will be obtained when, for each case, Zx = –Zy In a negative relationship, each product will involve multiplying two Z-scores with opposite signs (negative and positive), and the product will be negative. An example from the country data set is literacy rate and infant mortality rate. Interpreting R Not all texts agree on how to interpret the strength of R. See Garner (2010, p. 173) for a commonly used interpretation. R2: The Coefficient of Determination Square R to obtain R2 which is called the coefficient of determination. R2 ranges from 0 to 1, and it can be read as a proportion. (Or move the decimal point two places to the right, and read it as a percent.) It reveals what proportion of the variation in the outcome variable was predicted by the predictor variable. R2 is a Proportion between Variances Three variances: Total variance (difference between mean and Y). Explained regression variance (difference between mean and estimated Y). Unexplained residual or error variance (difference between estimated and observed Y). R2 is the ratio of explained, regression variance to total variance. See Figure 19 in Garner (2010), p. 177. Why is R2 So Important? If R2 is 0, it means that the predictor variable is worthless as a predictor. Our best estimate of the outcome (DV) variable remains the mean of the DV. This situation would look like a flat line at the mean of the Y-distribution for the total fit line in the scatterplot. A large R2 means that the linear model provides good estimates of the dependent variable — better than guessing the mean. Why is There an ANOVA in My Regression Analysis? The ANOVA “box” we see in the middle of the regression output is a test (F-test) for the significance of R2. It is testing the null hypothesis that the predictor variable does NOT predict any of the variation we found in the outcome variable’s distribution. R2 is a Proportion R2 = regression variance / total variance It answers these questions: How good is my model (the regression line) for predicting the distribution of the dependent (outcome) variable? How close are the observed Ys to the Y′s estimated in the linear model? The Regression Coefficients Our linear model of the relationship of the two variables is written as the equation of a line. Y = a + bX The y-intercept (constant) is a. The slope of the line is b. We are mostly interested in b. The Table of Coefficients In SPSS/PASW output, the table of coefficients shows the constant term and the slope coefficient. (They are under the heading “B”.) It shows the standardized and unstandardized slope coefficients. The standardized coefficients are called the betas or beta-weights, and are based on the Zscores for each of the two distributions. Each coefficient is tested for significance with a t-test. The null hypothesis is that beta = 0. The Regression Model [1] In the real world, the constant term is often meaningless, and we are interested in it only for writing the equation. We want to know if b is significant. If it is not, forget about it — the whole analysis is off (and R and R2 will also be NOT significant). If it is, we can write the equation with either a standardized or an unstandardized coefficient. The Regression Model [2] The unstandardized coefficients express the relationship in terms of “real world” units of measurement — e.g., feet, kilos, metres, inches, minutes, literacy percentage points, and books in libraries. The standardized coefficient expresses the relationship in terms of the Z-scores of the two variables. Positive and Negative (Inverse) Relationships [1] If the slope coefficient is a positive number, it expresses a positive relationship between the variables — more of one is associated with more of the other. (For example, more study time, higher GPA) If the slope coefficient is a negative number, it expresses a negative or inverse relationship between the variables — more of one is associated with less of the other. (For example, more binge drinking, lower GPA) Positive and Negative [2] Positive relationships will have a positive slope for the line in the graph, a positive slope coefficient, and a positive R. Negative relationships will have a negative slope for the line in the graph, a negative slope coefficient, and a negative R. (Warning: This negative R is missing its minus sign in some parts of SPSS/PASW output.) R2 is always 0 or positive. The End Now we have our linear model BASED ON THE AVAILABLE DATA. If R, R2, and the slope coefficients are significant, we have improved our ability to predict the outcome. Our estimates, Y′, are better than just guessing the mean of the Y distribution. We can return to our first question and start the analysis: Is the past performance of the quarterbacks a good predictor of their teams’ records in the coming season?