Linear Regression Tuesday, April 21, 2020 Today Today’s topic: Linear regression Quiz 4 (on correlation) is due tomorrow (Wed) by 8pm We will have one more quiz, posted on Tues, 4/28 and due the following Mon, May 4 by 8pm We will not have a Final Exam for this course Complete SETE course evaluations You will receive an email with instructions to complete evaluations for this class and your other classes by May 7 Linear Regression Used to describe the relationship between two interval/ratio variables How is regression distinct from correlation? Correlation is unitless Regression has units and tells us how a 1-unit change in independent variable affects value of dependent variable Regression relies on scatterplots and a prediction of the “best-fitting” line to describe a trend in the relationship Line described by slope and y-intercept Correlation is Unitless Formula for correlation coefficient (r): The units for Y and X are in the numerator and denominator – they cancel each other out Regression keeps the units of the independent and dependent variables and tells us how a 1-unit increase in independent variable affects predicted value of dependent variable Linear Regression Do you remember the formula for a line from algebra? Y = mX + b …where m = slope and b = y-intercept The formula for a regression line is nearly identical but with statistical notation: � = a + bX Y …where b = slope and a = y-intercept Linear Regression Y-intercept (or Constant) (a) The point at which the regression line crosses the y-axis. The expected value of Y when X equals 0. Not always meaningful; for some independent variables, 0 is not a valid value (age, education, body mass index) Slope (b) The steepness of the regression line “Rise divided by run” Indicates the predicted change in the dependent variable for each unit increase in the independent variable. Regression line � with X Predicting Y If I asked you to guess a person’s age and you knew nothing about them except that he or she lived in the US, what would be your best guess? � (y bar) The mean age for US residents, Y � = 38.2 years in 2018 Y Now if I tell you that person is widowed, would you adjust your guess? � with X Predicting Y In statistics, we are often trying to predict (make a better guess about) the mean for a variable. With bivariate statistics, we use an independent variable (widowhood) to improve our estimate of a dependent variable (age) � (“Y hat”) We use X to predict Y � (point on regression line) should give us a Y � (mean) because it draws better estimate than Y on information from X How would we find the line that best fits these data points? Which line best captures how these two variables co-vary? Which line best captures how these two variables co-vary? The difficulty is that we have a lot of choices that look about right….right? We need to find the single best intercept and single best slope for our line… We need a line that minimizes the distance from the line to the data points Called “Residuals” or Residual Errors Blue lines represent the error between the predicted value based on the red regression line and the actual value observed in the data (the circles) Linear Regression, also called “Ordinary Least Square (OLS) Regression” The logic is similar to variance. Some residual errors are negative and some are positive. If we just add the residual errors together, they cancel each other out and sum to zero. The estimates of a and b will have the property that the sum of the squared differences between 2 � the observed and predicted values ∑ Y − Y is minimized using ordinary least squares. This relies on Sum of Squared Errors Interpreting SPSS Output Let’s return to our example from last week about age and time spent using the internet Research question: Is there a relationship between respondent’s age and the number of hours they spend using the internet? Both IV and DV are interval/ratio Fit regression line SPSS will calculate the y-intercept and slope for the “best-fitting” line to tell us how well respondents’ age predicts their hours online per week, on average Regression with SPSS Regression with SPSS Output provided by SPSS Interpreting SPSS Output: Summary R = Pearson’s correlation coefficient (measure of correlation we discussed 4/14) r = .259, so there is a positive, weak association between education and hours worked Interpreting SPSS Output: Summary R2 (“R squared”): coefficient of determination R2 tells us what proportion of error is reduced by using X � versus just using � to predict Y Y Interpreting SPSS Output: Summary R2 Interpretation R2 = .067 means that 6.7% of the variation in the DV (hours of internet use) can be explained by the IV (age) versus using the mean of the DV alone Interpreting SPSS Output SPSS uses different notation than we’ve been using (Constant) row & B column = y-intercept a “Age of respondent” row and B column = slope b � = 26.395 + (−.262) × X Our regression equation: Y Interpreting SPSS Output Y-intercept a = 26.395 We can’t interpret this. Y-intercept tells us the value of the DV (hours of internet use) when our IV (age) equals 0. No one in the survey is zero years old. Interpreting SPSS Output Slope b = –.262 When age increases by 1 year, hours of internet use will decrease by .262 hours, on average Interpreting SPSS Output “Sig.” is the p-value If p-value is smaller than our alpha value (.05), then the association between the two variables is statistically significant In other words, we can be 95% confident that education and work hours are associated at the population level Interpreting SPSS Output � = 26.39 − .262 × X Our regression equation: Y If someone’s age X equals 65 years, what is our best �)? estimate of their hours online per week (Y � = 26.395 − .262 × 65 Y � = 9.365 Y