Linear regression with one independent variable Linear regression tells us about the relationship between a quantitative dependent variable y and an independent (predictor) variable x. Here are some examples. Number of flowers on rose plants versus amount of fertilizer. Grade on an exam versus hours of sleep before the exam. Mobility after stroke versus hours of rehabilitation Suppose we have $100 in the bank, and we work 10 hours with a salary of $10 per hour. After 10 hours, we have a total of 100 + 10*10 = 100 + 100 = 200 dollars. We can create a graph of how much money we have after each hour of work, and draw a straight line through the points. This line is the linear regression line. The slope of the line is 10, because we earn 10 dollars per hour. The intercept of the line (when x = 0) is 100, because at time 0 we have 100 dollars. So the equation for the regression line is y = 100 + 10x dollars = 100 + 10 * hours worked The intercept (100) and slope (10) of the line are called coefficients. You may recall from high school algebra that we wrote the equation for a line as y = mx + b. For statistical analysis, we commonly use this form: y = b 0 + b1 x where b0 = intercept b1 = slope for the independent variable x For multiple regression, where we can have more than one x variable, we'll use this notation. b0 = intercept b1 = slope for the first independent variable x1 b2 = slope for the second independent variable x2 and so on Recall the example from correlation where we have a job that pays tips, rather than a fixed hourly wage. We get these tips for working 1 to 8 hours. Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Tips 18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92 We would like to know the average tips per hour worked, which is the slope of the line through the points. We use linear regression to calculate the slope, to learn the average tips per hour. As we'll see shortly, the slope of the line is 10.22, meaning that we earn $10.22 per hour on average. So we earn slightly more in tips than in the $10 per hour salary job. Fitting a regression line Suppose you are given the job of finding the best line to fit these data, and to calculate the slope of the line. Hours Worked 0 1 2 3 4 5 6 7 8 9 10 Dollars 0 5 10 15 20 25 30 35 40 45 50 You would probably draw a line like the one in the middle. Why not the line on the right? The regression line at the right doesn't pass through the points. We call the distance of each point from the regression line the "residual". The line on the right has large residuals: the points are a long way from the line. Let's look at an example we saw earlier, average tips earned per hour. Here are the data and two candidate regression lines. Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Tips 18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92 The line on the right fits the data better than the line on the left. Linear regression finds the line that minimizes the sum of the squared residuals from the line. Because regression minimizes squares, the method for finding the regression line is called least squares. Least squares regression finds the values of the coefficients that minimize the sum of the squared distance of all the points from the regression line. That is, the coefficients fit the line that minimizes the sum of the squared residuals. For the example of tips versus hours worked, the regression analysis gives us this result. Intercept hours.worked Coefficient Std. Error t value Pr(>|t|) 0.9464 1.1675 0.811 0.431 10.2202 0.2312 44.205 <2e-16 *** The slope of the line is 10.2202, which is the coefficient from the table. The intercept is 0.9464. So the equation for the line is y = b0 + b1*x y = intercept + slope * x Tips = intercept + slope * hours.worked Tips = 0.9464 + 10.2202 * hours.worked Statistical inference: is the slope of the line significant? Recall that test statistics are often of this form: effect size standard error of the effect size This ratio applies to testing the significance of the regression coefficients. In particular, we are most interested to know if the slope is really different from zero. We test if the ratio of the slope to its standard error is significantly different from zero. 𝑇= slope standard error of the slope An alternative method to using the t-test is to use the F-test. For the case of simple linear regression with one variable, the F-statistic is just the square of the T-statistic. The advantage of the F-statistic is that it generalizes to multiple regression (multiple independent variables). The concepts underlying significance testing with the F-statistic for regression are similar to the concepts using the F-statistic for ANOVA. Linear regression model assumptions The linear regression model makes several assumptions about the data that we should check before performing inference. The residuals should be roughly normally distributed, and not show any pattern, and the data should fit a straight line. Here's an example of data that show a pattern that indicates a problem for linear regression. This scatterplot shows clear curvature, so a simple linear model is not appropriate. Transformations of the data (such as log or square root) may help it fit the assumptions better. In this case, we should add a quadratic (x2) term to the linear regression model. We'll do this later.