4.1 Simple Linear Regression

Linear regression with one independent variable
Linear regression tells us about the relationship between a quantitative dependent
variable y and an independent (predictor) variable x.
Here are some examples.
Number of flowers on rose plants versus amount of fertilizer.
Grade on an exam versus hours of sleep before the exam.
Mobility after stroke versus hours of rehabilitation
Suppose we have $100 in the bank, and we work 10 hours with a salary of $10 per hour.
After 10 hours, we have a total of 100 + 10*10 = 100 + 100 = 200 dollars. We can create
a graph of how much money we have after each hour of work, and draw a straight line
through the points. This line is the linear regression line.
The slope of the line is 10, because we earn 10 dollars per hour.
The intercept of the line (when x = 0) is 100, because at time 0 we have 100 dollars.
So the equation for the regression line is
y = 100 + 10x
dollars = 100 + 10 * hours worked
The intercept (100) and slope (10) of the line are called coefficients. You may recall from
high school algebra that we wrote the equation for a line as
y = mx + b.
For statistical analysis, we commonly use this form:
y = b 0 + b1 x
b0 = intercept
b1 = slope for the independent variable x
For multiple regression, where we can have more than one x variable, we'll use this
b0 = intercept
b1 = slope for the first independent variable x1
b2 = slope for the second independent variable x2
and so on
Recall the example from correlation where we have a job that pays tips, rather than a
fixed hourly wage. We get these tips for working 1 to 8 hours.
Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8
18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92
We would like to know the average tips per hour worked, which is the slope of the line
through the points. We use linear regression to calculate the slope, to learn the average
tips per hour. As we'll see shortly, the slope of the line is 10.22, meaning that we earn
$10.22 per hour on average. So we earn slightly more in tips than in the $10 per hour
salary job.
Fitting a regression line
Suppose you are given the job of finding the best line to fit these data, and to calculate
the slope of the line.
Hours Worked 0 1 2 3 4 5 6 7 8 9 10
0 5 10 15 20 25 30 35 40 45 50
You would probably draw a line like the one in the middle. Why not the line on the right?
The regression line at the right doesn't pass through the points. We call the distance of
each point from the regression line the "residual". The line on the right has large
residuals: the points are a long way from the line.
Let's look at an example we saw earlier, average tips earned per hour. Here are the data
and two candidate regression lines.
Hours 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8
18 10 19 28 29 45 27 47 52 66 51 60 78 74 81 92
The line on the right fits the data better than the line on the left. Linear regression finds
the line that minimizes the sum of the squared residuals from the line. Because regression
minimizes squares, the method for finding the regression line is called least squares.
Least squares regression finds the values of the coefficients that minimize the sum of the
squared distance of all the points from the regression line. That is, the coefficients fit the
line that minimizes the sum of the squared residuals.
For the example of tips versus hours worked, the regression analysis gives us this result.
Coefficient Std. Error t value Pr(>|t|)
0.2312 44.205
<2e-16 ***
The slope of the line is 10.2202, which is the coefficient from the table. The intercept is
0.9464. So the equation for the line is
y = b0 + b1*x
y = intercept + slope * x
Tips = intercept + slope * hours.worked
Tips = 0.9464 + 10.2202 * hours.worked
Statistical inference: is the slope of the line significant?
Recall that test statistics are often of this form:
effect size
standard error of the effect size
This ratio applies to testing the significance of the regression coefficients. In particular,
we are most interested to know if the slope is really different from zero.
We test if the ratio of the slope to its standard error is significantly different from zero.
standard error of the slope
An alternative method to using the t-test is to use the F-test. For the case of simple linear
regression with one variable, the F-statistic is just the square of the T-statistic. The
advantage of the F-statistic is that it generalizes to multiple regression (multiple
independent variables). The concepts underlying significance testing with the F-statistic
for regression are similar to the concepts using the F-statistic for ANOVA.
Linear regression model assumptions
The linear regression model makes several assumptions about the data that we should
check before performing inference. The residuals should be roughly normally distributed,
and not show any pattern, and the data should fit a straight line.
Here's an example of data that show a pattern that indicates a problem for linear
regression. This scatterplot shows clear curvature, so a simple linear model is not
Transformations of the data (such as log or square root) may help it fit the assumptions
better. In this case, we should add a quadratic (x2) term to the linear regression model.
We'll do this later.