Chapter 13 – Linear Regression and Correlation

advertisement
Chapter 4 – Summarizing Bivariate Data
Linear Regression and Correlation
This chapter introduces an important method for making inferences about a linear correlation (or
relationship) between two variables, and describing such a relationship with an equation that can be
used for predicting the value of one variable given the value of the other variable. We will only look
at linear relationships, which mean that when graphed, the points approximate a straight-line
pattern. We will also introduce a way of measuring the strength of a linear relationship.
Data that consist of ordered pairs are called bivariate data.
A scatterplot is a display of ordered pairs plotted on a set of axes.
4.1 - Correlation
Two variables have a linear relationship if the data tend to cluster around a .............................
when plotted on a scatterplot.
A strong correlation means that the points in the scatterplot are closely clustered around a straight
line.
Two variables are positively associated if large values of one variable are associated with
........................ values of the other.
Two variables are negatively associated if large values of one variable are associated with
......................... values of the other.
The linear correlation coefficient, denoted r, measures the strength of a linear relationship
between the paired x- and y-quantitative values in a sample.
The value of r is always between -1 and +1 inclusive. That is, −1 ≤ r ≤ 1 .
If r = 1,
there is a perfect positive linear correlation.
(or association)
If r is close to 1,
there is a strong positive linear correlation.
If r is positive but close to zero,
there is a weak positive linear correlation.
If r = 0,
there is no correlation.
If r is negative but close to zero,
there is a weak negative linear correlation.
If r is close to -1,
there is a strong negative linear correlation.
If r = -1,
there is a perfect negative linear correlation.
Interchanging all x- and y-values will not change r.
A common error is to conclude that correlation implies causality, meaning that one variable causes
the results of the second variable. Correlation is NOT causation. There might be other known or
unknown variables, confounders, which affect both of the variables that we are studying.
Remember, that r measures the strength of a linear relationship, not the strength of a relationship
that is non-linear. The graphs below certainly seem to have variables with some sort of relationship,
but not linear, so r will not be helpful in these cases.
In this class, we will not spend time on the tedious calculations required to find a correlation
coefficient or the linear regression equation (next section) by hand, but we will use our calculators.
The emphasis will instead be on the interpretation of the results.
Ex.
Global Warming: The following data set shows the temperatures at different levels of CO2 .
CO2 (in parts
per million)
Temperature
(in Celsius)
314
317
320
326
331
339
346
354
361
369
13.9
14.0
13.9
14.1
14.0
14.3
14.1
14.5
14.5
14.4
(a)
Draw a scatterplot, first by hand, then using your calculator, where amount of CO2 is the
independent variable and the temperature is the dependent variable.
(b)
Find the correlation coefficient
4.2, 4.3 - The Least Squares Regression Line
If one can rent a car for $180/week plus $0.25/mile, we can write an equation for the cost of renting
a car for a week by=
y 180 + 0.25 x , where y represents the cost per week, and x represent the
number of miles driven in a week. This linear equation gives an exact value of y for any given x.
However, the variables often don't have an exact relationship, where one variable is determined
completely by the other variable. But if they appear to have a linear relationship, we can find the
graph and equation of the straight line that best represents this specific relationship. This straight
line is called the least squares regression line (or line of best fit).
The general form of this linear regression equation is y= a + bx (compare this equations to the
more familiar form =
y mx + b ). The regression line is seen as a measure of the mean value of y for
a given value of x.
Describe the following:
x=
y=
a=
b=
Positive linear relationship:
Negative linear relationship:
Describe how we graph a line y= a + bx :
Requirements for finding a linear regression line and it’s correlation coefficient:
1.
The sample of paired data is a random sample of quantitative data.
2.
Visual examination of the scatterplot shows that the points approximate a straight-line
pattern.
3.
Any outliers must be removed if they are known to be errors. Consider the effects of any
outliers that are not known errors. (In a scatterplot, an outlier is a point lying far away from
the other data points.)
Describe how to find a linear regression equation on a TI83/84:
More problems that applies to the Global Warming example on previous page:
(c)
Find the equation of the regression line, where amount of CO2 is the independent variable
and the temperature is the dependent variable.
(d)
Draw the regression line in the same coordinate system as the scatterplot.
(e)
Mark the errors (residuals) between the actual points and the corresponding points on the
regression line. (In general I will not ask you to draw the residuals, only for this problem.)
(e)
Interpret the slope for this particular problem, using correct units.
(f)
Interpret the y-intercept for this particular problem, using correct units.
(g)
Find the predicted temperature for a recent year in which the concentration of CO2 is 370.9
Is the predicted temperature close to the actual temperature of 14.5°C?
(h)
If the CO2 increases by 15 parts per million, how much would you expect the temperature to
change.
Regression lines are often useful for predicting the value of one variable, given some particular
value of the other variable. If the regression line fits the data quite well, then it makes sense to use
its equation for predictions (use a scatterplot and correlation coefficient to determine how well the
line fits). However, don't base predictions on values that are far beyond the boundaries of the
known sample data, as the linear relationship may not hold true there.
An influential point is a point that, when included in a scatterplot, strongly affects the position of
the least-squares regression line.
ex.
When a scatterplot contains outliers:
•
•
Compute the least-squares regression line both with and without each outlier to determine
which outliers are influential.
Report the equations of the least-squares regression line both with and without each
influential point.
Download