Linear regression

advertisement
Objectives
2.3
Least-squares regression

Regression lines

Prediction and Extrapolation

Correlation and r2

Transforming relationships
Adapted from authors’ slides © 2012 W.H. Freeman and Company
Straight Line Regression




A regression is a formula that describes how a response variable y
changes (on average) as an explanatory variable x changes.
We often use a regression line to predict the value of y for a given
value of x. The predicted value for y is often denoted yˆ.
In regression, the distinction between explanatory and response
variables is important.
A straight line regression has the form
yˆ  b0  b1x
y is the observed value
yˆ is the predicted y value (y hat)
b1 is the slope
b0 is the y-intercept
The least squares regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) differences between the data points and the
line is as small as possible. (These differences are called residuals.)
This is the line that best predicts y from x (not the other way).
Distances between the points and
line are squared so that they all are
positive values.
How to compute the slope and intercept:
First we calculate the slope of the line, b1, from
statistics we have already computed.
b1  r 
r is the correlation.
sy is the standard deviation of the response variable y.
sx is the standard deviation of the explanatory variable x.
sy
sx
Interpretation: b1 is the (average) change in the value of y when x is changed
by 1 unit.
Once we know b1, we can calculate b0,
the y-intercept.
b0  y  b1 x
x and y are the sample means of the x
and y variables.
Interpretation: b0 is the predicted value when x = 0 (although this value of x is
 not always meaningful).

Different scale different intercept and slope

It is important to note that if we change the scale of either the x or y
axis that the slope and intercept will also change. To see this look at
the yearly temperature data in (can be downloaded from Dostat) in
Statcrunch and observe the effect it has when we change year to a
different scale (observe that year is rather arbitrary, why use 2012
and not 10008?).

Do not be fooled by the size of the slope. If it is small (what ever that
means), it does not mean it is insignificant, or the correlation is
small. The size of a slope depends on the scaling that we use. The
significance depends on the standard error (more of this later on).
Efficiency of a biofilter, by temperature
In StatCrunch: Stat-Regression-Simple Linear
yˆ  97.5  0.0757 x.
For every degree that temperature goes up, the efficiency can be
expected to increase by b1 = 0.0757 units.
The predicted efficiency when temperature equals 10 is
yˆ  97.5  0.0757 10  98.26.
Relationship between ozone and carbon pollutants
In StatCrunch: Stat-Regression-Simple Linear
yˆ  0.0515  0.005708x.
For each unit that carbon goes up, ozone can be expected to increase by
b1 = 0.005708 units. The predicted efficiency when carbon equals 15 is
yˆ  .0515  0.005708 15  .1371.
However, the relationship is not strong so the prediction may not be all
that accurate.
Categorical variables in scatterplots
Often, things are not simple and one-dimensional. We need to group
the data into categories to reveal trends.
What may look like a positive linear
relationship is in fact a series of
negative linear associations.
Here, the habitat for each
observation is a lurking variable.
Plotting data points from different
habitats in different colors allows us
to make that important distinction.
Comparison of racing records over
time for men and women.
Each group shows a very strong
negative linear relationship that
would not be apparent without the
gender categorization.
Relationship between lean body mass
and metabolic rate in men and women.
Both men and women follow the same
positive linear trend, but women show
a stronger association. As a group,
males typically have larger values for
both variables.
Correlation versus regression
The correlation is a measure
In regression we examine
of spread (scatter) in both the
the variation in the response
x and y directions in the linear
variable (y) given the
relationship.
explanatory variable (x).
Coefficient of determination, R2
R2 is called the coefficient of
determination.
R2 represents the percentage of
the variation of y that can be
explained by the prediction from x.
(That is, it is the amount of vertical
scatter from the regression line
relative to the overall vertical
scatter.)
R2 is meaningful for any fit of the response variable to one
or more explanatory variables.
In the case of straight line fit only, however, R2 = r2, where r
is the correlation coefficient (positive or negative).
The r-squared and the linear model

The basic idea in statistical modeling is to fit the simplest model,
which best explains the data.

In the case of simple regression this means fitting a line through the
points. The only model simpler than that is fitting a flat line (the slope
is zero) through the points. The R-squared is a way of comparing
the gain by fitting slope over a flat line.

If the R-squared is 1, then the residuals are zero and the y-axis is
totally determined by the x-axis (the linear model is best). If the rsquared is zero, then the constant model is best. Usually the Rsquared is somewhere in between.

Note that the value of the slope can be very small (say 0.00003), but
the R-squared can still be one.
Efficiency of a biofilter, by temperature
yˆ  97.5  0.0757 x.
R2 = 79.4% is the proportion of the variation in Efficiency that is explained
by the straight line regression on Temperature.
Relationship between ozone and carbon pollutants
In StatCrunch: Stat-Regression-Simple Linear
yˆ  0.0515  0.005708x.
R2 = 44.7% is the proportion of the variation in Ozone Level that is
explained by the straight line regression on Carbon Pollutant Level.
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get a different line which is a predictor of x for a given value of y.
This is because the least squares regression of y on x is concerned with
the distance of all points from the line in the y direction only.
Here is a plot of Hubble telescope
data about galaxies moving away
from earth.
The solid line is the best prediction
of y = velocity from x = distance.
The dotted line is the best
prediction of x = velocity from y =
distance.
Examples
 For example, if you want to predict the girth of a 8 week calf given his 8
week weight, then the response variable is the girth (it should be on the yaxis) and the explanatory variable is the weight (on the x-axis).
 Another example, is trying to predict your midterm scores. It makes sense
to predict your midterm 3 score based on your midterm 2 score. Thus
midterm 3 is the response variable (on y-axis) and midterm 2 is the
explanatory variable in the x-axis.
Residuals
The distances from each point to the least-squares regression line give
us potentially useful information about the contribution of individual data
points to the overall pattern of scatter.
These distances are called
residuals, because they are
what is “left over” after fitting
the line.
Points above the
line have a positive
residual.
The sum of the residuals is
always 0.
Points below the line have a
negative residual.
Predicted ŷ
Observed y
y  yˆ  residual
Residual plots
Residuals are the differences between y-observed and y-predicted. We
plot them in a residual plot, which plots residuals vs. x.
If the data are best predicted simply by using a straight line then the
residuals will be scattered randomly above and below 0.
The x-axis in a residual plot is the
same as on the scatterplot.
Only the y-axis is different.
Constant mean and spread.
Residuals are randomly scattered – good!
Non-constant mean.
Curved pattern—means the relationship
you are looking at (e.g., a straight line)
has not been fit properly.
Non-constant spread.
A change in variability across a plot
indicates that the response variable is
less predictable for some values of x than
for others.
This can affect the accuracy of statistical
inference.
Outliers and influential points
Outlier: an observation that lies outside the overall pattern of
observations.
Influential observation: an observation that markedly changes the
regression if removed. This is often an outlier on the x-axis.
Child 19 = outlier
in y direction
Child 19 is an outlier
of the relationship.
Child 18 = outlier in x direction
Child 18 is only an
outlier in the x
direction and thus
might be an
influential point.
outlier in
y-direction
All data
Without child 18
Without child 19
Are these
points
influential?
influential
Always plot your data
A correlation coefficient and a regression line can be calculated for any
relationship between two quantitative variables. However, outliers
greatly influence the results, and running a linear regression on a
nonlinear association is not only meaningless but misleading.
So make sure to always
plot your data before
you run a correlation or
regression analysis.
Anscombe’s examples:
The four data sets below were constructed so that they each have correlation r
= 0.816, and the regression lines are all approximately ŷ = 3 + 0.5x. For all
four sets, we would predict ŷ = 8 when x = 10.
Anscombe’s examples:
The four scatterplots show that the correlation or regression analysis is
not appropriate for just any data set with two numerical variables.
A moderate linear
association. A
straight line is
regression OK.
Statistical
inference for SLR
is OK.
An obviously
nonlinear
relationship. A
straight line
regression is not
OK. Fit a different
curve.
One point deviates
from the highly
linear pattern. This
influential outlier
must be examined
closely before
proceeding.
Just one very
influential point; all
other points have
the same x value.
What experiment
was conducted
here?
Vocabulary: lurking vs. confounding

A lurking variable is a variable that is not among the explanatory or
response variables in the analysis and yet, if observed and
considered, may influence the interpretation of relationships among
those variables.

Two variables are confounded when their effects on a response
variable cannot be distinguished (statistically) from each other. The
confounded variables can be explanatory variables or lurking
variables.

Association is not causation. Even if a statistical association is
very strong, this is not by itself good evidence that a change in x will
cause a change in y. The association would be just as strong if we
reversed the roles of x and y.
Cautions before rushing into a correlation or a
regression analysis

Do not use a regression on inappropriate data.

Clear pattern in the residuals

Presence of large outliers

Clustered data falsely appearing linear
Use residual plots for
help in seeing these.

Beware of lurking variables.

Avoid extrapolating (predicting beyond values in the data set).

Recognize when the correlation/regression is being performed
on values that are averages of another variable.

An observed relationship in the data, however strong it is,
does not imply causation just on its own.
Download