Regression analysis

advertisement
Chapter 5
Regression
Chapter outline
The least-squares regression line
 Facts about least-squares regression
 Residuals
 Influential observations
 Cautions about correlation and
regression
 Association does not imply causation

Correlation and Regression

Regression effects are depicted by the slope
of the line.

Correlation can be seen as the spread of
points around the regression line. The
greater the amount of spread of points
around the regression line, the less predictive
is X of Y and consequently, the weaker the
correlation.
Perfect Positive Correlation
25
20
15
10
5
0
0
5
10
15
Correlation r = 1
20
25
Potato Chip Consumption
No Correlation
12
10
8
6
4
2
0
0
20
40
60
80
Feeling Thermometer for Clinton
100
120
Imperfect Correlation and Relationships

We rarely see perfect correlation

While Correlation is never perfect, we
can draw a line to summarize the trend
in the data points. This is the
Regression Line
Regression Line

Regression Line: A straight line that
describes how a response variable y
changes as an explanatory variable x
changes.

It can sometimes be used to predict the
value of y for a given value of x.
Making Predictions
Where do we Draw the Line?
Age and Income
45
40
Income in $10,000
35
30
25
20
15
10
5
0
0
10
20
30
40
Age
50
60
70
80
Minimize the sum of the distances
between the points and the line
12
10
-.25
+2
8
+2
6
-3.5
4
-.25
2
0
0
1
2
Square the Distances
3
4
5
6
7
The best fitting line would minimize the
sum of the squared distance of every
point in the scatterplot from the
regression line n
2
( yi  yˆ i )
Minimize 

i 1
This line -- the best-fitting line -- is that
line which -- compared to any other line
you could plot through the points -produced the smallest sum of squared
deviations.
•The slope b is the change in y when x increases by 1.
• The intercept a is the predicted value of y when x = 0.
Finding the equation of the regression line
 Exercise
5.16 (Page 125)
Facts about least-squares regression line





Fact 1:It is a mathematical model for the data.
Fact 2: The distinction between explanatory and response
variables is essential in regression.
Fact 3: There is a close connection between correlation and the
slope of least squares line.
Fact 4: The least-squares regression line always passes through
the point ( x, y,) where x is the mean of the x values, and y is the
mean of the y values.
Fact 5: The correlation r describes the strength of a straight-line
relationship. In the regression setting, this description takes a
specific form: the square of the correlation, r2, is the fraction of the
variation in the value of y that is explained by the least squares
regression of y on x.
Residual plots

A residual plot is a scatterplot of the regression
residuals against the explanatory variable.
Residual plots help us assess the fit of a
regression line.

A residual is the difference between an
observed of the response variable and the value
predicted by the regression line. That is,
Residual =observed y – predicted y
=
ˆ
y y
Outliers and Influential Observations

An outlier is an observation that lies outside the overall
pattern of the other observations

An observation is influential for a statistical calculation if
removing it would markedly change the result of the
calculation.

Points that are outliers in the x direction of a scatterplot
are often influential for the least-squares regression line.
Influential observations can also be described as outliers.
Outliers and Influential Observations
350
Heart attacks
300
Outlier
250
200
150
Influential
observation
100
50
0
0
2
4
6
8
10
Wine consumption
12
14
16
Beware extrapolation

Extrapolation is the use of a regression line for prediction
far outside the range of values of the explanatory variable x
that you used to obtain the line. Such predictions are often
not accurate.
Example

Suppose Angela was 1.20m tall on January 1st 1975, and
1.40m tall on January 1st 1976. By extrapolation, estimate
her height on January 1st 1977.

By extrapolation, it could be estimated that by January 1st
1977 she would have grown another 0.20m to be 1.60m
tall. This however assumes that she continued to grow at
the same rate. This must eventually become a false
assumption, otherwise by January 1st 1980, she would be a
giantess.
Lurking variable

A lurking variable is a variable that has an important
effect on the relationship among the variable in a study
but is not included among the variables studied.

Example: Studies of relationship between treatment of
heart disease and the patients’ gender show that women
are in general treated less aggressively than men with
similar symptoms. Women are less likely to undergo
bypass operation.

Question: Might this be discrimination?
Answer: No. Be aware of the lurking variable: Although
half of heart disease victim are women, they are on the
average much older than male victim.
Association does not imply causation
Example: Sales of rum and number of Methodist ministers
is positively correlated, but a large number of ministers
does not encourage rum drinking.
Is there a lurking variable that influences both rum sales
and Methodist ministers?
The the previous example, both the sales of rum and the
number of Methodists ministers were correlated with the
number of people in the U.S. As the number of people
increases, it causes an increase in demand for both
Methodist ministers and for rum.
Download