Document

advertisement
Chapter 4
Describing Bivariate
Numerical Data
Created by Kathy Fritz
Correlation
Pearson’s Sample Correlation Coefficient
Properties of r
Interpreting Scatterplots
To interpret a scatterplot, follow the basic strategy of
data analysis from Chapters 1 and 2. Look for patterns
and important departures from those patterns.
How to Examine a Scatterplot
As in any graph of data, look for the overall pattern and for
striking departures from that pattern.
• You can describe the overall pattern of a scatterplot by
the direction, form, and strength of the relationship.
• An important kind of departure is an outlier, an individual
value that falls outside the overall pattern of the
relationship.
Interpreting Scatterplots
Definition:
Two variables have a positive association when above-average
values of one tend to accompany above-average values of the
other, and when below-average values also tend to occur
together.
Two variables have a negative association when above-average
values of one tend to accompany below-average values of the
other.
Consider the SAT example.
Strength
Interpret the scatterplot.
Direction
Form
There is a moderately strong,
negative, curved relationship between
the percent of students in a state who
take the SAT and the mean SAT math
score.
Further, there are two distinct clusters
of states and two possible outliers that
fall outside the overall pattern.
Interpreting Scatterplots
When the points in a scatterplot tend to cluster tightly
around a line, the relationship is described as strong.
Try to order the scatterplots from strongest
relationship to the weakest.
A
B
C
D
These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010).
Pearson’s Sample Correlation Coefficient
• Usually referred to as just the correlation
coefficient
• Denoted by r
• Measures the strength and direction of a
linear relationship between two numerical
variables
Measuring Linear Association: Correlation
A scatterplot displays the strength, direction, and form of the
relationship between two quantitative variables.
Linear relationships are important because a straight line is a
simple pattern that is quite common. Unfortunately, our eyes
are not good judges of how strong a linear relationship is.
Definition:
The correlation r measures the strength of the linear
relationship between two quantitative variables.
•r is always a number between -1 and 1
•r > 0 indicates a positive association.
•r < 0 indicates a negative association.
•Values of r near 0 indicate a very weak linear relationship.
•The strength of the linear relationship increases as r
moves away from 0 towards -1 or 1.
•The extreme values r = -1 and r = 1 occur only in the case
of a perfect linear relationship.
Properties of r
1. The sign of r matches the direction of the
linear relationship.
r is positive
r is negative
Properties of r
2. The value of r is always greater than or equal
to -1 and less than or equal to +1.
Strong correlation
Moderate correlation
Weak correlation
-1 -.8
-.5
0
.5
.8
1
Properties of r
3. r = 1 only when all the points in the
scatterplot fall on a straight line that slopes
upward. Similarly, r = -1 when all the
points fall on a downward sloping line.
Properties of r
4. r is a measure of the extent to which x and y
are linearly related
Find the correlation for these points:
x
y
2
40
4
20
6
8
8
4
10
8
12
20
14
40
Compute the correlation coefficient? r = 0
Sketch the scatterplot.
40
30
20
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Properties of r
5. The value of r does not depend on the unit
of measurement for either variable.
Mare Weight
(in Kg)
Foal Weight
(in Kg)
Mare Weight
(in lbs)
Foal Weight
(in Kg)
556
129.0
1223.2
129.0
638
119.0
1403.6
119.0
588
132.0
1293.6
132.0
550
123.5
1210.0
123.5
580
112.0
1276.0
112.0
642
113.5
1412.4
113.5
568
95.0
1249.6
95.0
642
104.0
1412.4
104.0
556
104.0
1223.2
104.0
616
93.5
1355.2
93.5
549
108.5
1207.8
108.5
504
95.0
1108.8
95.0
515
117.5
1111.0
117.5
551
128.0
1212.2
127.5
594
127.5
1306.8
127.5
Measuring Linear Association: Correlation
Scatterplots and Correlation
Calculating Correlation Coefficient
The correlation coefficient is calculated using
the following formula:
where
and
Correlation
The formula for r is a bit complex. It helps us to see
what correlation is, but in practice, you should use
your calculator or software to find r.
How to Calculate the Correlation r
Suppose that we have data on variables x and y for n individuals.
The values for the first individual are x1 and y1, the values for the
second individual are x2 and y2, and so on.
The means and standard deviations of the two variables are x-bar and
sx for the x-values and y-bar and sy for the y-values.
The correlation r between x and y is:
x n  x y n  y 
1 x1  x y1  y  x 2  x y 2  y 





r
 
 ... 









n 1 
 sx  sy 
 sx  sy   sx  sy 

x i  x y i  y 
1

r




n 1  sx  sy 

The web site www.collegeresults.org (The
Education Trust) publishes data on U.S.
colleges and
universities. The following six-year graduation
rates and student-related expenditures per
full-time student for 2007 were reported for the seven
primarily undergraduate public universities in California with
enrollments between 10,000 and 20,000.
Expenditures
8810
7780
8112
8149
8477
7342
7984
Graduation rates
66.1
52.4
48.9
48.1
42.0
38.3
31.3
Here is the scatterplot:
Does the relationship appear
linear?
Explain.
College Expenditures Continued:
To compute the correlation coefficient, first
find the z-scores.
x
y
zx
zy
zxzy
8810
66.1
1.52
1.74
2.64
7780
52.4
-0.66
0.51
-0.34
8112
48.9
0.04
0.19
0.01
8149
48.1
0.12
0.12
0.01
8477
42.0
0.81
-0.42
-0.34
7342
38.3
-1.59
-0.76
1.21
7984
31.3
-0.23
-1.38
0.32
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
zx is negative
zy is positive
zxzy is negative
zx is negative
zy is negative
zxzy is positive
zx is positive
zy is positive
zxzy is positive
Will the sum of
zxzy be positive
or negative?
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
zx is negative
zy is positive
zxzy is negative
zx is negative
zy is negative
zxzy is positive
zx is positive
zy is positive
zxzy is positive
Will the sum of
zxzy be positive
or negative?
zx is negative
zy is positive
zxzy is negative
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
Will the sum of zxzy
be positive or
negative or zero?
Facts about Correlation
How correlation behaves is more important than the
details of the formula. Here are some important facts
about r.
1. Correlation makes no distinction between explanatory
and response variables.
2. r does not change when we change the units of
measurement of x, y, or both.
3. The correlation r itself has no unit of measurement.
Cautions:
• Correlation requires that both variables be quantitative.
• Correlation does not describe curved relationships between
variables, no matter how strong the relationship is.
• Correlation is not resistant. r is strongly affected by a few
outlying observations.
• Correlation is not a complete summary of two-variable data.
Does a value of r close to 1 or -1 mean that
a change in one variable causes a change in
the other variable?
Association or
correlation does
NOT imply
causation.
Consider the following examples:
• The relationship between the number of
Causality
can
onlymore
be shown
carefully
cavities
in
child’s
teeth
the
size of
Should
weaall
drink
hotand by
controlling
values
of crime
allto
variables
that
are
responses
cold
weather
chocolate
to
lower
the
rate?
his
orBoth
her
vocabulary
is
strong
and might be
related to the ones under study. In other
positive.
words,
with
well-controlled,
So does
thisamean
I should feedwell-designed
children more
variables
are both strongly
candy toThese
increase
their vocabulary?
experiment.
to the ageisofnegatively
the child
• Consumption ofrelated
hot chocolate
correlated with crime rate.
Linear Regression
Least Squares Regression Line
Suppose there is a relationship between two
numerical variables.
Let x be the amount spent on advertising and y be
the amount of sales for the product during a given
period.
You might want to predict product sales (y) for a
month when the amount spent on advertising is
$10,000 (x).
The equation of a line is:
y  a  bx
Where:
b – is the slope of the line
– it is the amount by which y increases
when x increases by 1 unit
a – is the intercept (also called y-intercept or
vertical intercept)
– it is the height of the line above x = 0
– in some contexts, it is not reasonable to
interpret the intercept
Regression Line
Linear (straight-line) relationships between two quantitative
variables are common and easy to understand. A regression line
summarizes the relationship between two variables, but only in
settings where one of the variables helps explain or predict the
other.
Definition:
A regression line is a line that describes how a response
variable y changes as an explanatory variable x changes. We
often use a regression line to predict the value of y for a
given value of x.
Shown is a scatterplot of the change in
nonexercise activity (cal) and measured fat
gain (kg) after 8 weeks for 16 healthy young
adults.
 The plot shows a moderately strong,
negative, linear association between NEA
change and fat gain with no outliers.
 The regression line predicts fat gain from
change in NEA.
When nonexercise
activity = 800 cal,
our line predicts a
fat gain of about
0.8 kg after 8
weeks.
Interpreting a Regression Line
A regression line is a model for the data, much like
density curves. The equation of a regression line gives
a compact mathematical description of what this model
tells us about the relationship between the response
variable y and the explanatory variable x.
Definition:
Suppose that y is a response variable (plotted on the
vertical axis) and x is an explanatory variable (plotted on
the horizontal axis). A regression line relating y to x has an
equation of the form
ŷ = a + bx
In this equation,
•ŷ (read “y hat”) is the predicted value of the response
variable y for a given value of the explanatory variable x.
•b is the slope, the amount by which y is predicted to change
when x increases by one unit.
•a is the y intercept, the predicted value of y when x = 0.
The Deterministic Model
Notice, the y-value is
determined by
substituting the
x-value into the
equation of the line.
Also notice that the
points fall on the line.
Interpreting a Regression Line
Consider the regression line from the example “Does
Fidgeting Keep You Slim?” Identify the slope and yintercept and interpret each value in context.
fatgain = 3.505 - 0.00344(NEA change)

The slope b = -0.00344 tells us
that the amount of fat gained is
predicted to go down by 0.00344
kg for each added calorie of
NEA.
The y-intercept a = 3.505 kg is
the fat gain estimated by this
model if NEA does not change
when a person overeats.
• Prediction
fatgain = 3.505 - 0.00344(NEA change)

Least-Squares Regression
We can use a regression line to predict the response ŷ
for a specific value of the explanatory variable x.
Use the NEA and fat gain regression line to predict the
fat gain for a person whose NEA increases by 400 cal
when she overeats.
Extrapolation
We can use a regression line to predict the response ŷ
for a specific value of the explanatory variable x. The
accuracy of the prediction depends on how much the
data scatter about the line.
While we can substitute any value of x into the equation
of the regression line, we must exercise caution in
making predictions outside the observed values of x.
Definition:
Extrapolation is the use of a regression line for prediction
far outside the interval of values of the explanatory
variable x used to obtain the line. Such predictions are
often not accurate.
Don’t make predictions using values of x that are much larger or much
smaller than those that actually appear in your data.
How do you find an appropriate line for
describing a bivariate data set?
y = 10 + 2x
40
30
20
10
5
10
15
20
Residuals
In most cases, no line will pass exactly through all the points in a
scatterplot. A good regression line makes the vertical distances
of the points from the line as small as possible.
Definition:
A residual is the difference between an observed value of the
response variable and the value predicted by the regression line.
That is,
residual = observed y – predicted y
residual = y - ŷ
residual
Positive residuals
(above line)
Negative residuals
(below line)
Least squares regression line
Different regression lines produce different residuals.
The least squares regression line is the line that
minimizes the sum of squared deviations or residuals.
and the y-intercept is:
𝑎 = 𝑦 − 𝑏𝑥
The equation of the least square regression line is:
𝑦 = 𝑎 + 𝑏𝑥
Least-Squares Regression Line
We can use technology to find the equation of the leastsquares regression line. We can also write it in terms
of the means and standard deviations of the two
variables and their correlation.
Definition: Equation of the least-squares regression line
We have data on an explanatory variable x and a response
variable y for n individuals. From the data, calculate the means
and standard deviations of the two variables and their
correlation. The least squares regression line is the line ŷ = a +
bx with
slope
br
sy
sx
and y intercept
a  y  bx
Let’s investigate the meaning of the least squares
regression line. Suppose we have a data set that consists
of the observations (0,0), (3,10) and 6,2).
(3,10)
1
yˆ  x  3
3
(6,2)
(0,0)
Pomegranate, a fruit native to Persia, has been used in the
folk medicines of many cultures to treat various ailments.
Researchers are now investigating if pomegranate's
antioxidants properties are useful in the treatment of cancer.
In one study, mice were injected with cancer cells and randomly
assigned to one of three groups, plain water, water supplemented
with .1% pomegranate fruit extract (PFE), and water supplemented
with .2% PFE. The average tumor volume for mice in each group was
recorded for several points in time.
(x = number of days after injection of cancer cells in mice assigned
to plain water and y = average tumor volume (in mm3)
x
11
15
19
23
27
y
150
270
450
580
740
Sketch a scatterplot
for this data set.
Pomegranate study continued
Predict the average volume of the tumor for 20
days after injection.
Predict the average volume of the tumor for 5
days after injection.
Assessing the Fit of a Line
Residuals
Residual Plots
Outliers and Influential Points
Coefficient of Determination
Standard Deviation about the Line
Assessing the fit of a line
Important questions are:
1.
Is the line an appropriate way to summarize
the relationship between x and y ?
2. Are there any unusual aspects of the data
set that you need to consider before
proceeding to use the least squares
regression line to make predictions?
3. If you decide that it is reasonable to use
the line as a basis for prediction, how
accurate can you expect predictions to be?
Residuals
Recall, the vertical deviations of points from
the least squares regression line are called
deviations.
These deviations are also called residuals.
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦
In a study, researchers were interested in how the distance a
deer mouse will travel for food (y) is related to the distance
from the food to the nearest pile of fine woody debris (x).
Distances were measured in meters.
Distance from
Debris (x)
Distance
Traveled (y)
6.94
0.00
5.23
6.13
5.21
11.29
7.10
14.35
8.16
12.03
5.50
22.72
9.19
20.11
9.05
26.16
9.36
30.65
Residual Plots
One of the first principles of data analysis is to look for an
overall pattern and for striking departures from the
pattern. A regression line describes the overall pattern
of a linear relationship between two variables. We see
departures from this pattern by looking at the residuals.
Definition:
A residual plot is a scatterplot of the residuals against the
explanatory variable. Residual plots help us assess how well a
regression line fits the data.
Residual plots
• A residual plot is a scatterplot of the
(x, residual) pairs.
• Residuals can also be graphed against the
predicted y-values
• Isolated points or a pattern of points in the
residual plot indicate potential problems.
Interpreting Residual Plots
A residual plot magnifies the deviations of the points
from the line, making it easier to see unusual
observations and patterns.
1) The residual plot should show no obvious patterns
2) The residuals should be relatively small in size.
Pattern in residuals
Linear model not
appropriate
Deer mice continued
Distance from
Debris (x)
Distance
Traveled (y)
6.94
0.00
14.76
-14.76
5.23
6.13
9.23
-3.10
5.21
11.29
9.16
2.13
7.10
14.35
15.28
-0.93
8.16
12.03
18.70
-6.67
5.50
22.72
10.10
12.62
9.19
20.11
22.04
-1.93
9.05
26.16
21.58
4.58
9.36
30.65
22.59
8.06
Plot the residuals against the
distance from debris (x)
Deer mice continued
15
Residuals
10
5
5
-5
-10
-15
6
7
8
9
Distance from debris
15
15
10
10
Residuals
Residuals
Deer mice continued
5
10
-5
15
20
25
Predicted Distance traveled
5
9
5
-5
-10
-10
-15
-15
6
7
8
9
Distance from debris
Residual plots continued
Let’s examine the accompanying data on x = height (in inches) and
y = average weight (in pounds) for American females, ages 30-39
(from The World Almanac and Book of Facts).
x
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
y
113
115
118
121
124
128
131
134
137
141
145
150
153
159
164
Let’s examine the data set for 12 black bears from the
Boreal Forest.
x = age (in years) and y = weight (in kg)
x
Y
10.5 6.5 28.5 10.5 6.5
54
40
62
51
55
7.5
6.5 5.5 7.5 11.5 9.5
5.5
56
62
50
42
40
59
51
Sketch a scatterplot with the fitted regression line.
60
Weight
55
50
45
40
Age
5
10
15
20
25
30
Black bears continued
Y
10.5 6.5 28.5 10.5 6.5
54
40
62
51
55
7.5
6.5 5.5 7.5 11.5 9.5
5.5
56
62
50
42
40
59
60
55
Weight
x
50
45
40
Age
5
10
15
20
Predicted Distance traveled
25
30
51
Coefficient of Determination
• The coefficient of determination is the
proportion of variation in y that can be
attributed to an approximate linear
relationship between x & y
• Denoted by r2
• The value of r2 is often converted to a
percentage.
The Role of r2 in Regression
The standard deviation of the residuals gives us a
numerical estimate of the average size of our
prediction errors. There is another numerical quantity
that tells us how well the least-squares regression line
predicts values of the response y.
Definition:
The coefficient of determination r2 is the fraction of the variation in the
values of y that is accounted for by the least-squares regression line of y on
x. We can calculate r2 using the following formula:
r2  1
where
and
SSE   residual 2

SST  (y i  y )2
SSE
SST
Let’s explore the meaning of r2 by revisiting the deer mouse data set.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x
6.94
5.23
5.21
y
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
Suppose you didn’t know any
x-values. What distance would
you expect deer mice to travel?
9.05
9.36
26.16 30.65
Deer mice continued
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
6.94
5.23
5.21
y
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
Now let’s find how much
variation there is in the
distance traveled (y) from the
least squares regression line.
9.05
9.36
26.16 30.65
Distance traveled
x
Distance to debris
Deer mice continued
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
Total amount of variation in
the distance traveled (y) is
SSTO = 773.95 m2.
The amount of variation in
y values from the
regression line is
SSResid = 526.27 m2
Approximately what percent of the variation in
distance traveled (y) can be explained by the
linear relationship?
r2 = 32%
Standard Deviation about the Least
Squares Regression Line
The standard deviation about the least squares
regression line is
𝑠𝑒 =
𝑆𝑆𝑅𝑒𝑠𝑖𝑑
𝑛−2
The value of se can be interpreted as the
typical amount an observation deviates from the
least squares regression line.
Standard Deviation about the Least
Squares Regression Line
•
Definition:
•
If we use a least-squares regression line to predict the values of a
response variable y from an explanatory variable x, the standard
deviation of the residuals (s) is given by
s
2
residuals

n2

2
ˆ
(
y

y
)
 i
n2
Partial output from the regression analysis of deer mouse data:
Predictor
Coef
SE Coef
T
P
Constant
-7.69
13.33
-0.58
0.582
Distance to debris
3.234
1.782
1.82
0.112
S = 8.67071
R-sq = 32.0%
R-sq(adj) = 22.3%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1
247.68
247.68
3.29
0.112
Resid Error
7
526.27
75.18
Total
8
773.95
Interpreting the Values of se and r2
A small value of se indicates that residuals tend to be
small. This value tells you how much accuracy you can
expect when using the least squares regression line to
make predictions.
A large value of r2 indicates that a large proportion of
the variability in y can be explained by the approximate
linear relationship between x and y. This tells you that
knowing the value of x is helpful for predicting y.
A useful regression line will have a reasonably small value
of se and a reasonably large value of r2.
A study (Archives of General Psychiatry[2010]: 570-577)
looked at how working memory capacity was related to
scores on a test of cognitive functioning and to scores on
an IQ test. Two groups were studied – one group
consisted of patients diagnosed with schizophrenia and
the other group consisted of healthy control subjects.
Putting it All Together
Describing Linear Relationships
Making Predictions
Steps in a Linear Regression Analysis
1.
2.
3.
4.
5.
6.
7.
Summarize the data graphically by constructing a scatterplot
Based on the scatterplot, decide if it looks like the relationship
between x an y is approximately linear. If so, proceed to the next
step.
Find the equation of the least squares regression line.
Construct a residual plot and look for any patterns or unusual
features that may indicate that line is not the best way to
summarize the relationship between x and y. In none are found,
proceed to the next step.
Compute the values of se and r2 and interpret them in context.
Based on what you have learned from the residual plot and the
values of se and r2, decide whether the least squares regression
line is useful for making predictions. If so, proceed to the last
step.
Use the least squares regression line to make predictions.
Revisit the crime scene DNA data
Recall the scientists were interested in predicting age of
a crime scene victim (y) using the blood test measure (x).
Step 4: A residual plot constructed from these data
showed a few observations with large residuals, but
these observations were not far removed from the
rest of the data in the x direction. The observations
were not judged to be influential. Also there were no
unusual patterns in the residual plot that would
suggest a nonlinear relationship between age and the
blood test measure.
Step 5: se = 8.9
and
r2 = 0.835
Approximately 83.5% of the variability in age can be
explained by the linear relationship. A typical
difference between the predicted age
and the actual age would be about
9 years.
Step 6: Based on the residual plot, the large value of r2,
and the relatively small value of se, the scientists
proposed using the blood test measure and the least
squares regression line as a way to estimate ages of
crime victims.
Step 7: To illustrate predicting age, suppose that a blood
sample is taken from an unidentified crime victim and
that the value of the blood test measure is
determined to be -10. The predicted age of the
victim would be
𝑦 = −33.65 − 6.74 −10 = 33.75 years
Common Mistakes
Avoid these Common Mistakes
1. Correlation does not imply causation. A
strong correlation implies only that the two
variables tend to vary together in a
predictable way, but there are many possible
explanations for why this is occurring other
than one variable causing change in the other.
Avoid these Common Mistakes
2. A correlation coefficient near 0 does not
necessarily imply that there is no relationship
between two variables. Although the
variables may be unrelated, it is also possible
that there is a strong but nonlinear
relationship.
40
Be sure to look at a
scatterplot!
30
20
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Avoid these Common Mistakes
3. The least squares regression line for
predicting y from x is NOT the same line as
the least squares regression line for
predicting x from y.
The ages (x, in months) and heights (y, in inches) of seven children are
given.
x
16
24
42
60
75
102
120
y
24
30
35
40
48
56
60
To predict height from age:
ℎ𝑒𝑖𝑔ℎ𝑡 = 20.40 + .34(𝑎𝑔𝑒)
To predict age from height:
𝑎𝑔𝑒 = −58.20 + 2.89(ℎ𝑒𝑖𝑔ℎ𝑡)
Avoid these Common Mistakes
4. Beware of extrapolation. Using the least
squares regression line to make predictions
outside the range of x values in the data set
often leads to poor predictions.
Predict the height of a child that is 15 years (180
months) old.
ℎ𝑒𝑖𝑔ℎ𝑡 = 20.4 + .34 180 = 81.6 𝑖𝑛𝑐ℎ𝑒𝑠
It is unreasonable that a 15 year-old would
be 81.6 inches or 6.8 feet tall
Avoid these Common Mistakes
5. Be careful in interpreting the value of the
intercept of the least squares regression line.
In many instances interpreting the intercept
as the value of y that would be predicted
when x = 0 is equivalent to extrapolating way
beyond the range of x values in the data set.
The ages (x, in months) and heights (y, in inches) of seven children are
given.
x
16
24
42
60
75
102
120
y
24
30
35
40
48
56
60
Avoid these Common Mistakes
6. Remember that the least squares regression
line may be the “best” line, but that doesn’t
necessarily mean that the line will produce
good predictions.
This has a relatively
large se – thus we can’t
accurately predict IQ
from working memory
capacity.
Avoid these Common Mistakes
7. It is not enough to look at just r2 or just se
when evaluating the regression line.
Remember to consider both values. In
general, your would like to have both a small
value for se and a large value for r2.
Avoid these Common Mistakes
8. The value of the correlation coefficient, as
well as the values for the intercept and slope
of the least squares regression line, can be
sensitive to influential observations in the
data set, particularly if the sample size is
small.
Be sure to always start with a plot to check
for potential influential observations.
Correlation and Regression Wisdom
1. The distinction between explanatory and response variables
is important in regression.
Least-Squares Regression
Correlation and regression are powerful tools for
describing the relationship between two variables.
When you use these tools, be aware of their
limitations
Correlation and Regression Wisdom
Definition:
An outlier is an observation that lies outside the overall
pattern of the other observations. Points that are outliers in
the y direction but not the x direction of a scatterplot have
large residuals. Other outliers may not have large residuals.
An observation is influential for a statistical calculation if
removing it would markedly change the result of the calculation.
Points that are outliers in the x direction of a scatterplot are
often influential for the least-squares regression line.
Least-Squares Regression
2. Correlation and regression lines describe only linear
relationships.
3. Correlation and least-squares regression lines are not
resistant.
Correlation and Regression Wisdom
4. Association does not imply causation.
An association between an explanatory variable x and a
response variable y, even if it is very strong, is not by itself
good evidence that changes in x actually cause changes in y.
A serious study once
found that people with
two cars live longer than
people who only own one
car. Owning three cars is
even better, and so on.
There is a substantial
positive correlation
between number of cars
x and length of life y.
Why?
Least-Squares Regression
Association Does Not Imply Causation
Download