Y - Murdochs Web

advertisement
Chapter 4
Describing Bivariate
Numerical Data
Created by Kathy Fritz
Forensic scientists must often estimate the age of an
unidentified crime victim. Prior to 2010, this was usually
done by analyzing teeth and bones, and the resulting
estimates were not very reliable. A study described in the
This line can be used to
paper “Estimating Human Age from T-Cell DNA
estimate the age of a crime
Rearrangements” (Current Biology [2010]) examined the
victim from a blood test.
relationship between age and a
measure based on a blood test.
Age and the blood test
measure were recorded for 195
people ranging in age from a
few weeks to 80 years. A
scatterplot of the data appears
to the right.
Do you think there is a
relationship? If so, what
kind? If not, why not?
Correlation
Pearson’s Sample Correlation Coefficient
Properties of r
 Does it look like there is a relationship between the
two variables?
Yes
 If so, is the relationship linear? Yes
 Does it look like there is a relationship between the
two variables?
Yes
 If so, is the relationship linear? Yes
 Does it look like there is a relationship between the
two variables?
Yes
 If so, is the relationship linear? No, looks curved
 Does it look like there is a relationship between the
two variables?
Yes
 If so, is the relationship linear? No, looks parabolic
 Does it look like there is a relationship between the
two variables?
No
 If so, is the relationship linear?
Linear relationships can be either positive
or negative in direction.
Are these linear relationships positive or
negative?
Negative
Positive
When the points in a scatterplot tend to cluster tightly
around a line, the relationship is described as strong.
Try to order the scatterplots from strongest
relationship to the weakest. A, C, B, D
A
B
C
D
These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010).
Pearson’s Sample Correlation Coefficient
• Usually referred to as just the correlation
coefficient
The strongest
• Denoted
by rvalues of the correlation coefficient
are r = +1 and r = -1.
• Measures the strength and direction of a
linear
relationship
two numerical
The weakest
value of between
the correlation
coefficient is
variables
r = 0.
An important definition!
Properties of r
1. The sign of r matches the direction of the
linear relationship.
r is positive
r is negative
Properties of r
2. The value of r is always greater than or equal
to -1 and less than or equal to +1.
Strong correlation
Moderate correlation
Weak correlation
-1 -.8
-.5
0
.5
.8
1
Properties of r
3. r = 1 only when all the points in the
scatterplot fall on a straight line that slopes
upward. Similarly, r = -1 when all the
points fall on a downward sloping line.
Properties of r
4. r is a measure of the extent to which x and y
are linearly related
Find the correlation for these points:
Does this mean that there is NO
x
2
4
6
8
10 12 14
relationship between these points?
y
40
20
8
4
8
20
40
Compute the correlation coefficient? r = 0
Sketch
r = 0, the
but scatterplot.
the data set
has a definite
relationship!
40
30
20
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Properties of r
5. The value of r does not depend on the unit
Calculate r for the
of measurement for either variable.
data set of mares’ Mare Weight Foal Weight
Mare Weight
Foal Weight
(in Kg)
(in Kg)
(in lbs)
(in Kg)
weight and the
556
129.0
1223.2
129.0
weight of their
638
119.0
1403.6
119.0
588
132.0
1293.6
132.0
foals.
550
123.5
1210.0
123.5
580
112.0
1276.0
112.0
r = -0.00359
642
113.5
1412.4
113.5
r = -0.00359
568
95.0
1249.6
95.0
642
104.0
556
104.0
616
93.5
549
108.5
504
95.0
515
117.5
551
128.0
594
127.5
Change the mare
weights to pounds by
multiply Kg by 2.2 and
calculate r.
1412.4
104.0
1223.2
104.0
1355.2
93.5
1207.8
108.5
1108.8
95.0
1111.0
117.5
1212.2
127.5
1306.8
127.5
Calculating Correlation Coefficient
The correlation coefficient is calculated using
the following formula:
where
and
The web site www.collegeresults.org (The
Education Trust) publishes data on U.S.
colleges and
universities. The following six-year graduation
rates and student-related expenditures per
full-time student for 2007 were reported for the seven
primarily undergraduate public universities in California with
enrollments between 10,000 and 20,000.
Expenditures 8810 7780 8112 8149 8477 7342 7984
Graduation
rates
66.1
52.4
48.9
Here is the scatterplot:
Does the relationship appear
linear?
Explain.
48.1
42.0
38.3
31.3
College Expenditures Continued:
To compute the correlation coefficient, first
find the z-scores.
x
y
zx
zy
zxzy
8810
66.1
1.52
1.74
2.64
8149
48.1
0.12
0.12
0.01
-0.66
0.51
-0.34
To7780
interpret52.4
the correlation
coefficient,
use the
8112
48.9
0.04
0.01
definition
– 0.19
8477is a positive,
42.0
0.81
-0.42
-0.34
There
moderate
linear
relationship
7342 six-year
38.3
-1.59
-0.76 and student1.21
between
graduation
rates
7984
31.3
-0.23
-1.38
0.32
related expenditures.
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
zx is negative
zy is positive
zxzy is negative
zx is negative
zy is negative
zxzy is positive
zx is positive
zy is positive
zxzy is positive
Will the sum of
zxzy be positive
or negative?
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
zx is negative
zy is positive
zxzy is negative
zx is negative
zy is negative
zxzy is positive
zx is positive
zy is positive
zxzy is positive
Will the sum of
zxzy be positive
or negative?
zx is negative
zy is positive
zxzy is negative
How the Correlation Coefficient Measures
the Strength of a Linear Relationship
Will the sum of zxzy
be positive or
negative or zero?
Does a value of r close to 1 or -1 mean that
a change in one variable causes a change in
the other variable?
Association
does NOT
imply
causation.
Consider the following examples:
• The relationship between the number of
Causality
can
onlymore
be shown
carefully
cavities
in
child’s
teeth
the
size of
Should
weaall
drink
hotand by
controlling
values
of crime
allto
variables
that
are
responses
cold
weather
chocolate
to
lower
the
rate?
his
orBoth
her
vocabulary
is
strong
and might be
related to the ones under study. In other
positive.
words,
with
well-controlled,
So does
thisamean
I should feedwell-designed
children more
variables
are both strongly
candy toThese
increase
their vocabulary?
experiment.
to the ageisofnegatively
the child
• Consumption ofrelated
hot chocolate
correlated with crime rate.
Linear Regression
Least Squares Regression Line
Suppose there is a relationship between two
numerical variables.
Let x be the amount spent on advertising and y be
the amount of sales for the product during a given
period.
You might want to predict product sales (y) for a
month when the amount spent on advertising is
$10,000 (x).
The equation of a line is:
y  a  bx
Where:
b – is the slope of the line
– it is the amount by which y increases
when x increases by 1 unit
a – is the intercept (also called y-intercept or
vertical intercept)
– it is the height of the line above x = 0
– in some contexts, it is not reasonable to
interpret the intercept
The Deterministic Model
We often say x determines y.
Notice, the y-value is
determined by
substituting the
x-value into the
equation of the line.
Also notice that the
points fall on the line.
But, when we fit a line to data, do
all the points fall on the line?
How do you find an appropriate line for
describing a bivariate
data
set?
The point
(15,44)
has a deviation
To assess the fit of a line, we need a
way
to combine
theofnis
To assess
the What
fit
adeviations
line,
we intoof
the meaning
single
measure
of
fit.
look ata how
the
points
this deviate
deviation?
vertically from the line.
of +4.
y = 10 + 2x
40
30
20
What is the meaning of
10 deviation?
a negative
5
10
15
20
Least squares regression line
The least squares regression line is the line that
minimizes the sum of squared deviations.
The most widely used measure of the fit of a line
y = a + bx to bivariate data is the sum of the
squared deviations about the line.
Let’s investigate the meaning of the least squares
regression line. Suppose we have a data set that consists
of the observations (0,0), (3,10) and 6,2).
Use the
a calculator
to
(3,10)
Find
sum of the
findsquares
the least
What is the sum
of squares
the
regression
of the deviations
deviations
fromline
the line
from the line?
6
Find the
Hmmmmm . . .
Will the sum
vertical
always be zero?
deviations
Why does this seem so
from the
familiar?
line
-3 of
The line that minimizes the sum
squared deviations is the least squares
(6,2)
-3
regression line.
1
yˆ  x  3
3
(0,0)
Sum of the squares = 54
Pomegranate, a fruit native to Persia, has been used in the
folk medicines of many cultures to treat various ailments.
Researchers are now investigating if pomegranate's
antioxidants properties are useful in the treatment of cancer.
In one study, mice were injected with cancer cells and randomly
assigned to one of three groups, plain water, water supplemented
with .1% pomegranate fruit extract (PFE), and water supplemented
with .2% PFE. The average tumor volume for mice in each group was
recorded for several points in time.
(x = number of days after injection of cancer cells in mice assigned
to plain water and y = average tumor volume (in mm3)
11
15
19
23
27
y
150
270
450
580
740
Sketch a scatterplot
for this data set.
800
Average tumor volume
x
700
600
500
400
300
200
100
10 12
14 16 18 20 22 24 26 28
Number of days after injection
Interpretation of slope:
The average volume of the tumor increases by
approximately 37.25 mm3 for each day increase in
the number of days after injection.
Computer
Does
software
the intercept
and graphing
have meaning
calculators
in this
can
calculate the
context?
least squares
Why orregression
why not? line.
Pomegranate study continued
It is
unknown
thethe
pattern
Predict
the
averagewhether
volume of
tumor for 20
in the scatterplot continues
daysobserved
after injection.
outside the range of x-values.
Why?
This is the danger of extrapolation. The
should
nottumor
be used
Predictleast
the squares
average line
volume
of the
forto5
make
predictions
for
y
using
x-values
days after injection.
outside the range in the data set.
Can volume be negative?
Why is the line used to summarize a linear
relationship called the least squares regression line?
This terminology comes from the
relationship between the least
squares line and the correlation
coefficient.
If r = 1, what do
you know about the
location of the
points?
Why is the line used to summarize a linear
relationship called the least squares regression line?
What would happen if r = 0.4? . . . 0.3? . . . 0.2?
If you want to predict x from y, can you
use the least squares line of y on x?
The regression line of y on x should not be used
to predict x, because it is not the line that
minimizes the sum of the squared deviations in
the x direction.
Assessing the Fit of a Line
Residuals
Residual Plots
Outliers and Influential Points
Coefficient of Determination
Standard Deviation about the Line
Assessing the fit of a line
Important questions are:
Once the least squares regression line is
the
next step way
is toto
examine
1. Isobtained,
the line an
appropriate
summarize
This section
how
effectively
the
line
summarizes
the relationship between x and y ?
the relationship between x and will
y. look at
2. Are there any unusual aspects of thegraphical
data
set that you need to consider before and
proceeding to use the least squares numerical
regression line to make predictions?
methods to
3. If you decide that it is reasonable toanswer
use
the line as a basis for prediction, how these
accurate can you expect predictionsquestions.
to be?
Residuals
In a study, researchers were interested in how the distance a
deer mouse will travel for food (y) is related to the distance
from the food to the nearest pile of fine woody debris (x).
Distances were measured in meters.
5.23
6.13
5.21
11.29
7.10
14.35
8.16
12.03
5.50
9.19
Distance
If the point
Traveled (y)
Distance traveled
6.94
is below the line
the
residual will
be negative.
0.00
14.76
-14.76
Distance from
Debris (x)
22.72
If the
9.23
-3.10
9.16
2.13
15.28
-0.93
18.70
-6.67
point 10.10
is above the 12.62
line
20.11
22.04
-1.93
Distance
to debris
the residual
will be
positive.
9.05
26.16
21.58
4.58
9.36
30.65
22.59
8.06
Calculate the predicted y and the residuals.
Residual plots
• A residual plot is a scatterplot of the
A careful
look at the residuals can
(x, residual)
pairs.
reveal many potential problems.
• Residuals
can alsoplot
be graphed
against
A residual
is a graph
of thethe
predicted y-valuesresiduals.
• Isolated points or a pattern of points in the
residual plot indicate potential problems.
Deer mice continued
Distance from
Debris (x)
Distance
Traveled (y)
6.94
0.00
14.76
-14.76
5.23
6.13
9.23
-3.10
5.21
11.29
9.16
2.13
7.10
14.35
15.28
-0.93
8.16
12.03
18.70
-6.67
5.50
22.72
10.10
12.62
9.19
20.11
22.04
-1.93
9.05
26.16
21.58
4.58
9.36
30.65
22.59
8.06
Plot the residuals against the
distance from debris (x)
Deer mice continued Are there any isolated points?
Is there a pattern in the points?
15
Residuals
10
5
5
-5
6
7
8
9
Distance from debris
The points in
the residual
plot appear
scattered
at random.
-10
-15
This indicates that a line is a reasonable way to
describe the relationship between the distance
from debris and the distance traveled.
15
15
10
10
Residuals
Residuals
Deer mice continued
5
10
-5
15
20
25
Predicted Distance traveled
5
9
5
-5
-10
-10
-15
-15
Residual plots can be plotted
against either the x-values or the
predicted y-values.
6
7
8
9
Distance from debris
Residual plots continued
Let’s examine the accompanying data on x = height (in inches) and
y = average weight (in pounds) for American females, ages 30-39
(from The World Almanac and Book of Facts).
x
58
59
60
61
62
y
113
115
118
121
124
63
65 66
The64residual
67 68
plot
128 131
134 137a 141 145
displays
definite curved
The scatterplot
pattern.
appears rather
straight.
Even though
r = 0.99, it is not
accurate to say
that weight
increases linearly
with height
69
70
71
72
150
153
159
164
Let’s examine the data set for 12 black bears from the
Boreal Forest.
If ythe
point(in
affects
the
x = age (in years) and
= weight
kg)
placement of the leastx 10.5 6.5 28.5 10.5 6.5
7.5 6.5
5.5 7.5 11.5
5.5
squares
regression
line,9.5then
Y
54 40 62
51 55the
56point
62 is
42 considered
40 59 51 an50
point.line.
Sketch a scatterplot with theinfluential
fitted regression
What
would happen
tounusual
the regression
Do
you notice
anything
line ifthis
thisdata
pointset?
is60 removed?
about
Weight
55
This observation has an
x-value that differs
greatly from the others
in the data set.
50
45
40
Age
5
10
15
20
25
30
Black bears continued
Y
10.5 6.5 28.5 10.5 6.5
54
40
62
51
55
7.5
6.5 5.5 7.5 11.5 9.5
5.5
56
62
50
42
40
59
51
Notice that this observation falls far
An observation is an
away from the regression line in
outlier if it has a
the y direction.
large residual.
60
55
Weight
x
50
45
40
Age
5
10
15
20
Predicted Distance traveled
25
30
Coefficient of Determination
• The coefficient of determination is the
Suppose that you would like to predict the price of
proportion
of variation in y that can be
houses in a particular city from the size of the house
attributed
to an There
approximate
linear in house
(in square feet).
will be variability
relationship
between
x & y that makes accurate
price, and it is
this variability
price prediction a challenge.
• Denoted by r2
If you know that differences in house size account for
a large proportion of the variability in house price,
• The
ofthe
r2 size
is often
converted
a predict
then value
knowing
of a house
will helpto
you
its price.
percentage.
Let’s explore the meaning of r2 by revisiting the deer mouse data set.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x
6.94
5.23
5.21
y
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
Suppose you didn’t know any
x-values. What distance would
you expect deer mice to travel?
Why do we
square the
To finddeviations?
the total amount of
variation
in the distance
Total amount
of variation in the
traveled
you need (y)
to find
distance(y)traveled
is the
sum of the squares of these
deviations from
the=mean.
SSTo
773.95 m2
9.05
9.36
26.16 30.65
Deer mice continued
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
6.94
5.23
5.21
y
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
Now let’s find how much
variation there is in the
distance traveled (y) from the
Why do we
least squares regression line.
Distance traveled
x
9.05
26.16 30.65
square the
Distance to debris
residuals?
The
amount
of variation
in the
To find
the amount
of variation
in the distance
traveled
findthe least
distance
traveled
(y) (y),
from
the sum of
the squaredline is
squares
regression
residuals.
SSResid = 526.27 m2
9.36
Deer mice continued
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
Total amount of variation in
the distance traveled (y) is
SSTO = 773.95 m2.
The amount of variation in
y values from the
regression line is
How does the
variation in y change
when we used the
least squares
regression line?
SSResid = 526.27 m2
Approximately what percent of the variation in
distance traveled (y) can be explained by the
linear relationship?
r2 = 32%
Standard Deviation about the Least
Squares Regression Line
The coefficient of determination (r 2) measures the
extent of variability about the least squares
regression line relative to overall variability in y. This
does not necessarily imply that the deviations from
the line are small in an absolute sense.
Partial output from the regression analysis of deer mouse data:
The standard deviation (s):
Predictor
T
P
This is the typicalCoef
amount SE
by Coef
which an observation
Constant
-7.69
13.33 regression
-0.58
0.582
deviates from the
least squares
line
Distance to debris
S = 8.67071
3.234
R-sq = 32.0%
1.782
1.82
0.112
R-sq(adj) = 22.3%
The y-intercept (a):
The slope (b):
Analysis of Variance
This value has no meaning in context since it doesn't
The
distance
traveled
to food
increases
by approximately
Source
DF
SS
MS
F
P
make sense to have a negative distance.
2the
3.234
meters
for
an
increase
of
1
meter
to
The
coefficient
of
determination
(r
): nearest
Regression
1
247.68
247.68
3.29
0.112
debris pile.
Only
32% of the
Resid
Error
7 observed
526.27 variability
75.18 in the distance
SSResid
traveled
for
food
can
be
explained
by the approximate
Total
8
773.95
linear relationship between the distance traveled for food
and the distance to the nearest debris pile.
SSTo
Interpreting the Values of se and r2
A small value of se indicates that residuals tend to be
small. This value tells you how much accuracy you can
expect when using the least squares regression line to
make predictions.
A large value of r2 indicates that a large proportion of
the variability in y can be explained by the approximate
linear relationship between x and y. This tells you that
knowing the value of x is helpful for predicting y.
A useful regression line will have a reasonably small value
of se and a reasonably large value of r2.
A study (Archives of General Psychiatry[2010]: 570-577)
looked
how
working
memory
capacity
was related
to
Forat
the
patient
group,
the typical
deviation
of the
For the control group, the typical deviation of the
scores
on a test
ofthe
cognitive
functioning
and to10.7,
scores
on
observations
from
regression
line is about
which
observations from the regression line is about 6.1, which is
an
IQ test. Two
groups
were studied
onerelatively
group small
is somewhat
large.
Approximately
14%– (a
smaller. Approximately 79% (a much larger amount) of
consisted
patients
diagnosed
with schizophrenia
amount) ofof
the
variation
in the cognitive
functioningand
score
the variation in the cognitive functioning score is
the otherisgroup
consisted
of linear
healthy
control subjects.
explained
by the
relationship.
explained by the regression line.
Thus, the regression line for the control group would
produce more accurate predictions than the regression
line for the patient group.
Putting it All Together
Describing Linear Relationships
Making Predictions
Steps in a Linear Regression Analysis
1.
2.
3.
4.
5.
6.
7.
Summarize the data graphically by constructing a scatterplot
Based on the scatterplot, decide if it looks like the relationship
between x an y is approximately linear. If so, proceed to the next
step.
Find the equation of the least squares regression line.
Construct a residual plot and look for any patterns or unusual
features that may indicate that line is not the best way to
summarize the relationship between x and y. In none are found,
proceed to the next step.
Compute the values of se and r2 and interpret them in context.
Based on what you have learned from the residual plot and the
values of se and r2, decide whether the least squares regression
line is useful for making predictions. If so, proceed to the last
step.
Use the least squares regression line to make predictions.
Revisit the crime scene DNA data
Recall the scientists were interested in predicting age of
a crime scene victim (y) using the blood test measure (x).
Step 1: Scientist first constructed a scatterplot of the
data.
Step 2: Based on the
scatterplot, it does
appear that there is a
reasonably strong
negative linear
relationship between and
the blood test measure.
Step 4: A residual plot constructed from these data
showed a few observations with large residuals, but
these observations were not far removed from the
rest of the data in the x direction. The observations
were not judged to be influential. Also there were no
unusual patterns in the residual plot that would
suggest a nonlinear relationship between age and the
blood test measure.
Step 5: se = 8.9
and
r2 = 0.835
Approximately 83.5% of the variability in age can be
explained by the linear relationship. A typical
difference between the predicted age
and the actual age would be about
9 years.
Step 6: Based on the residual plot, the large value of r2,
and the relatively small value of se, the scientists
proposed using the blood test measure and the least
squares regression line as a way to estimate ages of
crime victims.
Modeling Nonlinear
Relationships
Choosing a Nonlinear Function to
Describe a Relationship
Function
Equation
Looks Like
10
10
5
5
Quadratic
5
10
5
10
50
100
10
Square root
20
0
-10
50
Reciprocal
100
12
10
11
9
10
8
50
100
50
100
Choosing a Nonlinear Function to
The common log (base 10)
Describe a Relationship
may also be used.
Function
Log
Equation
Looks Like
10
10
5
5
While statisticians
often use these
nonlinear regressions,
in AP Statistics, we
will linearize our data
using transformations.
Then we can use what
we already know about
the least squares
regression line.
50
Exponential
100
50
10
2
5
1
5
5
10
4
Power
2
2
4
100
10
Models that Involve Transforming Only x
This suggest that if the pattern in the
scatterplot of (x, y) pairs looks like one of
these curves, an appropriate transformation
of the x values should result in transformed
data that shows a linear relationship.
Read “x prime”
Model
Square root
Reciprocal
Log
Transformation
Let’s look at
an example.
Is electromagnetic radiation from phone
antennae associated with declining bird
populations? The accompanying data on
x = electromagnetic field strength (Volts
per meter) and y = sparrow density
(sparrows per hectare)
First look at a scatterplot of the data.
The data is
curved and
looks similar to
the graph of
the log model.
Field
Strength
Sparrow
Density
0.11
41.71
0.20
33.60
0.29
24.74
0.40
19.50
0.50
19.42
0.61
18.74
1.01
24.23
1.10
22.04
0.70
16.29
0.80
14.69
0.90
16.29
1.20
16.97
1.30
12.83
1.41
13.17
1.50
4.64
1.80
2.11
1.90
0.00
3.01
0.00
3.10
14.69
3.41
0.00
Field Strength vs. Sparrow Density Continued
Sparrow Density = 14.8 – ln (Field Strength)
Predictor
Constant
Ln (field strength)
S = 5.50641
Ln Field
Strength
Sparrow
Density
-2.207
41.71
-1.609
33.60
-1.238
24.74
Coef
SE Coef
T
P
14.805
1.238
11.96
0.000
-.0916
19.50
-10.546
1.389
-7.59
0.000
-0.693
19.42
-0.494
18.74
0.001
24.23
0.095
22.04
-0.357
16.29
-0.223
14.69
-0.105
16.29
R-Sq = 76.2%
R-Sq(adj) = 74.9%
0.182
. . . and graph the
scatterplot Notice
of y on that
x’ the 0.262
0.344
transformed data 0.405
is now linear. We 0.588
0.642
can find the least 1.102
squares regression 1.131
1.227
line.
16.97
12.83
13.17
4.64
2.11
0.00
0.00
14.69
0.00
Field Strength vs. Sparrow Density Continued
Sparrow Density = 14.8 – ln (Field Strength)
Predictor
Constant
Ln (field strength)
S = 5.50641
Coef
SE Coef
T
P
14.805
1.238
11.96
0.000
-10.546
1.389
-7.59
0.000
R-Sq = 76.2%
A residual plot from the least
squares regression line fit to
the transformed data, shown
below, has no apparent
patterns or unusual features.
It appears that the log model
is a reasonable choice for
describing the relationship
between sparrow density and
field strength.
R-Sq(adj) = 74.9%
The value of
R 2 for this
model is
0.762 and
se = 5.5.
Field Strength vs. Sparrow Density Continued
Sparrow Density = 14.8 – ln (Field Strength)
Predictor
Constant
Ln (field strength)
S = 5.50641
Coef
SE Coef
T
P
14.805
1.238
11.96
0.000
-10.546
1.389
-7.59
0.000
R-Sq = 76.2%
R-Sq(adj) = 74.9%
This model can now be used to predict sparrow density
from field strength. For example, if the field strength is
1.6 Volts per meter, what is the prediction for the sparrow
density?
Models that Involve Transforming y
Let’s consider the remaining nonlinear models, the
exponential model and the power model.
Using
properties of below, the
Notice that using the
transformations
logarithms,
it follows
that . . .
exponential and
power models
are linearized.
Model
Exponential
Power
Transformation
In a study of factors that affect the survival of loon
chicks in Wisconsin, a relationship between the pH of lake
water and blood mercury level in loon chicks was observed.
The researchers thought that it is possible that the pH of
the lake could be related to the type of fish that the
loons ate. A scatterplot of the data is shown below.
Ln(blood mercury level)= 1.06-0.396 Lake pH
Predictor
Coef
SE Coef
T
P
Constant
1.0550
0.5535
1.91
0.065
Lake pH
-0.3956
0.0826
-4.79
0.000
S = 0.6056
R-Sq = 39.6%
R-Sq(adj) = 37.8%
Choosing Among Different Possible
Nonlinear Models
Often there is more than one reasonable model that could
be used to describe a nonlinear relationship between two
variables.
How do you choose a model?
1) Consider scientific theory. Does it suggest what
model the relationship is?
2) In the absence of scientific theory, choose a model
that has small residuals (small se) and accounts for a
large proportion of the variability in y (large R 2).
Common Mistakes
Avoid these Common Mistakes
1. Correlation does not imply causation. A
strong correlation implies only that the two
variables tend to vary together in a
predictable way, but there are many possible
explanations for why this is occurring other
The
number
of fire trucks
at a house
than
one variable
causing
change in the other.
that is on fire and the amount of
damage from the fire have a strong,
Don’t fall into this trap!
positive correlation.
So, to avoid a large amount of damage if
your house is on fire – don’t allow
several fire trucks to come to your
house?
Avoid these Common Mistakes
2. A correlation coefficient near 0 does not
necessarily imply that there is no relationship
between two variables. Although the
variables may be unrelated, it is also possible
that there is a strong but nonlinear
relationship.
40
Be sure to look at a
scatterplot!
30
20
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Avoid these Common Mistakes
3. The least squares regression line for
predicting y from x is NOT the same line as
the least squares regression line for
predicting x from y.
The ages (x, in months) and heights (y, in inches) of seven children are
given.
x
16
24
42
60
75
102
120
y
24
30
35
40
48
56
60
Avoid these Common Mistakes
4. Beware of extrapolation. Using the least
squares regression line to make predictions
outside the range of x values in the data set
often leads to poor predictions.
Predict the height of a child that is 15 years (180
months) old.
It is unreasonable that a 15 year-old would
be 81.6 inches or 6.8 feet tall
Avoid these Common Mistakes
5. Be careful in interpreting the value of the
intercept of the least squares regression line.
In many instances interpreting the intercept
as the value of y that would be predicted
when x = 0 is equivalent to extrapolating way
beyond the range of x values in the data set.
The ages (x, in months) and heights (y, in inches) of seven children are
given.
x
16
24
42
60
75
102
120
y
24
30
35
40
48
56
60
Avoid these Common Mistakes
6. Remember that the least squares regression
line may be the “best” line, but that doesn’t
necessarily mean that the line will produce
good predictions.
This has a relatively
large se – thus we can’t
accurately predict IQ
from working memory
capacity.
Avoid these Common Mistakes
7. It is not enough to look at just r2 or just se
when evaluating the regression line.
Remember to consider both values. In
general, your would like to have both a small
value for se and a large value for r2.
This indicates that deviations from
Thistoindicates
the line tend
be small.that the linear
relationship explains a large
proportion of the variability in
the y values.
Avoid these Common Mistakes
8. The value of the correlation coefficient, as
well as the values for the intercept and slope
of the least squares regression line, can be
sensitive to influential observations in the
data set, particularly if the sample size is
small.
Be sure to always start with a plot to check
for potential influential observations.
Download