Daniel S. Yates
The Practice of Statistics
Third Edition
Chapter 4:
More about Relationships
between Two Variables
Copyright © 2008 by W. H. Freeman & Company
Case Study: A matter
of life and Death
Age
Monthly Premium
40
$29
45
$46
50
$68
55
$106
60
$157
65
$257
1. How much would a 58-yr-old expect to pay for such a policy?
2. A 68-yr-old?
4.1 Transforming to achieve Linearity:
Today’s Objectives









Explain what is meant by transforming data.
Discuss the advantages of transforming nonlinear data.
Tell where y=log(x) fits into the heirarchy of power
transformations.
Explain the ladder of power transformations.
Explain how linear growth differs from exponential growth.
Identify real-life situations in which a transformation can be
used to linearize data from an exponential growth model.
Use a logarithmic transformation to linearize a data set that
can be modeled by an exponential model.
Identify situations in which a transformation is required to
linearize a power model.
Use a transformation to linearize a data set that can be
modeled by a power model.
Activity 4:
Scatterplot and LSRL of brain weight for 96 species
of mammels.
What do the outliers tell us?
Dolphins and humans are smart, hippos are dumb and elephants
are just big.
Scatterplot of brain weight against body
weight for mammals with outliers removed.
Scatterplot and LSRL of the logarithms of brain weight against
the logarithm of body weight for 96 species of mammels.
Transforming data


Applying a function such as the logarithm or square
root to a quantitative variable is called
transforming or reexpressing the data.
Understanding how simple functions work will help
us choose and use transformations to straighten
nonlinear patterns.
Fishing tournament
Transforming data with powers.

Since weighing live flopping fish would be difficult,
it is decided to Relate the length of fish to their
weight. Since length is 2-dimensional and weight is
three dimensional, it is concluded that the weight of
the fish should be proportional to the cube of its
length. Thus a model in the form of
weight  a  length3

Should work.
The nearby marine research
laboratory provides the following data
Average length and weight at different ages for Atlantic Ocean rockfish,
Age(yr)
Length(cm)
Weight(g)
Age(yr)
Length(cm)
Weight(g)
1
5.2
2
11
28.2
318
2
8.5
8
12
29.6
371
3
11.5
21
13
30.8
455
4
14.3
38
14
32.0
504
5
16.8
69
15
33.0
518
6
19.2
117
16
34.0
537
7
21.3
148
17
34.9
651
8
23.3
190
18
36.4
719
9
25.0
264
19
37.2
726
10
26.7
293
20
37.7
810
Scatterplot of Atlantic Ocean rockfish
weight versus length
The scatterplot of weight versus length3 looks linear.
We perform a least-squares regression on the transformed points
(length3, weight). The resulting equation is
weight = 4.066 + 0.0147 X length3
with r2=0.995. So 99.5 % of the variation in the weight of Atlantic
Ocean rockfish is accounted for by the linear regression of weight on
Length3.
Plot of residuals versus length3
For smaller fish, the residuals fall below the line. This indicates
that our predictions will be slightly high. For larger fish, our
results should be quite accurate.
Atlantic Ocean rockfish data with the model
weight =4.066+0.147 length3
Prediction


Using our model, we can predict the weight of a fish
that is 36 centimeters long.
Weight= 4.066+0.0147(36)3 = 689.9 grams
Assignment




P. 199
4.1
P.219
4.13, 4.16
TRANSFORMING WITH
POWERS
The heirarchy of power functions. The logarithm function
corresponds to p=0.
Facts about the family of power
functions


The graph of a linear function (power p=1) is a straight line.
Powers greater than 1 (like p=2 and p=4) give graphs that
bend upward. The sharpness of the bend increases as p
increases.
•Powers less than 1 but greater than 0(like p=0.5) give graphs that
bend downward.
•Powers less than 0(like p=-0.5 and p=-1) gives graphs that decrease
as x increases. Greater negative values of p result in graphs that
decrease more quickly.

Look at the p=0 graph. This is not the graph of y=0x
because the zero power is just the constant 1. It is the
logarithm logx.
A country’s GDP and life expectancy
The hierarchy of power transformations

Life expectancy and gross domestic product for 115
nations
Transformation of Life expectancy and gross domestic
product for 115 nations using √GDP
Transformation of Life expectancy and gross
domestic product for 115 nations using log(GDP)
Transformation of Life expectancy and gross
domestic product for 115 nations using 1/√GDP
Warning:
Don’t just push buttons on your calculator to try to
straighten out the data.
It is more useful to begin with a theory or
mathematical model that we expect to describe a
relationship. The transformation needed to make
the relationship linear is then a consequence of the
model.
One of the most common models is exponential
growth.

Growth of a bacteria population over a 24-hr
period.
The growth of money
Understanding Exponenetial Growth


A dollar invested at an annual rate of 6% turns into
$1.06 in a year. The original dollar remains and
has earned $0.06 in interest. If the $1.06 remains
invested for a second year, the new amount is
therefore $1.06 X $1.06 or (1.06)2.
After x year, the dollar has become 1.06x dollars.
Wealthy Indians

If the Native Americans who sold Manhattan Island
for $24 in 1626 had deposited the $24 in a
savings account earning 6% annual interest, they
would now have almost $80 billion.
Moore’s law and computer chips
Exponential growth
Gordon Moore, one of the founders of Intel,
predicted in 1965 that the # of transistors on
an integrated circuit chip would double
every 18 months. This is “Moore’s Law.”
Data on the dates and # of transistors for Intel microprocessors
Processor
Date
Transistors
Processor
Date
Transistor
4004
1971
2,250
486 DX
1989
1,180,000
8008
1972
2,500
Pentium
1993
3,100,000
8080
1974
5,000
Pentium II
1997
7,500,000
8086
1978
29,000
Pentium III
1999
24,000,000
286
1982
120,000
Pentium 4
2000
42,000,000
386
1985
275,000
Scatterplot showing growth in the number of transistors
on a computer chip from 1971-2000
Is this exponential growth?
The Logarithm Transformation
Assignment
Worksheet on Logarithms.
Scatterplot of ln(number of transistors) versus
years since 1970
Plot of transformed transistor data with
least-squares line.
ln(transistors) = 7.41 + 0.332 (years since 1970)
R2=99.5%
Residual plot for the transformed data
Predictions in the exponential
growth model.






To predict the number of transistors on Intel’s Itanium
2 8.366chips, which was released in 2003, we
substitute 33 for “years since 1970” into the
regression equation.
Ln(transistors)= 7.41 + 0.332 X (33) =1
Since ln is log base e, this tells us that
Transistors= e18.366 =94,678,73
So our model predicts about 95 million transistors
on the Itanium 2 chip.
In fact they had about 410 million transistors!
Growth of a bacteria population
over a 24 hour period.
Transforming bacteria counts
Exact exponential growth
Logarithms of the bacteria count
Modeling exponential growth with TI-83/84
Enter the data into list 1 and list 2. Use ZoomStatData to draw the
scatterplot.
Define L3 as the natural logarithm of L2. Then make a scatterplot
of ln(L2) versus L1
Next, we perform the least-squares regression on the
transformed data.
Here is the scatterplot with the LSRL
Despite the high r2-value, you should always inspect the residual plot.
Be sure to plot the residuals(stored in the RESID list) versus List1.
Assignment

P. 203, 4.4
Power Law Models
1.
2.
The power model is
y  ax p
Take the logarithm of both sides of this equation. You see
that
log y  log a  p log x
That is, taking the logarithm of both variables results in a linear
relationship between logx and logy.
3. Look carefully: the power p in the power law becomes the
slope of the straight line that links logy to logx.
Predicting brain weight from body
weight :Using a Power model
log y  1.01 0.72 log x
1.01 0.72 log x
y  10
 101.01 X 100.72 log x

 10.2  10

log x 0.72
Because 10log x =x, the estimated
power model connecting predicted brain
Weight y^ with body weight x for mammal weight x for mammals is
yˆ  10.2 y 0.72
yˆ  10 .2127 
0.72
 10 .232 .7 
 333 .7 grams
What’s a planet anyway?
Planet
Distance from
sun(astronomical units)
Period of revolution (Earth
years)
Mercury
0.387
0.241
Venus
0.723
0.615
Earth
1.000
1.00
Mars
1.524
1.881
Jupiter
5.203
11.862
Saturn
9.539
29.456
Uranus
19.191
84.070
Neptune
30.061
164.810
Pluto
39.529
248.530
Scatterplot of planetary distance from
the sun and period of revolution.
The scatterplot of ln(period) versus
distance is not linear.
The scatterplot of ln(period) vs.
ln(distance) appears very linear.
Plot of residuals versus ln(distance)
The last step is to perform an inverse
transformation on the linear regression equation:
ln( period)  0.000254 1.5 ln(distance)
e
ln( period )
e
period  e
0.000254 1.50 ln(distance)
0.000254 1.50 ln(distance)
period  1.000e
ln(distance)1.50
period  1.000distance
1.50
Planetary data with power model
Power law modeling
•
•
•
•
•
Enter the x data(explanatory) into L1 and the y data
into L2.
Produce a scatterplot of y versus x. Confirm a
nonlinear trend that could be modeled by a power
function in the form y=axp.
Define L3 to be the logarithm of L1 and define L4 to be
the logarithm of L2.
Plot L4 versus L3. Verify that the pattern is
approximately linear.
Calculate the regression equation for the transformed
data and store it in Y1. Check the r2 value.
Power law modeling cont…
•
•
•
Construct a residual plot, in the form of either RESID versus x or
RESID vs predicted values. Ideally, the points in a residual plot
should be randomly scattered above and below the y=0 reference
line.
Perform the inverse transformation to find the power function y=axp
that models the original data. Define Y2 to be
(10a)(xb)or(ea)(xb)depending on the type of logaritm you used for
the transformation. The calculator has stored the values of a and b
for the most recent regression performed. Deselect Y1 and plot Y2
and the scatterplot for the original data together.
To make a prediction for the value x=k, evaluate Y2(k) on the home
screen.
Assignment 4.1


P. 212
4.6, 4.10, 4.11
4.2 Relationships between Categorical
Variables (Objectives)





Explain what is meant by a two-way table.
Explain what is meant by marginal distributions in a
2-way table.
Describe how changing counts to percents is helpful
in describing relationships between categorical
variables.
Explain what is meant by a conditional distribution.
Define Simpson’s paradox, and give an example of
it.
Organizing categorical variables



This is a two-way table because it describes two categorical
variables.
Age group is the row variable because each row in the table
describes students in one age group.
Sex is the column variable because each column describes
one sex.
Sex
Age Group(years)
Female
Male
Total
15-17
89
61
150
18-24
5,668
4,697
10,365
25-34
1,904
1,584
3,494
35+
1,660
970
2,630
Total
9,321
7,317
16,639
Marginal Distributions

The distributions that appear at the bottom(sex
alone) and right margins (age alone) of a 2-way
table.
Sex
Age Group(years)
Female
Male
Total
15-17
89
61
150
18-24
5,668
4,697
10,365
25-34
1,904
1,584
3,494
35+
1,660
970
2,630
Total
9,321
7,317
16,639
Calculating a marginal distribution

The percent of college students 18-24 is
totalage 18 - 25 10,365

 0.623  62.3%
table total
16,639
Age group
Percent

15-17
18-24
25-34
35 +
0.9
62.3
21.0
15.8
The total is 100% because everyone is in one of the
four categories.
A bar graph of the distribution of age for college students. This
is one of the marginal distributions for the previous table.
Describing Relationships
To describe relationships among
categorical variables, calculate
appropriate percents from the counts
given.
 When we compare the percents of
women in the two age groups we are
comparing two conditional distributions.

Conditional distribution of sex given age
Sex(18-24)
Female
Male
54.7
45.3
Sex(35+)
Female
Male
63.1
36.9
Bar graph comparing the percent of female college
students in four age groups.
There are more women than men in all age groups, but
the percent of women is higher among older students.
Distribution of age given sex
Age group
Percent of women
15-17
18-24
25-34
35+
1.0
60.8
20.4
17.8
Age group
Percent of men
15-17
18-24
25-34
35+
0.8
64.2
21.7
13.3
CrunchIt! output of the two way table of college students by age
and sex along with each entry as a percent of its row table.
The percents in each row give the conditional distribution of sex
for one age group, and the percents in the “Total” row give the
marginal distribution of sex for all college students.
Caution!!!!!


No single graph (such as a scatterplot)
portrays the form of the relationship
between categorical variables.
No single numerical measure(such as
correlation) summarizes the strength of
the association.
Assignment


P. 245
4.53, 4.55
Smiling Faces



Women Smile more than
men.
Women smile more when
they think they are being
watched. Men don’t.
Within professions, there
is no difference between
how much women and
men smile.
Do medical helicopters
save lives?
Helicopter
Road
Victim died
64
260
Victim Survived
136
840
Total
200
1100
Serious Accidents
Less Serious Accidents
Helicopter
Road
Helicopter
Road
Died
48
60
Died
16
200
Survived
52
40
Survived
84
800
Total
100
100
Total
100
1000
Assignment





P. 212
4.6, 4.10, 4.11
250-252
4.59, 4.60,
P.255 4.70
4.3 Establishing Causation
4.3 Objectives







Identify the three ways in which the association between two
variables can be explained.
Explain what process provides the best evidence for causation.
Define what is meant by a common response.
Discuss why establishing a cause-and-effect relationship through
experimentation is not always possible.
Define what it means to say that two variables are confounded.
Explain what it means to say that a lack of evidence for a causeand-effect relationship does not necessarily mean that there is no
cause-and-effect relationship.
List five criteria for establishing causation when you cannot conduct
a controlled experiment.
Six interesting relationships
Examining association
1)
2)
3)
4)
5)
6)
X=mother’s BMI;
Y=daughter’s BMI
X= amount of saccharin in a rat’s diet
Y=count of tumors in the rat’s body
X= a high school senior’s SAT score
Y= The students first year college GPA
X=monthly flow of money into stock market funds
Y=monthly rate of returns for the stock market
X=whether a person regularly attends religious services
Y= how long the person lives
X=the number of years of education a worker has.
Y=the worker’s income
Variables x and y show a strong association (dashed line). The
association may be the result of of any several causal
relationships (solid arrow)
(b) Changes in
both x and y are
caused by a lurking
variable.
(c) The effect (if
any) of x on y is
confounded with
the effect of a
lurking variable z.
.
(a) Changes in x
cause changes in y
BMI in mothers and daughters;
saccharin in rats
Mothers and daughter’s BMI?
What about heredity? Diet and exercise habits?
Even when direct causation is present, it is rarely a
complete explanation of an association between
two variables.
Rats?
Do results with rats translate to people?
 Even well established causal relationships causal
relations may not generalize to other settings.

Explaining Association: Common
Response


“Beware the lurking variable”
When both x and y change in response to changes
in z, this is called a common response.
1. X= A high school senior’s SAT score
Y= The students first year college GPA
 The results of both can be explained by common response to
ability and knowledge.
2. X=monthly flow of money into stock market funds
Y=monthly rate of returns for the stock market
 Both the market and individual investors respond to positive
feelings about the market
Explaining Association: Confounding
Confounding: Religion and life
span; education and income
1) X=whether a person regularly attends religious services
Y= how long the person lives
People who go to church are also less likely to smoke, more
likely to exercise and less likely to be overweight.
2) X=the number of years of education a worker has.
Y=the worker’s income
•
•
People with high ability are more likely to come from
prosperous homes and therefore have more education.
Being able and rich leads to both higher education and higher
income.
Establishing Causation



The best way to establish causation is through a
controlled experiment but this is not always possible.
Do Power lines increase the risk of leukemia?
A careful study shows no evidence of more than a
chance connection between magnetic fields and
childhood leukemia.
Does smoking cause lung cancer?

•
•
•
•
•
Criteria for establishing causation without experiment.
The association is strong.
The association is consistent
Larger values of the response variable are associated
with stronger responses.
The alleged cause precedes the effect in time.
The alleged cause is plausable
Assignment 4.3




P. 237
4.37
P.238-240
4.40, 4.43,4.49
1. Case Closed!
Determining Insurance Premiums
a)
Would a power model provide an appropriate
description of the relationship between age and
monthly premiums? Transform the data and sketch a
graph that will help answer this question.
Answer: Let y=premium and x= age. Scatterplots of the
original and transformed data after taking the
logarithms of both variables show that the original
data has a strong nonlinear relationship and the
transformed data shows a clear linear trend, so the
power model is appropriate.
1.
Age
Monthly Premium
40
$29
45
$46
50
$68
55
$106
60
$157
65
$257
b) Would an exponential model provide an appropriate
description of the relationship between age and monthly
premium? Transform the data and sketch a graph that will help
answer this question.
Answer: The linear trend in the scatterplot of the logarithm of
premiumversus age suggests that the exponential model is
appropriate.
c) Based on your answers to a and b, Use LSR to fit the most
appropriate type of model for these data. Perform the inverse
transformation to write monthly premiums as a function of age.
Answer: The LSRL for the transformed data is
logy^=-0.0275+0.0373x.
Using the inverse transformation, the predicted premium is
y^=10-0.0275100.0373x=0.9386X100.0373x
d) Use your model to predict the monthly premium for a 58-year
old man who wants a $1 million, 10 yr term life insurance policy.
For a 68 yr old.
Answer: $136.74 and $322.76
e) How comfortable do you feel about these predictions in d?
Justify your answer using a residual plot and r2?
Answer: You should feel very comfortable with these
predictions. The residual plot shows no clear patterns and
r2=99.9%, so the exponential model provides an excellent fit.
2. Death statistics
Deaths in the US from selected causes in 2003
15-24
25-44
45-64
Accidents
14,966
27,844
23,699
AIDS
171
6,879
5,917
Cancer
1,628
19,041
144,936
Heart Disease
1,083
16,283
101,713
Homicide
5,148
7,367
2,756
Suicide
3,921
11,251
10,057
Total Deaths
33,022
128,924
437,058
a) Why don’t the entries in each column add to the “total deaths”
count?
Answer: The entries in each column are only from these six
selected causes of deaths.
b) Should you use counts or percents to compare the age groups.
Answer: Percents should be used because there are different
numbers in each group.
c)Construct a conditional distribution of cause of death for each
age group. Then make a bar graph to display the results.
Answer:
15-24
25-44
45-64
Accidents
45.32%
21.60%
5.42%
AIDS
0.52%
5.34%
1.35%
Cancer
4.93%
14.77%
33.16%
Heart Disease
3.28%
12.63%
23.27%
Homicide
15.59%
5.71%
0.63%
Suicide
11.87%
8.73%
2.30%
d) Explain how the leading cause of death changes
as people age.
3. Stay Fitter, live Longer
A Sign at a fitness center says “mortality is halved
for men over 65 who walk at least two miles a day.
a)
Mortality is eventually 100% for everyone. What
do you think mortality is halved means?
Answer: The chance of dying for men over 65 who
walk at lest 2 miles a day is half of men who do
not exercise.
b)
Assuming that the claim is true, explain why this fact
does not prove that walking causes lower mortality.
Answer: Individuals who exercise have regularly have
other habits that could contribute to longer lives.

What you should have learned

1.
2.
3.
4.
A. Modeling Linear Data
Use powers to transform nonlinear data to achieve linearity.
Recognize that, when a variable is multiplied by a fixed number
in each equal time period, exponential growth results and that,
when one variable is proportional to a power of a second
variable, a power law model results.
In the case of both exponential growth and power functions,
perform a logarithmic transformation and obtain points that lie in
a linear pattern. Then use the LSR on the transformed data.
Carry out an inverse transformation to produce a curve that
models the original data.
Know that deviation from the overall pattern are most easily
examined by fitting a line to the transformed points and plotting
the residuals from this line against the explanatory variable.
B. Relations in Categorical Data
1.
2.
3.
From a two-way table of counts, find the marginal
distributions of both variables by obtaining the row
sums and column sums.
Describe the relationship between two categorical
variables by computing and comparing percents. Often
this involves comparing the conditional distributions of
one variable for the different categories of the other
variable. Construct bar graphs when appropriate.
Recognize Simpson’s paradox and be able to explain
it.
C. Establishing Causation.
1.
2.
3.
Recognize possible lurking variables that may help
explain the observed association between two
variables x and y.
Determine whether the relationship between two
variables is most likely due to causation, common
response, or confounding.
Understand that even a strong association does
not mean that there is a cause-and-effect
relationship between x and y.
Chapter Review
Assignment.


P.257-261
4.72,4.75, 4.77, 4.80,4.83
Monotonic Functions



A monotonic function f(t) moves in one direction as
its argument t increases.
A monotonic increasing function preserves the
order of data. That is, if a>b, then f(a)>f(b).
A monotonic decreasing function reserves the
order of data. That is, if a>b, then f(a)>f(b)