Chapter 10 Re-expressing Data: Get it Straight! Introduction All

advertisement
Chapter 10
Re-expressing Data: Get it Straight!
Introduction
All quantitative data come to us measured in some way, with units specified.
But maybe those units aren’t the best choice.
Why bother changing units?...because some expressions of the data may be
easier to think about. And some may be much easier to analyze with
statistical methods.
Straight to the Point
We cannot use a linear model unless the relationship between the two
variables is linear. Often re-expression can save the day, straightening bent
relationships so that we can fit and use a simple linear model.
Some ways to re-express data are with logarithms, reciprocals, square roots,
squaring, etc. Re-expressions can be seen in everyday life—everybody does
it.
The relationship between fuel efficiency (in miles per gallon) and weight (in
pounds) for late model cars looks fairly linear at first:
Discuss what you see here in this
scatterplot:
A look at the residuals plot shows a problem:
The original scatterplot shows a negative direction, roughly linear shape, and
strong relationship. There do not seem to be any outliers or unusual
features.
However, there is a definite bend there in the residuals. Let’s just give up…
No, wait! All is not lost!! We can re-express fuel efficiency as gallons per
hundred miles (a reciprocal) and eliminate the bend in the original
scatterplot:
The bend in the relationship between Fuel Efficiency and Weight is the kind
of failure to satisfy the conditions for an analysis that we can repair by re-
expressing the data. This scatterplot is more nearly linear, but the reexpression changes the direction of the relationship.
The direction of the association is positive now because we are measuring
gas consumption and heavier cars consume more gas per mile. This new
model makes better predictions than the previous one!
A look at the residuals plot for the new model seems more reasonable:
Gallons per hundred miles – what an absurd way to measure fuel
efficiency!! Who would ever do it that way?!
Answer: everyone except US drivers…  Most of the world says “I’ve got to
go 100 km, how much gas do I need?” But Americans say, “I’ve got 10 gallons
in the tank, how far can I drive?” (we’ll revisit this example in a little bit)
Re-expressions think about the data differently, but don’t change what they
mean
Goals of Re-expression
Goal 1: Make the distribution of a variable (as seen in its histogram, for
example) more symmetric.
Goal 2: Make the spread of several groups (as seen in side-by-side boxplots)
more alike, even if their centers differ.
Goal 3: Make the form of a scatterplot more nearly linear (because linear
scatterplots are easier to model!)
Goal 4: Make the scatter in a scatterplot spread out evenly rather than
thickening at one end.
This can be seen in the two scatterplots we just saw with Goal 3:
Practice Exercises
 Page 239 #1 – 4
Practice Answers
Homework
 Revisiting some things you learned in Algebra 2…
 Page 239 #5, 6, 7
The Ladder of Powers
There is a family of simple re-expressions that move data toward our goals
in a consistent way. This collection of re-expressions is called the Ladder of
Powers.
The Ladder of Powers orders the effects that the re-expressions have on
data. Where to start? You may be wondering what to do to the data to reexpress it…It turns out that certain kinds of data are more likely to be
helped by particular re-expressions.
Power
Name
Comment
2
Square of data
values
Try with unimodal distributions that are skewed to the left.
1
Raw data
Data with positive and negative values and no bounds are
less likely to benefit from re-expression.
½
Square root of data
values
Counts often benefit from a square root re-expression
“0”
We’ll use logarithms
here
Measurements that cannot be negative often benefit from a
log re-expression.
Reciprocal square
root
An uncommon re-expression, but sometimes useful.
The reciprocal of the
data
Ratios of two quantities (e.g., mph) often benefit from a
reciprocal.
–1/2
–1
Just Checking
1. You want to model the relationship between the number of birds
counted at a nesting site and the temperature (in degrees Celsius).
The scatterplot of counts vs. temperature shows an upwardly curving
pattern, with more birds spotted at higher temperatures. What
transformation (if any) of the bird counts might you start with?
2. You want to model the relationship between prices for various items in
Paris and in Hong Kong. The scatterplot of Hong Kong prices vs.
Parisian prices shows a generally straight pattern with a small amount
of scatter. What transformation (if any) of the Hong Kong prices
might you start with?
3. You want to model the population growth of the US over the past 200
years. The scatterplot shows a strongly upwardly curved pattern.
What transformation (if any) of the population might you start with?
Example: Cars 1991
Fuel efficiency (mpg) vs. Weight for 38 cars as reported by Consumer
Reports
Weight
Fuel
Eff.
Weight
Fuel
Eff.
Weight
Fuel
Eff.
1875
32
2600
27
3100
15
1875
35
2620
22
3375
14
1925
31
2625
26
3375
18
1925
34
2625
28
3355
21
1940
33
2620
34
3500
18
2200
37
2700
28
3575
17
2225
29
2700
26
3700
16
2230
30
2725
27
3800
15
2240
30.5
2800
22
3790
16
2245
34
2825
20
3850
15
2250
31
2825
23
3850
16
2255
27
2850
22
3900
13
3025
21
4000
14
Write the equation for this data
Make a scatterplot of the data using your calculator.
Homework
 Page 240 #8, 9, 11, 12
Step-by-Step Example
Standard fishing line comes in a range of strengths, usually expressed
as “test pounds.” Five-pound test lines, for example, can be expected to
withstand a pull of up to five pounds without breaking. The convention in
selling fishing line is that the price of a spool doesn’t vary with strength.
Higher test pound line is thicker, though, so spools of fishing line hold about
the same amount of material. Let’s look at Length and Strength of spools of
fishing line manufactured by the same company and sold for the same price
at one store.
Question
 How are the Length on the spool and the Strength related? And what
re-expression will straighten the relationship?
THINK
 I want to fit a linear model for the length and strength of fishing line.
 I have the length and “pound test” strength of fishing line sold by a
single vendor at a particular store.
 Let Length = length in yards of fishing line on the spool
 Strength = the test strength in pounds
3500
3000
2500
2000
1500
1000
500
0
0
100
200
300
400
The plot shows a negative direction and an association that has little scatter
but is not straight.
SHOW
 Let’s try a re-expression of the data to make it more nearly linear.
 Below is a scatterplot of the square root of Length against the
strength:
Strength vs Sqrt(Length) of Fishing Line
60
sqrt(length)
50
40
30
20
10
0
0
50
100
150
200
Strength
250
300
The plot is less bent, but still not straight.
Let’s try the logarithm of Length against Strength:
Strength vs Log(Length) in Fishing
Line
LOG(LENGTH)
4
3
2
1
0
0
50
100
150
200
STRENGTH
250
300
350
350
This is much better, but still not straight, so let’s take another step up the
ladder to the reciprocal
0
Strength vs -1/Length for Fishing Line
0
50
100
150
200
250
300
350
-0.002
-1/Length
-0.004
-0.006
-0.008
-0.01
-0.012
Strength
Maybe now I moved too far along the ladder. A half-step back is the -½
power (the negative reciprocal square root) (we use negative to preserve
the direction of the relationship)
Strength vs -1/Sqrt(Length) of Fishing
Line
0
0
50
100
150
200
-1/sqrt(Length)
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
Strength
250
300
350
TELL
It’s hard to choose between the last two alternatives. Either of the last two
choices is good enough. What should we choose? I’m going to go with the
negative reciprocal of the length.
Now that the re-expressed data satisfies the straight enough condition, we
can fit a linear model by least squares. Using a calculator, I found:
-1
= -0.000343- 0.0000316(Strength)
Length
We can use this model to predict the length of a spool, say, 35-pound test
line:
-1
= -0.000343- 0.0000316(35)
Length
= -0.001449
We could leave the result in these units, but here we want to transform the
predicted value back into yards. Length = -1/-0.001449 = 690 yards
Example
Plan B: Attack of the Logarithms
When none of the data values is zero or negative, logarithms can be a helpful
ally in the search for a useful model. Try taking the logs of both the x- and
y-variable. Then re-express the data using some combination of x or log(x)
vs. y or log(y).
TI Tips
Let’s revisit the Arizona State tuition data. Recall that when we tried to fit
a linear model to the yearly tuition costs, the residuals plot showed a
distinct curve.
This curved pattern indicates that data re-expression may be in order. If
you have no clue which re-expression to try, the Ladder of Powers may help.
We just used that approach in the fishing line example. Here, though, we
can paly a hunch. It is reasonable to suspect that tuition increases at a
relatively consistent percentage year by year. This suggests that using the
logarithm of tuition may help.
Tell the calculator to find the logs of the tuitions, and store them as a new
list. Perform the regression for the logarithm of tuition vs. year
Do you know what the model’s equation is? Remember, it involves a logarithm!
Can you estimate the tuition for 2001? Make sure you think!!!
Example
Multiple Benefits
We often choose a re-expression for one reason and then discover that it
has helped other aspects of an analysis. For example, a re-expression that
makes a histogram more symmetric might also straighten a scatterplot or
stabilize variance.
Why Not Just Use a Curve?
If there’s a curve in the scatterplot, why not just fit a curve to the data?
The mathematics and calculations for “curves of best fit” are considerably
more difficult than “lines of best fit.” Besides, straight lines are easy to
understand. We know how to think about the slope and the y-intercept.
What



Can Go Wrong?
Don’t expect your model to be perfect.
Don’t stray too far from the ladder.
Don’t choose a model based on R2 alone:
 Beware of multiple modes.
 Re-expression cannot pull separate modes together.
 Watch out for scatterplots that turn around.
 Re-expression can straighten many bent relationships, but not
those that go up then down, or down then up.
 Watch out for negative data values.
 It’s impossible to re-express negative values by any power that
is not a whole number on the Ladder of Powers or to re-express
values that are zero for negative powers.
 Watch for data far from 1.
 Data values that are all very far from 1 may not be much
affected by re-expression unless the range is very large. If all
the data values are large (e.g., years), consider subtracting a
constant to bring them back near 1.
What have we learned?
 When the conditions for regression are not met, a simple reexpression of the data may help.
 A re-expression may make the:
 Distribution of a variable more symmetric.
 Spread across different groups more similar.
 Form of a scatterplot straighter.
 Scatter around the line in a scatterplot more consistent.
 Taking logs is often a good, simple starting point.
 To search further, the Ladder of Powers or the log-log
approach can help us find a good re-expression.
 Our models won’t be perfect, but re-expression can lead us to a useful
model.
Homework
 Page 241 #15, 17, 27 and Alligators Problem (above)
Download