Online Statistics Education B

advertisement
16. Transformations
A. Log
B. Tukey's Ladder of Powers
C. Box-Cox Transformations
D. Exercises
The focus of statistics courses is the exposition of appropriate methodology to
analyze data to answer the question at hand. Sometimes the data are given to you,
while other times the data are collected as part of a carefully-designed experiment.
Often the time devoted to statistical analysis is less than 10% of the time devoted
to data collection and preparation. If aspects of the data preparation fail, then the
success of the analysis is in jeopardy. Sometimes errors are introduced into the
recording of data. Sometimes biases are inadvertently introduced in the selection of
subjects or the mis-calibration of monitoring equipment.
In this chapter, we focus on the fact that many statistical procedures work
best if individual variables have certain properties. The measurement scale of a
variable should be part of the data preparation effort. For example, the correlation
coefficient does not require the variables have a normal shape, but often
relationships can be made clearer by re-expressing the variables. An economist
may choose to analyze the logarithm of prices if the relative price is of interest. A
chemist may choose to perform a statistical analysis using the inverse temperature
as a variable rather than the temperature itself. But note that the inverse of a
temperature will differ depending on whether it is measured in °F, °C, or °K.
The introductory chapter covered linear transformations. These
transformations normally do not change statistics such as Pearson’s r, although
they do affect the mean and standard deviation. The first section here is on log
transformations which are useful to reduce skew. The second section is on Tukey’s
ladder of powers. You will see that log transformations are a special case of the
ladder of powers. Finally, we cover the relatively advanced topic of the Box-Cox
transformation.
578
Log Transformations
by David M. Lane
Prerequisites
• Chapter 1: Logarithms
• Chapter 1: Shapes of Distributions
• Chapter 3: Additional Measures of Central Tendency
• Chapter 4: Introduction to Bivariate Data
Learning Objectives
1. State how a log transformation can help make a relationship clear
2. Describe the relationship between logs and the geometric mean
The log transformation can be used to make highly skewed distributions less
skewed. This can be valuable both for making patterns in the data more
interpretable and for helping to meet the assumptions of inferential statistics.
Figure 1 shows an example of how a log transformation can make patterns
more visible. Both graphs plot the brain weight of animals as a function of their
body weight. The raw weights are shown in the upper panel;
the log-transformed
body weights: Fit Y by X of Brain by Body/1000
Page 1 of 1
weights
plotted
in Body/1000
the lower panel.
Bivariateare
Fit of
Brain By
7000
6000
5000
Brain
4000
3000
2000
1000
0
-1000
0
10000 20000 30000 40000 50000 60000 70000 80000
Body/1000
579
body weights: Bivariate of Log(Brain) by log(body)
Page 1 of 1
Bivariate Fit of Log(Brain) By log(body)
9
8
7
6
Log(Brain)
5
4
3
2
1
0
-1
-2
0
5
10
15
20
Log(Body)
Figure 1. Scatter plots of brain weight as a function of body weight in terms
of both raw data (upper panel) and log-transformed data (lower
panel).
It is hard to discern a pattern in the upper panel whereas the strong relationship is
shown clearly in the lower panel.
The comparison of the means of log-transformed data is actually a
comparison of geometric means. This occurs because, as shown below, the anti-log
of the arithmetic mean of log-transformed values is the geometric mean.
Table 1 shows the logs (base 10) of the numbers 1, 10, and 100. The
arithmetic mean of the three logs is
(0 + 1 + 2)/3 = 1
The anti-log of this arithmetic mean of 1 is:
101 = 10
580
which is the geometric mean:
(1 x 10 x 100).3333 = 10.
Table 1. Logarithms.
X
Log10(X)
1
10
100
0
1
2
Therefore, if the arithmetic means of two sets of log-transformed data are equal
then the geometric means are equal.
581
Tukey Ladder of Powers
by David W. Scott
Prerequisites
• Chapter 1: Logarithms
• Chapter 4: Bivariate Data
• Chapter 4: Values of Pearson Correlation
• Chapter 12: Independent Groups t Test
• Chapter 13: Introduction to Power
• Chapter 16: Tukey Ladder of Powers
Learning Objectives
1. Give the Tukey ladder of transformations
2. Find a transformation that reveals a linear relationship
3. Find a transformation to approximate a normal distribution
Introduction
We assume we have a collection of bivariate data
(x1,y1),(x2,y2),...,(xn,yn)
and that we are interested in the relationship between variables x and y. Plotting the
data on a scatter diagram is the first step. As an example, consider the population of
the United States for the 200 years before the Civil War. Of course, the decennial
census began in 1790. These data are plotted two ways in Figure 1. Malthus
predicted that geometric growth of populations coupled with arithmetic growth of
grain production would have catastrophic results. Indeed the US population
followed an exponential curve during this period.
582
These data are plotted two ways in Figure 1. Malthus predicted that geometric growth of populations coupled with arithmetic growth of grain production
would have catastrophic results. Indeed the US population followed an exponential curve during this period.
●
US population (millions)
30
20.0
25
●
●
●
10.0
●
●
●
5.0
20
●
15
●
10
●
5
●●
●●●
●●●●●●●
1700
●
●
●
●
2.0
●
●
1.0
●
●
0.5
●
0
●
●
●
●
●
1750 1800
Year
●
0.2
1850
0.1
●
●
●
1700
1750 1800
Year
1850
Figure 1. The US population from 1670 - 1860. The Y-axis on the right panel
on aUS
logpopulation
scale.
Figure 1:isThe
from 1670 - 1860 on graph and semi-log scales.
Tukey's Transformation Ladder
Tukey(1977)
(1977)describes
describes
orderly
of re-expressing
variables
a
Tukey
anan
orderly
wayway
of re-expressing
variables
usingusing
a power
power
transformation.
You
familiar
with regression
polynomial
regression,
transformation.
You may
be should
familiarbe
with
polynomial
(a form
of
where the simple linear model y = b0 + b1 x is extended with terms such as
multiple regression) in which the simple linear model y = b0 + b1X is extended
b2 x2 + b3 x3 + b4 x4 . Alternatively,
Tukey suggests exploring simple relationwith terms such as b2x2 + b3x3 + b4x4. Alternatively, Tukey suggests exploring
ships such as
simple relationships such as
y = b0 + b1 x
or
y = b0 + b1 x ,
(1)
λ
λ = b + b X (Equation 1)
= b0 + bchosen
1X ortoymake
0 relationship
1
where is ay parameter
the
as close to a straight
line as possible. Linear relationships are special, and if a transformation of
where λ is a parameter chosen to make the relationship as close to a straight line as
possible. Linear relationships are special,
2 and if a transformation of the type xλ or
yλ works as in Equation (1), then we should consider changing our measurement
scale for the rest of the statistical analysis.
There is no constraint on values of λ that we may consider. Obviously
choosing λ = 1 leaves the data unchanged. Negative values of λ are also
reasonable. For example, the relationship
y = b0 + b1/x
would be represented by λ = −1. The value λ = 0 has no special value, since X0 = 1,
which is just a constant. Tukey (1977) suggests that it is convenient to simply
define the transformation when λ = 0 to be the logarithm function rather than the
583
is no
on values of Tukey
that we
may consider.
Obviously
,There
which
is constraint
just a constant.
(1977)
suggests
that it is convenient
oosing
= 1 the
leavestransformation
the data unchanged.
Negative
are also
ply define
when
=values
0 to of
be the
logarithm function
asonable. For example, the relationship
than the constant 1. We shall revisit this convention shortly. The
b1
ng table gives examples
y = of
b0 +the Tukey ladder of transformations.
constant 1. We shall revisit xthis convention shortly. The following table gives
examples
Tukey
uld be represented
by of=the 1.
The ladder
value of=transformations.
0 has no special value, since
= 1, which is Table
just a constant.
Tukey (1977)
suggests
that it is convenient
1: Tukey’s
Ladder
of Transformation
Table
1.
Tukey's
Ladder
of
Transformations
simply define the transformation when = 0 to be the logarithm function
ther than the constant 1. We shall revisit this convention shortly. The
-2 of the
-1 Tukey
-1/2ladder 0of transformations.
1/2 1 2
lowing table gives examples
p
1
1
1
p
Xfm x2 Ladder
log x
x
x
Table 1: Tukey’s
ofx Transformation
x
x2
-1 -1/2values,
0 then
1/2special
1 care
2 must be taken so that the
If x takes -2
on negative
p
transformations
p1 sense,
Xfm x12 x1 make
log xif possible.
x x We
x2 generally limit ourselves to variables
x
x takes onwhere
negative
values,
then
special
bevariables
taken such
so as
that
x > 0 to avoid these considerations. Forcare
some must
dependent
ansformations
make
sense,it isifconvenient
possible.
We1 togenerally
limitthe
ourselves to
the number
of errors,
to add
x before applying
transformation.
les
xnegative
> 0 tovalues,
avoidthen
these
considerations.
If xwhere
takes on
special
care must be taken so that
Also,sense,
if the transformation
λlimit
is negative,
then
the transformed
e transformations
make
if possible.
Weparameter
generally
ourselves
tothe
so,
if the transformation
parameter
is negative,
then
transformed
λ
riables wherevariable
x > 0 tox avoid
these considerations.
is reversed.
For example, if x is increasing, then 1/x is decreasing. We
leAlso,
x ifisthe
reversed.
For example,
ifnegative,
x is increasing,
then 1/x is decreasing.
transformation
parameter
thentothe
choose
to redefine
the Tukeyistransformation
be transformed
-(xλ) if λ < 0 in order to preserve
oose
the
Tukey
to
be
xtheifTukey<transformation
0 in order to
riable xtoisredefine
reversed.
iftransformation
xafter
is increasing,
then 1/x
is decreasing.
the orderFor
of example,
the
variable
transformation.
Formally,
e choose
redefine
Tukey
transformation
be x if < 0 in order
to
is defined
as
ve
the toorder
ofthe
the
variable
after to
transformation.
Formally,
the Tukey
eserve the order of the variable after transformation. Formally, the Tukey
ormation is
defined
to be
ansformation
is defined
to be
8 8
< x < x if > 0 if
>0
log x
if = 0
x̃ =
(2)
x̃ := (x )log ifx < 0 if = 0
(2)
:
(x )
if
<0
Table 2 we reproduce Table 1 but using the modified definition when
< 0.
Table 2 we reproduce
1 but
using the
definition
when λ < 0.when
ble 2 we In
reproduce
Table 1Table
but
using
themodified
modified
definition
Table 2: Modified Tukey’s Ladder of Transformation
Table 2. Modified Tukey's Ladder of Transformations
3
Xfm
-2
-1
1
x2
1
x
-1/2
p1 3
x
0
log x
1/2
p
x
The Best Transformation
oal is to find a value of
1
2
x
x2
584
that makes the scatter diagram as linear as
1
x2
Xfm
3
1
x
p1
x
log x
x
x2
x
The Best Transformation
The Best Transformation for Linearity
The
goal
valueofofλ that
that
makes
the scatter
diagram
as
The
goalisisto
to find
find aavalue
makes
the scatter
diagram
as linearas
as linear
possible.
possible.
Forpopulation,
the US population,
the transformation
logarithmic transformation
applied
For the US
the logarithmic
applied to y makes
the to
y makes
the relationship
almost
perfectly
linear.line
Thein red
dashed
lineofin
the
relationship
almost perfectly
linear.
The red dashed
the right
frame
Figure
right
frame
of of
Figure
has that
a slope
ofUS
about
1.35; that
1 has
a slope
about11.35;
is, the
population
grewis,
at athe
rateUS
of population
about 35%
grew
at
a
rate
of
about
35%
per
decade.
per decade.
The The
logarithmic
transformation
corresponds
to choice
the choice
= 0 by
logarithmic
transformation corresponds
to the
λ = 0 by Tukey's
Tukey’s
convention.
Figure
2, the
we scatter
displaydiagram
the scatter
diagram
(x, ỹdata
) of
convention.
In FigureIn
2, we
display
of the US
population
theforUS
for other
λ =population
0 as well asdata
for other
choiceschoices
of λ. of .
●
●
●●●●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●
λ = −1
λ = −0.5
●
●
●
λ = 0.5
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
λ = −0.25
●
●
●●
●
λ=0
●
λ = 0.75
●
●
λ=1
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
λ = 0.25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●●●
●
●
●
●
●●
●●●
●●●●●●●
●
●
Figure 2. The US population from 1670 to 1860 for various values of λ.
Figure 2: The US population from 1670 to 1860 for various values of .
The raw data are plotted in the bottom right frame of Figure 2 when λ = 1. The
The raw data
in the
framehow
of the
Figure
2 when
logarithmic
fit is inare
theplotted
upper right
framebottom
when λ right
= 0. Notice
scatter
=
1. Thesmoothly
logarithmic
fit is
in the
upper
right frame
0. Notice
how
diagram
morphs
from
convex
to concave
as λ when
increases.=Thus
intuitively
thethere
scatter
diagram
from concave
to convex
increases.
is a unique
bestsmoothly
choice ofmorphs
λ corresponding
to the “most
linear”asgraph.
Thus intuitively
is athis
unique
choiceis of
corresponding
to the “most
One way there
to make
choicebest
objective
to use
an objective function
for this
linear”
graph.
purpose. One approach might be to fit a straight line to the transformed points and
try to minimize the residuals. However, an easier approach is based on the fact that
4 of the linearity of a scatter diagram. In
the correlation coefficient, r, is a measure
particular, if the points fall on a straight line then their correlation will be r = 1.
(We need not worry about the case when r = −1 since we have defined the Tukey
transformed variable xλ to be positively correlated with x itself.)
585
Correlation coefficient
.75
.8
.85
.9
.95
1
Correlation coefficient
.75
.8
.85
.9
.95
1
1
scatter diagram. In particular, if the points fall on a straight line then
their
correlation
will b
correlation will be ⇢ = 1. (We need not worry about the case whensince
⇢ = we
1 have de
since we have defined the Tukey transformed variable x̃ to be positively
correlated with x
correlated with x itself.)
In Figure 3,
In Figure 3, we plot the correlation coefficient of the scatter (x,
diagram
ỹ ) as a func
In Figure 3, we plot the correlation coefficient of the scatter diagram
(x, ỹ ) as a function of . It is clear that the logarithmic transformation
( = 0) is nearly
as a function of λ. It is clear that the logarithmic transformation (λ = 0) is nearly
( =
0) is nearly optimal by this criterion.
optimal by this criterion.
.9990
.9995
●
−1
−.5
0
λ
.5
1
−.04 −.02
0
λ
.02
−1
.04
−
Figure 3. Graph of US population correlation coefficient as function of λ.
Figure 3: Graph of US population correlation coefficient as functionFigure
of . 3: Graph
Is the US population still on the same exponential growth pattern? In Figure 4 we
display the US population from 1630 to 2000 using the transformation and fit used
Is the US pop
Is the
theright
USframe
population
on the same
exponential
growth
in
of Figurestill
1. Fortunately,
the exponential
growth
(or atpattern?
least its In
Figure 4 we displ
Figure
wenot
display
theinto
US the
population
1630
2000
the transforrate) 4was
sustained
Twentieth from
Century.
If itto
had,
the using
US population
in
mation
and fit as
mation
and2000
fit as
in the
right
Figure
1. toFortunately,
thethan
exponential
the year
would
have
beenframe
over 2of
billion
(2.07
be exact), larger
the
growth (or at lea
growth
(or atofleast
population
China.its rate) was not sustained into the Twentieth Century.
it had, the US
If it had, the US population in the year 2000 would have been over If
2 billion
(2.07 to be exact
(2.07 to be exact), larger than the population of China.
5
586
US population
US (millions)
population (millions)
1000
● ● ●
● ● ●
● ● ●
●
●
● ●
● ●
●
●
● ●
●
● ●
● ●
●
● ● ● ●
● ●
● ● ●
● ● ●
● ●
●
●
● ●
● ●
● ●
●
● ●
●
● ●
●
● ●
●
●
●
● ●
●
● ●
●
● ●
●
● ●
● ●
●
●
●
10
1000
.1
10
.001
.1
1700
1800
Year
1900
2000
●
.001
Figure 4: Graph
1630-2000
with
1700of US population
1800
1900
= 0. 2000
Year
We can examine the decennial census population figures of individual
4: Graph
population
states asFigure
well. In
Figure of
5 US
we display
the 1630-2000
populationwith
data =
for0.the state of
We
can
examine
the
decennial
census
population
figures
of
individual
states as
New York from 1790 to 2000, together with an estimate of the population
in
well.
In
Figure
5
we
display
the
population
data
for
the
state
of
New
York
from
2008.
something
starting in figures
1970. (This
began the
WeClearly
can examine
the unusual
decennialhappened
census population
of individual
1790
to
2000,
together
with
an
estimate
of
the
population
in
2008.
Clearly
period
of well.
mass In
migration
West the
and population
South as the
rust
states as
Figure 5towethe
display
data
forbelt
the industries
state of
something
unusual
happened
starting
in
1970.
(This
began
the
period
of
mass
New York
fromdown.)
1790 toThus
2000,we
together
with
estimate
the population
in
began
to shut
compute
thean
best
valueofusing
the data from
migration
to
the
West
and
South
as
the
rust
belt
industries
began
to
shut
down.)
2008. Clearly
1970.frame
(Thisdisplays
began the
1790-1960
in something
the middleunusual
frame happened
of Figure starting
5. Theinright
the
Thus,
we
compute
the
best
λ
value
using
the
data
from
1790-1960
in
the
middle
period of mass
migration
to with
the West
and South
the
rust belt period.
industries
transformed
data,
together
the linear
fit forasthe
1790-1960
The
frame
of
Figure
5.
The
right
frame
displays
the
transformed
data,
together
with
the
began to value
shut down.)
compute
the and
best onevalue
using
the data from
physical
of =Thus
0.41 we
is not
obvious
might
reasonably
choose
linear
fit
for
the
1790-1960
period.
The
value
of
λ
=
0.41
is
not
obvious
and
one
1790-1960
theformiddle
frame
of Figure 5. The right frame displays the
to
use = in
0.50
practical
reasons.
might
reasonably
choose
to
use
λ
= 0.50
for practical
reasons.
transformed data, together with the
linear
fit for the
1790-1960 period. The
20
physical Population
value of of= 0.41 is1 not obvious and one mightTransformed
reasonably choose
Population of
3.0
York for
State
to use New
= 0.50
practical.95reasons.λ = 0.41
(millions)
New York State
Figure 4. Graph of US population 1630-2000 with λ = 0.
●
●
●
●
●
●
●
●
●
10
●
Population of
●
New York State
●
(millions)
15
●
10
0
5
●●
●
●
●
●
●
●
●
●
●
●
1900
Year
●
●
2000
.8
.5
λ
●
1
1.51.5
●
●
●
●
●
●
●
1800 ● 1850
●
●
1.0
●●
●
●
0
●●●
●
●
−.5
●
●
●
●
●
●
●
●
2.01.0
.85
.75
1950
●
2.51.5
●
1850
●
λ = 0.41
.9.8
●
●
●
Transformed ●
●
3.02.0 Population ●of
●
New York State
●
●
●
●
.95
.85
●
1800
1.9
●
●
●
●
5
●
●
●
●
2.5
●
20
●●
●
●
15
●●●
1900
Year
1950
2000
●
Figure 5. Graphs related to .75
the New York state population 1790-2008.
0
●●
●
●
●
●
●
1800 5:1850
1900 1950
2000
1 state
1.5
1800 1850 19001790-2008.
1950 2000
Figure
Graphs
related
to−.5the 0New.5 York
population
Year
λ
Year
Figure 5: Graphs related to the New York state population 1790-2008.
6
587
If we look at one of the younger states in the West, the picture is di↵erent.
If we look at one of the younger states in the West, the picture is different. Arizona
Arizona
has attracted many retirees and immigrants. Figure 6 summarizes
has attracted many retirees and immigrants. Figure 6 summarizes our findings.
our findings. Indeed, the growth of population in Arizona is logarithmic, and
Indeed, the growth of population in Arizona is logarithmic, and appears to still be
appears to still be logarithmic through 2005.
logarithmic through 2005.
6
●
Population of
Arizona State
5
●
●
log−Transformed
Population of
Arizona State
1.5
●
1.0
λ = −0.02
.95
(millions)
4
1
●
●
●
●
●
0.5
●
3
0.0
●
●
.9
2
−0.5
●
●
●
●
1
0
−1.0
●
●
●
●
●
●
−1.5
.85
1920 1940 1960 1980 2000
Year
−1
−.5
0
λ
.5
1
●
1920 1940 1960 1980 2000
Year
Figure 6. Graphs related to the Arizona state population 1910-2005.
Figure 6: Graphs related to the Arizona state population 1910-2005.
Reducing Skew
Many statistical methods such as t tests and the analysis of variance assume normal
Box-Cox
Transformation
distributions.
Although
these methods are relatively robust to violations of
normality, transforming the distributions to reduce skew can markedly increase
George
Box and Sir David Cox collaborated on one paper (Box, 1964). The
their power.
story is that
Coxthewas
Box at Wisconsin,
they
As anwhile
example,
data visiting
in the “Stereograms”
case study is
verydecided
skewed. Athey
t
should
write
a
paper
together
because
of
the
similarity
of
their
names
(and
test of the difference between the two conditions using the raw data results in a p
thatvalue
bothofare
British).
In fact,
Professor
Box is married
toHowever,
the daughter
0.056,
a value not
conventionally
considered
significant.
after a of
Sir Ronald
Fisher. (λ = 0) that reduces the skew greatly, the p value is 0.023 which
log transformation
The
Box-Cox transformation
of the variable x is also indexed by , and
is conventionally
considered significant.
is definedThe
as demonstration in Figure 7 shows distributions of the data from the
x
1
Stereograms case study as transformed
values of λ. Decreasing λ
x0 = with various
.
(3)
makes the distribution less positively skewed. Keep in mind that λ = 1 is the raw
At first
although
formula
Equation
is a scaled
version
data. glance,
Notice that
there is athe
slight
positivein
skew
for λ = 0(3)
but much
less skew
than of
foundtransformation
in the raw data (λdoes
= 1).not
Values
of below
0 result
in negative
x , this
appear
to be
the same
as theskew.
Tukey formula
4
in Equation (2). However, a closer look shows that when < 0, both x̃ and
x0 change the sign of x to preserve the ordering.
Of more interest is the fact that when = 0, then the Box-Cox variable
is the indeterminate form 0/0. Rewriting the Box-Cox formula as
0
x =
e
log(x)
1
⇡
1+
log(x) +
1
2
2
log(x)2 + · · ·
1
! log(x) 588
Figure 7. Distribution of data from the Stereogram case study for various
values of λ.
589
●
0.0
appears to still be logarithmic through 2005.
.9
●
●
6
●
●
−0.5
●
1
−1.0
●
Population of
−1.5
State
.85 Arizona
Box-Cox
Transformations
.95
(millions)
4
●
●
●
●
log−Transformed
Population of
Arizona State
1.5
5
960 1980 2000
−1
−.5
ar
by3David Scott
●
0●
λ
.5
1
1.0
●
λ = −0.02
1920 1940 0.5
1960 1980 2000
●
Year
0.0
Prerequisites
●
●
●
●
.9
2
●
−0.5
●
●
●
−1.0
1
aphs related
to the
Arizona
population
1910-2005.
This
section
assumes
a higher state
level of mathematics
background
than most other
●
●
●
●
●
●
−1.5
.85
0
sections
of this work.
1920 1940 1960 1980 2000
−1
−.5
0
.5
1
1920 1940 1960 1980 2000
Year
Year
λ
• Chapter 1: Logarithms
• Chapter 3: Additional Measures of Central Tendency (Geometic Mean)
Chapter6:4: Graphs
Bivariaterelated
Data
to the Arizona state population 1910-2005.
•Figure
• Chapter 4: Values of Pearson Correlation
Tukey Ladder of Powers
• Chapter
David
Cox 16:
collaborated
on one paper (Box, 1964). The
●
●
ox Transformation
Sir
4 was
Box-Cox
Transformation
hile Cox
visiting Box
at Wisconsin, they decided they
George Box and Sir David Cox collaborated on one paper (Box, 1964). The story
aper together
of the
names
George Boxbecause
and Sir David
Coxsimilarity
collaboratedofontheir
one paper
(Box,(and
1964). The
is that while Cox was visiting Box at Wisconsin, they decided they should write a
is thatProfessor
while Cox was
visiting
Box at to
Wisconsin,
they decided
ritish).story
Inpaper
fact,
Box
is married
thethat
daughter
of they
together because of the
similarity
of their names
(and
both are British).
should
write a paper together because of the similarity of their names (and
In fact, Professor Box is married to the daughter of Sir Ronald Fisher.
er.
that both
are
British).
In fact, Professor
Boxx is
married
to by
the
of
The
Box-Cox
transformation
of the variable
is also
indexed
λ, daughter
and is
x transformation
of the variable x is also indexed by , and
Sir defined
Ronaldas Fisher.
The Box-Cox transformation of the variable x is also indexed by , and
is defined as
x
1
x0 =
. x0 = x 1 .
(3) (3)
(Equation 1)
At first
although
the formula(3)
in Equation
(3) is version
a scaled version
of
although
theglance,
formula
in Equation
is a scaled
of
At first glance, although the formula in Equation (1) is a scaled version of the
, this not
transformation
does
not
appear
be the
same
as the
Tukey formula
mationxdoes
appearxλto
betransformation
the
sametodoes
as
the
Tukey
formula
Tukey transformation
, this
not
appear
to be the
same as the
in Equation (2). However, a closer look shows that when < 0, both x̃ and
Tukey
formula in
Equation
(2). However,
a closer look
shows
that when
λand
< 0,
However,
a closer
look
shows
that
when
<
0,
both
x̃
x0 both
change
the
sign
of
x
to
preserve
the
ordering.
xλ and xʹ′λ change the sign of xλ to preserve the ordering. Of more interest is
Of fact
more
interest
is the
fact that when = 0, then the Box-Cox variable
gn of x the
to
preserve
the
ordering.
that when λ = 0, then the Box-Cox variable is the indeterminate form 0/0.
the
indeterminate
form
0/0.
the Box-Cox
formula
as
rest is isthe
fact
that
when
=asRewriting
0, then the
Box-Cox
variable
Rewriting
the Box-Cox
formula
2
log(x)
nate form 0/0.
the
1 + Box-Cox
log(x) + 12 formula
log(x)2 +as
···
1
e Rewriting
1
0
⇡
x =
1
! log(x)
1 2
2
1
+
log(x)
+
log(x)
+ ···
1
2 may also be obtained using l'Hôpital's rule from your
as λ → 0. This same result
⇡
! log(x)
calculus course. This gives a rigorous explanation
for Tukey's suggestion that the
7
590
7
as ! 0. This same result may also be obtained using l’Hopital’s rule from
your calculus course. This gives a rigorous explanation for Tukey’s suggestion that the log transformation (which is not an example of a polynomial
transformation)
may
be inserted
at the value
= 0. transformation) may
log transformation
(which
is not an example
of a polynomial
0
of x that x = 1 always maps to the point
beNotice
insertedwith
at thethis
valuedefinition
λ = 0.
0
Notice
thisof
definition
of xʹ′how
x = transformation
1 always maps to the
point look
xʹ′λ = at
0 the
x = 0 for
all with
values
. To see
works,
λ thatthe
for all values
of λ. To7.see
works, look
the examples
examples
in Figure
Inhow
thethe
toptransformation
row, the choice
= 1 at
simply
shifts xinto the
Figure
the topisrow,
the choice
λ = In
1 simply
shifts xrow
to the
value
x−1, which is
value
x 1.1,Inwhich
a straight
line.
the bottom
(on
a semi-logarithmic
a straight
In the =
bottom
row (on a semi-logarithmic
scale),
the choice λ = 0 which
scale),
the line.
choice
0 corresponds
to a logarithmic
transformation,
to a logarithmic
transformation, awhich
is now
a straight
We
is corresponds
now a straight
line. We superimpose
larger
collection
ofline.
transformation
a larger collection
transformations
on a semi-logarithmic scale in
onsuperimpose
a semi-logarithmic
scale inofFigure
8.
λ = −1
λ=0
λ=1
●
●
●
λ = −1
λ=0
λ=1
●
●
●
−1.0 −0.5
0.0
0.5
1.0 −1.0 −0.5
0.0
0.5
1.0
Figure 2.
0.5
1.0
1.5
2.0
0.5
1.0
1.5
2.0
0.5
1.0
1.5
2.0
Figure 1. Examples of the Box-Cox transformation xʹ′λ versus x for λ = −1, 0,
Figure 1.
7: InExamples
of the
transformation
x0 point
versus
=
the second row,
xʹ′λ isBox-Cox
plotted against
log(x). The red
is at x for
0
1, 0, 1.(1,In0).the second row, x is plotted against log(x). The red point is at
(1, 0).
8
591
1
2
3
2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−1
0
●
0.5
1.0
1.5
2.0
Figure 2. Examples of the Box-Cox transformation versus log(x) for −2 < λ
0
Figure 8: Examples
of the Box-Cox
transformation
log(x)
< 3. The bottom
curve corresponds
to λ =x−2versus
and the
upperforto λ2=<3.
< 3. The bottom curve corresponds to = 2 and the upper to = 3.
Transformation to Normality
Another
important use of variable
transformation is to eliminate skewness and
Transformation
to Normality
other distributional features that complicate analysis. Often the goal is to find a
Another
important
use of that
variable
eliminate
skewness
simple
transformation
leadstransformation
to normality. Inisthetoarticle
on q-q
plots, we discuss
and other
featuresofthat
how todistributional
assess the normality
a setcomplicate
of data, analysis. Often the goal is
to find a simple transformation that leads to normality.
In the article
on2,...,x
q-q plots,n.we discuss how to assess the normality of a set
x1,x
of data,
x2 , . . . , xline
n . on the q-q plot. Since the correlation
Data that are normal lead tox1a, straight
maximized
scatter diagram
linear,
we canSince
use the
Datacoefficients
that are normal
lead when
to a astraight
line on is
the
q-q plot.
thesame
correlation
coefficient
maximized
a scatter
diagram is linear, we can
approach
above toisfind
the mostwhen
normal
transformation.
use the
same approach
above
find the most normal transformation.
Specifically,
we form
the ntopairs
5
9
592
Specifically, we form the n pairs
pecifically, we form the n pairs
✓
✓
◆
◆
i
0.5
1
✓
✓
◆
◆, x(i) ,
for i = 1, 2, . . . , n ,
i 0.5 n
1
, x(i) ,
for i = 1, 2, . . . , n ,
n
1
where
is the inverse cdf of the normal density and x(i) denotes the ith
value of the data set.
1 sorted
−1 is the inverse
th sorted the ith
where
Φ
CDF
of
the
normal
density
and
x
the
i
e
is As
the
inverse
cdf
of
the
normal
density
and
x
denotes
(i) denotes
(i)
an example, consider a large sample of British household incomes
value
ofdata
the dataset.
set. As an example, consider a large sample of British household
d value
of the
taken
in 1973,
normalized
to have mean equal to one (n = 7125). Such data
incomes taken in 1973, normalized to have mean equal to one (n = 7,125). Such
are often strongly
skewed,
as is clear
from Figure
9. Thehousehold
data were sorted
s an example,
consider
a large
British
incomes
data are often
strongly skewed,
as issample
clear from of
Figure
3. The data
were sorted and
and paired with the 7125 normal quantiles. The value of that gave the
n in 1973,
normalized
tonormal
havequantiles.
mean The
equal
paired
with the 7125
valueto
of one
λ that (n
gave=
the7125).
greatest Such data
greatest correlation (r = 0.9944) was = 0.21.
0.0
0
4
6
8
Income
10
12
0.8
0.6
0.4
0.2
−2
−1
0
1
2
3
λ
Figure 3. (L) Density plot of the 1973 British income data. (R) The best
λ is 0.21.
Figure 9: value
(L) of
Density
plot of the 1973 British income data. (R) The best
value of
0.0
2
Correlation coefficient
0.2
0.4
0.6 Correlation
0.8 coefficient
1.0
0.2
Density
0.4
0.2
Density
0.4
0.6
0.6
1.0
correlation
(r = 0.9944)
wasclear
λ = 0.21.
ften strongly
skewed,
as is
from Figure 9. The data were sorted
paired with the 7125 normal quantiles. The value of that gave the
est correlation (r = 0.9944) was = 0.21.
is 0.21.
The kernel density plot of the optimally transformed data is shown in the left frame
of Figure
4. While
thisplot
figure
much
less skewed
than in Figure
3, there
is clearly
kernel
density
of isthe
optimally
transformed
data
is1 shown
0 The
2
4
6
8
10
12
−2
−1
0
2 in the
3
an extra “component” in the distribution that might reflect the poor. Economists
left frame of Income
Figure 10. While this figure is much less skewed
than
in
Figure
often analyze the logarithm of income corresponding to λ = 0; seeλFigure 4. The
9, there is clearly an extra “component” in the distribution that might reflect
correlation is only r = 0.9901 in this case, but for convenience, the log-transform
the poor. Economists often analyze the logarithm of income corresponding
probably will be preferred.
to Density
= 0; see Figure
The correlation
is only
r = 0.9901
in this
butbest
re 9: (L)
plot 10.
of the
1973 British
income
data.
(R)case,
The
for0.21.
convenience, the log-transform probably will be preferred.
of is
he kernel density plot of the optimally transformed data is shown593in the
rame of Figure 10. While this figure is much less skewed than in Figure
10
Densit
Densit
0.
−1
0.6
0.2
−2
0
1
2
3
−4
−2
Density
−2
0
2
Income
Income
−4
−2
0
−4
−2
0
Income
Income
Density
0.2
0.4
0.0
0
1
2
3
−4
Income
Income
−3 −2 −1
0
1
2
3
−3 −2 −1 0
1
2
3
Income
Income
0
2
2
2
L)
Density
ofBritish
the 1973
British
data transformed
with
y plot
of theplot
1973
income
dataincome
transformed
with
R)
log-transform
with
= 0. income data transformed with
10:The
(L) Density
plot of=the
1973 British
og-transform
with
0.
0.0
(L) Density plot of the 1973 British income data transformed with
21. (R) The log-transform with = 0.
(R) The log-transform with = 0.
−3
−2
−1
0
1
Income
2
3
−4
−2
0
Income
2
pplications
er Applications
Other
Applications
Figure
4. (L) Density plot of the 1973 British income data transformed with
her Figure
Applications
10:
Density
of the 1973
λ =(L)
0.21.
(R) Theplot
log-transform
withBritish
λ = 0. income data transformed with
ssion
another
where
variable
transformation
is
⇤ is application
nalysis
another
where
variabletransformation
transformation
analysis
is(R)
another
application
where
is i
= 0.21.
The application
log-transform
with
= 0. variable
Other
Applications
the
model
noranalysis
is the
another
application where variable transformation is
ntly
applied.
For
the
model
pplied.
For
model
is another application where variable transformation is
applied.Regression
For theanalysis
model
For· the
Other
= 0 +6 frequently
· · model
+ xx2 p++· ·✏· + p xp + ✏
1x
1y+
20xApplications
2+
= applied.
+
1 x1 + 2 p
y=
· + xp x+p ✏+ ✏
y = 0 ++ 1 xx1 +
+ 2xx2++· ·· ·· +
0
1 is1 another
2 2application where
p p variable transformation is
Regression analysis
model
frequently
applied. For the model
and fitted model
ted
odel
ŷmodel
= b0 + b1 x1 ŷ+=b2bx0 2++b1·x·1· +
+bb2pxx2p+, · · · + bp xp ,
=
·+
0 +b
1 x1 +
ŷ==b0b0++bby11x
x11 +
·2 x··2··+
+
b· ·p+
xbp x
,p xpp ,+ ✏
ŷ
b
+
·
22xx
22+
the predictor
variables
xj can be transformed.
usualiscriterion is
variables
xj can
be transformed.
The usual The
criterion
and fitted
model
eiance
predictor
variables
can
be
transformed.
The,The
usual
criterion
is
of the
residuals,
by
each
of
theby
predictor
xj can
be transformed. The
criterion
is the criterion
j =
redictor
variables
xxjgiven
can
usual
i
siduals,
given
ŷvariables
b0 +be
b1 x1transformed.
+ b2 x2 + · · · + bp xusual
p
variance
of the residuals,
given
by
nce
of the
residuals,
givenX
by
n
of the
residuals,
given
by
each
of the
predictor
variables
xj can be transformed. The usual criterion is
n
1
X
1
(ŷ
yi )2 .
n2
the variance
of the residuals,
given
by
i
X
(ŷi 1yX
)n .
in
n i=1
(ŷi 1 X
yni )2 2.
1 i=1
2
n i=1 (ŷi y(ŷ
)
i . yi ) .
i
n i=1
casionally,Occasionally,
the response
variable
yy may
may
be transformed.
In must
this case,
n
the
response
variable
be
transformed.
In
this
case,
care
i=1 be transformed. In this case,
response
variable
y the
may
ust
be taken
because
variance
of ythe
residuals
isas not
be
taken
because
the
variance
themay
residuals
is transformed.
not
comparable
λ varies.
Let case,
onally,
the
response
variable
be
InIncomparable
this
case,
Occasionally, the responseofy
variable
may
be
transformed.
this
because
the
variance
of the
residuals
isthe
not
comparable
ries.
Let
represent
the
geometric
mean
of
the
response
variables
ybecause
care ḡresponse
must
be
taken
because
the
variance
of
residuals
is notcomparable
comparable
be taken
the
variance
of
the
residuals
ally,
the
variable
y
may
be
transformed.
In this case
represent the geometric mean of the response variables. is not
present
the
geometric
mean
of
the
response
variables
varies. Let ḡthe
the geometric
mean
of the
responsevariables
variables
y represent
!
s.taken
Let asḡybecause
represent
geometric
mean
ofresiduals
the
response
1/n
the
variance
of
the
is
not
comparable
n
Y
594
!
1/n
!
n
1/n
Y
n theḡygeometric
= n !
y 1/nmean
. of the response variables
Let ḡy represent
Y
.
Yḡy = i yi
ḡy =
yi
.
i=1
i=1
e response variable y may be transformed. In this case,
because the variance of the residuals is not comparable
represent the geometric mean of the response variables
!1/n
n
Y
response is defined
as
ḡy =
yi
.
i=1
y as 1
response is0 defined
y = 11
.
Then the transformed response is defined as
0
y =
ithmic case),
y
· ḡy
1
· ḡy
1
1
.
When λ = 0 (the logarithmic case),
rithmic case),
0
y0 = ḡy · log(y) .
0
y
= ḡy · log(y)
.
For0 more examples
and discussions, see Kutner, Nachtsheim, Neter, and Li (2004).
simultaneously transform both the predictor and the
n simultaneously
transform
both the predictor
the et al.
or
more examples
and discussions,
see and
Kutner
For more examples and discussions, see Kutner et al.
ss
ox,
D. R.
analysis
of transformations,”
x, D.
R. (1964).
(1964).“An
“An
analysis
of transformations,”
Statistical Society, Series B, 26: 211-252.
tatistical Society, Series B, 26: 211-252.
im, C., Neter, J., and Li, W. (2004). Applied Linear
m, C., Neter, J.,Homewood,
and Li, W.
cGraw-Hill/Irwin,
IL. (2004). Applied Linear
Graw-Hill/Irwin,
Homewood,
IL.
Exploratory Data Analysis,
Addison-Wesley,
Reading,
Exploratory Data Analysis, Addison-Wesley, Reading,
595
Statistical Literacy
by David M. Lane
Prerequisites
• Chapter 16: Logarithms
Many financial web pages give you the option of using a linear or a logarithmic Yaxis. An example from Google Finance is shown below.
What do you think?
To get a straight line with the linear option chosen, the price would have to go up
the same amount every time period. What would result in a straight line with the
logarithmic option chosen?
The price would have to go up the same proportion every time
period. For example, go up 0.1% every day.
596
References
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations, Journal of the
Royal Statistical Society, Series B, 26, 211-252.
Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2004). Applied Linear Statistical
Models, McGraw-Hill/Irwin, Homewood, IL.
Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA.
597
Exercises
Prerequisites
All Content in This Chapter
1. When is a log transformation valuable?
2. If the arithmetic mean of log10 transformed data were 3, what would be the
geometric mean?
3. Using Tukey's ladder of transformation, transform the following data using a λ
of 0.5: 9, 16, 25
4. What value of λ in Tukey's ladder decreases skew the most?
5. What value of λ in Tukey's ladder increases skew the most?
6. In the ADHD case study, transform the data in the placebo condition (D0) with
λ's of .5, 0, -.5, and -1. How does the skew in each of these compare to the skew
in the raw data. Which transformation leads to the least skew?
598
Download