16. Transformations A. Log B. Tukey's Ladder of Powers C. Box-Cox Transformations D. Exercises The focus of statistics courses is the exposition of appropriate methodology to analyze data to answer the question at hand. Sometimes the data are given to you, while other times the data are collected as part of a carefully-designed experiment. Often the time devoted to statistical analysis is less than 10% of the time devoted to data collection and preparation. If aspects of the data preparation fail, then the success of the analysis is in jeopardy. Sometimes errors are introduced into the recording of data. Sometimes biases are inadvertently introduced in the selection of subjects or the mis-calibration of monitoring equipment. In this chapter, we focus on the fact that many statistical procedures work best if individual variables have certain properties. The measurement scale of a variable should be part of the data preparation effort. For example, the correlation coefficient does not require the variables have a normal shape, but often relationships can be made clearer by re-expressing the variables. An economist may choose to analyze the logarithm of prices if the relative price is of interest. A chemist may choose to perform a statistical analysis using the inverse temperature as a variable rather than the temperature itself. But note that the inverse of a temperature will differ depending on whether it is measured in °F, °C, or °K. The introductory chapter covered linear transformations. These transformations normally do not change statistics such as Pearson’s r, although they do affect the mean and standard deviation. The first section here is on log transformations which are useful to reduce skew. The second section is on Tukey’s ladder of powers. You will see that log transformations are a special case of the ladder of powers. Finally, we cover the relatively advanced topic of the Box-Cox transformation. 578 Log Transformations by David M. Lane Prerequisites • Chapter 1: Logarithms • Chapter 1: Shapes of Distributions • Chapter 3: Additional Measures of Central Tendency • Chapter 4: Introduction to Bivariate Data Learning Objectives 1. State how a log transformation can help make a relationship clear 2. Describe the relationship between logs and the geometric mean The log transformation can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics. Figure 1 shows an example of how a log transformation can make patterns more visible. Both graphs plot the brain weight of animals as a function of their body weight. The raw weights are shown in the upper panel; the log-transformed body weights: Fit Y by X of Brain by Body/1000 Page 1 of 1 weights plotted in Body/1000 the lower panel. Bivariateare Fit of Brain By 7000 6000 5000 Brain 4000 3000 2000 1000 0 -1000 0 10000 20000 30000 40000 50000 60000 70000 80000 Body/1000 579 body weights: Bivariate of Log(Brain) by log(body) Page 1 of 1 Bivariate Fit of Log(Brain) By log(body) 9 8 7 6 Log(Brain) 5 4 3 2 1 0 -1 -2 0 5 10 15 20 Log(Body) Figure 1. Scatter plots of brain weight as a function of body weight in terms of both raw data (upper panel) and log-transformed data (lower panel). It is hard to discern a pattern in the upper panel whereas the strong relationship is shown clearly in the lower panel. The comparison of the means of log-transformed data is actually a comparison of geometric means. This occurs because, as shown below, the anti-log of the arithmetic mean of log-transformed values is the geometric mean. Table 1 shows the logs (base 10) of the numbers 1, 10, and 100. The arithmetic mean of the three logs is (0 + 1 + 2)/3 = 1 The anti-log of this arithmetic mean of 1 is: 101 = 10 580 which is the geometric mean: (1 x 10 x 100).3333 = 10. Table 1. Logarithms. X Log10(X) 1 10 100 0 1 2 Therefore, if the arithmetic means of two sets of log-transformed data are equal then the geometric means are equal. 581 Tukey Ladder of Powers by David W. Scott Prerequisites • Chapter 1: Logarithms • Chapter 4: Bivariate Data • Chapter 4: Values of Pearson Correlation • Chapter 12: Independent Groups t Test • Chapter 13: Introduction to Power • Chapter 16: Tukey Ladder of Powers Learning Objectives 1. Give the Tukey ladder of transformations 2. Find a transformation that reveals a linear relationship 3. Find a transformation to approximate a normal distribution Introduction We assume we have a collection of bivariate data (x1,y1),(x2,y2),...,(xn,yn) and that we are interested in the relationship between variables x and y. Plotting the data on a scatter diagram is the first step. As an example, consider the population of the United States for the 200 years before the Civil War. Of course, the decennial census began in 1790. These data are plotted two ways in Figure 1. Malthus predicted that geometric growth of populations coupled with arithmetic growth of grain production would have catastrophic results. Indeed the US population followed an exponential curve during this period. 582 These data are plotted two ways in Figure 1. Malthus predicted that geometric growth of populations coupled with arithmetic growth of grain production would have catastrophic results. Indeed the US population followed an exponential curve during this period. ● US population (millions) 30 20.0 25 ● ● ● 10.0 ● ● ● 5.0 20 ● 15 ● 10 ● 5 ●● ●●● ●●●●●●● 1700 ● ● ● ● 2.0 ● ● 1.0 ● ● 0.5 ● 0 ● ● ● ● ● 1750 1800 Year ● 0.2 1850 0.1 ● ● ● 1700 1750 1800 Year 1850 Figure 1. The US population from 1670 - 1860. The Y-axis on the right panel on aUS logpopulation scale. Figure 1:isThe from 1670 - 1860 on graph and semi-log scales. Tukey's Transformation Ladder Tukey(1977) (1977)describes describes orderly of re-expressing variables a Tukey anan orderly wayway of re-expressing variables usingusing a power power transformation. You familiar with regression polynomial regression, transformation. You may be should familiarbe with polynomial (a form of where the simple linear model y = b0 + b1 x is extended with terms such as multiple regression) in which the simple linear model y = b0 + b1X is extended b2 x2 + b3 x3 + b4 x4 . Alternatively, Tukey suggests exploring simple relationwith terms such as b2x2 + b3x3 + b4x4. Alternatively, Tukey suggests exploring ships such as simple relationships such as y = b0 + b1 x or y = b0 + b1 x , (1) λ λ = b + b X (Equation 1) = b0 + bchosen 1X ortoymake 0 relationship 1 where is ay parameter the as close to a straight line as possible. Linear relationships are special, and if a transformation of where λ is a parameter chosen to make the relationship as close to a straight line as possible. Linear relationships are special, 2 and if a transformation of the type xλ or yλ works as in Equation (1), then we should consider changing our measurement scale for the rest of the statistical analysis. There is no constraint on values of λ that we may consider. Obviously choosing λ = 1 leaves the data unchanged. Negative values of λ are also reasonable. For example, the relationship y = b0 + b1/x would be represented by λ = −1. The value λ = 0 has no special value, since X0 = 1, which is just a constant. Tukey (1977) suggests that it is convenient to simply define the transformation when λ = 0 to be the logarithm function rather than the 583 is no on values of Tukey that we may consider. Obviously ,There which is constraint just a constant. (1977) suggests that it is convenient oosing = 1 the leavestransformation the data unchanged. Negative are also ply define when =values 0 to of be the logarithm function asonable. For example, the relationship than the constant 1. We shall revisit this convention shortly. The b1 ng table gives examples y = of b0 +the Tukey ladder of transformations. constant 1. We shall revisit xthis convention shortly. The following table gives examples Tukey uld be represented by of=the 1. The ladder value of=transformations. 0 has no special value, since = 1, which is Table just a constant. Tukey (1977) suggests that it is convenient 1: Tukey’s Ladder of Transformation Table 1. Tukey's Ladder of Transformations simply define the transformation when = 0 to be the logarithm function ther than the constant 1. We shall revisit this convention shortly. The -2 of the -1 Tukey -1/2ladder 0of transformations. 1/2 1 2 lowing table gives examples p 1 1 1 p Xfm x2 Ladder log x x x Table 1: Tukey’s ofx Transformation x x2 -1 -1/2values, 0 then 1/2special 1 care 2 must be taken so that the If x takes -2 on negative p transformations p1 sense, Xfm x12 x1 make log xif possible. x x We x2 generally limit ourselves to variables x x takes onwhere negative values, then special bevariables taken such so as that x > 0 to avoid these considerations. Forcare some must dependent ansformations make sense,it isifconvenient possible. We1 togenerally limitthe ourselves to the number of errors, to add x before applying transformation. les xnegative > 0 tovalues, avoidthen these considerations. If xwhere takes on special care must be taken so that Also,sense, if the transformation λlimit is negative, then the transformed e transformations make if possible. Weparameter generally ourselves tothe so, if the transformation parameter is negative, then transformed λ riables wherevariable x > 0 tox avoid these considerations. is reversed. For example, if x is increasing, then 1/x is decreasing. We leAlso, x ifisthe reversed. For example, ifnegative, x is increasing, then 1/x is decreasing. transformation parameter thentothe choose to redefine the Tukeyistransformation be transformed -(xλ) if λ < 0 in order to preserve oose the Tukey to be xtheifTukey<transformation 0 in order to riable xtoisredefine reversed. iftransformation xafter is increasing, then 1/x is decreasing. the orderFor of example, the variable transformation. Formally, e choose redefine Tukey transformation be x if < 0 in order to is defined as ve the toorder ofthe the variable after to transformation. Formally, the Tukey eserve the order of the variable after transformation. Formally, the Tukey ormation is defined to be ansformation is defined to be 8 8 < x < x if > 0 if >0 log x if = 0 x̃ = (2) x̃ := (x )log ifx < 0 if = 0 (2) : (x ) if <0 Table 2 we reproduce Table 1 but using the modified definition when < 0. Table 2 we reproduce 1 but using the definition when λ < 0.when ble 2 we In reproduce Table 1Table but using themodified modified definition Table 2: Modified Tukey’s Ladder of Transformation Table 2. Modified Tukey's Ladder of Transformations 3 Xfm -2 -1 1 x2 1 x -1/2 p1 3 x 0 log x 1/2 p x The Best Transformation oal is to find a value of 1 2 x x2 584 that makes the scatter diagram as linear as 1 x2 Xfm 3 1 x p1 x log x x x2 x The Best Transformation The Best Transformation for Linearity The goal valueofofλ that that makes the scatter diagram as The goalisisto to find find aavalue makes the scatter diagram as linearas as linear possible. possible. Forpopulation, the US population, the transformation logarithmic transformation applied For the US the logarithmic applied to y makes the to y makes the relationship almost perfectly linear.line Thein red dashed lineofin the relationship almost perfectly linear. The red dashed the right frame Figure right frame of of Figure has that a slope ofUS about 1.35; that 1 has a slope about11.35; is, the population grewis, at athe rateUS of population about 35% grew at a rate of about 35% per decade. per decade. The The logarithmic transformation corresponds to choice the choice = 0 by logarithmic transformation corresponds to the λ = 0 by Tukey's Tukey’s convention. Figure 2, the we scatter displaydiagram the scatter diagram (x, ỹdata ) of convention. In FigureIn 2, we display of the US population theforUS for other λ =population 0 as well asdata for other choiceschoices of λ. of . ● ● ●●●●●● ●●● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● λ = −1 λ = −0.5 ● ● ● λ = 0.5 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● λ = −0.25 ● ● ●● ● λ=0 ● λ = 0.75 ● ● λ=1 ● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● λ = 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ●● ●●● ●●●●●●● ● ● Figure 2. The US population from 1670 to 1860 for various values of λ. Figure 2: The US population from 1670 to 1860 for various values of . The raw data are plotted in the bottom right frame of Figure 2 when λ = 1. The The raw data in the framehow of the Figure 2 when logarithmic fit is inare theplotted upper right framebottom when λ right = 0. Notice scatter = 1. Thesmoothly logarithmic fit is in the upper right frame 0. Notice how diagram morphs from convex to concave as λ when increases.=Thus intuitively thethere scatter diagram from concave to convex increases. is a unique bestsmoothly choice ofmorphs λ corresponding to the “most linear”asgraph. Thus intuitively is athis unique choiceis of corresponding to the “most One way there to make choicebest objective to use an objective function for this linear” graph. purpose. One approach might be to fit a straight line to the transformed points and try to minimize the residuals. However, an easier approach is based on the fact that 4 of the linearity of a scatter diagram. In the correlation coefficient, r, is a measure particular, if the points fall on a straight line then their correlation will be r = 1. (We need not worry about the case when r = −1 since we have defined the Tukey transformed variable xλ to be positively correlated with x itself.) 585 Correlation coefficient .75 .8 .85 .9 .95 1 Correlation coefficient .75 .8 .85 .9 .95 1 1 scatter diagram. In particular, if the points fall on a straight line then their correlation will b correlation will be ⇢ = 1. (We need not worry about the case whensince ⇢ = we 1 have de since we have defined the Tukey transformed variable x̃ to be positively correlated with x correlated with x itself.) In Figure 3, In Figure 3, we plot the correlation coefficient of the scatter (x, diagram ỹ ) as a func In Figure 3, we plot the correlation coefficient of the scatter diagram (x, ỹ ) as a function of . It is clear that the logarithmic transformation ( = 0) is nearly as a function of λ. It is clear that the logarithmic transformation (λ = 0) is nearly ( = 0) is nearly optimal by this criterion. optimal by this criterion. .9990 .9995 ● −1 −.5 0 λ .5 1 −.04 −.02 0 λ .02 −1 .04 − Figure 3. Graph of US population correlation coefficient as function of λ. Figure 3: Graph of US population correlation coefficient as functionFigure of . 3: Graph Is the US population still on the same exponential growth pattern? In Figure 4 we display the US population from 1630 to 2000 using the transformation and fit used Is the US pop Is the theright USframe population on the same exponential growth in of Figurestill 1. Fortunately, the exponential growth (or atpattern? least its In Figure 4 we displ Figure wenot display theinto US the population 1630 2000 the transforrate) 4was sustained Twentieth from Century. If itto had, the using US population in mation and fit as mation and2000 fit as in the right Figure 1. toFortunately, thethan exponential the year would have beenframe over 2of billion (2.07 be exact), larger the growth (or at lea growth (or atofleast population China.its rate) was not sustained into the Twentieth Century. it had, the US If it had, the US population in the year 2000 would have been over If 2 billion (2.07 to be exact (2.07 to be exact), larger than the population of China. 5 586 US population US (millions) population (millions) 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 1000 .1 10 .001 .1 1700 1800 Year 1900 2000 ● .001 Figure 4: Graph 1630-2000 with 1700of US population 1800 1900 = 0. 2000 Year We can examine the decennial census population figures of individual 4: Graph population states asFigure well. In Figure of 5 US we display the 1630-2000 populationwith data = for0.the state of We can examine the decennial census population figures of individual states as New York from 1790 to 2000, together with an estimate of the population in well. In Figure 5 we display the population data for the state of New York from 2008. something starting in figures 1970. (This began the WeClearly can examine the unusual decennialhappened census population of individual 1790 to 2000, together with an estimate of the population in 2008. Clearly period of well. mass In migration West the and population South as the rust states as Figure 5towethe display data forbelt the industries state of something unusual happened starting in 1970. (This began the period of mass New York fromdown.) 1790 toThus 2000,we together with estimate the population in began to shut compute thean best valueofusing the data from migration to the West and South as the rust belt industries began to shut down.) 2008. Clearly 1970.frame (Thisdisplays began the 1790-1960 in something the middleunusual frame happened of Figure starting 5. Theinright the Thus, we compute the best λ value using the data from 1790-1960 in the middle period of mass migration to with the West and South the rust belt period. industries transformed data, together the linear fit forasthe 1790-1960 The frame of Figure 5. The right frame displays the transformed data, together with the began to value shut down.) compute the and best onevalue using the data from physical of =Thus 0.41 we is not obvious might reasonably choose linear fit for the 1790-1960 period. The value of λ = 0.41 is not obvious and one 1790-1960 theformiddle frame of Figure 5. The right frame displays the to use = in 0.50 practical reasons. might reasonably choose to use λ = 0.50 for practical reasons. transformed data, together with the linear fit for the 1790-1960 period. The 20 physical Population value of of= 0.41 is1 not obvious and one mightTransformed reasonably choose Population of 3.0 York for State to use New = 0.50 practical.95reasons.λ = 0.41 (millions) New York State Figure 4. Graph of US population 1630-2000 with λ = 0. ● ● ● ● ● ● ● ● ● 10 ● Population of ● New York State ● (millions) 15 ● 10 0 5 ●● ● ● ● ● ● ● ● ● ● ● 1900 Year ● ● 2000 .8 .5 λ ● 1 1.51.5 ● ● ● ● ● ● ● 1800 ● 1850 ● ● 1.0 ●● ● ● 0 ●●● ● ● −.5 ● ● ● ● ● ● ● ● 2.01.0 .85 .75 1950 ● 2.51.5 ● 1850 ● λ = 0.41 .9.8 ● ● ● Transformed ● ● 3.02.0 Population ●of ● New York State ● ● ● ● .95 .85 ● 1800 1.9 ● ● ● ● 5 ● ● ● ● 2.5 ● 20 ●● ● ● 15 ●●● 1900 Year 1950 2000 ● Figure 5. Graphs related to .75 the New York state population 1790-2008. 0 ●● ● ● ● ● ● 1800 5:1850 1900 1950 2000 1 state 1.5 1800 1850 19001790-2008. 1950 2000 Figure Graphs related to−.5the 0New.5 York population Year λ Year Figure 5: Graphs related to the New York state population 1790-2008. 6 587 If we look at one of the younger states in the West, the picture is di↵erent. If we look at one of the younger states in the West, the picture is different. Arizona Arizona has attracted many retirees and immigrants. Figure 6 summarizes has attracted many retirees and immigrants. Figure 6 summarizes our findings. our findings. Indeed, the growth of population in Arizona is logarithmic, and Indeed, the growth of population in Arizona is logarithmic, and appears to still be appears to still be logarithmic through 2005. logarithmic through 2005. 6 ● Population of Arizona State 5 ● ● log−Transformed Population of Arizona State 1.5 ● 1.0 λ = −0.02 .95 (millions) 4 1 ● ● ● ● ● 0.5 ● 3 0.0 ● ● .9 2 −0.5 ● ● ● ● 1 0 −1.0 ● ● ● ● ● ● −1.5 .85 1920 1940 1960 1980 2000 Year −1 −.5 0 λ .5 1 ● 1920 1940 1960 1980 2000 Year Figure 6. Graphs related to the Arizona state population 1910-2005. Figure 6: Graphs related to the Arizona state population 1910-2005. Reducing Skew Many statistical methods such as t tests and the analysis of variance assume normal Box-Cox Transformation distributions. Although these methods are relatively robust to violations of normality, transforming the distributions to reduce skew can markedly increase George Box and Sir David Cox collaborated on one paper (Box, 1964). The their power. story is that Coxthewas Box at Wisconsin, they As anwhile example, data visiting in the “Stereograms” case study is verydecided skewed. Athey t should write a paper together because of the similarity of their names (and test of the difference between the two conditions using the raw data results in a p thatvalue bothofare British). In fact, Professor Box is married toHowever, the daughter 0.056, a value not conventionally considered significant. after a of Sir Ronald Fisher. (λ = 0) that reduces the skew greatly, the p value is 0.023 which log transformation The Box-Cox transformation of the variable x is also indexed by , and is conventionally considered significant. is definedThe as demonstration in Figure 7 shows distributions of the data from the x 1 Stereograms case study as transformed values of λ. Decreasing λ x0 = with various . (3) makes the distribution less positively skewed. Keep in mind that λ = 1 is the raw At first although formula Equation is a scaled version data. glance, Notice that there is athe slight positivein skew for λ = 0(3) but much less skew than of foundtransformation in the raw data (λdoes = 1).not Values of below 0 result in negative x , this appear to be the same as theskew. Tukey formula 4 in Equation (2). However, a closer look shows that when < 0, both x̃ and x0 change the sign of x to preserve the ordering. Of more interest is the fact that when = 0, then the Box-Cox variable is the indeterminate form 0/0. Rewriting the Box-Cox formula as 0 x = e log(x) 1 ⇡ 1+ log(x) + 1 2 2 log(x)2 + · · · 1 ! log(x) 588 Figure 7. Distribution of data from the Stereogram case study for various values of λ. 589 ● 0.0 appears to still be logarithmic through 2005. .9 ● ● 6 ● ● −0.5 ● 1 −1.0 ● Population of −1.5 State .85 Arizona Box-Cox Transformations .95 (millions) 4 ● ● ● ● log−Transformed Population of Arizona State 1.5 5 960 1980 2000 −1 −.5 ar by3David Scott ● 0● λ .5 1 1.0 ● λ = −0.02 1920 1940 0.5 1960 1980 2000 ● Year 0.0 Prerequisites ● ● ● ● .9 2 ● −0.5 ● ● ● −1.0 1 aphs related to the Arizona population 1910-2005. This section assumes a higher state level of mathematics background than most other ● ● ● ● ● ● −1.5 .85 0 sections of this work. 1920 1940 1960 1980 2000 −1 −.5 0 .5 1 1920 1940 1960 1980 2000 Year Year λ • Chapter 1: Logarithms • Chapter 3: Additional Measures of Central Tendency (Geometic Mean) Chapter6:4: Graphs Bivariaterelated Data to the Arizona state population 1910-2005. •Figure • Chapter 4: Values of Pearson Correlation Tukey Ladder of Powers • Chapter David Cox 16: collaborated on one paper (Box, 1964). The ● ● ox Transformation Sir 4 was Box-Cox Transformation hile Cox visiting Box at Wisconsin, they decided they George Box and Sir David Cox collaborated on one paper (Box, 1964). The story aper together of the names George Boxbecause and Sir David Coxsimilarity collaboratedofontheir one paper (Box,(and 1964). The is that while Cox was visiting Box at Wisconsin, they decided they should write a is thatProfessor while Cox was visiting Box at to Wisconsin, they decided ritish).story Inpaper fact, Box is married thethat daughter of they together because of the similarity of their names (and both are British). should write a paper together because of the similarity of their names (and In fact, Professor Box is married to the daughter of Sir Ronald Fisher. er. that both are British). In fact, Professor Boxx is married to by the of The Box-Cox transformation of the variable is also indexed λ, daughter and is x transformation of the variable x is also indexed by , and Sir defined Ronaldas Fisher. The Box-Cox transformation of the variable x is also indexed by , and is defined as x 1 x0 = . x0 = x 1 . (3) (3) (Equation 1) At first although the formula(3) in Equation (3) is version a scaled version of although theglance, formula in Equation is a scaled of At first glance, although the formula in Equation (1) is a scaled version of the , this not transformation does not appear be the same as the Tukey formula mationxdoes appearxλto betransformation the sametodoes as the Tukey formula Tukey transformation , this not appear to be the same as the in Equation (2). However, a closer look shows that when < 0, both x̃ and Tukey formula in Equation (2). However, a closer look shows that when λand < 0, However, a closer look shows that when < 0, both x̃ x0 both change the sign of x to preserve the ordering. xλ and xʹ′λ change the sign of xλ to preserve the ordering. Of more interest is Of fact more interest is the fact that when = 0, then the Box-Cox variable gn of x the to preserve the ordering. that when λ = 0, then the Box-Cox variable is the indeterminate form 0/0. the indeterminate form 0/0. the Box-Cox formula as rest is isthe fact that when =asRewriting 0, then the Box-Cox variable Rewriting the Box-Cox formula 2 log(x) nate form 0/0. the 1 + Box-Cox log(x) + 12 formula log(x)2 +as ··· 1 e Rewriting 1 0 ⇡ x = 1 ! log(x) 1 2 2 1 + log(x) + log(x) + ··· 1 2 may also be obtained using l'Hôpital's rule from your as λ → 0. This same result ⇡ ! log(x) calculus course. This gives a rigorous explanation for Tukey's suggestion that the 7 590 7 as ! 0. This same result may also be obtained using l’Hopital’s rule from your calculus course. This gives a rigorous explanation for Tukey’s suggestion that the log transformation (which is not an example of a polynomial transformation) may be inserted at the value = 0. transformation) may log transformation (which is not an example of a polynomial 0 of x that x = 1 always maps to the point beNotice insertedwith at thethis valuedefinition λ = 0. 0 Notice thisof definition of xʹ′how x = transformation 1 always maps to the point look xʹ′λ = at 0 the x = 0 for all with values . To see works, λ thatthe for all values of λ. To7.see works, look the examples examples in Figure Inhow thethe toptransformation row, the choice = 1 at simply shifts xinto the Figure the topisrow, the choice λ = In 1 simply shifts xrow to the value x−1, which is value x 1.1,Inwhich a straight line. the bottom (on a semi-logarithmic a straight In the = bottom row (on a semi-logarithmic scale), the choice λ = 0 which scale), the line. choice 0 corresponds to a logarithmic transformation, to a logarithmic transformation, awhich is now a straight We is corresponds now a straight line. We superimpose larger collection ofline. transformation a larger collection transformations on a semi-logarithmic scale in onsuperimpose a semi-logarithmic scale inofFigure 8. λ = −1 λ=0 λ=1 ● ● ● λ = −1 λ=0 λ=1 ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Figure 2. 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Figure 1. Examples of the Box-Cox transformation xʹ′λ versus x for λ = −1, 0, Figure 1. 7: InExamples of the transformation x0 point versus = the second row, xʹ′λ isBox-Cox plotted against log(x). The red is at x for 0 1, 0, 1.(1,In0).the second row, x is plotted against log(x). The red point is at (1, 0). 8 591 1 2 3 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −1 0 ● 0.5 1.0 1.5 2.0 Figure 2. Examples of the Box-Cox transformation versus log(x) for −2 < λ 0 Figure 8: Examples of the Box-Cox transformation log(x) < 3. The bottom curve corresponds to λ =x−2versus and the upperforto λ2=<3. < 3. The bottom curve corresponds to = 2 and the upper to = 3. Transformation to Normality Another important use of variable transformation is to eliminate skewness and Transformation to Normality other distributional features that complicate analysis. Often the goal is to find a Another important use of that variable eliminate skewness simple transformation leadstransformation to normality. Inisthetoarticle on q-q plots, we discuss and other featuresofthat how todistributional assess the normality a setcomplicate of data, analysis. Often the goal is to find a simple transformation that leads to normality. In the article on2,...,x q-q plots,n.we discuss how to assess the normality of a set x1,x of data, x2 , . . . , xline n . on the q-q plot. Since the correlation Data that are normal lead tox1a, straight maximized scatter diagram linear, we canSince use the Datacoefficients that are normal lead when to a astraight line on is the q-q plot. thesame correlation coefficient maximized a scatter diagram is linear, we can approach above toisfind the mostwhen normal transformation. use the same approach above find the most normal transformation. Specifically, we form the ntopairs 5 9 592 Specifically, we form the n pairs pecifically, we form the n pairs ✓ ✓ ◆ ◆ i 0.5 1 ✓ ✓ ◆ ◆, x(i) , for i = 1, 2, . . . , n , i 0.5 n 1 , x(i) , for i = 1, 2, . . . , n , n 1 where is the inverse cdf of the normal density and x(i) denotes the ith value of the data set. 1 sorted −1 is the inverse th sorted the ith where Φ CDF of the normal density and x the i e is As the inverse cdf of the normal density and x denotes (i) denotes (i) an example, consider a large sample of British household incomes value ofdata the dataset. set. As an example, consider a large sample of British household d value of the taken in 1973, normalized to have mean equal to one (n = 7125). Such data incomes taken in 1973, normalized to have mean equal to one (n = 7,125). Such are often strongly skewed, as is clear from Figure 9. Thehousehold data were sorted s an example, consider a large British incomes data are often strongly skewed, as issample clear from of Figure 3. The data were sorted and and paired with the 7125 normal quantiles. The value of that gave the n in 1973, normalized tonormal havequantiles. mean The equal paired with the 7125 valueto of one λ that (n gave= the7125). greatest Such data greatest correlation (r = 0.9944) was = 0.21. 0.0 0 4 6 8 Income 10 12 0.8 0.6 0.4 0.2 −2 −1 0 1 2 3 λ Figure 3. (L) Density plot of the 1973 British income data. (R) The best λ is 0.21. Figure 9: value (L) of Density plot of the 1973 British income data. (R) The best value of 0.0 2 Correlation coefficient 0.2 0.4 0.6 Correlation 0.8 coefficient 1.0 0.2 Density 0.4 0.2 Density 0.4 0.6 0.6 1.0 correlation (r = 0.9944) wasclear λ = 0.21. ften strongly skewed, as is from Figure 9. The data were sorted paired with the 7125 normal quantiles. The value of that gave the est correlation (r = 0.9944) was = 0.21. is 0.21. The kernel density plot of the optimally transformed data is shown in the left frame of Figure 4. While thisplot figure much less skewed than in Figure 3, there is clearly kernel density of isthe optimally transformed data is1 shown 0 The 2 4 6 8 10 12 −2 −1 0 2 in the 3 an extra “component” in the distribution that might reflect the poor. Economists left frame of Income Figure 10. While this figure is much less skewed than in Figure often analyze the logarithm of income corresponding to λ = 0; seeλFigure 4. The 9, there is clearly an extra “component” in the distribution that might reflect correlation is only r = 0.9901 in this case, but for convenience, the log-transform the poor. Economists often analyze the logarithm of income corresponding probably will be preferred. to Density = 0; see Figure The correlation is only r = 0.9901 in this butbest re 9: (L) plot 10. of the 1973 British income data. (R)case, The for0.21. convenience, the log-transform probably will be preferred. of is he kernel density plot of the optimally transformed data is shown593in the rame of Figure 10. While this figure is much less skewed than in Figure 10 Densit Densit 0. −1 0.6 0.2 −2 0 1 2 3 −4 −2 Density −2 0 2 Income Income −4 −2 0 −4 −2 0 Income Income Density 0.2 0.4 0.0 0 1 2 3 −4 Income Income −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Income Income 0 2 2 2 L) Density ofBritish the 1973 British data transformed with y plot of theplot 1973 income dataincome transformed with R) log-transform with = 0. income data transformed with 10:The (L) Density plot of=the 1973 British og-transform with 0. 0.0 (L) Density plot of the 1973 British income data transformed with 21. (R) The log-transform with = 0. (R) The log-transform with = 0. −3 −2 −1 0 1 Income 2 3 −4 −2 0 Income 2 pplications er Applications Other Applications Figure 4. (L) Density plot of the 1973 British income data transformed with her Figure Applications 10: Density of the 1973 λ =(L) 0.21. (R) Theplot log-transform withBritish λ = 0. income data transformed with ssion another where variable transformation is ⇤ is application nalysis another where variabletransformation transformation analysis is(R) another application where is i = 0.21. The application log-transform with = 0. variable Other Applications the model noranalysis is the another application where variable transformation is ntly applied. For the model pplied. For model is another application where variable transformation is applied.Regression For theanalysis model For· the Other = 0 +6 frequently · · model + xx2 p++· ·✏· + p xp + ✏ 1x 1y+ 20xApplications 2+ = applied. + 1 x1 + 2 p y= · + xp x+p ✏+ ✏ y = 0 ++ 1 xx1 + + 2xx2++· ·· ·· + 0 1 is1 another 2 2application where p p variable transformation is Regression analysis model frequently applied. For the model and fitted model ted odel ŷmodel = b0 + b1 x1 ŷ+=b2bx0 2++b1·x·1· + +bb2pxx2p+, · · · + bp xp , = ·+ 0 +b 1 x1 + ŷ==b0b0++bby11x x11 + ·2 x··2··+ + b· ·p+ xbp x ,p xpp ,+ ✏ ŷ b + · 22xx 22+ the predictor variables xj can be transformed. usualiscriterion is variables xj can be transformed. The usual The criterion and fitted model eiance predictor variables can be transformed. The,The usual criterion is of the residuals, by each of theby predictor xj can be transformed. The criterion is the criterion j = redictor variables xxjgiven can usual i siduals, given ŷvariables b0 +be b1 x1transformed. + b2 x2 + · · · + bp xusual p variance of the residuals, given by nce of the residuals, givenX by n of the residuals, given by each of the predictor variables xj can be transformed. The usual criterion is n 1 X 1 (ŷ yi )2 . n2 the variance of the residuals, given by i X (ŷi 1yX )n . in n i=1 (ŷi 1 X yni )2 2. 1 i=1 2 n i=1 (ŷi y(ŷ ) i . yi ) . i n i=1 casionally,Occasionally, the response variable yy may may be transformed. In must this case, n the response variable be transformed. In this case, care i=1 be transformed. In this case, response variable y the may ust be taken because variance of ythe residuals isas not be taken because the variance themay residuals is transformed. not comparable λ varies. Let case, onally, the response variable be InIncomparable this case, Occasionally, the responseofy variable may be transformed. this because the variance of the residuals isthe not comparable ries. Let represent the geometric mean of the response variables ybecause care ḡresponse must be taken because the variance of residuals is notcomparable comparable be taken the variance of the residuals ally, the variable y may be transformed. In this case represent the geometric mean of the response variables. is not present the geometric mean of the response variables varies. Let ḡthe the geometric mean of the responsevariables variables y represent ! s.taken Let asḡybecause represent geometric mean ofresiduals the response 1/n the variance of the is not comparable n Y 594 ! 1/n ! n 1/n Y n theḡygeometric = n ! y 1/nmean . of the response variables Let ḡy represent Y . Yḡy = i yi ḡy = yi . i=1 i=1 e response variable y may be transformed. In this case, because the variance of the residuals is not comparable represent the geometric mean of the response variables !1/n n Y response is defined as ḡy = yi . i=1 y as 1 response is0 defined y = 11 . Then the transformed response is defined as 0 y = ithmic case), y · ḡy 1 · ḡy 1 1 . When λ = 0 (the logarithmic case), rithmic case), 0 y0 = ḡy · log(y) . 0 y = ḡy · log(y) . For0 more examples and discussions, see Kutner, Nachtsheim, Neter, and Li (2004). simultaneously transform both the predictor and the n simultaneously transform both the predictor the et al. or more examples and discussions, see and Kutner For more examples and discussions, see Kutner et al. ss ox, D. R. analysis of transformations,” x, D. R. (1964). (1964).“An “An analysis of transformations,” Statistical Society, Series B, 26: 211-252. tatistical Society, Series B, 26: 211-252. im, C., Neter, J., and Li, W. (2004). Applied Linear m, C., Neter, J.,Homewood, and Li, W. cGraw-Hill/Irwin, IL. (2004). Applied Linear Graw-Hill/Irwin, Homewood, IL. Exploratory Data Analysis, Addison-Wesley, Reading, Exploratory Data Analysis, Addison-Wesley, Reading, 595 Statistical Literacy by David M. Lane Prerequisites • Chapter 16: Logarithms Many financial web pages give you the option of using a linear or a logarithmic Yaxis. An example from Google Finance is shown below. What do you think? To get a straight line with the linear option chosen, the price would have to go up the same amount every time period. What would result in a straight line with the logarithmic option chosen? The price would have to go up the same proportion every time period. For example, go up 0.1% every day. 596 References Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations, Journal of the Royal Statistical Society, Series B, 26, 211-252. Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2004). Applied Linear Statistical Models, McGraw-Hill/Irwin, Homewood, IL. Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA. 597 Exercises Prerequisites All Content in This Chapter 1. When is a log transformation valuable? 2. If the arithmetic mean of log10 transformed data were 3, what would be the geometric mean? 3. Using Tukey's ladder of transformation, transform the following data using a λ of 0.5: 9, 16, 25 4. What value of λ in Tukey's ladder decreases skew the most? 5. What value of λ in Tukey's ladder increases skew the most? 6. In the ADHD case study, transform the data in the placebo condition (D0) with λ's of .5, 0, -.5, and -1. How does the skew in each of these compare to the skew in the raw data. Which transformation leads to the least skew? 598