Re-expressing Data Chapter 6 – Normal Model –What if data do not follow a Normal model? Chapters 8 & 9 – Linear Model –What if a relationship between two variables is not linear? 1 Re-expressing Data Re-expression is another name for changing the scale of (transforming) the data. Usually we re-express the response variable, Y. 2 Goals of Re-expression Goal 1 – Make the distribution of the re-expressed data more symmetric. Goal 2 – Make the spread of the re-expressed data more similar across groups. 3 Goals of Re-expression Goal 3 – Make the form of a scatter plot more linear. Goal 4 – Make the scatter in the scatter plot more even across all values of the explanatory variable. 4 Ladder of Powers Power: 2 2 Re-expression: y Comment: Use on left skewed data. 5 Ladder of Powers Power: 1 Re-expression: y Comment: No re-expression. Do not re-express the data if they are already well behaved. 6 Ladder of Powers Power: ½ y Re-expression: Comment: Use on count data or when scatter in a scatter plot tends to increase as the explanatory variable increases. 7 Ladder of Powers Power: “0” Re-expression: log y Comments: Not really the “0” power. Use on right skewed data. Measurements cannot be negative or zero. 8 Ladder of Powers –½, –1 1 1 , Re-expression: y y Comments: Use on right skewed data. Measurements cannot be negative or zero. Use on ratios. Power: 9 Goal 1 - Symmetry Data are obtained on the time between nerve pulses along a nerve fiber. Time is rounded to the nearest half unit where a unit is 1 50 of a second. th – 30.5 represents 30.5 50 0.61 sec 10 .99 2 .95 .90 .75 .50 1 0 .25 .10 .05 .01 Normal Quantile Plot 3 -1 -2 -3 40 Count 60 20 0 10 20 30 Time ( 40th 1 50 50 sec) 60 70 11 Time – Nerve Pulses Distribution is skewed right. Sample mean (12.305) is much larger than the sample median (7.5). Many potential outliers. Data not from a Normal model. 12 .99 2 .95 .90 1 .75 0 .50 .25 Normal Quantile Plot 3 -1 .10 .05 -2 .01 -3 40 20 Count 30 10 0 1 2 3 4 5 6 Sqrt(Time) 7 8 9 13 .99 2 .95 .90 1 .75 0 .50 .25 Normal Quantile Plot 3 -1 .10 .05 -2 .01 -3 20 Count 30 10 -1 0 1 2 3 Log(Time) 4 5 14 Summary Time – Highly skewed to the right. Sqrt(Time) – Still skewed right. Log(Time) –Fairly symmetric and mounded in the middle. – Could have come from a Normal model. 15 Goal 3 – Straighten Up What is the relationship between the temperature of coffee and the time since it was poured? –Y, temperature ( oF) –X, time (minutes) 16 Bivariate Fit of Temp By Time (min) 200 190 180 170 160 Temp 150 140 130 120 110 100 90 80 0 10 20 30 40 Time (min) 50 60 17 Cooling Coffee There is a general negative association – as time since the coffee was poured increases the temperature of the coffee decreases. 18 Linear Model 190 180 Temp (F) 170 160 150 140 130 120 110 100 -10 0 10 20 30 40 Time (min) 50 60 19 Linear Model Fit Summary – Predicted Temp = 176.7 – 1.56*Time – On average, temperature decreases 1.56 oF per minute. – R2 = 0.99, 99% of the variation in temperature is explained by the linear relationship with time. 20 Plot of Residuals 5 4 3 Residual 2 1 0 -1 -2 -3 -4 -5 -10 0 10 20 30 Time (min) 40 50 60 21 Curved Pattern There is a clear pattern in the plot of residuals versus time. –Under predict, over predict, under predict. The linear fit is very good, but we can do better. 22 Bivariate Fit of Log(Temp) By Time (min) 5.5 5.4 5.3 Log(Temp) 5.2 5.1 5 4.9 4.8 4.7 4.6 4.5 -10 0 10 20 30 Time (min) 40 50 60 Linear Fit 23 Log(Temp) by Time Summary – Predicted Log(Temp) = 5.1946 – 0.0114*Time –On average, log temperature decreases 0.0114 log(oF) per minute. 24 Plot of Residuals 0.010 Residual 0.005 0.000 -0.005 -0.010 -10 0 10 20 30 Time (min) 40 50 60 25 Interpretation There is a random scatter of points around the zero line. The linear model relating Log(Temp) to Time is the best we can do. 26 Original Scale? Predicted Log(Temp) = 5.1946 – 0.0114*Time Predicted Temp = 180.3*e–0.0114*Time – Predicted temp at time=0, 180.3 oF – The predicted temp in one more minute is the predicted temp now multiplied by e–0.0114 = 0.98866 27 JMP Method 1 –Create a new column in JMP, Log(Temp): Cols – Formula – Transcendental – Log. 28 JMP Method 1 (continued) –Fit Y by X – Log(Temp) X – Time Y –Fit Linear 29 JMP Method 2 –Fit Y by X – Temp X – Time Y –Fit Special Transform Y – Log 30