Re-expressing Data

advertisement
Re-expressing Data

Chapter 6 – Normal Model
–What if data do not follow a
Normal model?

Chapters 8 & 9 – Linear
Model
–What if a relationship between
two variables is not linear?
1
Re-expressing Data


Re-expression is another
name for changing the scale
of (transforming) the data.
Usually we re-express the
response variable, Y.
2
Goals of Re-expression


Goal 1 – Make the distribution
of the re-expressed data more
symmetric.
Goal 2 – Make the spread of
the re-expressed data more
similar across groups.
3
Goals of Re-expression


Goal 3 – Make the form of a
scatter plot more linear.
Goal 4 – Make the scatter in
the scatter plot more even
across all values of the
explanatory variable.
4
Ladder of Powers
 Power:
2
2
 Re-expression: y
 Comment: Use on left skewed
data.
5
Ladder of Powers
 Power:
1
 Re-expression: y
 Comment: No re-expression.
Do not re-express the data if
they are already well behaved.
6
Ladder of Powers
 Power:
½
y
 Re-expression:
 Comment: Use on count data
or when scatter in a scatter plot
tends to increase as the
explanatory variable increases.
7
Ladder of Powers
 Power:
“0”
 Re-expression: log y 
 Comments: Not really the “0”
power. Use on right skewed
data. Measurements cannot be
negative or zero.
8
Ladder of Powers
–½, –1 1
1

,

 Re-expression:
y
y
 Comments: Use on right
skewed data. Measurements
cannot be negative or zero.
Use on ratios.
 Power:
9
Goal 1 - Symmetry


Data are obtained on the time
between nerve pulses along a
nerve fiber.
Time is rounded to the nearest
half unit where a unit is 1 50 of a
second.
th
– 30.5 represents 30.5 50  0.61 sec
10
.99
2
.95
.90
.75
.50
1
0
.25
.10
.05
.01
Normal Quantile Plot
3
-1
-2
-3
40
Count
60
20
0
10
20
30
Time (
40th
1
50
50
sec)
60
70
11
Time – Nerve Pulses




Distribution is skewed right.
Sample mean (12.305) is much
larger than the sample median
(7.5).
Many potential outliers.
Data not from a Normal model.
12
.99
2
.95
.90
1
.75
0
.50
.25
Normal Quantile Plot
3
-1
.10
.05
-2
.01
-3
40
20
Count
30
10
0
1
2
3
4
5
6
Sqrt(Time)
7
8
9
13
.99
2
.95
.90
1
.75
0
.50
.25
Normal Quantile Plot
3
-1
.10
.05
-2
.01
-3
20
Count
30
10
-1
0
1
2
3
Log(Time)
4
5
14
Summary



Time – Highly skewed to the
right.
Sqrt(Time) – Still skewed right.
Log(Time) –Fairly symmetric
and mounded in the middle.
– Could have come from a Normal
model.
15
Goal 3 – Straighten Up

What is the relationship
between the temperature of
coffee and the time since it
was poured?
–Y, temperature ( oF)
–X, time (minutes)
16
Bivariate Fit of Temp By Time (min)
200
190
180
170
160
Temp
150
140
130
120
110
100
90
80
0
10
20
30
40
Time (min)
50
60
17
Cooling Coffee

There is a general negative
association – as time since the
coffee was poured increases
the temperature of the coffee
decreases.
18
Linear Model
190
180
Temp (F)
170
160
150
140
130
120
110
100
-10
0
10
20
30
40
Time (min)
50
60
19
Linear Model Fit

Summary
– Predicted Temp = 176.7 –
1.56*Time
– On average, temperature decreases
1.56 oF per minute.
– R2 = 0.99, 99% of the variation in
temperature is explained by the
linear relationship with time.
20
Plot of Residuals
5
4
3
Residual
2
1
0
-1
-2
-3
-4
-5
-10
0
10
20
30
Time (min)
40
50
60
21
Curved Pattern

There is a clear pattern in the
plot of residuals versus time.
–Under predict, over predict,
under predict.

The linear fit is very good,
but we can do better.
22
Bivariate Fit of Log(Temp) By Time (min)
5.5
5.4
5.3
Log(Temp)
5.2
5.1
5
4.9
4.8
4.7
4.6
4.5
-10
0
10
20
30
Time (min)
40
50
60
Linear Fit
23
Log(Temp) by Time

Summary
– Predicted Log(Temp) = 5.1946 –
0.0114*Time
–On average, log temperature
decreases 0.0114 log(oF) per
minute.
24
Plot of Residuals
0.010
Residual
0.005
0.000
-0.005
-0.010
-10
0
10
20
30
Time (min)
40
50
60
25
Interpretation


There is a random scatter of
points around the zero line.
The linear model relating
Log(Temp) to Time is the
best we can do.
26
Original Scale?


Predicted Log(Temp) = 5.1946 –
0.0114*Time
Predicted Temp =
180.3*e–0.0114*Time
– Predicted temp at time=0, 180.3 oF
– The predicted temp in one more minute
is the predicted temp now multiplied by
e–0.0114 = 0.98866
27
JMP

Method 1
–Create a new column in JMP,
Log(Temp): Cols – Formula –
Transcendental – Log.
28
JMP

Method 1 (continued)
–Fit Y by X
– Log(Temp)
X – Time
Y
–Fit Linear
29
JMP

Method 2
–Fit Y by X
– Temp
X – Time
Y
–Fit Special
Transform
Y – Log
30
Download