330.Lect10

advertisement
STATS 330: Lecture 10
4/8/2015
330 lecture 10
1
Diagnostics 2
Aim of today’s lecture
 To describe some more remedies for
non-planar data
 To look at diagnostics and remedies for
non-constant scatter.
4/8/2015
330 lecture 10
2
Remedies for non-planar
data (cont)
 Last time we looked at diagnostics for nonplanar data
 We discussed what to do if the diagnostics
indicate a problem.
 The short answer was: we transform, so that the
model fits the transformed data.
 How to choose a transformation?
• Theory
• Ladder of powers
• Polynomials
 We illustrate with a few examples
4/8/2015
330 lecture 10
3
Example: Using theory cherry trees
A tree trunk is a bit like a cylinder
Volume = p  (diameter/2)2  height
Log volume = log(p/4) + 2 log(diameter) +
log(height)
so a linear regression using the logged
variables should work!
In fact R2 increases from 95% to 98%, and
residual plots are better
4/8/2015
330 lecture 10
4
Example: cherry trees
(cont)
> new.reg<- lm(log(volume)~log(diameter)+log(height),
data=cherry.df)
> summary(new.reg)
Call:
lm(formula = log(volume) ~ log(diameter) + log(height),
data = cherry.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-6.63162
0.79979 -8.292 5.06e-09 *** Previously
log(diameter) 1.98265
0.07501 26.432 < 2e-16 ***
94.8%
log(height)
1.11712
0.20444
5.464 7.81e-06 ***
--Residual standard error: 0.08139 on 28 degrees of freedom
Multiple R-Squared: 0.9777,
Adjusted R-squared: 0.9761
F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
4/8/2015
330 lecture 10
5
Example: cherry trees
(original)
10
Residuals vs Fitted
31
0
-5
Residuals
5
2
18
10
20
30
40
50
60
70
Fitted values
lm(formula = volume ~ diameter + height, data = cherry.df)
4/8/2015
330 lecture 10
6
Example: cherry trees
(logs)
0.00
-0.05
-0.15
-0.10
Residuals
0.05
0.10
0.15
Residuals vs Fitted
16
18
-0.20
15
2.5
3.0
3.5
4.0
Fitted values
lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df)
4/8/2015
330 lecture 10
7
Tyre abrasion data: gam plots
100
50
-100
-100
-50
0
s(tensile,5.42)
50
0
-50
s(hardness,1)
100
150
150
rubber.gam = gam(abloss~s(hardness)+s(tensile),
data=rubber.df)
par(mfrow=c(1,2)); plot(rubber.gam)
50
60
70
80
90
120
hardness
4/8/2015
140
160
180
200
220
240
tensile
330 lecture 10
8
Tyre abrasion data:
polynomial
GAM curve is like a polynomial, so fit a polynomial (ie
include terms tensile2, tensile3,…)
lm(abloss~hardness+poly(tensile,4),
data=rubber.df)
Degree of polynomial
Usually a lot of trial and error involved! We
have succeeded when
• R2 improves
• Residual plots show no pattern
4th deg polynomial works for the rubber data: R2
increases from 84% to 94%
4/8/2015
330 lecture 10
9
Why
th
4
degree?
> rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df)
> summary(rubber.lm)
Try 5th degree
Coefficients:
(Intercept)
poly(tensile,
poly(tensile,
poly(tensile,
poly(tensile,
poly(tensile,
hardness
5)1
5)2
5)3
5)4
5)5
Estimate Std. Error t value Pr(>|t|)
615.3617
29.8178 20.637 2.44e-16 ***
-264.3933
25.0612 -10.550 2.76e-10 ***
23.6148
25.3437
0.932 0.361129
119.9500
24.6356
4.869 6.46e-05 ***
-91.6951
23.6920 -3.870 0.000776 ***
9.3811
23.6684
0.396 0.695495
-6.2608
0.4199 -14.911 2.59e-13 ***
Highest
significant
power
Residual standard error: 23.67 on 23 degrees of freedom
Multiple R-squared: 0.9427,
Adjusted R-squared: 0.9278
F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13
4/8/2015
330 lecture 10
10
Ladder of powers
 Rather than fit polynomials in some
independent variables, guided by gam
plots, we can transform the response
using the “ladder of powers”
(i.e. use yp as the response rather than y for
some power p)
 Choose p either by trial and error using R2
or use a “Box-Cox plot – see later in this
lecture
4/8/2015
330 lecture 10
11
Checking for equal scatter
 The model specifies that the scatter about
the regression plane is uniform
 In practice this means that the scatter
doesn’t depend on the explanatory
variables or the mean of the response
 All tests, confidence intervals rely on this
4/8/2015
330 lecture 10
12
Scatter
 Scatter is measured by the size of the
residuals
 A common problem is where the scatter
increases as the mean response
increases
 This is means the big residuals happen
when the fitted values are big
 Recognize this by a “funnel effect” in the
residuals versus fitted value plot
4/8/2015
330 lecture 10
13
Example: Education
expenditure data
 Data for the 50 states of the USA
 Variables are
• Per capita expenditure on education (response),
variable educ
• Per capita Income, variable percap
• Number of residents per 1000 under 18, variable
under18
• Number of residents per 1000 in urban areas,
variable urban
• Fit model educ~ percap+under18+urban
4/8/2015
330 lecture 10
14
300
320
340
360
380
300 400 500 600 700 800 900
200 250 300 350 400 450 500 550
500
urban
Outlier!
400
response
(response)
3500 4000 4500 5000 5500
200
300
educ
340
360
380
percap
300
320
under18
300
4/8/2015
400
500
600
700
800
900
3500
4000
4500
330 lecture 10
5000
5500
15
Outlier, pt 50 (California)
4/8/2015
330 lecture 10
16
Basic fit, outlier in
> educ.lm = lm(educ~urban + percap + under18,
data=educ.df)
> summary(educ.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -555.92562 123.46634 -4.503 4.56e-05 ***
urban
-0.00476
0.05174 -0.092
0.927
percap
0.07236
0.01165
6.211 1.40e-07 ***
under18
1.55134
0.31545
4.918 1.16e-05 ***
Residual standard error: 40.53 on 46 degrees of freedom
Multiple R-squared: 0.5902,
Adjusted R-squared: 0.5634
F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09
R2 is 59%
4/8/2015
330 lecture 10
17
Basic fit, outlier out
> educ50.lm = lm(educ~urban + percap + under18, data=educ.df,
subset=-50)
> summary(educ50.lm)
See how we exclude pt 50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban
0.06624
0.04966
1.334 0.188948
percap
0.04827
0.01220
3.958 0.000266 ***
under18
0.88983
0.33159
2.684 0.010157 *
--Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-squared: 0.4947,
Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
R2 is now 49%
4/8/2015
330 lecture 10
18
> par(mfrow=c(1,2))
> plot(educ50.lm, which = c(1,3))
Increasing relationship
Scale-Location
1.5
Residuals vs Fitted
15
Standardized residuals
50
0.0
Residuals
0
-50
-100
10
1.0
7
7
220 240 260 280 300 320 340
220 240 260 280 300 320 340
Fitted values
Fitted values
4/8/2015
1510
0.5
100
Funnel effect
330 lecture 10
19
Remedies
 Either Transform the response
 Or Estimate the variances of the
observations and use “weighted least
squares”
4/8/2015
330 lecture 10
20
Transforming the response
> tr.educ50.lm <- lm(I(1/educ)~urban + percap
+ under18,data=educ.df[-50,])
> plot(tr.educ50.lm)
Transform to reciprocal
Residuals vs Fitted
0e+00
-5e-04
Residuals
Better!
5e-04
1e-03
10
47
-1e-03
15
0.0025
0.0030
0.0035
0.0040
0.0045
Fitted values
lm(I(1/educ) ~ urban + percap + under18)
4/8/2015
330 lecture 10
21
What power to choose?
 How did we know to use reciprocals?
 Think of a more general model
I(educ^p)~percap + under18 + urban
where p is some power
 Then estimate p from the data using a BoxCox plot
4/8/2015
330 lecture 10
22
Transforming the response (how?)
boxcoxplot(educ~urban + percap +
under18, educ.df[-50,])
Draws “Box-Cox plot”
A “R330” function
270
268
269
Min at
about -1
266
267
Profile likelihood
271
272
Box-Cox plot
-2
-1
0
1
2
p
4/8/2015
330 lecture 10
23
Weighted least squares
 Tests are invalid if observations do not
have constant variance
 If the ith observation has variance vis2,
then we can get a valid test by using
“weighted least squares”, minimising the
sum of the weighted squared residuals
Sri2/vi rather than the sum of squared
residuals Sri2
 Need to know the variances vi
4/8/2015
330 lecture 10
24
Finding the weights
 Step 1: Plot the squared residuals versus the
fitted values
 Step 2: Smooth the plot
 Step 3: Estimate the variance of an observation
by the smoothed squared residual
 Step 4: weight is reciprocal of smoothed squared
residual
 Rationale: variance is a function of the mean
 Use “R330” function funnel
4/8/2015
330 lecture 10
25
Doing it in R (1)
vars = funnel(educ50.lm)# a“R330” function
6000
4000
0
2000
Squared residuals
3.6
3.4
3.2
3.0
Log std. errors
3.8
Slope 1.7, indicates
p=1-1.7=-0.7
5.5
5.6
5.7
5.8
Log means
4/8/2015
220
260
300
340
Fitted values
330 lecture 10
26
> educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,])
> summary(educ50.lm)
Note pvalues
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -278.06430 132.61422 -2.097 0.041664 *
urban
0.06624
0.04966
1.334 0.188948
percap
0.04827
0.01220
3.958 0.000266 ***
under18
0.88983
0.33159
2.684 0.010157 *
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` '
1
Residual standard error: 35.88 on 45 degrees of freedom
Multiple R-Squared: 0.4947,
Adjusted R-squared: 0.461
F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07
4/8/2015
330 lecture 10
27
> weighted.lm<-lm(educ~urban+percap+under18,
weights=1/vars, data=educ.df[-50,])
Note
> summary(weighted.lm)
changes!
Note
reciprocals!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -270.29363 102.61073 -2.634
0.0115 *
urban
0.01197
0.04030
0.297
0.7677
percap
0.05850
0.01027
5.694 8.88e-07 ***
under18
0.82384
0.27234
3.025
0.0041 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 1.019 on 45 degrees of freedom
Multiple R-squared: 0.629,
Adjusted R-squared: 0.6043
F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10
Conclusion: unequal variances matter! Can change
results!
4/8/2015
330 lecture 10
28
Download