Lecture

advertisement
•
•
•
Multiple comparisons
Basics of linear regression
Model selection
Multiple comparison
•
•
•
One comparison - use t-test or equivalent
Few comparisons - use Bonferroni
Many comparisons - Tukey’s honest significant differences, Holm, Scheffe or
others
Bonferroni correction
If there is only one comparison then we can use t-test or intervals based on t
distribution. However if the number of tests increases then probability that
significant effect will be observed when there is no significant effect becomes
very large. It can be calculated using 1-(1-)n, where  is significance level and
n is the number of comparisons. For example if the significance level is 0.05
and the number of comparisons (tests) is 10 then the probability that at least one
significant effect will be detected by chance is 1-(1-0.05)10=0.40. Bonferroni
suggested using /n instead of  for designing simultaneous confidence
intervals. It means that the intervals will be calculated using
i-j  t/(2n)(se of comparison)0.5
Clearly when n becomes very large these intervals will become very conservative.
Bonferroni correction is recommended when only few effects are compared.
pairwise.t.test in R can do Bonferroni correction.
If Bonferoni correction is used then p values are multiplied by the number of
comparisons (Note that of we are testing effects of I levels of factor then the
number of comparisons is I(I-1)/2
Bonferroni correction: Example
Let us take the example dataset - poisons and try Bonferroni correction for each
factor:
pairwise.t.test(poison,treat,”none”,data=poisons)
1
2
2 0.32844 3 3.3e-05 0.00074
pairwise.t.test(poison,treat,”bonferroni”,data=poisons)
1
2
2 0.9853 3 1e-04 0.0022
As it is seen each p-value is multiplied by the number of comparisons 3*2/2 = 3. If
the corresponding adjusted p-value becomes more than one then it is truncated
to one.
It says that there are significant differences between effects of poisons 1 and 3 and
between 2 and 3. Difference between effects of poisons 1 and 2 is not
significant.
Holm correction
Another correction for multiple tests – Holm’s correction is less conservative than
Bonferroni correction. It is a modification of Bonferroni correction. It works in
a sequential manner.
Let us say we need to make n comparisons and significant level is . Then we
calculate p values for all of them and sort them in ascending order and apply
the following procedure:
1)
set i = 1
2)
If (n-i+1) pi<  then it is significant, otherwise it is not.
3)
If a comparison number i is significant then increment i by one and if i  n go
to the step 2
The number of significant effects will be equal to i where the procedure stops.
When reporting p-values Holm correction works similar to Bonferroni but in a
sequential manner. If we have m comparisons then the smallest p value is
multiplied by m, the second smallest is multiplied by m-1, j-th comparison is
multiplied by m-j+1
Holm correction: example
Let us take the example - the data set poisons and try Holm correction for each factor:
pairwise.t.test(poison,treat,”none”,data=poisons)
1
2
2 0.32844 3 3.3e-05 0.00074
pairwise.t.test(poison,treat,”holm”,data=poisons) # this correction is the default in R
1
2
2 0.3284 3 1e-04 0.0015
The smallest is multiplied by 3 the second by 2 and the largest by 1
It says that there is significant differences between effects of poisons 1 and 3 and between 2
and 3. Difference between effects of poisons 1 and 2 is not significant.
Tukey’s honest significant difference
This test is used to calculate simultaneous confidence intervals for differences of all
effects.
Tukey’s range distribution. If we have a random sample of size N from normal
distribution then the distribution of stundentised range - (maxi(xi)-mini(xi))/sd is
called Tukey’s distribution.
Let us say we want to test if i- j is 0. For simultaneous 100% confidence intervals
we need to calculate for all pairs lower and upper limits of the interval using:
difference ± ql, sd (1/Ji+1/Jj)0.5 /2
Where q is the -quantile of Tukey’s distribution, Ji and Jj are the numbers of
observations used to calculate i and j, sd is the standard deviation, l is the
number of levels to be compared and  is the degree of freedom used to
calculate sd.
Tukey’s honest significant difference
R command to perform this test is TukeyHSD. It takes an object derived using aov as an input
and gives confidence intervals for all possible differences. For example for poison data
(if you want to use this command you should use aov for analysis)
lm1 = aov(time~poison+treat,data=poisons)
TukeyHSD(lm1)
$poison
diff
lwr
upr
p adj
2-1 -0.073125 -0.2089936 0.0627436 0.3989657 # insignifacnt
3-1 -0.341250 -0.4771186 -0.2053814 0.0000008 # significant
3-2 -0.268125 -0.4039936 -0.1322564 0.0000606 # significant
$treat
B-A
C-A
D-A
C-B
D-B
D-C
diff
0.36250000
0.07833333
0.22000000
-0.28416667
-0.14250000
0.14166667
lwr
upr
p adj
0.18976135 0.53523865 0.0000083 #sginificant
-0.09440532 0.25107198 0.6221729 #insgnific
0.04726135 0.39273865 0.0076661 #significant
-0.45690532 -0.11142802 0.0004090 #significant
-0.31523865 0.03023865 0.1380432 #insignific
-0.03107198 0.31440532 0.1416151 #insignific
More tests
Mann-Whitney-Wilcoxon test – non-parametric test for median
Kolmogorov-Smirnoff test – test if distribution of two random variables are same
Grubbs test – tests for outliers (more in R package – outliers)
Basics of linear regression analysis
•
•
•
Purpose of linear models
Least-squares solution for linear models
Analysis of diagnostics
Reason for linear models
Purpose of regression is to reveal statistical relations between input and output variables.
Statistics cannot reveal functional relationship. It is purpose of other scientific studies. Statistics
can help validation of various functional relationship (models). Let us assume that we suspect
that functional relationship has a form:
y = f (x, b,e)
where  is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and
y is output,  is an error associated with the experiment. Then we can set up experiments for
various values of x and get output (or response) for them. If the number of experiments is n then
we will have n output values. Denote them as a vector y=(y1,y2,,,yn). Purpose of statistics is to
evaluate parameter vector using input and output values. If function f is a linear function of the
parameters and errors are additive then we are dealing with linear model. For this model we can
write
yi = b1 x1i + b2 x2i + ... + b p x pi + e i
Linear model is linearly dependent on parameters but not on input variables. For example
y = b0 + b1 x1 + b2 x12 + e
is a linear model. But
y = b0 + b1 x1 + b12 x2 + e
is not.
Assumptions
Basic assumptions for analysis of linear model are:
1)
the model is linear in parameters
2)
the error structure is additive
3)
Random errors have 0 mean, equal variances and they are uncorrelated.
These assumptions are sufficient to deal with linear models. Uncorrelated with equal
variance assumptions (number 3) can be removed. Then the treatments
becomes a little bit more complicated.
Note that for general solution, normal distribution of errors assumption is not used.
This assumption is necessary to design test statistics. If this assumption breaks
down then we can use bootstrap or other approximate techniques to design test
statistic.
These assumptions can be written in a vector form:
y = Xb + e, E(e) = 0, V(e) = s 2I
where y, 0,  are vectors and I and X are matrices. The matrix X is called a design
matrix, input matrix etc. I is nxn identity matrix.
Solution for the general case
Problem of linear model fitting into the data is reduced to mimimisation:
Its solution is:
bˆ = (XT X)-1 XT y
This estimation is unbiased:
E(bˆ ) = b
Covariance matrix of the estimator is:
V(bˆ ) = s 2 (XT X)-1
And variance of the error is estimated as:
p-1
1 n 2
1 n
sˆ =
ri =
(yi -å bˆ j xij )2
å
å
n - p i=1
n - p i=1
j=0
2
Note that correlation does not depend on the data directly, it only depends on the amount of data
and how it has been collected (design of experiment is important).
Standard diagnostics
Before starting to model
1)
Visualisation of the data:
1)
2)
plotting predictor vs observations. These plots may give a clue about the relationship,
outliers
Smootheners
After modelling and fitting
2)
Fitted values vs residuals. It may help to identify outliers, correctness of the
model
3)
Normal QQ plot of residuals. It may help to check distribution assumptions
4)
Cook’s distance, reveals outliers, checks correctness of the model
5)
Model assumptions - t tests given by the default print of lm
Checking model and designing tests
3)
Cross-validation. If you have a choice of models then cross-validation may help
to choose the “best” model
Visualisation prior to modelling
Different type of datasets may require different visualisation tools. For simple
visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used.
Visualisation prior to modelling may help to propose a model (form of the functional
relationship between input and output, probability distribution of observation etc)
For example for dataset women - where weights and heights for 15 cases have been
measured. plot and pairs commands produce these plots:
After modelling: linear models
After modelling the results should be analysed. For example
attach(women)
lm1 = lm(weight~height)
It means that we want a liner model (we believe that dependence of weight on height is
linear):
weight=0+1*height
Results could be viewed using
lm1
summary(lm1)
The last command will produce significance of various coefficients also. Significance
levels produced by summary should be considered carefully. If there are many
coefficients then the chance that one “significant” effect is observed is very high.
After modelling: linear models
It is a good idea to plot data and fitted model, and differences between fitted and
observed values on the same graph. For linear models with one predictor it can be
done using:
plot(weight,height)
abline(lm1)
segements(weight,fitted(lm1),weight,height)
This plot already shows some systematic
differences. It is an indication that model
may need to be revised.
Checking validity of the model: standard tools
Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight
into model and how to improve it. Some of these plots can be done using
plot(lm1)
Confidence and prediction bands
Confidence bands are those where true line would belong with 1-α probability
P( f (bˆ, x) - c1(x) < f (b, x) < f (bˆ, x) - c1(x)) =1- a
Prediction bands are bands where future observation would full with probability of 1-α:
P( f (bˆ, x) - c 2 (x) < y future < f (bˆ, x) - c 2 (x)) =1- a
Prediction bands are wider than confidence bands.
Prediction and confidence bands
lm1 = lm(height~weight)
pp = predict(lm1,interval='p')
pc = predict(lm1,interval='c')
plot(weight,height,ylim=range(height,pp))
n1=order(weight)
matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')
matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red’)
These commands produce two sets
of bands: narrow and wide.
Narrow band corresponds to
confidence bands and wide band is
prediction band
Analysis of diagnostics
Residuals and hat matrix: Residuals are differences between observation and fitted
values:
r = y - yˆ = y - Xbˆ = y - X(XT X)-1 XT y = (I - X(XT X)-1 XT )y = (I - H)y
H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these
values are close to one then that fitted value is determined by this observation.
Sometimes hi’=hi/(1-hi) is used to enhance high leverages.
Q-Q plot can be used to check normality assumption. If assumption on distribution is
correct then this plot should be nearly linear. If the distribution is normal then tests
designed for normal distributions can be used. Otherwise bootstrap may be more
useful to derive the desired distributions.
Analysis of diagnostics: Cont.
Other analysis tools include:
ri' = ri /( s(1 - hi )1 / 2 ) standardiz ed (stutendiz ed) residual
ri* = ri /( si (1 - hi )1 / 2 ) externally standardiz ed residual
Ci = (hi ' )1 / 2 ri*
DFFITS
Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is
unbiased estimator of 2 after removal of i-th observation
R-squared
One of the statistics used for goodness of fit is R-squared. It is defined as:
SSe
SSt
SSe = å (y i - yˆ i ) 2
R 2 = 1-
SSt = å (y i - y ) 2
yˆ i is fitted values.
Adjusted R-squared:
2
Radj
= 1- (1- R 2 )
n -1
n - p -1
n - the number of observations, p - the number of parameters
Several potential problems with linear model and
simple regression analysis
1) Interpretation
Regression analysis can tell us that there is a relationship between input and output. It does not
say one causes another
2) Outliers
Use robust M-estimators
3) Singular or near singular cases
Ridge regression or SVD.
In underdetermined systems Partial least squares may be helpful
4) Distribution assumptions
Generalised linear models
Random or mixed effect models
Data transformation
5) Non-linear model
Box-Cox transformation
Non-linear model fitting for example using R command nlm
Model Selection
•
•
•
Model selection based on t-distribution
Information criteria
Cross-validation
Introduction
When modelling relationship between input and out variables we need to make decision
about the form of the model. If there are several input parameters then modelling
decision could be: which input variable is important in predicting the output. Let us say
that we have several input variables and the proposed model is linear:
y=β0 + β1 x1+β2 x2 + ε
If some of the coefficients are 0 then corresponding input variable has no effect on
predicting response – y.
Another situation may arise when we want to explain output using polynomials on input
variables.
y=β0 + β1 x1+β2 x12 + ε
In these cases we may want to know if some of the coefficients are 0. E.g. if β2 is 0 then
relationship between input and output is linear. Otherwise relationship may be higher
order polynomials. In general if we have functions of increasing complexity gk then it is
likely that functions of higher complexity will give us better agreement between input
and output. One of the ways of calculating agreement is residual sum of square (RSS):
RSS = Σ(yi-gk(xi))2
Where i is the observation number and k is the function (model) number.
Introduction
One very simple way of testing if a particular coefficient of the linear model is 0 is
testing hypothesis:
H0:βj=0 versus H1:βj≠0
As we already know, to do this test we need to know the distribution of the estimated
parameters. We know that for large samples the distribution of the parameters
asymptotically is normal. We also know that covariance matrix of this distribution is:
cov(b ) = sˆ 2 (X T X)-1
n
n
p-1
1
1
sˆ =
ri2 =
(y i -å bˆ j x ij ) 2
å
å
n - p i=1
n - p i=1
j= 0
2
Where all xi0 are 1. The number of observations is n and the number of parameters is p.
Under H0 the distribution of βj is normal distribution with mean 0 and the variance
equal to the diagonal term of the covariance matrix. Therefore the distribution of
bˆ j
sˆ jj
is t-distribution with n-p degrees of freedom.
Example
160
150
130
120
130
120
64
66
68
70
72
58
60
62
64
66
68
70
58
72
60
62
64
66
height
height
140
130
140
weight
150
150
160
160
height
120
weight
62
130
60
120
58
140
weight
150
140
weight
140
130
120
weight
150
160
160
Let us take an example of weight vs height and look at the results of linear model fitting using
polynomials on input variable (height) up to fifth order. Plots show that second order polynomial
fits sufficiently well. After the second order improvements are marginal. Natural question would
be which model is better, can we justify more complex model based on the observations we have.
58
60
62
64
66
height
68
70
72
58
60
62
64
66
height
68
70
72
68
70
72
Model selection using t-distribution: summary of lm
method
The distribution of estimated parameters are used in the summary of lm command. Let
us say we want to see if some of the coefficients of the model:
y=β0 + β1 x1+β2 x12 + β1 x13 +β2 x14 + β1 x15 + ε
After fitting the model using
lm5 = lm(weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5))
summary(lm5) will produce a table that will have information about significance for each
estimated parameter:
Coefficients:
(Intercept)
height
I(height^2)
I(height^3)
I(height^4)
I(height^5)
Estimate Std. Error t value Pr(>|t|)
9.144e+04 8.300e+04
1.102
0.299
-6.947e+03 6.418e+03 -1.083
0.307
2.107e+02 1.982e+02
1.063
0.315
-3.187e+00 3.058e+00 -1.042
0.325
2.403e-02 2.355e-02
1.020
0.334
-7.224e-05 7.246e-05 -0.997
0.345
Residual standard error: 0.2245 on 9 degrees of freedom
Multiple R-squared: 0.9999,
Adjusted R-squared: 0.9998
F-statistic: 1.335e+04 on 5 and 9 DF, p-value: < 2.2e-16
Model selection using t-distribution
More details of this table will be described in the tutorial. Here we are interested in notrejecting null-hypothesis (that particular parameter is 0). If we are using t-distribution to
to make such decision then we need to take the largest p-value and remove the
corresponding coefficient from the model. In this case it is the coefficient corresponding
to height5. After removing that we can do model fitting using the reduced model. We
use:
lm4 = lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))
summary(lm4)
It will produce significance levels for coefficients
Coefficients:
(Intercept)
height
I(height^2)
I(height^3)
I(height^4)
Estimate Std. Error t value Pr(>|t|)
8.824e+03 4.550e+03
1.939
0.0812 .
-5.552e+02 2.814e+02 -1.973
0.0768 .
1.319e+01 6.515e+00
2.024
0.0705 .
-1.389e-01 6.693e-02 -2.076
0.0646 .
5.507e-04 2.574e-04
2.140
0.0581 .
Now we cannot remove any of the coefficients, p-values are not large enough to say
that particular coefficient is insignificant.
Model selection using t-distribution
summary command after lm command allows a simple way of model selection. It would
work as follows:
1)
Specify full model (it may be a problem. Model always can be made more
complex)
2)
Fit the model using lm
3)
Use summary and remove the coefficient with the highest p-value (Say values
that are more than 0.1. This value is arbitrary and depends how much would one want
to reduce the model)
4)
Repeat 2) and 3) until model can no longer be reduced
Warning: Using multiple hypothesis testing may produce misleading results. For
example if we are working on a significance level α and we want to test 10 hypothesis
then the probability that at least one of the hypothesis will be significant when it is not
is 1-(1-α)10. If α=0.05 then this probability will be 0.4 and if we are testing 100
hypothesis then this probability will be 0.99. When we used summary we deliberately
did not use rejection of null-hypothesis. We used high p-values not to reject nullhypothesis.
Akaike’s information criterion
Another model selection tool is Akaike’s information criterion (AIC). This technique is
based on the value of likelihood at the maximum and uses the number of parameters to
penalise complex models. Let us remind us the maximum likelihood technique.
Likelihood is proportional to the conditional distribution of observations (outputs,
responses) given parameters of the model:
L(y | b ) µ P(y | b )
In maximum likelihood estimation we maximise the likelihood function wrt the
parameters of the model. By making model more and more complex we can get larger
and larger likelihood values. But is more complex model justified?
Once we have maximised the likelihood function we can calculate the value of the
likelihood at the maximum. Akaike’s information criterion as implemented in R uses
the following formula:
AIC = -2log(L(y | bˆ ) + c * p
Where c is a constant (may depend on the number of observations) that defines various
forms of information criterion, p is the effective number of parameters in the model. In
linear model p is equal to the number of parameters. When c=2 then we have the classic
AIC and when c=log(n) then we have Bayesian information criterion (BIC).
Akaike’s information criterion
To apply AIC to the cases of linear models we need to specify likelihood function and
find its value at the maximum. We can consider linear models with least square
minimisation as a special case of maximum likelihood.
n
L(y | b ) = Õ
i=1
1
e
2ps i
-
n
(y i -g(x,b )2
2s i2
1
=
(2p )
n
2
e
n
Õs
-
å
i =1
(y i -g(x, b )2
2s i2
i
i=1
Where σι is the standard deviation of the i-th observation. When all σι are equal to each
other (denote them as σ) and they are unknown then we can write:
Once we have found the maximum wrt β we can carry on and find the maximum wrt σ.
n
1
sˆ = å (y i - g(x, bˆ )) 2
n i=1
2
Note that the maximum likelihood estimator for the variance is not unbiased. Unbiased
estimator is
2
sˆ unbiased
n
1
=
(y i - g(x, bˆ )) 2
å
n - p i=1
Akaike’s information criterion
Let us find the value of the likelihood at the maximum. We will have:
L(y | bˆ ) =
1
n
2
n
2 2
e
-
n
2
(2p ) (sˆ )
If we put it in the expression of AIC we will have:
RSS
AIC1 = -2log(L(y | bˆ ) + cp = n + n log(2p ) + n log(sˆ 2 ) + cp = n + n log(2p ) + n log(
) + cp
n
In model selection the number of observations do not change, so AIC can be written:
AIC = n log(
RSS
) + cp
n
If we use more complex model then RSS will be smaller. On the other hand model with
smaller number of parameters may have better predictive power. AIC attempts to
combine these two conflicting ideas together.
When comparing two models, model with smaller AIC is considered to be better.
Akaike’s information criterion in R
There are number of functions in R that uses AIC. The simplest of them is extractAIC.
This function gives the number of parameters as well as AIC. Another function is step.
In its simplest form this function removes one at a time a parameter and calculates the
AIC. If removal of a parameter decreases AIC then this parameter is removed from
further consideration. The procedure is applied iteratively. In the following slide a result
of step function is shown.
Akaike’s information criterion
Start: AIC=-40.48
weight ~ height + I(height^2) + I(height^3) + I(height^4) + I(height^5)
Df Sum of Sq RSS AIC
Stage 1: Procedure removes one
- I(height^5) 1 0.050 0.504 -40.910
parameter at a time. Removal of height5
- I(height^4) 1 0.052 0.506 -40.840
reduces AIC
- I(height^3) 1 0.055 0.508 -40.773
- I(height^2) 1 0.057 0.510 -40.708
- height
1 0.059 0.513 -40.646
<none>
0.454 -40.482
Step: AIC=-40.91
weight ~ height + I(height^2) + I(height^3) + I(height^4)
Df Sum of Sq RSS AIC
<none>
0.504 -40.910
Stage 2: Procedure tries further reduce
- height
1 0.196 0.700 -37.979
the number of parameters. But the
- I(height^2) 1 0.206 0.710 -37.759
“best” model is the full model.
- I(height^3) 1 0.217 0.721 -37.536
- I(height^4) 1 0.231 0.734 -37.256
Call:
lm(formula = weight ~ height + I(height^2) + I(height^3) + I(height^4))
Coefficients:
Final stage: The “best” model is given
(Intercept)
height I(height^2) I(height^3) I(height^4)
8.824e+03 -5.552e+02 1.319e+01 -1.389e-01 5.507e-04
Akaike’s information criterion: further improvements
There are several additional forms of AIC. They try to correct to small samples. For
example corrected AIC has an expression:
AICc = AIC +
2 p( p + 1)
n - p -1
R does not produce AICc (I am not aware if it produces). However it can be calculated
using (for example)
extractAIC(lm4)+2*p*(p+1)/(n-p-1)
Where n is the number of observations and p is the number of parameters. For example
for fourth order polynomial fit for weight <–> height relationship we can use:
lm4=lm(weight ~ height + I(height^2) + I(height^3) + I(height^4))
extractAIC(lm4)+2*5*6/(15-5-1)
It will produce the value -34.24 and for the reduced model
lm3=lm(weight ~ height + I(height^2) + I(height^3))
extractAIC(lm3)+2*4*5/(15-4-1)
It will produce -33.26. Since more complex model has lower AICc, it is preferred.
AIC, BIC
Both AIC and BIC attempt to produce minimal best model that describes the
observations as best as possible. BIC reduces (prefers simpler model) more than AIC.
In general in applications it is better to use common sense and look at the parameters
very carefully. If you can find interpretation of all terms and reasons for them being
there then they may be valid. Even when the “best” model has been selected it would
be better to come back to model selection again if new observations become available.
Absolute values of AIC (or BIC) are not important. Differences between AIC-s for
different models are important.
AIC (and BIC) can be used to compare models with the same observations. If
observations change (for example few outliers are removed or new observations
become available) then AIC should be applied for all models again.
Cross-validation
If we have a sample of observations, can we use this sample and choose among given models.
Cross validation attempts to reduce overfitting thus helps model selection.
Description of cross-validation: We have a sample of the size n – (yi,xi) .
1)
Divide the sample into K roughly equal size parts.
2)
For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate
prediction error for k-th part.
3)
Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation
prediction error.
If K=n then we will have leave-one-out cross-validation technique.
Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of
the sample be Ak and number of points in this subset is Nk.. Then prediction error per
observation is:
PE =
1 K 1
å
K k =1 N k
å ( y - g ( x, a
i
k
))2
iÎAk
Then we would choose the function that gives the smallest prediction error. We can expect that in
future when we will have new observations this function will give smallest prediction
error.
This technique is widely used in modern statistical analysis. It is not restricted to least-squares
technique. Instead of least-squares we could could use other techniques such as maximumlikelihood, Bayesian estimation, M-estimation.
Cross-validation in R
There is no cross-validation method for lm in R but there is a cross-validation function
for glm (generalised linear method). This method requires that resoponses are stored in
the output from model fitting. We can change a little bit the output from lm and then
use cross-validation for glm. Following sequence of commands will allow us to do this
(it is just an example):
lm1 = lm(weight~height)
lm1$y = weight # Add responses to the output from lm
cv.glm(lm1)$delta # Do cross-validation
cv.glm function by default uses leave one out cross-validation.
Cross-validation in R
Let us apply cross-valdiation for weight~height relationship.
require(boot)
require(MASS)
lm1=lm(weight~height,data=women)
lm1$y=women$weight
cv.glm(women,lm1)$delta
This sequence of command will produce the following. So prediction error is 3.04.
1
1
3.040776 3.002450
lm1=lm(weight~height+I(height^2),data=women)
lm1$y=women$weight
cv.glm(women,lm1)$delta
It produces
1
1
0.2198339 0.2156849
Prediction error is reduced to 0.22. Further application will produce prediction errors 0.15, 0.13,
0.18. The minimum seems to be when polynomial of fourth order is used. However differences
between third and fourth order is not as substantial as that between first and second order.
Depending on circumstances we would choose models as polynomila of second, third or fourth
order.
R commands
lm – model fit
summary (or summary.lm) – will give summary about the model including pvalues for each model parameter
extractAIC – to extract Akaikes information criterion
extractAIC – with k=log(n) as an argument it will produce Bayesian information
criterion
stepAIC (or step) – will use AIC to step by step model comparision
add1 – will add one parameter at a time and calculate AIC
drop1 – will drop one parameter at a time and calculate AIC
cv.glm – from the library boot. It is suitable for the objects produced by glm,
however slight modification allows it to be used for lm object also.
References
1.
2.
3.
Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”
Stuart, A., Ord, JK, and Arnold, S. (1991) “Kendall’s advanced Theory of
statistics. Volume 2A. Classical Inference and the Linear models” Arnold
publisher, London, Sydney, Auckland
Dalgaard, P (2002) “Introductory statistics with R”, Springer Publishers
Download