Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall... Often the first step to working with multivariate data is...

advertisement
Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall 2001 SOLUTION
Often the first step to working with multivariate data is dealing with missing values. In this lab we explore
how S-Plus deals with missing values in some of its functions, and work with a library for S-Plus which provides
substitute values for missing data, multiple imputation.
When there are missing values in a data set, before analyzing the data we need to determine the reasons for the
missing information. From a statistical perspective this means understanding the distribution of missings: do the
missings occur randomly throughout the data matrix, or do they occur more frequently in certain combinations
of variables. If data is missing completely at random (MCAR) then multiple imputation might be an effective
way to provide substitute values.
The data set that we use today is from an historical study of kangaroos in Australia:
In 1803 a French research vessel captured 19 live kangaroos from Kangaroo Island (an island near
Adelaide in the Great Australian Bight), Tasmania, and the coast of the mainland. Despite a long
and arduous voyage, and diet consisting sometimes of rum and damper (a bread made with flour and
water), some live specimens reached France. At the end of the nineteenth century, only three of the
original specimens, now preserved, were still existing in Europe, one in London, one in Leiden and
one in Paris, and a great deal of taxonomic confusion arose. In order to classify the preserved skulls,
measurements were made on the skulls of 139 specimens of 3 species of kangaroos, M. giganteus (24
males, 24 females), M. Melanops (20 males, 22 females), and M. fuliginosus (25 males, 24 females).
Another point to note is that kangaroo skulls continue to grow throughout their lifetime. The training
sample contains adult kangaroos, but the age of the preserved skulls is unknown.
The data set has the following variables:
LABEL ANIMAL SEX SPECIES SP.SEX.GP BASILAR OCCIPITONASAL PALATILAR
PALATEWIDTH NASALLENGTH NASALWIDTH SQUAMOSAL INTERLAC ZYGOMATIC
POSTORBITAL ROSTRAL SOPD CREST INCFOR MANDLENGTH MANDWIDTH MANDDEPTH
RAMUS
Concentrate just on this selection of variables: BASILAR, OCCIPITONASAL, PALATILAR, PALATEWIDTH,
SOPD, MANDLENGTH. Also ignore the last 3 lines. These are the measurements for the historic skulls.
1. Subset the data to exclude the 3 historic skulls, and just the columns of interest. Call it roos3.sub.
2. Calculate the summary statistics (mean, std dev, min, max) for each variable, note the number of missing
values for each variable. Report these. How does S-plus handle the missing values when calculating these
univariate statistics?
***
Summary Statistics for data in:
Min:
Mean:
Max:
Total N:
NA’s :
Std Dev.:
roos3.sub ***
BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH
1048
1145
693
185.0 462.0
880
1497
1562
1026
258.2 653.5
1254
1893
1945
1315
332.0 798.0
1568
139
139
139
139.0 139.0
139
0
0
0
20.0 10.0
12
162
151
121
30.1 66.1
143
S-Plus calculates the summary statistics omitting the missings on a univariate basis. That is, for the first
3 variables where there were no missings all 139 observations were used to calculate the mean.
1
3. Calculate the variance-covariance matrix, and report. Experiment with the different options (fail, omit,
include, available) for handling missing values. (Fail = don’t calculate anything if there are missing,
omit = exclude all cases with missings, include = report a missing if any values are missing, available
= omits cases only if they have missing for one of the two variables used to calculate the covariance for
any cell. Are the results different between the omit and available options? If so, explain the differences.
Omit:
BASILAR
OCCIPITONASAL
PALATILAR
PALATEWIDTH
SOPD
MANDLENGTH
BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH
19830
18059
14747
3249 7076
17579
18059
18346
13727
3088 6601
15969
14747
13727
11405
2572 5342
13266
3249
3088
2572
931 1306
3137
7076
6601
5342
1306 2971
6508
17579
15969
13266
3137 6508
16195
Available:
BASILAR
OCCIPITONASAL
PALATILAR
PALATEWIDTH
SOPD
MANDLENGTH
BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH
26144
23477
19261
3217 10178
22308
23477
22912
17486
2993 9251
20329
19261
17486
14644
2555 7622
16598
3217
2993
2555
904 1333
2988
10178
9251
7622
1333 4367
9051
22308
20329
16598
2988 9051
20325
There are considerable differences between the variances and covariances depending on the type of method
for handling missing values. For example, the variances and covariances for variables BASILAR, OCCIPITONASAL for the Omit method are much lower than that for the Available method.
4. Generate a scatterplot matrix. How do you think the missing values are handled?
My guess is that missings are omitted on a pairwise basis. Only those cases missing in a pair of variables
are omitted from the plot.
5. Here we will impute missing values using the library called “norm” (available from http://www.stat.psu.edu/∼j
To load it on your PC, select the Load Library item on the File menu, and select the library norm from
the menu that subsequently comes up. (If you also load the description a help window giving details on the
library and how to use it will pop up.) Now you will need to use the scripting window to use the library,
to type in the following commands.
x<-data.matrix(roos3.sub)
s<-prelim.norm(x)
s$nmis
s$r
thetahat<-em.norm(s)
getparam.norm(s,thetahat,corr=T)
rngseed(1234567)
theta<-da.norm(s,thetahat,steps=50,showits=T)
roos3.imp1<-imp.norm(s,theta,x)
theta<-da.norm(s,thetahat,steps=50)
2
roos3.imp2<-imp.norm(s,theta,x)
theta<-da.norm(s,thetahat,steps=50)
roos3.imp3<-imp.norm(s,theta,x)
Once you have done these commands, you will see 3 new data files in the Objects window, called roos3.imp1,
roos3.imp2, roos3.imp3. These are copies of the full data matrix with substituted values for the missings.
Each should have different substitute values.
From the original data set write down the matrix elements (row, col) where a value is missing, and and
from each of the imputed data sets find the substitute value for each and tabulate these values, eg
Row
1
3
5
6
19
Col
4
4
4
4
6
Imp 1
197
224
203
169
1305
Imp 2
240
233
...
...
...
Imp 3
220
240
...
...
...
Make sure the list of indices is accurate. The imputed values will vary from simulation to simulation. They
are “random” values.
Compute the summary statistics and variance-covariance matrices for each of these imputed data sets.
Compare the values to the summary statistics for complete (non-missing) data, and the variance-covariance
matrix calculated using available method.
Imputation 1
Min:
Mean:
Max:
Std Dev.:
BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH
1048
1145
693
168.6 462.0
880
1497
1562
1026
254.5 653.3
1263
1893
1945
1315
332.0 798.0
1568
162
151
121
33.4 64.9
145
BASILAR
OCCIPITONASAL
PALATILAR
PALATEWIDTH
SOPD
MANDLENGTH
BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH
26144
23477
19261
4228 9930
22990
23477
22912
17486
3889 9030
20551
19261
17486
14644
3294 7407
17203
4228
3889
3294
1117 1799
4059
9930
9030
7407
1799 4218
9034
22990
20551
17203
4059 9034
21041
Note that the means and variance/covariances are the same where we have complete data. They differ for
variables where there were missing values. The values are similar to the available method.
Generate a scatterplot matrix for each of the imputed data sets. Are there any oddities in the plots?
Anything that looks different from the scatterplot matrix for the complete data. (Hand in these plots.)
Should notice that some of the missing values correspond to values on other variables that are very low,
eg BASILAR vs PALATEWIDTH. The imputed values then occur as low values on PALATEWIDTH, but
look consistent with the trend.
3
Download