Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall... Often the first step to working with multivariate data is...

advertisement
Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall 2001
Often the first step to working with multivariate data is dealing with missing values. In this lab we explore
how S-Plus deals with missing values in some of its functions, and work with a library for S-Plus which provides
substitute values for missing data, multiple imputation.
When there are missing values in a data set, before analyzing the data we need to determine the reasons for the
missing information. From a statistical perspective this means understanding the distribution of missings: do the
missings occur randomly throughout the data matrix, or do they occur more frequently in certain combinations
of variables. If data is missing completely at random (MCAR) then multiple imputation might be an effective
way to provide substitute values.
The data set that we use today is from an historical study of kangaroos in Australia:
In 1803 a French research vessel captured 19 live kangaroos from Kangaroo Island (an island near
Adelaide in the Great Australian Bight), Tasmania, and the coast of the mainland. Despite a long
and arduous voyage, and diet consisting sometimes of rum and damper (a bread made with flour and
water), some live specimens reached France. At the end of the nineteenth century, only three of the
original specimens, now preserved, were still existing in Europe, one in London, one in Leiden and
one in Paris, and a great deal of taxonomic confusion arose. In order to classify the preserved skulls,
measurements were made on the skulls of 139 specimens of 3 species of kangaroos, M. giganteus (24
males, 24 females), M. Melanops (20 males, 22 females), and M. fuliginosus (25 males, 24 females).
Another point to note is that kangaroo skulls continue to grow throughout their lifetime. The training
sample contains adult kangaroos, but the age of the preserved skulls is unknown.
The data set has the following variables:
LABEL ANIMAL SEX SPECIES SP.SEX.GP BASILAR OCCIPITONASAL PALATILAR
PALATEWIDTH NASALLENGTH NASALWIDTH SQUAMOSAL INTERLAC ZYGOMATIC
POSTORBITAL ROSTRAL SOPD CREST INCFOR MANDLENGTH MANDWIDTH MANDDEPTH
RAMUS
Concentrate just on this selection of variables: BASILAR, OCCIPITONASAL, PALATILAR, PALATEWIDTH,
SOPD, MANDLENGTH. Also ignore the last 3 lines. These are the measurements for the historic skulls.
1. Subset the data to exclude the 3 historic skulls, and just the columns of interest. Call it roos3.sub.
2. Calculate the summary statistics (mean, std dev, min, max) for each variable, note the number of missing
values for each variable. Report these. How does S-plus handle the missing values when calculating these
univariate statistics?
3. Calculate the variance-covariance matrix, and report. Experiment with the different options (fail, omit,
include, available) for handling missing values. (Fail = don’t calculate anything if there are missing,
omit = exclude all cases with missings, include = report a missing if any values are missing, available
= omits cases only if they have missing for one of the two variables used to calculate the covariance for
any cell. Are the results different between the omit and available options? If so, explain the differences.
4. Generate a scatterplot matrix. How do you think the missing values are handled?
5. Here we will impute missing values using the library called “norm” (available from http://www.stat.psu.edu/∼j
To load it on your PC, select the Load Library item on the File menu, and select the library norm from
the menu that subsequently comes up. (If you also load the description a help window giving details on the
library and how to use it will pop up.) Now you will need to use the scripting window to use the library,
to type in the following commands.
1
x<-data.matrix(roos3.sub)
s<-prelim.norm(x)
s$nmis
s$r
thetahat<-em.norm(s)
getparam.norm(s,thetahat,corr=T)
rngseed(1234567)
theta<-da.norm(s,thetahat,steps=50,showits=T)
roos3.imp1<-imp.norm(s,theta,x)
theta<-da.norm(s,thetahat,steps=50)
roos3.imp2<-imp.norm(s,theta,x)
theta<-da.norm(s,thetahat,steps=50)
roos3.imp3<-imp.norm(s,theta,x)
Once you have done these commands, you will see 3 new data files in the Objects window, called roos3.imp1,
roos3.imp2, roos3.imp3. These are copies of the full data matrix with substituted values for the missings.
Each should have different substitute values.
From the original data set write down the matrix elements (row, col) where a value is missing, and and
from each of the imputed data sets find the substitute value for each and tabulate these values, eg
Row
1
3
...
Col
4
4
Imp 1
231
250
Imp 2
240
233
Imp 3
220
240
Compute the summary statistics and variance-covariance matrices for each of these imputed data sets.
Compare the values to the summary statistics for complete (non-missing) data, and the variance-covariance
matrix calculated using available method.
Generate a scatterplot matrix for each of the imputed data sets. Are there any oddities in the plots?
Anything that looks different from the scatterplot matrix for the complete data. (Hand in these plots.)
2
Download