Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall 2001 Often the first step to working with multivariate data is dealing with missing values. In this lab we explore how S-Plus deals with missing values in some of its functions, and work with a library for S-Plus which provides substitute values for missing data, multiple imputation. When there are missing values in a data set, before analyzing the data we need to determine the reasons for the missing information. From a statistical perspective this means understanding the distribution of missings: do the missings occur randomly throughout the data matrix, or do they occur more frequently in certain combinations of variables. If data is missing completely at random (MCAR) then multiple imputation might be an effective way to provide substitute values. The data set that we use today is from an historical study of kangaroos in Australia: In 1803 a French research vessel captured 19 live kangaroos from Kangaroo Island (an island near Adelaide in the Great Australian Bight), Tasmania, and the coast of the mainland. Despite a long and arduous voyage, and diet consisting sometimes of rum and damper (a bread made with flour and water), some live specimens reached France. At the end of the nineteenth century, only three of the original specimens, now preserved, were still existing in Europe, one in London, one in Leiden and one in Paris, and a great deal of taxonomic confusion arose. In order to classify the preserved skulls, measurements were made on the skulls of 139 specimens of 3 species of kangaroos, M. giganteus (24 males, 24 females), M. Melanops (20 males, 22 females), and M. fuliginosus (25 males, 24 females). Another point to note is that kangaroo skulls continue to grow throughout their lifetime. The training sample contains adult kangaroos, but the age of the preserved skulls is unknown. The data set has the following variables: LABEL ANIMAL SEX SPECIES SP.SEX.GP BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH NASALLENGTH NASALWIDTH SQUAMOSAL INTERLAC ZYGOMATIC POSTORBITAL ROSTRAL SOPD CREST INCFOR MANDLENGTH MANDWIDTH MANDDEPTH RAMUS Concentrate just on this selection of variables: BASILAR, OCCIPITONASAL, PALATILAR, PALATEWIDTH, SOPD, MANDLENGTH. Also ignore the last 3 lines. These are the measurements for the historic skulls. 1. Subset the data to exclude the 3 historic skulls, and just the columns of interest. Call it roos3.sub. 2. Calculate the summary statistics (mean, std dev, min, max) for each variable, note the number of missing values for each variable. Report these. How does S-plus handle the missing values when calculating these univariate statistics? 3. Calculate the variance-covariance matrix, and report. Experiment with the different options (fail, omit, include, available) for handling missing values. (Fail = don’t calculate anything if there are missing, omit = exclude all cases with missings, include = report a missing if any values are missing, available = omits cases only if they have missing for one of the two variables used to calculate the covariance for any cell. Are the results different between the omit and available options? If so, explain the differences. 4. Generate a scatterplot matrix. How do you think the missing values are handled? 5. Here we will impute missing values using the library called “norm” (available from http://www.stat.psu.edu/∼j To load it on your PC, select the Load Library item on the File menu, and select the library norm from the menu that subsequently comes up. (If you also load the description a help window giving details on the library and how to use it will pop up.) Now you will need to use the scripting window to use the library, to type in the following commands. 1 x<-data.matrix(roos3.sub) s<-prelim.norm(x) s$nmis s$r thetahat<-em.norm(s) getparam.norm(s,thetahat,corr=T) rngseed(1234567) theta<-da.norm(s,thetahat,steps=50,showits=T) roos3.imp1<-imp.norm(s,theta,x) theta<-da.norm(s,thetahat,steps=50) roos3.imp2<-imp.norm(s,theta,x) theta<-da.norm(s,thetahat,steps=50) roos3.imp3<-imp.norm(s,theta,x) Once you have done these commands, you will see 3 new data files in the Objects window, called roos3.imp1, roos3.imp2, roos3.imp3. These are copies of the full data matrix with substituted values for the missings. Each should have different substitute values. From the original data set write down the matrix elements (row, col) where a value is missing, and and from each of the imputed data sets find the substitute value for each and tabulate these values, eg Row 1 3 ... Col 4 4 Imp 1 231 250 Imp 2 240 233 Imp 3 220 240 Compute the summary statistics and variance-covariance matrices for each of these imputed data sets. Compare the values to the summary statistics for complete (non-missing) data, and the variance-covariance matrix calculated using available method. Generate a scatterplot matrix for each of the imputed data sets. Are there any oddities in the plots? Anything that looks different from the scatterplot matrix for the complete data. (Hand in these plots.) 2