Stat 407 Lab 4 Imputing Missing Values with S-Plus Fall 2001 SOLUTION Often the first step to working with multivariate data is dealing with missing values. In this lab we explore how S-Plus deals with missing values in some of its functions, and work with a library for S-Plus which provides substitute values for missing data, multiple imputation. When there are missing values in a data set, before analyzing the data we need to determine the reasons for the missing information. From a statistical perspective this means understanding the distribution of missings: do the missings occur randomly throughout the data matrix, or do they occur more frequently in certain combinations of variables. If data is missing completely at random (MCAR) then multiple imputation might be an effective way to provide substitute values. The data set that we use today is from an historical study of kangaroos in Australia: In 1803 a French research vessel captured 19 live kangaroos from Kangaroo Island (an island near Adelaide in the Great Australian Bight), Tasmania, and the coast of the mainland. Despite a long and arduous voyage, and diet consisting sometimes of rum and damper (a bread made with flour and water), some live specimens reached France. At the end of the nineteenth century, only three of the original specimens, now preserved, were still existing in Europe, one in London, one in Leiden and one in Paris, and a great deal of taxonomic confusion arose. In order to classify the preserved skulls, measurements were made on the skulls of 139 specimens of 3 species of kangaroos, M. giganteus (24 males, 24 females), M. Melanops (20 males, 22 females), and M. fuliginosus (25 males, 24 females). Another point to note is that kangaroo skulls continue to grow throughout their lifetime. The training sample contains adult kangaroos, but the age of the preserved skulls is unknown. The data set has the following variables: LABEL ANIMAL SEX SPECIES SP.SEX.GP BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH NASALLENGTH NASALWIDTH SQUAMOSAL INTERLAC ZYGOMATIC POSTORBITAL ROSTRAL SOPD CREST INCFOR MANDLENGTH MANDWIDTH MANDDEPTH RAMUS Concentrate just on this selection of variables: BASILAR, OCCIPITONASAL, PALATILAR, PALATEWIDTH, SOPD, MANDLENGTH. Also ignore the last 3 lines. These are the measurements for the historic skulls. 1. Subset the data to exclude the 3 historic skulls, and just the columns of interest. Call it roos3.sub. 2. Calculate the summary statistics (mean, std dev, min, max) for each variable, note the number of missing values for each variable. Report these. How does S-plus handle the missing values when calculating these univariate statistics? *** Summary Statistics for data in: Min: Mean: Max: Total N: NA’s : Std Dev.: roos3.sub *** BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH 1048 1145 693 185.0 462.0 880 1497 1562 1026 258.2 653.5 1254 1893 1945 1315 332.0 798.0 1568 139 139 139 139.0 139.0 139 0 0 0 20.0 10.0 12 162 151 121 30.1 66.1 143 S-Plus calculates the summary statistics omitting the missings on a univariate basis. That is, for the first 3 variables where there were no missings all 139 observations were used to calculate the mean. 1 3. Calculate the variance-covariance matrix, and report. Experiment with the different options (fail, omit, include, available) for handling missing values. (Fail = don’t calculate anything if there are missing, omit = exclude all cases with missings, include = report a missing if any values are missing, available = omits cases only if they have missing for one of the two variables used to calculate the covariance for any cell. Are the results different between the omit and available options? If so, explain the differences. Omit: BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH 19830 18059 14747 3249 7076 17579 18059 18346 13727 3088 6601 15969 14747 13727 11405 2572 5342 13266 3249 3088 2572 931 1306 3137 7076 6601 5342 1306 2971 6508 17579 15969 13266 3137 6508 16195 Available: BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH 26144 23477 19261 3217 10178 22308 23477 22912 17486 2993 9251 20329 19261 17486 14644 2555 7622 16598 3217 2993 2555 904 1333 2988 10178 9251 7622 1333 4367 9051 22308 20329 16598 2988 9051 20325 There are considerable differences between the variances and covariances depending on the type of method for handling missing values. For example, the variances and covariances for variables BASILAR, OCCIPITONASAL for the Omit method are much lower than that for the Available method. 4. Generate a scatterplot matrix. How do you think the missing values are handled? My guess is that missings are omitted on a pairwise basis. Only those cases missing in a pair of variables are omitted from the plot. 5. Here we will impute missing values using the library called “norm” (available from http://www.stat.psu.edu/∼j To load it on your PC, select the Load Library item on the File menu, and select the library norm from the menu that subsequently comes up. (If you also load the description a help window giving details on the library and how to use it will pop up.) Now you will need to use the scripting window to use the library, to type in the following commands. x<-data.matrix(roos3.sub) s<-prelim.norm(x) s$nmis s$r thetahat<-em.norm(s) getparam.norm(s,thetahat,corr=T) rngseed(1234567) theta<-da.norm(s,thetahat,steps=50,showits=T) roos3.imp1<-imp.norm(s,theta,x) theta<-da.norm(s,thetahat,steps=50) 2 roos3.imp2<-imp.norm(s,theta,x) theta<-da.norm(s,thetahat,steps=50) roos3.imp3<-imp.norm(s,theta,x) Once you have done these commands, you will see 3 new data files in the Objects window, called roos3.imp1, roos3.imp2, roos3.imp3. These are copies of the full data matrix with substituted values for the missings. Each should have different substitute values. From the original data set write down the matrix elements (row, col) where a value is missing, and and from each of the imputed data sets find the substitute value for each and tabulate these values, eg Row 1 3 5 6 19 Col 4 4 4 4 6 Imp 1 197 224 203 169 1305 Imp 2 240 233 ... ... ... Imp 3 220 240 ... ... ... Make sure the list of indices is accurate. The imputed values will vary from simulation to simulation. They are “random” values. Compute the summary statistics and variance-covariance matrices for each of these imputed data sets. Compare the values to the summary statistics for complete (non-missing) data, and the variance-covariance matrix calculated using available method. Imputation 1 Min: Mean: Max: Std Dev.: BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH 1048 1145 693 168.6 462.0 880 1497 1562 1026 254.5 653.3 1263 1893 1945 1315 332.0 798.0 1568 162 151 121 33.4 64.9 145 BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH BASILAR OCCIPITONASAL PALATILAR PALATEWIDTH SOPD MANDLENGTH 26144 23477 19261 4228 9930 22990 23477 22912 17486 3889 9030 20551 19261 17486 14644 3294 7407 17203 4228 3889 3294 1117 1799 4059 9930 9030 7407 1799 4218 9034 22990 20551 17203 4059 9034 21041 Note that the means and variance/covariances are the same where we have complete data. They differ for variables where there were missing values. The values are similar to the available method. Generate a scatterplot matrix for each of the imputed data sets. Are there any oddities in the plots? Anything that looks different from the scatterplot matrix for the complete data. (Hand in these plots.) Should notice that some of the missing values correspond to values on other variables that are very low, eg BASILAR vs PALATEWIDTH. The imputed values then occur as low values on PALATEWIDTH, but look consistent with the trend. 3