Missing value imputation methods for multilevel data Antonella Plaia and Anna Lisa Bondı̀ Dipartimento di Scienze Statistiche e Matematiche “S. Vianelli”, University of Palermo plaia@unipa.it, bondi@dssm.unipa.it Summary. Missing information is inevitable in environmental research. In literature there are several methods for handling missing data and the choice of an appropriate one depends, among others, on the missing data pattern and on the missing data mechanism. One approach to the problem is to impute the missing data to yield a complete data set. The goal of this paper is to propose a new imputation method and to compare it to other imputation methods known in literature, with the assumption that the missing data mechanism, as it often happens in air quality data, is MAR (missing at random). Considering missing data in PM10 concentration measured every two hours by 8 monitoring stations distributed over the metropolitan area of Palermo, Sicily, dur- ing 2003, simulated incomplete data have been generated, in order to evaluate the performance of the imputation method by using suitable performance indicators. Key words: missing data, single imputation methods, PM10. 1 Introduction Missing data is a very frequent problem in environmental research, usually due to faults in data acquisition. The choice of an appropriate method for handling missing data depends, among others, on the missing data pattern and on the missing-data mechanism. The standard classification of missing data mechanism ( [LR87]; [S97]) considers data: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). Considering the complete data Y = Yj and the missing data indicator M = mj , where mj = 1 if Yj is missing and mj = 0 otherwise, the missing data mechanism is characterized by the conditional distribution of M given Y, say f (M |Y, φ), where φ denotes unknown parameters. Let Yobs and Ymiss denote the observed component and the missing of Y, respectively. If f (M |Y, φ) = f (M |φ) for all Y, the data are called MCAR; if (1) 1456 Antonella Plaia and Anna Lisa Bondı̀ f (M |Y, φ) = f (M |Yobs , φ) (2) for all Ymiss , φ, the missing data mechanism is MAR; if the distribution of them depends on Ymiss the mechanism is called NMAR. Usually, the data mechanism of air quality data is MAR, as, being due to the monitoring site down, the probability that a value is missing does not depend on the missing value (as it would be if the monitoring site could not measure values below a given threshold). In presence of missing data, a possibility is to discard units whose information is incomplete, considering a listwise deletion, that involves the complete discard of the units with missing data, or a pairwise deletion, where units are excluded from any calculation involving variables with missing data. On the other hand, statistical methods are available that take the missing data into account at the time of analysis. These methods include likelihood-based approaches such as generalized linear models and the expectation-maximization (E-M) algorithm when data are “missing at random”. A different approach is to impute the missing values so that the resulting data set is complete. This third possibility is to be chosen if the data set will be used for many different types of analysis by a number of researchers. Methods available for creating complete data matrices can be divided into two main categories: single imputation and multiple imputation methods. Single imputation methods fill in one value for each missing one; they have many appealing features, because standard complete-data methods can be applied directly and because imputation need to be carried out only once. Multiple imputation methods generate multiple simulated values for each missing value, in order to reflect the uncertainty attached to missing data ( [S97]). Generally a multiple imputation method requires a full specification of the distributional form of Y in order to derive the conditional distribution of the missing data given the observed data. Besides, multiple imputation is generally used to estimate some parameter θ of the distribution of the Y. The aim of this parer is to propose a new single imputation method that, considering the particular structure of the data set, creates a “complete” data set that can be analyzed by any researcher on different occasions and using different techniques. 2 Data and Methods 2.1 Data In [MRZ00] the authors analyzed the space - time variability of PM10 concentration via meteorological variables. In the cited paper, as well as in almost all the researches about the effects of P M10 on human health, daily mean concentrations are considered. But a daily mean can not be computed if more than 25% of bi-hour (or hour) data is missing. Therefore, in the present paper we will consider the bi-hour data set in order to ‘complete’ it by imputing missing values. In particular we will consider P M10 concentration measured every two hours by 8 monitoring stations distributed over the metropolitan area of Palermo, Sicily, during 2003. These longitudinal data show a multilevel structure with monitoring sites as second level units and single measure as first level units. The data set structure in Table 1 shows other temporal variables i.e. week (W), day of week (W-D) and hour of the day (H) that will be Missing value imputation methods for multilevel data 1457 used in the single imputation method proposed in this paper. The data set shows a certain percentage of missing data going from a 3.5% of Station 3 to a 13% of Station 2. About 93% of the sequences of missing values has a length of 3 or smaller, 5% has a length between 3 and 12, and only 2% are longer that 12 (i.e. that is longer than one day). Figure 4 shows the frequency distribution of the gap length for a particular monitoring station. Table 1. Data structure W W-D H St1 St2 St3 St4 St5 St6 St7 St8 1 1 1 1 1 1 ... 12.47 27.09 11.10 24.62 14.84 16.46 ... 108.02 28.17 34.70 31.51 23.91 30.72 ... 30.46 9.34 24.17 27.23 18.50 38.54 ... NA 20.10 29.76 34.96 15.21 35.87 ... 4.29 26.66 21.94 26.20 20.10 16.76 ... 45.05 NA NA NA 20.94 15.20 ... 50.91 26.36 44.46 45.89 NA NA ... 3 3 3 3 3 3 ... 2 4 6 8 10 12 ... 23.32 34.68 26.85 19.80 24.38 22.47 ... 0 50 frequency 100 150 St7 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 gap length of missing values Fig. 1. Distribution of gap length in a monitoring station: St7 2.2 Methods The best methods of estimating missing data, in general, depend upon the statistical properties of the data. It is often difficult to determine and compare the accuracy of different imputation methods. Unless one is able to retrieve the missing data, a method must be devised to create a data set that mimics real life data and missing data patterns, because the imputation performance does not depend only on the amount of missing data but on the characteristics of the missing data mechanism. Since the missing data mechanism of air quality data is generally MAR - missing at random - i.e. the probability that a value is missing does not depend on the missing value ( [R76]) various imputation methods were used to estimate the “missing values”. Among the single imputation methods for longitudinal data we can distinguish 1458 Antonella Plaia and Anna Lisa Bondı̀ methods based on the information on the same subject (e.g. last observation carried forward, next observation carried backward, last & next ( [ED03])), methods that borrow information from other subjects (row mean/median) and methods that use both pieces of information (e.g. conditional mean imputation, hot-deck imputation). Considering the multilevel structure of our data set (with monitoring sites as second level units and single measures as first level units), we suppose to be particularly important to consider both the row and column information (in Table 1) to impute a missing value (that is to consider both the spatial and temporal correlation among data). Actually, differently from what can happen for a climatological data set (that can shows the same structure as our data set) here it results to be a main point to consider site specific effects, for example week, week-day or day-hour site specific effects, that is to distinguish, for example, between a mean hour effect (averaged over the eight monitoring sites) and a site specific effect. The new imputation method we will propose in the next section considers these particular characteristics of the data. SDEM method Denoting the generic element of the data set (Table 1) as xswdh , where s refers to the monitoring site (s = 1, 2, ...S), w to the Week (w = 1, 2, ..., 53), d to the WeekDay (d = 1, 2, ...7) and h to the Hour (h = 2, 4, ...24), we will propose to use the Site-Dependent Effect Method (SDEM) that considers explicitly a week effect, a day effect and an hour effect (all site-dependent), assuming their additivity. A missing value will be estimated by: X x̄ S x̂swdh = x̄.wdh + (x̄sw .. − sw .. S S X x̄ S ) + (x̄s.. h − s.. h S s=1 ) (3) St5 St6 St7 St8 mean St1 St2 St3 St4 St5 St6 St7 St8 mean 60 50 40 30 20 20 30 40 50 PM10 60 70 St3 St4 70 St1 St2 PM10 s. d . 80 s=1 80 s=1 X x̄ S ) + (x̄s. d . − Mon Tue Wed Thu week−day Fri Sat Sun 2 4 6 8 10 12 14 16 18 20 22 24 hour Fig. 2. Week-day and hour site specific and mean effect in 8 monitoring sites Figure 2 justifies the use of a site specific week (not shown here), week-day and hour effect, showing, for each monitoring site, the difference between a mean time-effect and a site specific time effect. Missing value imputation methods for multilevel data 1459 The performance of this method will be compared to other single imputation methods in literature, and in particular to the Hour Mean method x̄s.. h ( [LLZ99]), x +x to Last & Next method ( [ED03]) swd(h−1) 2 swd(h+1) , and to a model-based multiple imputation method (MBMI) ( [S97]). 3 Performance indicators and missing data simulation Two performance indicators have been considered to assess the goodness of imputation. Denoting with Oi the i−th observed data point, with O the average of observed data, with Pi the i − th imputed data point, with P the average of imputed data, with σO the standard deviation of the observed data and σP the standard deviation of the imputed data, and finally with N the number of imputations (that is the number of missing data), we use ( [JNTRK04]): 1. the coefficient of correlation (ρ) between observed and imputed: ρ=[ 1 × N P N i=1 [(Pi − P )(Oi − O)] ] σP σO (4) 2. an index of agreement (d) d=1−[ P P N i=1 (Pi N (|P i − O| i=1 − Oi )2 + |Oi − O|)2 ] (5) The index of agreement, with respect to the coefficient of correlation, is related to the sizes of the discrepancies between predicted and observed values. In order to evaluate the performance of the imputation methods, simulated incomplete data have been generated, after which the methods were applied to the data and the performance indicators computed. The imputation performance does depend on the amount of missing data and on the missing data pattern. Moreover, as already explained, the missing data mechanism of air quality data is usually MAR being due to monitoring station being down. Figure 4 shows a typical frequency distribution of the gap length (of a whole year) in a monitoring station. As it shows, most of the sequences of missing values are very short (from 1 to 10 values long) with only 1 or 2 gaps longer than one day (that is 12 consecutive missing values). Referring to Table 1 we will consider 4 different missing data patterns that differ for the total percentage of missing data in the table, for the distribution of the gap length, and for the maximum number of missing values per row. Two different total amount of missing data have been considered: about 5% and about 15%. Two maximum number of missing values per row have been considered, 4 and 8. For each of the four missing data pattern 100 missing data indicator matrix M have been generated drawing the gap length from a mixture of two distributions, an exponential of parameter λ = 0.5 (that produces the short gaps) and an Uniform with parameters (20, 70) and (40, 120) for the 5% and 15% missing data patterns respectively (to produce the long gaps), the starting point of each gap from a Uniform(1, 4380) and the number of missing data per row (in Table 1) from a Uniform(0, 4) or Uniform(0, 8). Matrix M applied to the observed data set creates “artificially” missing data (actually real values are known) and this allow to compute the value of the performance indicators (4) and (5) to assess the goodness of the imputation methods. 1460 Antonella Plaia and Anna Lisa Bondı̀ Table 2. Index of agreement: Comparing methods versus missing data patterns St1 St2 St3 St4 St5 St6 St7 St8 H-M L&N SDEM MI 0.36 0.75 0.86 0.74 H-M L&N SDEM MI 0.37 0.78 0.87 0.75 H-M L&N SDEM MI 0.37 0.73 0.86 0.75 H-M L&N SDEM MI 0.38 0.74 0.86 0.74 5 % missing data Up to 4 missing values per row 0.26 0.37 0.46 0.45 0.44 0.51 0.73 0.68 0.60 0.62 0.63 0.60 0.69 0.84 0.70 0.78 0.76 0.74 0.66 0.79 0.63 0.69 0.61 0.70 Up to 8 missing values per row 0.26 0.38 0.47 0.45 0.45 0.51 0.75 0.70 0.62 0.62 0.63 0.60 0.69 0.84 0.72 0.78 0.76 0.74 0.67 0.79 0.64 0.68 0.62 0.71 15 % missing data Up to 4 missing values per row 0.26 0.37 0.47 0.45 0.43 0.51 0.71 0.66 0.57 0.59 0.58 0.57 0.69 0.84 0.71 0.77 0.75 0.74 0.66 0.79 0.62 0.68 0.61 0.70 Up to 8 missing values per row 0.26 0.37 0.46 0.45 0.45 0.51 0.69 0.67 0.57 0.61 0.61 0.57 0.69 0.84 0.69 0.77 0.75 0.74 0.66 0.79 0.62 0.68 0.62 0.71 0.38 0.69 0.84 0.80 0.39 0.70 0.85 0.80 0.39 0.65 0.84 0.79 0.39 0.66 0.84 0.80 4 Results and discussion Figure 3 shows the coefficient of correlation and the index of agreement versus the monitoring site for the four methods (H-M, Last & Next, SDEM and MI) and the four missing data patterns (averaged over the 100 simulated incomplete data sets). The Hour-Mean method results to be the worst according both to the coefficient of correlation and to the index of agreement. Last & Next appears to perform better, especially according to the index of agreement, but its performance gets worse increasing the gap length. Multiple imputation too performs better according to the index of agreement rather than to the coefficient of correlation and this independently from the gap length. Anyway SDEM results to be the best method according to both the indices and independently from the gap length. The row-wise missing pattern (that is the maximum number of Stations with missing values at the same time) does not influence the method performance. Table 2 shows the average index of agreement (averaged over the one hundred simulated incomplete data, σd about 0.05) for each monitoring site and each data pattern. Missing value imputation methods for multilevel data 0.8 0.6 0.0 0.2 0.4 0.6 0.4 0.0 0.2 correlation 0.8 1.0 Last & Next 1.0 Hour Mean St1 St2 St3 St4 St5 St6 St7 St8 St1 St2 St3 St4 St5 St6 St7 St8 St6 St7 St8 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Multiple Imputation 1.0 SDEM correlation 1461 St1 St2 St3 St4 St5 monitoring site St6 St7 St8 St1 St2 St3 St4 St5 monitoring site Fig. 3. Coefficient of correlation for the four methods and the four missing data patterns. 5 Conclusions In a large longitudinal study where many analysis will be performed, it is important to have a complete (imputed) data set. If time information, like hour of the day, day of the week, week of the year (or month) are available, it is important to consider conditional mean imputation methods. But if the data shows a multilevel structure (as in our case) both the time dependent information and the site dependent one are to be considered. The new imputation method proposed in this paper tries to make the best of all the information considering, on the one hand time effects averaged over the monitoring sites, on the other a site specific time effect. Both the correlation coefficient and the index of agreement agree to evaluate SDEM as the best method among the ones compared in this paper, and independently on the gap length and on the number of stations with missing data. 1462 Antonella Plaia and Anna Lisa Bondı̀ 0.8 0.6 0.0 0.2 0.4 0.6 0.4 0.0 0.2 index of agreement 0.8 1.0 Last & Next 1.0 Hour Mean St1 St2 St3 St4 St5 St6 St7 St8 St1 St2 St3 St6 St7 St8 St6 St7 St8 1.0 0.8 0.0 0.2 0.4 0.6 0.8 0.6 0.4 0.0 0.2 index of agreement St5 Multiple Imputation 1.0 SDEM St4 St1 St2 St3 St4 St5 St6 monitoring site St7 St8 St1 St2 St3 St4 St5 monitoring site Fig. 4. Index of agreement for the four methods and the four missing data patterns. 6 Acknowledgements The work was partly supported by University of Palermo research project (ex 60%). We wish to thank Eng. M. Vultaggio, AMIA (Azienda Municipalizzata per l’Igiene Ambientale, Palermo) for providing the data used in the study. References [BP05] Bondı̀, A.L., Plaia, A.: Weather variables and air pollution via hierarchical linear models. In: SIS (ed) Statistics and Environment. Cleup, Messina, 237–240 (2005) [ED03] Engels, J.M., Diehr, P.: Imputation of missing longitudinal data: a comparison of methods. Journal of Clinical Epidemiology, 56, 968–976 (2003) [JNTRK04] Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 38, 2895–2907 (2004) [LLZ99] Li, K.H., Le, N.D., Zidek, J.V: Spatial-temporal models for ambient hourly P M10 in Vancouver. Environmetrics, 10, 321–328 (1999) Missing value imputation methods for multilevel data [LR87] [R76] [S97] 1463 Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987) Rubin, D.B.: Inference and missing data (with discussion) Biometrika, 63, 581–592 (1976) Schafer, J.L.: Analysis of incomplete multivariate data. Monographs on Statistics and Applied Probability No. 72. Chapman & Hall, London (1997)