Missing value imputation methods for multilevel data

advertisement
Missing value imputation methods for
multilevel data
Antonella Plaia and Anna Lisa Bondı̀
Dipartimento di Scienze Statistiche e Matematiche “S. Vianelli”, University of
Palermo plaia@unipa.it, bondi@dssm.unipa.it
Summary. Missing information is inevitable in environmental research. In literature there are several methods for handling missing data and the choice of an
appropriate one depends, among others, on the missing data pattern and on the
missing data mechanism. One approach to the problem is to impute the missing
data to yield a complete data set. The goal of this paper is to propose a new imputation method and to compare it to other imputation methods known in literature,
with the assumption that the missing data mechanism, as it often happens in air
quality data, is MAR (missing at random). Considering missing data in PM10 concentration measured every two hours by 8 monitoring stations distributed over the
metropolitan area of Palermo, Sicily, dur- ing 2003, simulated incomplete data have
been generated, in order to evaluate the performance of the imputation method by
using suitable performance indicators.
Key words: missing data, single imputation methods, PM10.
1 Introduction
Missing data is a very frequent problem in environmental research, usually due to
faults in data acquisition. The choice of an appropriate method for handling missing
data depends, among others, on the missing data pattern and on the missing-data
mechanism. The standard classification of missing data mechanism ( [LR87]; [S97])
considers data: missing completely at random (MCAR), missing at random (MAR)
and not missing at random (NMAR). Considering the complete data Y = Yj and the
missing data indicator M = mj , where mj = 1 if Yj is missing and mj = 0 otherwise,
the missing data mechanism is characterized by the conditional distribution of M
given Y, say f (M |Y, φ), where φ denotes unknown parameters. Let Yobs and Ymiss
denote the observed component and the missing of Y, respectively. If
f (M |Y, φ) = f (M |φ)
for all Y, the data are called MCAR; if
(1)
1456
Antonella Plaia and Anna Lisa Bondı̀
f (M |Y, φ) = f (M |Yobs , φ)
(2)
for all Ymiss , φ, the missing data mechanism is MAR; if the distribution of them
depends on Ymiss the mechanism is called NMAR.
Usually, the data mechanism of air quality data is MAR, as, being due to the
monitoring site down, the probability that a value is missing does not depend on the
missing value (as it would be if the monitoring site could not measure values below
a given threshold).
In presence of missing data, a possibility is to discard units whose information
is incomplete, considering a listwise deletion, that involves the complete discard
of the units with missing data, or a pairwise deletion, where units are excluded
from any calculation involving variables with missing data. On the other hand,
statistical methods are available that take the missing data into account at the time
of analysis. These methods include likelihood-based approaches such as generalized
linear models and the expectation-maximization (E-M) algorithm when data are
“missing at random”. A different approach is to impute the missing values so that
the resulting data set is complete. This third possibility is to be chosen if the data
set will be used for many different types of analysis by a number of researchers.
Methods available for creating complete data matrices can be divided into two
main categories: single imputation and multiple imputation methods. Single imputation methods fill in one value for each missing one; they have many appealing
features, because standard complete-data methods can be applied directly and because imputation need to be carried out only once. Multiple imputation methods
generate multiple simulated values for each missing value, in order to reflect the uncertainty attached to missing data ( [S97]). Generally a multiple imputation method
requires a full specification of the distributional form of Y in order to derive the conditional distribution of the missing data given the observed data. Besides, multiple
imputation is generally used to estimate some parameter θ of the distribution of the
Y.
The aim of this parer is to propose a new single imputation method that, considering the particular structure of the data set, creates a “complete” data set that can
be analyzed by any researcher on different occasions and using different techniques.
2 Data and Methods
2.1 Data
In [MRZ00] the authors analyzed the space - time variability of PM10 concentration
via meteorological variables. In the cited paper, as well as in almost all the researches
about the effects of P M10 on human health, daily mean concentrations are considered. But a daily mean can not be computed if more than 25% of bi-hour (or hour)
data is missing. Therefore, in the present paper we will consider the bi-hour data set
in order to ‘complete’ it by imputing missing values. In particular we will consider
P M10 concentration measured every two hours by 8 monitoring stations distributed
over the metropolitan area of Palermo, Sicily, during 2003. These longitudinal data
show a multilevel structure with monitoring sites as second level units and single
measure as first level units. The data set structure in Table 1 shows other temporal
variables i.e. week (W), day of week (W-D) and hour of the day (H) that will be
Missing value imputation methods for multilevel data
1457
used in the single imputation method proposed in this paper. The data set shows a
certain percentage of missing data going from a 3.5% of Station 3 to a 13% of Station
2. About 93% of the sequences of missing values has a length of 3 or smaller, 5%
has a length between 3 and 12, and only 2% are longer that 12 (i.e. that is longer
than one day). Figure 4 shows the frequency distribution of the gap length for a
particular monitoring station.
Table 1. Data structure
W W-D H St1
St2
St3
St4
St5
St6
St7
St8
1
1
1
1
1
1
...
12.47
27.09
11.10
24.62
14.84
16.46
...
108.02
28.17
34.70
31.51
23.91
30.72
...
30.46
9.34
24.17
27.23
18.50
38.54
...
NA
20.10
29.76
34.96
15.21
35.87
...
4.29
26.66
21.94
26.20
20.10
16.76
...
45.05
NA
NA
NA
20.94
15.20
...
50.91
26.36
44.46
45.89
NA
NA
...
3
3
3
3
3
3
...
2
4
6
8
10
12
...
23.32
34.68
26.85
19.80
24.38
22.47
...
0
50
frequency
100
150
St7
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
gap length of missing values
Fig. 1. Distribution of gap length in a monitoring station: St7
2.2 Methods
The best methods of estimating missing data, in general, depend upon the statistical
properties of the data. It is often difficult to determine and compare the accuracy
of different imputation methods. Unless one is able to retrieve the missing data, a
method must be devised to create a data set that mimics real life data and missing
data patterns, because the imputation performance does not depend only on the
amount of missing data but on the characteristics of the missing data mechanism.
Since the missing data mechanism of air quality data is generally MAR - missing at
random - i.e. the probability that a value is missing does not depend on the missing
value ( [R76]) various imputation methods were used to estimate the “missing values”. Among the single imputation methods for longitudinal data we can distinguish
1458
Antonella Plaia and Anna Lisa Bondı̀
methods based on the information on the same subject (e.g. last observation carried
forward, next observation carried backward, last & next ( [ED03])), methods that
borrow information from other subjects (row mean/median) and methods that use
both pieces of information (e.g. conditional mean imputation, hot-deck imputation).
Considering the multilevel structure of our data set (with monitoring sites as second
level units and single measures as first level units), we suppose to be particularly
important to consider both the row and column information (in Table 1) to impute
a missing value (that is to consider both the spatial and temporal correlation among
data). Actually, differently from what can happen for a climatological data set (that
can shows the same structure as our data set) here it results to be a main point to
consider site specific effects, for example week, week-day or day-hour site specific effects, that is to distinguish, for example, between a mean hour effect (averaged over
the eight monitoring sites) and a site specific effect. The new imputation method
we will propose in the next section considers these particular characteristics of the
data.
SDEM method
Denoting the generic element of the data set (Table 1) as xswdh , where s refers to
the monitoring site (s = 1, 2, ...S), w to the Week (w = 1, 2, ..., 53), d to the WeekDay (d = 1, 2, ...7) and h to the Hour (h = 2, 4, ...24), we will propose to use the
Site-Dependent Effect Method (SDEM) that considers explicitly a week effect, a day
effect and an hour effect (all site-dependent), assuming their additivity. A missing
value will be estimated by:
X x̄
S
x̂swdh = x̄.wdh + (x̄sw .. −
sw ..
S
S
X x̄
S
) + (x̄s.. h −
s.. h
S
s=1
) (3)
St5
St6
St7
St8
mean
St1
St2
St3
St4
St5
St6
St7
St8
mean
60
50
40
30
20
20
30
40
50
PM10
60
70
St3
St4
70
St1
St2
PM10
s. d .
80
s=1
80
s=1
X x̄
S
) + (x̄s. d . −
Mon
Tue
Wed
Thu
week−day
Fri
Sat
Sun
2
4
6
8
10
12
14
16
18
20
22
24
hour
Fig. 2. Week-day and hour site specific and mean effect in 8 monitoring sites
Figure 2 justifies the use of a site specific week (not shown here), week-day
and hour effect, showing, for each monitoring site, the difference between a mean
time-effect and a site specific time effect.
Missing value imputation methods for multilevel data
1459
The performance of this method will be compared to other single imputation
methods in literature, and in particular to the Hour Mean method x̄s.. h ( [LLZ99]),
x
+x
to Last & Next method ( [ED03]) swd(h−1) 2 swd(h+1) , and to a model-based multiple
imputation method (MBMI) ( [S97]).
3 Performance indicators and missing data simulation
Two performance indicators have been considered to assess the goodness of imputation. Denoting with Oi the i−th observed data point, with O the average of observed
data, with Pi the i − th imputed data point, with P the average of imputed data,
with σO the standard deviation of the observed data and σP the standard deviation
of the imputed data, and finally with N the number of imputations (that is the
number of missing data), we use ( [JNTRK04]):
1. the coefficient of correlation (ρ) between observed and imputed:
ρ=[
1
×
N
P
N
i=1 [(Pi
− P )(Oi − O)]
]
σP σO
(4)
2. an index of agreement (d)
d=1−[
P
P
N
i=1 (Pi
N
(|P
i − O|
i=1
− Oi )2
+ |Oi − O|)2
]
(5)
The index of agreement, with respect to the coefficient of correlation, is related
to the sizes of the discrepancies between predicted and observed values.
In order to evaluate the performance of the imputation methods, simulated incomplete data have been generated, after which the methods were applied to the
data and the performance indicators computed. The imputation performance does
depend on the amount of missing data and on the missing data pattern. Moreover, as
already explained, the missing data mechanism of air quality data is usually MAR
being due to monitoring station being down. Figure 4 shows a typical frequency
distribution of the gap length (of a whole year) in a monitoring station. As it shows,
most of the sequences of missing values are very short (from 1 to 10 values long)
with only 1 or 2 gaps longer than one day (that is 12 consecutive missing values).
Referring to Table 1 we will consider 4 different missing data patterns that differ
for the total percentage of missing data in the table, for the distribution of the gap
length, and for the maximum number of missing values per row. Two different total
amount of missing data have been considered: about 5% and about 15%. Two maximum number of missing values per row have been considered, 4 and 8. For each of
the four missing data pattern 100 missing data indicator matrix M have been generated drawing the gap length from a mixture of two distributions, an exponential of
parameter λ = 0.5 (that produces the short gaps) and an Uniform with parameters
(20, 70) and (40, 120) for the 5% and 15% missing data patterns respectively (to
produce the long gaps), the starting point of each gap from a Uniform(1, 4380) and
the number of missing data per row (in Table 1) from a Uniform(0, 4) or Uniform(0,
8). Matrix M applied to the observed data set creates “artificially” missing data (actually real values are known) and this allow to compute the value of the performance
indicators (4) and (5) to assess the goodness of the imputation methods.
1460
Antonella Plaia and Anna Lisa Bondı̀
Table 2. Index of agreement: Comparing methods versus missing data patterns
St1 St2 St3 St4 St5 St6 St7 St8
H-M
L&N
SDEM
MI
0.36
0.75
0.86
0.74
H-M
L&N
SDEM
MI
0.37
0.78
0.87
0.75
H-M
L&N
SDEM
MI
0.37
0.73
0.86
0.75
H-M
L&N
SDEM
MI
0.38
0.74
0.86
0.74
5 % missing data
Up to 4 missing values per row
0.26 0.37 0.46 0.45 0.44 0.51
0.73 0.68 0.60 0.62 0.63 0.60
0.69 0.84 0.70 0.78 0.76 0.74
0.66 0.79 0.63 0.69 0.61 0.70
Up to 8 missing values per row
0.26 0.38 0.47 0.45 0.45 0.51
0.75 0.70 0.62 0.62 0.63 0.60
0.69 0.84 0.72 0.78 0.76 0.74
0.67 0.79 0.64 0.68 0.62 0.71
15 % missing data
Up to 4 missing values per row
0.26 0.37 0.47 0.45 0.43 0.51
0.71 0.66 0.57 0.59 0.58 0.57
0.69 0.84 0.71 0.77 0.75 0.74
0.66 0.79 0.62 0.68 0.61 0.70
Up to 8 missing values per row
0.26 0.37 0.46 0.45 0.45 0.51
0.69 0.67 0.57 0.61 0.61 0.57
0.69 0.84 0.69 0.77 0.75 0.74
0.66 0.79 0.62 0.68 0.62 0.71
0.38
0.69
0.84
0.80
0.39
0.70
0.85
0.80
0.39
0.65
0.84
0.79
0.39
0.66
0.84
0.80
4 Results and discussion
Figure 3 shows the coefficient of correlation and the index of agreement versus the
monitoring site for the four methods (H-M, Last & Next, SDEM and MI) and the
four missing data patterns (averaged over the 100 simulated incomplete data sets).
The Hour-Mean method results to be the worst according both to the coefficient of
correlation and to the index of agreement. Last & Next appears to perform better,
especially according to the index of agreement, but its performance gets worse increasing the gap length. Multiple imputation too performs better according to the
index of agreement rather than to the coefficient of correlation and this independently from the gap length. Anyway SDEM results to be the best method according
to both the indices and independently from the gap length. The row-wise missing
pattern (that is the maximum number of Stations with missing values at the same
time) does not influence the method performance.
Table 2 shows the average index of agreement (averaged over the one hundred
simulated incomplete data, σd about 0.05) for each monitoring site and each data
pattern.
Missing value imputation methods for multilevel data
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
correlation
0.8
1.0
Last & Next
1.0
Hour Mean
St1
St2
St3
St4
St5
St6
St7
St8
St1
St2
St3
St4
St5
St6
St7
St8
St6
St7
St8
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Multiple Imputation
1.0
SDEM
correlation
1461
St1
St2
St3
St4
St5
monitoring site
St6
St7
St8
St1
St2
St3
St4
St5
monitoring site
Fig. 3. Coefficient of correlation for the four methods and the four missing data
patterns.
5 Conclusions
In a large longitudinal study where many analysis will be performed, it is important
to have a complete (imputed) data set. If time information, like hour of the day, day
of the week, week of the year (or month) are available, it is important to consider
conditional mean imputation methods. But if the data shows a multilevel structure
(as in our case) both the time dependent information and the site dependent one are
to be considered. The new imputation method proposed in this paper tries to make
the best of all the information considering, on the one hand time effects averaged
over the monitoring sites, on the other a site specific time effect. Both the correlation
coefficient and the index of agreement agree to evaluate SDEM as the best method
among the ones compared in this paper, and independently on the gap length and
on the number of stations with missing data.
1462
Antonella Plaia and Anna Lisa Bondı̀
0.8
0.6
0.0
0.2
0.4
0.6
0.4
0.0
0.2
index of agreement
0.8
1.0
Last & Next
1.0
Hour Mean
St1
St2
St3
St4
St5
St6
St7
St8
St1
St2
St3
St6
St7
St8
St6
St7
St8
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.0
0.2
index of agreement
St5
Multiple Imputation
1.0
SDEM
St4
St1
St2
St3
St4
St5
St6
monitoring site
St7
St8
St1
St2
St3
St4
St5
monitoring site
Fig. 4. Index of agreement for the four methods and the four missing data patterns.
6 Acknowledgements
The work was partly supported by University of Palermo research project (ex 60%).
We wish to thank Eng. M. Vultaggio, AMIA (Azienda Municipalizzata per l’Igiene
Ambientale, Palermo) for providing the data used in the study.
References
[BP05]
Bondı̀, A.L., Plaia, A.: Weather variables and air pollution via hierarchical
linear models. In: SIS (ed) Statistics and Environment. Cleup, Messina,
237–240 (2005)
[ED03]
Engels, J.M., Diehr, P.: Imputation of missing longitudinal data: a comparison of methods. Journal of Clinical Epidemiology, 56, 968–976 (2003)
[JNTRK04] Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen,
M.: Methods for imputation of missing values in air quality data sets.
Atmospheric Environment, 38, 2895–2907 (2004)
[LLZ99] Li, K.H., Le, N.D., Zidek, J.V: Spatial-temporal models for ambient hourly
P M10 in Vancouver. Environmetrics, 10, 321–328 (1999)
Missing value imputation methods for multilevel data
[LR87]
[R76]
[S97]
1463
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley,
New York (1987)
Rubin, D.B.: Inference and missing data (with discussion) Biometrika, 63,
581–592 (1976)
Schafer, J.L.: Analysis of incomplete multivariate data. Monographs on
Statistics and Applied Probability No. 72. Chapman & Hall, London
(1997)
Download