Final protocol for statistical data analysis to examine mortality and

advertisement
Particles size and composition in Mediterranean countries:
geographical variability and short-term health effects
MED-PARTICLES Project 2011-2013
Under the Grant Agreement EU LIFE+ ENV/IT/327
Particles size and composition in Mediterranean countries:
geographical variability and short-term health effects
MED-PARTICLES
ACTION 9.
Development of the statistical modeling strategy for the analysis of the short-term effects
of air pollutants on health endpoints
Summary: The protocol aims to provide a detailed statistical analysis of time series data of PM
measurements and health outcomes (mortality and hospitalizations) in the Mediterranean cities
involved in the project.
--------------------------------------------------------------------------
1
2
1. SHORT-TERM ASSOCIATION BETWEEN DAILY HEALTH OUTCOMES AND
DAILY POLLUTANT CONCENTRATIONS
Objective
The objective of this chapter of the statistical protocol is to provide detailed steps for the statistical
analysis applied for the investigation of the association between the PM measurements and health
outcomes time series data collected in the Mediterranean cities involved in the project. Apart from
the statistical guidance, the specific code will also be supplied in each step (when necessary) to
ensure comparability. Following the discussions of the Statistical Group a short presentation of the
exploratory analysis undertaken by the Rome team in order to finalize the control of the
confounding effect of long term trends and seasonality in the data is also presented. The R statistical
package (version 2.13.0, package mgcv) will be used for the statistical analysis.
Data
Exposure
For the epidemiological analysis of the times series data we will use the pollutant measurements
obtained from urban/suburban background sites and fixed monitors located near traffic (only when
they represent the exposure of the population living nearby). The monitor-specific concentrations
will be averaged and missing values from the averaged series will be imputed, as detailed in the
protocol in Action 7. As a result, for each city and for each one of the following pollutants, daily
exposure variables will be used for epidemiological analyses, as described below:
-
PM2.5: daily average
-
PM10: daily average
-
PM2.5-10: daily average
-
NO2: daily average
-
CO: daily maximum value of the 8-hr running means
-
O3: daily maximum value of the 8-hr running means
-
Temperature: daily mean temperature
-
Humidity: daily mean relative humidity
Outcomes
Daily counts of cause-specific mortality and emergency hospitalizations will be collected under
Action 6 and used in the epidemiological analysis, to investigate the association with daily
3
concentrations of different air pollutants, with a specific focus on particulate matter (PM)
exposures.
Methods
a) General Strategy and Time-trend control
The particles’ indices-health outcomes associations for each health outcome will be investigated
using Poisson regression models allowing for overdispersion (using quasi(poisson) in R). The
model is of the form:
log E [Yt]=β0 + b * PMt + s (timet, k)+ dummy variables for day of the week +[other confounders
such as temp as decided, holiday, influenza.. ]
where E[Yt] is the expected value of the Poisson distributed variable Yt indicating the daily
outcome count on day t with Var(Yt)=φE[Yt], φ being the over-dispersion parameter, timet is a
continuous variable indicating the time of event (with values from 1 to the length of the timeperiod) and PMt is the particles’ level on the average of same and previous day.
The functions s capture the relationship between the time-varying covariates and calendar time and
health outcome. We will use penalized regression splines as smoothing functions, as implemented
by Wood in R (2000) for our main analysis model. In this setting, k is the number of basis
functions. We will use natural cubic splines as basis functions for the penalized regression splines.
We will choose the number of basis functions (k) to be 50 for each year of available data for the
time variable. Hence, the term corresponding to time will be of the form: s(trend, k=50*number of
years of data, bs="cr"). We will choose 8 effective degrees of freedom (edfs) per calendar year of
available data to control for seasonality. The choice is based on the previous experience that this
would be a rather conservative estimate of the effect (Samoli et al. 2010) as well as the results of the
exploratory analysis undertaken within MED-PARTICLES project shown in the Appendix.
Hence, the relevant R code will be of the form
model<-gam(outcome~s(trend,k=50*nyears,bs="cr")+as.factor(wday)+…, data=city, na=na.omit,
family=quasipoisson,sp=2.2e+6))
(example of a value of the smoothing parameter in order to
achieve the desirable edf.s)
where the smoothing parameter sp will be chosen in a way so that the estimated degrees of freedom
for the variable time will be those defined as 8*nyears (as produced by summary.glm(model)).
4
In the model we will include the dummy variables for day of the week as indicated. Careful
inspection of the trend plots should indicate if such long term control is adequate or strange peaks
remain in the data that we will try to account for by including extra period-specific terms (dummy
variables or smooth terms depending on the data, see also in other confounders section below). In
this case, it will be possible to use different DF for different periods within a year, eg possible
smooth terms for the summer, especially if we have general (not emergency) admissions; or
respiratory infections in the winter.
Sensitivity analysis
We will apply 2 separate models as sensitivity analysis on seasonality control:
1. One model will use the time series approach with penalised splines for seasonality control
(as the main model) but with the final edfs for long term trends based on the choice of the
smoothing parameter for time that minimizes the sum of the absolute value of partial
autocorrelations (PACF) of the residuals from lags 1 to 30.
2. The other model will be a Poisson model where the underlying method will be the casecrossover approach with time-stratified strategy for the selection of control days. Modelling
the time trend with a three-way interaction between year, month and day of death is
equivalent to the standard case-crossover design with time-stratified strategy for control
days selection, according to which the control days are selected as the same days of the
week within the same month and year of the event day. The model specification for time
trend and day of the week is:
log E [Yt]=β0 + b * PMt +
Σ
βi [YY/MM/DOW] +[other confounders such as temp as
decided, holiday, influenza.. ]
where [YY/MM/DOW] represents all possible combinations of year, month and day of
week hence yy/mm/dow-1 dummy variables are entered in the model. In this model there
will be no further inclusion of day of the week dummy variables.
b) Other confounding factors control
Time series data on daily temperature (oC, mean) and relative humidity (%) will be used to control
for the potential confounding effects of weather. We expect weather series to be complete. For
temperature control we will include a natural spline with 3 degrees of freedom for lag 0 and a
5
natural spline with 3 degrees of freedom for lag 1. When we do distributed lag analyses for the
pollutants, we will include the same distributed lag term for temperature, as we do for the
pollutants. For humidity adjustment, we will include a linear term for relative humidity, since we
believe it is sufficient to capture potential confounding due to this parameter.
External information on influenza epidemics or other unusual events (heat waves, strikes, etc) will
be used, if available.
The data on influenza should preferably be daily counts i.e. number of cases. When there are
weekly or biweekly data on number of cases then daily values should be calculated by division. If
the only available information is the existence of an epidemic, then a variable taking the value 1 for
epidemic days and 0 otherwise should provided. If other unusual events have taken place during
the study period (such as: strike in the health services, flood, earthquake, heat or cold wave) a
dummy variable with value 1 during the unusual event and 0 otherwise should be included in the
file. Furthermore, if in a city there is a sharp reduction of the population because everyone takes
holidays in the same period, it is possible that will need additional control especially in the analysis
oh hospital admissions series.
In case there are no influenza data available we will use the APHEA-2 method for influenza
control, including a dummy variable taking the value of 1 when the 7-day moving average of the
respiratory mortality was greater than the 90-th percentile of its city-specific distribution. In this
case, since influenza definition is based on the distribution of respiratory mortality, we will include
the influenza dummy variable only when we analyzing non respiratory related admissions and
mortality. In all other influenza definitions, instead, it will be included for all study outcomes,
including respiratory ones.
c) Lag structure of exposure
According to the EPIAIR protocol we will include: a) cubic polynomial distributed lag models and
single-lag models from lag 0 to lag 6 to visually examine the lag structure of the association between PM
exposures and health outcomes; b) three cumulative lags chosen a priori to represent immediate effects (lag
0-1), delayed effects (lag 2-5) and prolonged effects (lag 0-5); c) for each combination of exposure/outcome,
choice of one of these three lags as the reference lag, based on the meta-analytic polynomial distributed lag
shape and on the meta-analytic estimates of the a priori cumulative lags. This choice is relevant to identify
the lagged exposure to be used for additional analyses as bi-pollutant models and sensitivity analyses (the
same reference lag is used for all cities). This strategy seems a good compromise between a priori definitions
and flexibility of lag choice for different exposure/outcome combinations.
6
d) Two pollutant models
We will run two pollutant models, including the rest of the routinely measured pollutants in the
particles models. Special care will be given to the inclusion of NO2 in PM2.5 models if there is
high correlation (r>0.7).
e) Concentration-response analysis
We will evaluate the shape of the relationship between PM concentrations and daily health
outcomes by:
1) Running threshold models, e.g. restricting the analyses to decreasing concentrations of
pollutants, and estimating the associations for the so-restricted time-series. This will provide an
answer to the question whether there exists an association between different PM fractions and
different health outcomes, even at lower and lower concentrations of the pollutants;
2) Running meta-smoothing models, which consist in estimating non-linear associations between
PM fractions and health outcomes within each city, and then plotting a non-linear metaanalytical curve for the pooled results.
7
APPENDIX
Results from initial exploratory analysis on the choice of seasonality control on the Rome
dataset
Following discussions of the Statistical group on the optimal analysis of epidemiological time series
data, the Rome team applied an exploratory analysis in order to compare the framework of time
series vs case crossover design, as well as different degrees of seasonality control under each
framework. Among the group there was a consensus that the 2 methods are identical and address
only the amount of seasonality control, which is the fundamental question in this type of analysis.
Several methods proposed in the bibliography were tested and discussed in order to reach a
consensus. In brief, all analyses was conducted using the over-dispersed Poisson regression models,
comparing the time-series (smoothed splines for time-trend) vs case-crossover approach with timestratified strategy for the selection of control days (indicator variables for time-trend).
Data: The Rome dataset (01/01/2006 – 31/12/2010) was analyzed for the association between daily
count of deaths from natural causes (ICD-9: 001-799) aged 35+ and lag 0-1 concentrations of PM10
averaged from background and traffic monitors
Methods: A multivariate adjustment model was defined including the other time-varying factors:
day of the week indicator for the time-series approach, holiday indicator, influenza indicator, high
temperature (penalized spline of mean air temperature (lag 0-1) on days with mean air temperature
(lag 0-1) above city-specific median value, and low temperature (penalized spline of mean air
temperature (lag 1-6) on days with mean air temperature (lag 1-6) below city-specific median
value).
Eight alternative models for control of long-tem trends were implemented, the first five under the
time-series approach and the last three under the case crossover:
1) Model 1 “PS-PACF”: Time trend adjusted for by using a penalized regression spline of time
trend, with smoothing parameter and effective degrees of freedom chosen in order to minimize the
sum of the absolute value of PACF of residuals from lag 3 to lag 30 (with a minimum choice of 3
degrees of freedom per year for seasonality control), setting the number of basis function equal to
50 for each year of data, 2) Model 2 “PS-8DF”: Time trend adjusted for by using penalized
regression spline of time trend, with smoothing parameter chosen in order to have 8 effective
degrees of freedom per year, setting the number of basis function equal to 50 for each year of data,
3) Model 3 “PS-GCV”: Time trend adjusted for by using penalized regression spline of time trend,
with effective degrees of freedom chosen to minimize the generalized cross-validation (GCV)
function, setting the number of basis function equal to 50 for each year of data, 4) Model 4 “NS5DF”: Time trend adjusted for by using natural cubic splines of time trend with 5 degrees of
8
freedom for each year of data, 5) Model 5 “NS-8DF”: Time trend adjusted for by using natural
cubic splines of time trend with 8 degrees of freedom for each year of data, 6) Model 6: “CC-TS”:
Time trend and day of the week adjusted for by using a three way interaction between year, month
and day-of- the- week , 7) Model 7: “ATKIN”: Time trend adjusted for by using dummy variables
for months by year+ monthly-specific penalized splines of month day and 8) Model 8 STRICK:
Time trend adjusted for by using a cubic polynomial of day of season+ two-way interaction
between year and month (which includes the main effects)+ two-way interaction between month
and day-of-the-week (which includes the main effects).
The applied models were compared in terms of: 1) GCV parameter, 2) sum of the absolute values of
partial autocorrelation of residuals (PACF) from lag 3 to lag 30, 3) effective degrees of freedom for
time trend, and 4) estimated association between PM10 and mortality, expressed as beta, standard
error and percent increase of risk corresponding to 10μg/m3 variation in PM10.
Results: The results are summarised in the following Table
MODEL
NAME
EDF
(time)
GCV
PACF
BETA
SE
IR
95%CI
MODEL 1
PS-PACF
15
1.16
0.56
0.001169
0.000288
1.18
(0.61;1.75)
MODEL 2
PS-8df
40
1.14
0.85
0.000648
0.000305
0.65
(0.05;1.25)
MODEL 3
PS-GCV
84.76
1.13
1.48
0.000652
0.000325
0.65
(0.02;1.30)
MODEL 4
NS-5DF
25
1.15
0.63
0.000568
0.000306
0.57
(-0.03;1.18)
MODEL 5
NS-8DF
40
1.15
0.82
0.000544
0.000310
0.55
(-0.06;1.16)
MODEL 6
CC-TS
-
1.47
1.34
0.000559
0.000340
0.56
(-0.11;1.23)
MODEL 7
ATKIN
-
1.15
0.73
0.000437
0.000321
0.44
(-0.19;1.07)
MODEL 8
STRICK
-
1.94
0.81
0.000531
0.000320
0.53
(-0.10;1.16)
Conclusions: We decided to apply Model 2, following a time series framework, as the main
analysis model, and conduct a rather limited sensitivity analysis by applying a) Model 6, to
accommodate the case crossover design, and b) a model with PS with 3dfs/year to point towards the
application of PACF that tends to choose a small number of degrees of freedom for seasonality
control. The underlying basis for the choice of fixed dfs rather than a choice based on the PACF
criterion for the time series approach is that it represents an average and rather conservative
estimate, as well as that PACF allows for more city-specific flexibility in long-term control which
would be favoured in case the exploration of potential heterogeneity was within the main scopes of
the project. Furthermore, the limited number for cities involved in MED-PARTICLES limits the
9
exploration of potential heterogeneity in the second stage of the analysis (during application of
meta-analysis techniques). Finally the PS smoother was favoured over NS in that PS present greater
flexibility, while there was a preference for the time series vs the case crossover framework based
on the fact that there are cities where only aggregated data are available, hence diminishing the
advantage of the case crossover design to explore interactions.
References :
Atkinson
RW,
et
al.
Urban
ambient
particle
metrics
and
health:
a
time-series
analysis.Epidemiology. 2010 Jul;21(4):501-11.
Strickland MJ, et al. Short-term associations between ambient air pollutants and paediatric asthma
emergency department visits. Am J RespirCrit Care Med. 2010 Aug 1;182(3):307-16.
10
Download