Particles size and composition in Mediterranean countries: geographical variability and short-term health effects MED-PARTICLES Project 2011-2013 Under the Grant Agreement EU LIFE+ ENV/IT/327 Particles size and composition in Mediterranean countries: geographical variability and short-term health effects MED-PARTICLES ACTION 9. Development of the statistical modeling strategy for the analysis of the short-term effects of air pollutants on health endpoints Summary: The protocol aims to provide a detailed statistical analysis of time series data of PM measurements and health outcomes (mortality and hospitalizations) in the Mediterranean cities involved in the project. -------------------------------------------------------------------------- 1 2 1. SHORT-TERM ASSOCIATION BETWEEN DAILY HEALTH OUTCOMES AND DAILY POLLUTANT CONCENTRATIONS Objective The objective of this chapter of the statistical protocol is to provide detailed steps for the statistical analysis applied for the investigation of the association between the PM measurements and health outcomes time series data collected in the Mediterranean cities involved in the project. Apart from the statistical guidance, the specific code will also be supplied in each step (when necessary) to ensure comparability. Following the discussions of the Statistical Group a short presentation of the exploratory analysis undertaken by the Rome team in order to finalize the control of the confounding effect of long term trends and seasonality in the data is also presented. The R statistical package (version 2.13.0, package mgcv) will be used for the statistical analysis. Data Exposure For the epidemiological analysis of the times series data we will use the pollutant measurements obtained from urban/suburban background sites and fixed monitors located near traffic (only when they represent the exposure of the population living nearby). The monitor-specific concentrations will be averaged and missing values from the averaged series will be imputed, as detailed in the protocol in Action 7. As a result, for each city and for each one of the following pollutants, daily exposure variables will be used for epidemiological analyses, as described below: - PM2.5: daily average - PM10: daily average - PM2.5-10: daily average - NO2: daily average - CO: daily maximum value of the 8-hr running means - O3: daily maximum value of the 8-hr running means - Temperature: daily mean temperature - Humidity: daily mean relative humidity Outcomes Daily counts of cause-specific mortality and emergency hospitalizations will be collected under Action 6 and used in the epidemiological analysis, to investigate the association with daily 3 concentrations of different air pollutants, with a specific focus on particulate matter (PM) exposures. Methods a) General Strategy and Time-trend control The particles’ indices-health outcomes associations for each health outcome will be investigated using Poisson regression models allowing for overdispersion (using quasi(poisson) in R). The model is of the form: log E [Yt]=β0 + b * PMt + s (timet, k)+ dummy variables for day of the week +[other confounders such as temp as decided, holiday, influenza.. ] where E[Yt] is the expected value of the Poisson distributed variable Yt indicating the daily outcome count on day t with Var(Yt)=φE[Yt], φ being the over-dispersion parameter, timet is a continuous variable indicating the time of event (with values from 1 to the length of the timeperiod) and PMt is the particles’ level on the average of same and previous day. The functions s capture the relationship between the time-varying covariates and calendar time and health outcome. We will use penalized regression splines as smoothing functions, as implemented by Wood in R (2000) for our main analysis model. In this setting, k is the number of basis functions. We will use natural cubic splines as basis functions for the penalized regression splines. We will choose the number of basis functions (k) to be 50 for each year of available data for the time variable. Hence, the term corresponding to time will be of the form: s(trend, k=50*number of years of data, bs="cr"). We will choose 8 effective degrees of freedom (edfs) per calendar year of available data to control for seasonality. The choice is based on the previous experience that this would be a rather conservative estimate of the effect (Samoli et al. 2010) as well as the results of the exploratory analysis undertaken within MED-PARTICLES project shown in the Appendix. Hence, the relevant R code will be of the form model<-gam(outcome~s(trend,k=50*nyears,bs="cr")+as.factor(wday)+…, data=city, na=na.omit, family=quasipoisson,sp=2.2e+6)) (example of a value of the smoothing parameter in order to achieve the desirable edf.s) where the smoothing parameter sp will be chosen in a way so that the estimated degrees of freedom for the variable time will be those defined as 8*nyears (as produced by summary.glm(model)). 4 In the model we will include the dummy variables for day of the week as indicated. Careful inspection of the trend plots should indicate if such long term control is adequate or strange peaks remain in the data that we will try to account for by including extra period-specific terms (dummy variables or smooth terms depending on the data, see also in other confounders section below). In this case, it will be possible to use different DF for different periods within a year, eg possible smooth terms for the summer, especially if we have general (not emergency) admissions; or respiratory infections in the winter. Sensitivity analysis We will apply 2 separate models as sensitivity analysis on seasonality control: 1. One model will use the time series approach with penalised splines for seasonality control (as the main model) but with the final edfs for long term trends based on the choice of the smoothing parameter for time that minimizes the sum of the absolute value of partial autocorrelations (PACF) of the residuals from lags 1 to 30. 2. The other model will be a Poisson model where the underlying method will be the casecrossover approach with time-stratified strategy for the selection of control days. Modelling the time trend with a three-way interaction between year, month and day of death is equivalent to the standard case-crossover design with time-stratified strategy for control days selection, according to which the control days are selected as the same days of the week within the same month and year of the event day. The model specification for time trend and day of the week is: log E [Yt]=β0 + b * PMt + Σ βi [YY/MM/DOW] +[other confounders such as temp as decided, holiday, influenza.. ] where [YY/MM/DOW] represents all possible combinations of year, month and day of week hence yy/mm/dow-1 dummy variables are entered in the model. In this model there will be no further inclusion of day of the week dummy variables. b) Other confounding factors control Time series data on daily temperature (oC, mean) and relative humidity (%) will be used to control for the potential confounding effects of weather. We expect weather series to be complete. For temperature control we will include a natural spline with 3 degrees of freedom for lag 0 and a 5 natural spline with 3 degrees of freedom for lag 1. When we do distributed lag analyses for the pollutants, we will include the same distributed lag term for temperature, as we do for the pollutants. For humidity adjustment, we will include a linear term for relative humidity, since we believe it is sufficient to capture potential confounding due to this parameter. External information on influenza epidemics or other unusual events (heat waves, strikes, etc) will be used, if available. The data on influenza should preferably be daily counts i.e. number of cases. When there are weekly or biweekly data on number of cases then daily values should be calculated by division. If the only available information is the existence of an epidemic, then a variable taking the value 1 for epidemic days and 0 otherwise should provided. If other unusual events have taken place during the study period (such as: strike in the health services, flood, earthquake, heat or cold wave) a dummy variable with value 1 during the unusual event and 0 otherwise should be included in the file. Furthermore, if in a city there is a sharp reduction of the population because everyone takes holidays in the same period, it is possible that will need additional control especially in the analysis oh hospital admissions series. In case there are no influenza data available we will use the APHEA-2 method for influenza control, including a dummy variable taking the value of 1 when the 7-day moving average of the respiratory mortality was greater than the 90-th percentile of its city-specific distribution. In this case, since influenza definition is based on the distribution of respiratory mortality, we will include the influenza dummy variable only when we analyzing non respiratory related admissions and mortality. In all other influenza definitions, instead, it will be included for all study outcomes, including respiratory ones. c) Lag structure of exposure According to the EPIAIR protocol we will include: a) cubic polynomial distributed lag models and single-lag models from lag 0 to lag 6 to visually examine the lag structure of the association between PM exposures and health outcomes; b) three cumulative lags chosen a priori to represent immediate effects (lag 0-1), delayed effects (lag 2-5) and prolonged effects (lag 0-5); c) for each combination of exposure/outcome, choice of one of these three lags as the reference lag, based on the meta-analytic polynomial distributed lag shape and on the meta-analytic estimates of the a priori cumulative lags. This choice is relevant to identify the lagged exposure to be used for additional analyses as bi-pollutant models and sensitivity analyses (the same reference lag is used for all cities). This strategy seems a good compromise between a priori definitions and flexibility of lag choice for different exposure/outcome combinations. 6 d) Two pollutant models We will run two pollutant models, including the rest of the routinely measured pollutants in the particles models. Special care will be given to the inclusion of NO2 in PM2.5 models if there is high correlation (r>0.7). e) Concentration-response analysis We will evaluate the shape of the relationship between PM concentrations and daily health outcomes by: 1) Running threshold models, e.g. restricting the analyses to decreasing concentrations of pollutants, and estimating the associations for the so-restricted time-series. This will provide an answer to the question whether there exists an association between different PM fractions and different health outcomes, even at lower and lower concentrations of the pollutants; 2) Running meta-smoothing models, which consist in estimating non-linear associations between PM fractions and health outcomes within each city, and then plotting a non-linear metaanalytical curve for the pooled results. 7 APPENDIX Results from initial exploratory analysis on the choice of seasonality control on the Rome dataset Following discussions of the Statistical group on the optimal analysis of epidemiological time series data, the Rome team applied an exploratory analysis in order to compare the framework of time series vs case crossover design, as well as different degrees of seasonality control under each framework. Among the group there was a consensus that the 2 methods are identical and address only the amount of seasonality control, which is the fundamental question in this type of analysis. Several methods proposed in the bibliography were tested and discussed in order to reach a consensus. In brief, all analyses was conducted using the over-dispersed Poisson regression models, comparing the time-series (smoothed splines for time-trend) vs case-crossover approach with timestratified strategy for the selection of control days (indicator variables for time-trend). Data: The Rome dataset (01/01/2006 – 31/12/2010) was analyzed for the association between daily count of deaths from natural causes (ICD-9: 001-799) aged 35+ and lag 0-1 concentrations of PM10 averaged from background and traffic monitors Methods: A multivariate adjustment model was defined including the other time-varying factors: day of the week indicator for the time-series approach, holiday indicator, influenza indicator, high temperature (penalized spline of mean air temperature (lag 0-1) on days with mean air temperature (lag 0-1) above city-specific median value, and low temperature (penalized spline of mean air temperature (lag 1-6) on days with mean air temperature (lag 1-6) below city-specific median value). Eight alternative models for control of long-tem trends were implemented, the first five under the time-series approach and the last three under the case crossover: 1) Model 1 “PS-PACF”: Time trend adjusted for by using a penalized regression spline of time trend, with smoothing parameter and effective degrees of freedom chosen in order to minimize the sum of the absolute value of PACF of residuals from lag 3 to lag 30 (with a minimum choice of 3 degrees of freedom per year for seasonality control), setting the number of basis function equal to 50 for each year of data, 2) Model 2 “PS-8DF”: Time trend adjusted for by using penalized regression spline of time trend, with smoothing parameter chosen in order to have 8 effective degrees of freedom per year, setting the number of basis function equal to 50 for each year of data, 3) Model 3 “PS-GCV”: Time trend adjusted for by using penalized regression spline of time trend, with effective degrees of freedom chosen to minimize the generalized cross-validation (GCV) function, setting the number of basis function equal to 50 for each year of data, 4) Model 4 “NS5DF”: Time trend adjusted for by using natural cubic splines of time trend with 5 degrees of 8 freedom for each year of data, 5) Model 5 “NS-8DF”: Time trend adjusted for by using natural cubic splines of time trend with 8 degrees of freedom for each year of data, 6) Model 6: “CC-TS”: Time trend and day of the week adjusted for by using a three way interaction between year, month and day-of- the- week , 7) Model 7: “ATKIN”: Time trend adjusted for by using dummy variables for months by year+ monthly-specific penalized splines of month day and 8) Model 8 STRICK: Time trend adjusted for by using a cubic polynomial of day of season+ two-way interaction between year and month (which includes the main effects)+ two-way interaction between month and day-of-the-week (which includes the main effects). The applied models were compared in terms of: 1) GCV parameter, 2) sum of the absolute values of partial autocorrelation of residuals (PACF) from lag 3 to lag 30, 3) effective degrees of freedom for time trend, and 4) estimated association between PM10 and mortality, expressed as beta, standard error and percent increase of risk corresponding to 10μg/m3 variation in PM10. Results: The results are summarised in the following Table MODEL NAME EDF (time) GCV PACF BETA SE IR 95%CI MODEL 1 PS-PACF 15 1.16 0.56 0.001169 0.000288 1.18 (0.61;1.75) MODEL 2 PS-8df 40 1.14 0.85 0.000648 0.000305 0.65 (0.05;1.25) MODEL 3 PS-GCV 84.76 1.13 1.48 0.000652 0.000325 0.65 (0.02;1.30) MODEL 4 NS-5DF 25 1.15 0.63 0.000568 0.000306 0.57 (-0.03;1.18) MODEL 5 NS-8DF 40 1.15 0.82 0.000544 0.000310 0.55 (-0.06;1.16) MODEL 6 CC-TS - 1.47 1.34 0.000559 0.000340 0.56 (-0.11;1.23) MODEL 7 ATKIN - 1.15 0.73 0.000437 0.000321 0.44 (-0.19;1.07) MODEL 8 STRICK - 1.94 0.81 0.000531 0.000320 0.53 (-0.10;1.16) Conclusions: We decided to apply Model 2, following a time series framework, as the main analysis model, and conduct a rather limited sensitivity analysis by applying a) Model 6, to accommodate the case crossover design, and b) a model with PS with 3dfs/year to point towards the application of PACF that tends to choose a small number of degrees of freedom for seasonality control. The underlying basis for the choice of fixed dfs rather than a choice based on the PACF criterion for the time series approach is that it represents an average and rather conservative estimate, as well as that PACF allows for more city-specific flexibility in long-term control which would be favoured in case the exploration of potential heterogeneity was within the main scopes of the project. Furthermore, the limited number for cities involved in MED-PARTICLES limits the 9 exploration of potential heterogeneity in the second stage of the analysis (during application of meta-analysis techniques). Finally the PS smoother was favoured over NS in that PS present greater flexibility, while there was a preference for the time series vs the case crossover framework based on the fact that there are cities where only aggregated data are available, hence diminishing the advantage of the case crossover design to explore interactions. References : Atkinson RW, et al. Urban ambient particle metrics and health: a time-series analysis.Epidemiology. 2010 Jul;21(4):501-11. Strickland MJ, et al. Short-term associations between ambient air pollutants and paediatric asthma emergency department visits. Am J RespirCrit Care Med. 2010 Aug 1;182(3):307-16. 10