Examining Associations Between the Built Environment and Health

advertisement

eAppendix for:

Distributed Lag Models

Examining Associations Between the Built Environment and Health

Jonggyu Baek, Brisa N. Sánchez, Veronica J. Berrocal, and Emma V. Sanchez-Vaznaugh

SIMULATION

We performed a small scale simulation study to improve our understanding of estimation and inference of associations of interest in DLMs. In particular, we wanted to assess how estimates of the associations of interest are affected by the degree of spatial correlation in the built environment, and the shape of association between measured environment factors and an outcome across distance. Further, we compared results obtained from DLMs to those obtained from traditional approaches based on linear models whose goal is to estimate the average association between features of the built environment and an outcome up to a-priori specified distances.

For our simulations, we used as spatial domain the square (0, 500) × (0, 500) . In the square, we simulated food store locations (e.g., features of the built environment) by simulating realizations from an inhomogeneous Poisson point process. The intensity of the inhomogeneous

Poisson process was taken to be a realization of a log Gaussian process with mean ๐œ‡ ๐‘ฅ variance ๐œŽ 2 ๐‘ฅ

, marginal

, and exponential correlation function. In other words, the correlation between the log intensity at two points on the 500 × 500 grid is given by ๐‘’๐‘ฅ๐‘(−๐‘‘/๐œ™) , where ๐‘‘ is the distance between two points and ๐œ™ is the decay parameter, i.e., the rate at which the correlation decays.

We considered three scenarios for the spatial variability of the intensity function: 1) the marginal variance of the intensity function ๐œŽ 2 ๐‘ฅ

is set equal to 0; this implies that the intensity is constant over space and store locations are realizations of a homogeneous Poisson point process with intensity equal to log(๐œ‡ ๐‘ฅ

) ; 2) ๐œŽ 2 ๐‘ฅ

= 1 and ๐œ™ = 5 3 ; this corresponds to an intensity with a spatial correlation that is equal to 0.05 when the distance between two points is equal to 5 units, resulting in sampled food stores that display a small amount of clustering; and 3) ๐œŽ 2 ๐‘ฅ

= 1 and ๐œ™ = 20 3 ; this corresponds to an intensity function with a correlation that decays to 0.05 at a distance of 20 units, resulting in sampled food stores that display a large amount of clustering. In each case, the mean of the log Gaussian process used to simulate the intensity of the inhomogeneous Poisson process was taken to be equal to 0.15 (see Figure 2 in the manuscript).

For each of three built environment settings, we simulated one realization of the built environment; however, given a realization of the built environment, we simulated 1000 datasets with different locations for the health outcomes (e.g., the schools in our motivating application) and different outcome values (e.g., children’s BMIz at the various schools).

To simulate school’s locations within the (0, 500) × (0, 500) region, we proceeded as follows: we sampled ๐‘› ∈ {1,000, 6,000} schools’ ( ๐‘ฅ ๐‘–

, ๐‘ฆ ๐‘–

) coordinates from a Uniform(0, 500) distribution, for ๐‘– = 1, . . , ๐‘› . Finally, after counting the number of locations in the built

environment around each outcome location, we obtained, for each location ๐‘– , ๐‘‹ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) . We used these coordinates to generate values of outcome ๐‘Œ ๐‘–

∑ ๐ฟ ๐‘™=1 ๐›ฝ(๐‘Ÿ ๐‘™

)๐‘‹ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) + ๐œ– ๐‘–

, where ๐‘Ÿ

0

= 0 , ๐‘Ÿ

๐ฟ

= 10 ,

by sampling from the model: ๐‘Œ ๐‘–

๐ฟ = 100 and ๐œ– ๐‘–

~ ๐‘(0, ๐œ 2 )

=

. We used two function shapes for ๐›ฝ(๐‘Ÿ) : 1) a step function given by ๐›ฝ(๐‘Ÿ) = 0.1

if ๐‘Ÿ ≤ 5 and 0 otherwise, which results in the true data generating model and 2) a smooth function ๐›ฝ(๐‘Ÿ) = 0.1๐‘“

๐‘

๐‘Œ ๐‘–

= 0.1๐‘‹ ๐‘–

(๐‘Ÿ)/๐‘“

๐‘

(0)

(0; 5) + ๐œ– ๐‘–

, where ๐‘“

๐‘

(Figure 3A in the manuscript),

(๐‘Ÿ) is a normal density function with mean 0 and standard deviation 5 3 (Figure 3B in the manuscript). Note that that in the traditional models used to study the effect of the built environment on health, the tacit assumption is that the effect of the environment on health can be described by a step function of distance; in other words, the association ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) is deemed constant up to specified distance ๐‘Ÿ ๐‘˜

but is zero beyond ๐‘Ÿ ๐‘˜

. A step function ๐›ฝ(๐‘Ÿ) is likely unrealistic since it is hard to believe that the association abruptly vanishes beyond distance 5, yet this assumption is frequently (implicitly) made in practice. In contrast, the second function used for ๐›ฝ(๐‘Ÿ) implies that the association decays smoothly with distance and is near zero by distance 5. We chose the variance ๐œ 2 of the error term so that the model ๐‘… 2

was equal to either 0.2, 0.5 or 0.8 for the three different built environment schemes. In our motivating example the number of available schools is near 6,000 , and the model ๐‘… 2

was near 0.2 when the DLM was fitted without adjustment of confounders.

In fitting DLMs, we chose 100 lags, ๐ฟ = 100 , with ๐‘Ÿ

๐ฟ

= 10 . We fitted the model within a

Bayesian framework and specified the following prior distributions: ๐›ฝ ๐’ƒ

1

~ ๐‘(0, ๐œŽ 2 ๐‘

๐‘ฐ

๐ฟ−2

) , ๐œŽ 2 ๐‘

~ ๐ผ๐บ(0.1, 1 × 10 −6 ) , and ๐œ 2 ~ ๐ผ๐บ(0.1, 1 × 10

0

∝ 1

−6 )

, ๐œถ ∝ ๐Ÿ ,

. Details on posterior inference and the MCMC algorithm are provided in the next section. For comparison, we also fitted the traditional linear model, ๐‘Œ ๐‘–

= ๐›ฝ the built environment up to a distance ๐‘Ÿ ๐‘˜

0

+ ๐›ฝ

1

๐‘‹ ๐‘–

(0; ๐‘Ÿ ๐‘˜

. We used ๐‘Ÿ ๐‘˜

) + ๐œ– ๐‘–

which assumes a constant effect of

= 2.5, 5, and 7.5, respectively, and compared the estimate of ๐›ฝ

1

with the estimate of ๐›ฝฬ…(0; ๐‘Ÿ ๐‘˜

) obtained from the DLM for these three distances.

To examine how well DLMs capture true buffer effects at given distance lags, bias, variance, mean squared error (MSE), and coverage rate were calculated at each ๐‘Ÿ ๐‘™

, ๐‘™ = 1, 2, … , ๐ฟ , using the formulas:

๐ต๐‘–๐‘Ž๐‘ (๐‘Ÿ ๐‘™

) = ∑ 1000 ๐‘–=1

(๐›ฝฬ‚ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) − ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

)) 1000 ,

๐‘‰๐‘Ž๐‘Ÿ(๐‘Ÿ ๐‘™

) = ∑ 1000 ๐‘–=1

๐‘‰๐‘Ž๐‘Ÿ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

)) 1000 ,

๐‘€๐‘†๐ธ(๐‘Ÿ ๐‘™

) = ∑ 1000 ๐‘–=1

(๐›ฝฬ‚ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) − ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

))

2

⁄ 1000 ,

๐ถ๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘Ž๐‘”๐‘’(๐‘Ÿ ๐‘™

) = ∑ 1000 ๐‘–=1

๐ผ (๐›ฝฬ‚ ๐‘–,2.5%

(๐‘Ÿ ๐‘™

) ≤ ๐›ฝ(๐‘Ÿ ๐‘™

) ≤ ๐›ฝฬ‚ ๐‘–,97.5%

(๐‘Ÿ ๐‘™

)) .

To summarize their overall performance and compare DLMs with classical regression models, we calculated the integrated MSE, ๐ผ๐‘€๐‘†๐ธ = ๐‘Ÿ

๐ฟ

๐ฟ

∑ ๐ฟ ๐‘™=1

๐‘€๐‘†๐ธ(๐‘Ÿ ๐‘™

) , for both models. In evaluating

IMSE for the classical regression models, ๐›ฝฬ‚(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) was set equal to ๐›ฝฬ‚

1

for ๐‘Ÿ ๐‘™

≤ ๐‘Ÿ ๐‘˜

and zero otherwise.

When the true DL coefficient function ๐›ฝ(๐‘Ÿ) is a step function, bias occurs around distance lags where the step happens (eFigure 2A and 2B). Since the fitted DLM assumes that the buffer effect is a continuous function of distance, bias at those lags is expected, and that results in low coverage rates as well. When ๐›ฝ(๐‘Ÿ) varies continuously in r (eFigure 2C and 2D), much less bias is present, and the bias primarily occurs at the smallest lags because the estimated buffer effects are smoother than the true ๐›ฝ(๐‘Ÿ) . Some degree of over-smoothing is expected to occur when using random effect variances (vs GCV) to compute smoothing parameters.

1 Also, at the first few lags, there is relatively smaller amount of information since many DL covariates

๐‘‹ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) in the first few lags have many zero values. Hence bias at smallest lags is expected.

Additionally, when the degree of clustering in the built environment becomes large, the range of lags at which bias occurs becomes wider and coverage rates tend to be smaller.

For both functions ๐›ฝ(๐‘Ÿ) , variance of the estimated buffer effects is larger at the first few distance lags due to less information in DL covariates as previously explained. Note also that the variance of the estimated coefficients at both end points ( ๐‘Ÿ ๐‘™

= 0.1 and 10) tends to be larger than for other values of ๐‘Ÿ ๐‘™

because at the end points the coefficients are constrained only in one direction. The estimated buffer effects are more variable when the spatial dependence in the intensity function controlling the spatial distribution of the built environment features decays at a slower rate. This can be anticipated because the amount of independent contributions of built environment covariates ๐‘‹ ๐‘–

(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) is decreased (compare panels eFigure 2A vs 2B, and 2C vs

2D). The MSE is primarily dominated by bias since the variance is fairly constant across a range of distances, except at the endpoints, as mentioned above.

The comparison of estimated average association up to distance ๐‘Ÿ ๐‘˜

, with ๐‘Ÿ ๐‘˜

= 2.5

, 5 , and

7.5

, obtained from DLMs and traditional linear models is reported in Table 1 of the manuscript.

The true average association up to distance ๐‘Ÿ ๐‘˜

∑ ๐‘˜ ๐‘™=1 ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

)๐œ‹(๐‘Ÿ ๐‘™

2 − ๐‘Ÿ 2 ๐‘™−1

) ⁄ ๐œ‹๐‘Ÿ 2 ๐‘˜

, ๐›ฝฬ…(0; ๐‘Ÿ ๐‘˜

) , is calculated using

. When locations of food stores are generated from a homogeneous Poisson point process, the estimated associations from the traditional linear models are very close to the true values and their coverage rates are close to 95% (i.e., valid inference) for both functions used for ๐›ฝ(๐‘Ÿ) . However, if there is clustering of locations in the built environment, the estimated associations from the traditional models are positively biased

(away from the null) giving invalid inference unless the model is correctly specified (i.e., when ๐›ฝ(๐‘Ÿ) is the step function with ๐‘Ÿ ๐‘˜

= 5 ). In particular, when ๐‘Ÿ ๐‘˜

= 2.5

, a huge amount of bias occurs in the traditional models due to failure in adjusting the effects at longer lags. When the buffer size selected was greater than the true buffer size in traditional models (e.g., ๐‘Ÿ ๐‘˜

= 7.5

), the amount of bias in estimates was smaller; however standard errors of the estimated coefficients were grossly underestimated yielding invalid inference (e.g., very low coverage). Note that when negative and positive bias is cancelled up to specified distances in the fitted DLMs, bias in estimating the average buffer effect is close to zero (eFigure 2). In general, compared to the traditional regression models, estimated average buffer effects obtained using DLMs generally performed better having much less bias and better coverage rates except when the fitted traditional models coincide with the true data generating models.

Since both the traditional regression models and the DLMs have some degree of bias, we summarize their relative performance in terms of integrated mean squared error (IMSE) up to distance ๐‘Ÿ

๐ฟ

= 10 (eTable 1).

1

When the true form of the ๐›ฝ(๐‘Ÿ) function is the step function, the

IMSE was minimum for the traditional regression models using the a-priori distance lag ๐‘Ÿ ๐‘˜

= 5 ,

which is not surprising since the estimated model is the data generating model. However, when ๐›ฝ(๐‘Ÿ) decays with distance ๐‘Ÿ , the DLMs consistently yield the smallest IMSE.

To conserve space and avoid redundancy, here we only reported results for the simulation setting with ๐‘› = 6000 and ๐‘… 2 = 0.2

, since this scenario corresponds to the data in our motivating example. For the smaller sample sizes, bias and coverage rates of the DLM estimates deteriorate, and the strong confounding bias in the traditional regression models persists. The bias in the DLM is largely attenuated when the model ๐‘… 2

increases, but this does not happen for the traditional regression models.

To further examine assumptions used by the fitted DLMs we conducted additional simulations: 1) we specified different numbers of lags, i.e., ๐ฟ = 25, 50, 200 , to define ringshaped areas that differ from the ones ( ๐ฟ = 100 ) used in the data generating model; and 2) we assumed different maximum distance ๐‘Ÿ

๐ฟ

= 3, 20 . As expected, using a smaller numbers of lags in DLMs ( ๐ฟ = 25 ), resulted in smoother estimated DL coefficients because the DL coefficients are estimated in wider ring shaped area and thus become coarser. A larger number of lags ( ๐ฟ =

200 ) yielded similar results as ๐ฟ = 100 . When the maximum distance was misspecified and ๐‘Ÿ

๐ฟ

= 3 , we observed bias in the DL coefficients when there is clustering of locations in the built environment. However, the amount of bias in estimates of the average buffer effect at ๐‘Ÿ ๐‘˜

= 2.5

was less than that from traditional regression models. Results were consistent to those with ๐‘Ÿ

๐ฟ

=

10 when the maximum lag distance used to fit the DLMs was equal to 20.

ESIMATION

We constrain the coefficients ๐›ฝ(๐‘Ÿ

1, 2, … , ๐ฟ , by using splines.

๐‘™−1

; ๐‘Ÿ ๐‘™

) to vary as a smooth function of distance ๐‘Ÿ ๐‘™

, ๐‘™ =

2,3

This ensures that coefficients corresponding to adjacent areas are similar, as we would not typically expect associations to change abruptly across distance. It also alleviates possible numerical problems that may arise when many locations have zero food stores between two given radii ๐‘Ÿ ๐‘™−1 ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

and ๐‘Ÿ ๐‘™

. In particular, we model the association coefficients

) using a radial basis function; that is ๐›ฝ(๐‘Ÿ ๐‘™−1

; ๐‘Ÿ ๐‘™

) = ๐›ผ

0

+ ๐›ผ

1 ๐‘Ÿ ๐‘™

+ ∑ ๐ฟ ๐‘˜=1 ๐›ผฬƒ ๐‘˜

|๐‘Ÿ ๐‘™

− ๐‘Ÿ ๐‘˜

| 3

, (1a)

1 ๐‘Ÿ

1

[|๐‘Ÿ ๐‘™

In a matrix form, (1a) can be written as

− ๐‘Ÿ ๐‘˜

| 3 ]

1≤๐‘™,๐‘˜≤๐ฟ

, ๐œถ = [ ๐›ผ ๐›ผ

0

1

] , and ๐œถ

1 ๐œท = ๐‘ช

, … , ๐›ผฬƒ

๐ฟ

) ๐‘‡

0 ๐œถ + ๐‘ช

1 ๐œถ , where

. The coefficients ๐œถ

๐‘ช

0

= [ โ‹ฎ โ‹ฎ

1 ๐‘Ÿ

๐ฟ

] , ๐‘ช

1

=

are penalized so the squared second derivative of the estimated DL coefficient function is penalized. The objective is to minimize โˆฅ ๐’€ − ๐Ÿ ๐‘› constrains free parameters known, the optimization problem can be re-written as a mixed model by redefining ๐œถ where ๐‘ด

1

๐œถ ๐‘‡ ๐‘ช

1

is an ๐œถ and decomposition [๐‘ช

0 ๐›ฝ

๐‘ช

0 ๐œถ

1

− ๐‘ฟ๐œท โˆฅ rather than

๐ฟ × (๐ฟ − 2)

] = ๐‘ธ ๐‘

๐‘น

2 ๐œถ , and ๐‘ช ๐‘

= โˆฅ ๐’€ − ๐Ÿ ๐‘›

๐‘‡

0 ๐œถ

๐ฟ + 2 implied from the columns of

and setting ๐›ฝ

0

๐‘ด

1

− ๐‘ฟ(๐‘ช

orthogonal matrix to ๐‘ช

0

0 ๐œถ + ๐‘ช

1

, where ๐‘ด

1

as the 3 rd

ฬƒ) โˆฅ 2

subject to the

. The latter constraint implies that there are really

๐‘ช

0

and ๐‘ช

1

.

1,4

๐ฟ

As is well

ฬƒ = ๐‘ด

1 ๐’‚

1

,

can be determined using the ๐‘„๐‘…

to last columns of ๐‘ธ ๐‘

.

4

Further,

finding ๐‘ด

1/2

2 that satisfies ๐‘ด

2 transformation mixed model becomes ๐’€ = ๐›ฝ ๐’ƒ

1

~ ๐‘ต

๐ฟ−2

(๐ŸŽ, ๐œŽ 2 ๐‘ ๐’‚

1

๐‘ฐ) to ๐‘ด

−1/2

2

0

= ๐‘ด

๐Ÿ ๐‘›

1/2

2

+ ๐‘ฟ ∗

๐‘ด

1/2

2

= ๐‘ด ๐œถ + ๐’ ∗ ๐’ƒ

. The smoothing parameter is

1

๐‘‡

1

๐‘ช

+ ๐

1 ๐œ† = ๐œ

๐‘ด

1

2

, and defining ๐’ƒ

1, and re-structuring the data ๐‘ฟ

/๐œŽ 2 ๐‘

, where

.

= ๐‘ฟ๐‘ช ๐ ~ ๐‘ต

0 ๐‘› ๐’ƒ

and

(๐ŸŽ, ๐œ

1

๐’

2

through the

๐‘ฐ)

= ๐‘ฟ๐‘ช

and

1

๐‘ด

1

๐‘ด

−1โˆ•2

2

, the

The mixed model can be fitted with packaged software for mixed models in the frequentist framework. Once we have the estimates from the fitted regression, the estimates of ๐œถ the DL coefficients can be obtained as ๐œท = ๐›€ [ ๐’ƒ

1

[๐‘ช

0

๐‘ช

1

๐‘ด

1

๐‘ด

2

−1/2

].

] and ๐ถ๐‘œ๐‘ฃ(๐œท) = ๐›€๐ถ๐‘œ๐‘ฃ ([ ๐œถ ๐’ƒ

1

]) ๐›€ ๐‘‡ where ๐›€ =

Alternatively, the model can be estimated in the Bayesian framework. With prior distributions of ๐›ฝ

0

∝ 1 , ๐œถ ∝ ๐Ÿ , ๐’ƒ

1

~ ๐‘(0, ๐œŽ 2 ๐‘

๐‘ฐ

๐ฟ−2

) ๐œŽ conditionals are all available in closed forms. Let ๐‘ซ ∗

2 ๐‘

~ ๐ผ๐บ(๐‘Ž

= [๐Ÿ ๐‘›

๐‘ฟ ๐œŽ

, ๐‘ ๐œŽ

) , ๐œ

๐’ ∗

2

] =

~ ๐ผ๐บ(๐‘Ž ๐œ

, ๐‘ ๐œ

) , the full

[๐Ÿ ๐‘›

๐‘ฟ๐‘ช

0

๐‘ฟ๐‘ช

1

๐‘ด

1

๐‘ด

2

] , then the full conditional for where ๐šบ = (๐‘ซ ∗๐‘‡ ๐‘ซ ∗ ⁄ ๐œ 2 + ๐œŽ conditional distribution for

−2 ๐‘ ๐œŽ 2 ๐‘

๐‘ฎ)

−1

, ๐‘ฎ = ๐‘‘๐‘–๐‘Ž๐‘”{๐ŸŽ

3 full conditional distribution of

๐‘ซ ∗ (๐›ฝ

0 ๐›‚, ๐’ƒ

1

, ๐›‚, ๐’ƒ

by ๐›€ [

1

) ๐œถ ๐’ƒ

๐‘‡

1

is ๐œ 2 ๐‘(๐œŽ

is

2 ๐‘

| ⋅) = ๐ผ๐บ(๐‘Ž ๐œŽ ๐‘(๐œ 2

, ๐Ÿ

| ⋅) = ๐ผ๐บ(๐‘Ž

. Inference for DL coefficients ๐œท ๐œ

๐ฟ−2

} ๐›ฝ

0

, ๐›‚, ๐’ƒ

and

1

is ๐ = ๐šบ๐‘ซ

⁄ , ๐‘

⁄ , ๐‘ ๐œ ๐œŽ ๐‘(๐›ฝ

∗๐‘‡

0

๐’€ ๐œ

+ ๐’ƒ

1

+ (๐’“ ๐‘‡

, ๐›‚, ๐’ƒ

๐‘‡ ๐’“) 2

2

๐‘ฎ๐’ƒ

. The full

1

1

| ⋅) = ๐‘(๐, ๐šบ)

2 , while, the

, where ๐’“ = ๐’€ − is obtained by transforming posterior samples of

] with ๐›€ as described above. Inference for average lag effects, ๐›ฝฬ…(0; ๐‘Ÿ ๐‘˜

), can be

, easily determined from posterior samples.

References

1. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression . Cambridge University

Press; 2003.

2. Hastie TJ, Tibshirani RJ. Generalized Additive Models . CRC Press; 1990.

3. Zanobetti A, Wand MP, Schwartz J, Ryan LM. Generalized additive distributed lag modelsโ€ฏ: quantifying.

Biostatistics . 2000;1(3):279-292.

4. Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models: A

Roughness Penalty Approac . CRC Press; 1993.

5. Simon Wood. Generalized Additive Models: An Introduction with R . CRC Press; 2006.

eFigure 1. The estimated DL coefficients up to 7 miles from schools in the student characteristics adjusted DLM. Food environment associations are only for (A) boys or (B) girls, and (C) the difference of association by sex; associations are only for (D) non-Hispanic Whites or (E)

Hispanics, and (F) the difference of association by race/ethnicity; associations are only for (G)

5 th

grade children or (H) 7 th

grade children, and (I) the difference of association by grade.

eFigure 2. Bias, variance, MSE, and coverage rate at each ๐‘Ÿ ๐‘™

, ๐‘™ = 1, 2, … , 100 for the cases when ๐›ฝ(๐‘Ÿ) is: (A) a step function under the built environment without clustering. (B) the step function under the built environment with a large amount of clustering. (C) ๐›ฝ(๐‘Ÿ) is the normal pdf under the built environment without clustering. (D) ๐›ฝ(๐‘Ÿ) is the normal pdf under the built environment with a large amount of clustering. Reported results are from a simulation case with n = 6,000 and ๐‘… 2 = 0.2

.

eTable 1. Integrated MSE from fitted traditional linear models with distance lag ๐‘Ÿ ๐‘˜

= 2.5, 5, and

7.5 and from fitted DLMs with a maximum distance ๐‘Ÿ

๐ฟ simulation case with ๐‘› = 6,000 and ๐‘… 2 = 0.2

.

= 10. Reported results are from a ๐›ฝ(๐‘Ÿ)

Spatial range in the built environment

Traditional linear model

( ๐‘Ÿ ๐‘˜

=2.5)

IMSE*

Step Independence 2.512

5 7.500

20 12.883

Curve Independence 0.205

5 0.169

20

* IMSE is multiplied by 100.

0.179

Traditional linear model

( ๐‘Ÿ ๐‘˜

=5)

IMSE*

0.004

0.004

0.003

0.790

0.743

0.767

Traditional linear model

( ๐‘Ÿ ๐‘˜

=7.5)

IMSE*

2.055

1.861

1.945

1.113

1.060

1.083

DLM

IMSE*

0.125

0.269

0.360

0.006

0.010

0.013

eTable 2. Simulation findings regarding the use of DIC for model selection. For each of 1000 datasets simulated for each scenario, we calculated DIC for the pre-specified distances in Table 1 of the manuscript and the DLM. Because selection of traditional vs DL model may depend on the a-priori specified distance, as a more comprehensive way to select the best traditional models, we also computed DIC for L traditional models using buffer sizes ๐‘Ÿ ๐‘˜

∈ {๐‘Ÿ

1

, … , ๐‘Ÿ

๐ฟ

} . The distance ๐‘Ÿ

(min DIC)

that gave the traditional model with minimum DIC was selected as the “best” buffer size. The bias in coefficients for the pre-specified buffer sizes is given in Table 1. Percent bias in the coefficient from the “best” traditional model, ๐œƒฬ‚

1,๐‘Ÿ ๐‘š๐‘–๐‘›(๐ท๐ผ๐ถ) compared to ๐›ฝฬ…(0, ๐‘Ÿ

(min DIC)

) , was calculated, as well as percent bias in the cumulative effect up to ๐‘Ÿ

(min DIC)

computed from the DLM.

The minimum DIC value among traditional models was compared with the DIC from the DLM. DIC selected the model that generated the data in almost all cases, except when the curve (Figure 3B in the manuscript) was used to generate data and there was lower power (e.g., n=1000). However, in these cases, the estimates from even best traditional model (even when its DIC is lower compared to the DLM) remain more biased. While DIC may select models that fit better, it may not select models that give unbiased effect estimates.

True

β (r)

Step

Curve

Spatial range in the built environment

N

DLM

Mean DIC from traditional model with buffer size:

2.5 / 5 / 7.5 / r

(min DIC)

0 1000 2410 2526 / 2394 / 2491 / 2393

6000 14391 15155 / 14353 / 14939 / 14353

5 1000 4340 4400 / 4333 / 4368 / 4332

6000 25988 26378 / 25973 / 26181 / 25972

20 1000 4694 4744 / 4689 / 4710 / 4687

6000 28146 28467 / 28133 / 28256 / 28132

0 1000 186 213 / 277 / 314 / 204

6000 1082

5 1000 1643

1276 / 1651 / 1877 / 1256

1652 / 1680 / 1721 / 1643

6000 9810 9886 / 10054 / 10302 / 9861

20 1000 1793 1802 / 1819 / 1845 / 1793

6000 10739 10813 / 10913 / 11071 / 10785

P(DLM is selected) via DIC

0.01

0.00

0.03

0.01

0.03

0.01

0.96

1.00

0.50

1.00

0.44

1.00

Mean of r

(min DIC) chosen via

DIC in traditional model

5.0

5.0

5.0

5.0

5.0

5.0

2.6

2.6

2.9

2.9

3.0

2.9

Mean %Bias in ๐œƒฬ‚

1,๐‘Ÿ ๐‘š๐‘–๐‘›(๐ท๐ผ๐ถ)

Mean %Bias in from traditional ๐›ฝฬ…ฬ‚(0, ๐‘Ÿ

(min DIC)

) from DLM model

5.4

2.3

7.4

2.3

8.1

2.3

7.6

3.2

24.2

20.1

26.1

22.1

9.4

3.8

12.5

7.6

14.0

8.7

6.5

2.6

7.2

3.1

9.0

3.4

eTable 3. Results from simulation study that examines whether accounting for spatial patterning in the outcome resolves issues of bias in the traditional model. Results are shown for simulation settings where true ๐›ฝ(๐‘Ÿ) is the curve shown in Figure 3A of the manuscript, ๐‘› ∈

{1000, 6000} , and small or large spatial correlation in the built environment. Results from a traditional linear model including spatial smoothing effects (model fitted is ๐‘Œ ๐‘–

= ๐›ฝ

0

+ ๐›ฝ

1

๐‘‹ ๐‘–

(0; ๐‘Ÿ ๐‘˜

) + ๐‘ (๐‘ฅ ๐‘–, ๐‘ฆ ๐‘–

) + ๐œ– ๐‘–

, where ๐‘ (๐‘ฅ ๐‘–

, ๐‘ฆ ๐‘–

) is a bivariate smoothing term of longitude (x) and latitude (y) constructed using tensor product basis functions

5

) are compared to the traditional model without spatial terms. Summaries are derived from 1000 simulated datasets. Accounting for the spatial patterning in the outcome does not resolve the bias in the estimated built environment association. Est. beta is the mean of estimates in 1000 datasets. Coverage rate is the percent of 95% confidence intervals including the true association. SD(beta) is the standard deviation of 1000 estimates. Mean(SE) is the mean of 1000 standard error estimates.

Model

TLM with spatial smoothing terms

N

1000

6000

1000

6000

Spatial range in the built environment

True value

5

5

20

20

Est. beta

0.058

0.074

0.074

0.076

0.076 ๐‘Ÿ ๐‘˜

= 2.5

Coverage SD* rate

-

(beta)

Mean*

(SE)

Est. beta

0.021

0.154 5.241 5.231 0.024

0.000 2.130 2.120 0.024

0.252 6.901 6.471 0.022

0.000 2.717 2.632 0.022

1000

6000

5

5

0.074

0.074

0.152 5.234 5.218 0.024

0.000 2.125 2.116 0.024

TLM

1000 20 0.076 0.216 6.782 6.376 0.022

6000 20 0.076 0.000 2.643 2.594 0.022

* SD(beta) and Mean(SE) are multiplied by 1000 for readability.

๐‘Ÿ ๐‘˜

= 5

Coverage SD* rate

-

(beta)

Mean*

(SE)

Est. beta

0.010

0.723 1.842 1.799 0.011

0.106 0.802 0.732 0.011

0.878 2.159 1.999 0.010

0.658 0.870 0.813 0.010

0.727 1.839 1.792 0.011

0.104 0.798 0.729 0.011

0.888 2.093 1.957 0.010

0.662 0.839 0.796 0.010 ๐‘Ÿ ๐‘˜

= 7.5

Coverage SD* rate

-

(beta)

Mean*

(SE)

0.510 1.097 1.007

0.004 0.480 0.410

0.813 1.085 1.000

0.361 0.446 0.407

0.499 1.090 1.002

0.006 0.476 0.408

0.833 1.043 0.974

0.390 0.426 0.396

Download