Supplementary Appendix S1

advertisement
Supplemental Appendix
Model description and robustness checks
We are interested in a statistical modeling approach that is open to exogenous shocks that
might affect the total population, e.g. environmental targets. We therefore stayed away from
the traditional logistic models and described growth by an additive sequence of specific terms.
An additive sequence may provide an adequate statistical representation, as more than 50% of
the human height increment between birth and adulthood is caused by leg growth. The
complex phenomenon of leg, i.e. long bone growth is beyond the scope of this paper, but we
want to stress that long bones essentially grow additive via endochondral ossification. The
specific architecture of the proliferating chondrocytes within the epiphyseal growth plates
result in a mainly unidirectional length increment of the bone. Growth of the chondrocytes
depends on multiple interactions of endocrine signals, nutrition, oxygenation, physical stress,
etc. (28) that all add in a cumulative, i.e. an additive, way to long bone length. This was the
reason to anticipate a model consisting of an additive rather than a multiplicative sequence of
specific terms.
To check the robustness of our empirical results and guard against misspecification, we also
investigate the statistical evidence for this model. We consider three archetypical models of
height. These are defined as
,
(1)
(2)
and
(3)
1
These three specifications characterize height as trend stationary (model 1), as first order
trend stationary, i.e. stationary in differences (model 2), and trend stationary in logarithmic
differences (model 3). The evidence for these three baseline specifications is evaluated on the
basis on the marginal likelihood, see below for details. Note that these different models imply
different inference results, when regressing height differences or logarithmic growth rates on
past height. In case that model specification 1 is supported by the data, the corresponding
regression coefficient is expected to be negative, for both dependent variables, i.e., height
differences and logarithmic growth rates. This effect is known asregression to the mean.
However, when model specification 2 holds, a significant impact of the past height is only
expected in case of regressing logarithmic growth rates on past height, whilst in case of model
specification 3 this is to be expected when regressing height differences on past height. Thus
in order to guard against spurious regression results, we check the evidence for the underlying
different characterizations of stationarity. The evidence according to the marginal likelihood
is strongly in favor of specifications 2. The corresponding marginal likelihoods are -6400.8
(1), -2506.9 (2), and -6652.9 (3) for boys and -5156.6 (1), -1956.7 (2), and -4364.0 (3) for
girls respectively.
The described model can be summarized as follows. Define thus for all
at time
point
where
denotes observed height. Then for
and
can be modelled (reduced form) conditional on
2
as
height differences
where
is a latent individual specific growth component,
'average' growth,
time specific dynamic transmission of growth,
as time specific
captures a backward
looking mechanism conceptualized as
and
captures growth tempo via difference between bone age and calendar age.
Nonlinearities occurring in growth due to puberty are controlled via
takes value 1, when a puberty control status is reached at time
or before.
denotes
the time point of a certain puberty control, i.e.
Note that this parameterization allows a time dependent modeling of the influence puberty
controls exhibit on growth.
Model estimation, model comparison and handling of missing data
Bayesian estimation is performed using MCMC techniques (Gibbs sampling). Conjugate
distributions are chosen a priori. Hyperparameters are set as follows. Means and variance of
conjugate normal distributions are set to zero and 1000 respectively. For conjugate gamma
distributions hyperparameters are both set to 1. Gibbs sampling is conducted via iterative
sampling form the following full conditional distributions yielding a sample from the joint
posterior distribution. 10000 iterations are performed, 2000 draws are discarded as burn-in.
3
Posterior sample characteristics serve as estimators. The algorithm involves the following
steps.
1. Sample from the full conditional distribution of
given as a univariate normal
distributions with parameters
and
where
,
and
and
,
and
denote the prior moments of a conjugate
normal density.
2. Define
with
and
of dimension
density, then
.
and
where
,
denotes a diagonal unit matrix
denote their prior moments of a conjugate normal
can be sampled from a bivariate normal density with moments
and
3. Define
,
and
is sampled from an
with
.
Then
-dimensional multivariate normal distribution with moments
and
4
4. Finally, sample for
where
and
from an inverse Gamma distribution with moments
denote the corresponding hyperparameters.
A total of 69 boys and 60 girls were observed from 9-18 years and 9-17 years respectively.
Two strategies are employed to deals with missing values. The first strategy uses complete
case analysis allowing for handling missing values in height and measure Tanner stages.
However, no complete case remains when considering bone age as a tempo control.
Several approaches to deal with missing values in explaining variables are discussed in the
literature. Based on the comprehensive review provided by (29) of multiple imputation as one
way to deal with missing values in survey data, (30) discuss the use of of multiple imputation
by chained equation (MICE) algorithms based on classification and regression trees (CART)
to mimic the full conditional distribution of missing values, when the data structure does not
allow for rich and yet computationally feasible full parametric models. Note that imputation
in panel model context is of especial relevance, since even a single missing value would cause
the loss of all observation of an individual. However in the context of missing values in
observed bone ages parametric models approximating the full conditional distribution are at
hand making use of the dynamic panel structure of the data. Consider a panel data setup, with
individuals and t = 1, … , T observation periods. Let zitdenote observed bone age
for individual i at period t .For the pattern of missing values, see Table A1.
Note that dealing with missing values is straightforward when Bayesian estimation is
performed via Markov Chain Monte Carlo (MCMC) methodology. In each iteration of the
Gibbs sampler, a new set of imputed values is generated thus incorporating the uncertainty
5
concerning the missing values within the parameter estimation of the structural model of
interest. Given the unsystematic pattern of missing values within the data, the algorithm has
the following structure. Define parametric models providing the approximation for the full
conditional distributions of each variable in terms of linear regressions.
is thereby
regressed on the following two periods, i.e.
For all following periods regression is performed on one previous and one consecutive period,
i.e.
and thus for the last period the model takes the form
These approximations of the full conditional distribution are then incorporated within the
MCMC sampling scheme, where draws of
are obtained from the corresponding
predicitive distributions, where parameters thereof are sampling from the sampling
distributions, i.e.
,
where
denotes the set of regressors involved in the approximations within the conditional
distributions of
.
The Bayesian framework allows to compare the differentspecifications via the marginal
likelihood
, which gives theevidence of the sample data
under a specific
model. This conceptincorporates the parameter uncertainty and the uncertainty stemming
from the missing values and provides a consistentmodel assessment even for smaller samples
as it is not based onasymptotic properties. The derivation of the marginal likelihood isalong
the way proposed by (15). Starting point of thederivation is to decompose the log marginal
likelihood of all data
into
6
As this identity holds for each point θ within the parameter space of the model, it is calculated
at a point within the highest density region, where θ* is the posterior mean. The first
component gives the log likelihood, which is calculated as
Where the draws
are obtained from a special shortened Gibbs runs iterating through the
the following conditional distributions
References
28. Nilsson O, Marino R, De Luca F, Phillip M, Baron J. Endocrine regulation of the
growth plate. Horm Res, 2005;64:157-65.
29. Raghunathan TE, Lepkowski JM, van Hoewyk J, Solenberger P. A multivariate
technique for multiply impuing missing values using a sequence of regression
models. Survey Meth, 2002;27:85-96.
30. Burgette LF, Reiter JP. Multiple imputation for missing data via sequential
regression trees. AmJ Epid, 2010;172:1070-1076.
Table A1: Number of missing values in bone ages
year
9
10
11
12
13
14
15
16
17
18
boys
14
10
15
16
10
11
9
11
8
8
girls
7
10
9
10
7
3
10
8
4
13
7
8
Download