Dietary assessment and estimation of intake densities Michael J. Daniels1 2 Alicia Carriquiry3 4 Michael Daniels is corresponding author: 102G Snedecor Hall, Department of Statistics, Iowa State University, Ames, IA 50011-1210, E-mail: mdaniels@iastate.edu. 2 Michael Daniels is Assistant Professor, Department of Statistics, Iowa State University. 3 Alicia Carriquiry is Associate Professor, Department of Statistics and Center for Agricultural and Rural Development, Iowa State University. 4 This work was partially funded through contracts number 0009730470 and 0009830322 between the National Center for Health Statistics, Center for Disease Control and Prevention, and the Department of Statistics, Iowa State University, and by Research Grant No. 1960915 from FONDECYT, Chile. 1 Summary The U.S. government has conducted nationwide food consumption surveys since 1936. Information obtained from these surveys is used to design food assistance programs, guide food and nutrition policy, and monitor the dietary status of the population. The distribution of usual intakes of a nutrient in the population is of interest to policy makers. Here, usual intake is dened as the long-run average intake of a nutrient by an individual. Usual intakes are not observable in practice. Instead, we observe daily intakes for a sample of individuals and a small number of days, and assume that observed intakes measure usual intakes with error. The distributions of observed intakes, however, are typically very skewed, and the day-to-day variability in intakes tends to be large relative to the between-individual variance, and can be heterogeneous across individuals (Nusser, Carriquiry, Dodd, and Fuller, 1996). In this paper, we present a Bayesian approach to estimating the distribution of usual intakes of a nutrient in a population. Starting with a sample of dietary intakes, we model a function to map the intakes into the normal scale. This function combines a power transformation and a cubic spline (constrained to be monotonic) with unknown number and location of knots, and is estimated using reversible jump Markov chain Monte Carlo methods (Green, 1995). From each \draw" of the transformation function we obtain a transformed set of intakes which are approximately normally distributed. We then remove the day-to-day variability in daily intakes by tting a measurement error model to each set of transformed observations. Each set of estimated individual usual intakes is then mapped back to the original scale using the inverse of the transformation function. Posterior distributions of percentiles and other attributes of the density for each nutrient are estimated accounting for all major sources of uncertainty. We apply these methods to a subset of the 1994-1996 Continuing Survey of Food Intakes by Individuals (CSFII) collected by the USDA (USDA, 1997). Key words: Dietary data CSFII Measurement error models Splines Reversible jump Markov chain Monte Carlo 1 1 Introduction The United States government collects dietary intake data since the 1930s. Nationwide food consumption surveys are conducted approximately once a year, where a large sample of individuals is asked to report their food consumption during the previous 24 hours. Thus, the survey instruments used to collect this information are called 24-hour recalls. Most nationwide food consumption surveys collect replicate 24-hour recalls for at least some of the individuals in the sample. Often, these repeated observations are not collected on consecutive days, so that multiple observations within an individual can be considered to be independent. Other survey instruments, for example food frequency questionnaires (FFQs), can also be used to collect dietary intake data and to estimate usual nutrient intake distributions (e.g., Carroll, Freedman, and Hartman, 1996). In this paper, we consider only the analyses of intake data collected via 24-hour recalls. The information obtained from these dietary surveys is used by policy makers to design, implement, monitor and evaluate food assistance programs and other nutritionrelated policies. For example, policy makers might be interested in comparing the nutritional status of children from low-income households who are enrolled in the School Lunch program versus that of children who are not. The eectiveness of the Food Stamps program might be evaluated by, for instance, monitoring the proportion of lowincome elderly who are consuming enough of some essential nutrient, or the proportion of teenaged girls who consume adequate amounts of folate or calcium. Since food assistance programs managed by the U.S. Department of Agriculture (USDA) alone cost approximately 26 billion dollars a year, it is important that information obtained from dietary data be as accurate as possible, and that measures of uncertainty be available for all estimates. 2 How do we obtain information about the intake of a nutrient from a dietary intake survey? Individuals participating in food consumption surveys are asked to recall their food (including beverages, snacks, and meals) consumption for the previous day. A database managed by the USDA is then used to \map" foods into their nutrient components. This USDA database contains approximately 6,600 entries, and is updated periodically. For example, we can obtain the content of about 26 dierent nutrients of a lunch composed of a slice of pepperoni pizza, an 8-ounce can of Diet Coke, and an apple. It is well known (e.g., Schubert, Holden, and Wolf, 1987 Haytowitz, Pehrsson, Smith, Gebhardt, Mathews and Anderson, 1996) that this food database is not error free. We do not, however, address the issue in this work. The data we obtain for analysis, then, are replicate observations (for at least a subsample of individuals) of daily intakes of a large set of nutrients for individuals in the sample. We use Yij to denote the observed intake of a nutrient for individual i on day j . Because these data are costly to collect, the number of replicate observations is typically no more than two or three for a subsample of the individuals in the survey. We use di to denote the number of days of intake information available for each individual in the sample. The relationship between diet and health underlies much of the government's goal of providing the population with the means to consume an adequate diet. Often, the eect of nutrient consumption on health-related outcomes is chronic, so that researchers are interested in the long-run average intake of a nutrient by an individual. This long-run average intake is known as the usual intake of a nutrient by an individual, and is denoted by yi, with i = 1 ::: n the number of individuals in the sample. Formally, yi = E fYij jig. Furthermore, population-level assessments such as those described earlier, require that we estimate the distribution of usual intakes F (y) in the group of interest. This usual 3 intake distribution concept was set forth in a report by the National Research Council (NRC, 1986). The problem of estimating usual nutrient intake distributions from dietary survey data is a challenging one. Usual intakes are not observable in practice, and observed daily intakes measure usual intakes with error. Furthermore, various characteristics of dietary intake data (described in the next section) prevent the use of standard normal-theory methods for analysis. Nusser, Carriquiry, Dodd, and Fuller (1996), Eckert, Carroll, and Wang (1997), Chen (1998) and Carriquiry (1999), among others, have recently proposed approaches for analyzing dietary intake data. In particular, Nusser et al. (1996) propose a measurement error model approach on transformed intake data that results in estimators of usual intake distributions that perform well in simulation studies. The Nusser et al. methodology, however, is developed from a frequentist viewpoint, and consists of several steps. Thus, it is not possible to obtain expressions for standard errors of various estimates that properly incorporate all uncertainties accumulated along the way. In fact, the estimators of standard errors for percentiles of the usual intake distribution given in Nusser et al. (1996) are obtained under the assumption that the function used to transform the data into the normal scale, and the variance components in the measurement error model, are xed and known. We revisit the Nusser et al. (1996) approach to estimating usual intake distributions from dietary intake data, and reformulate it within a Bayesian framework. Our objective is to derive marginal posterior distributions for parameters of the usual intake distribution of a nutrient that are of interest to policy makers and researchers in nutrition. We focus on the marginal posterior distributions of percentiles of the usual intake distribution, and argue that the posterior variances we obtain reect all uncertainties accrued in the various steps of the procedure. We use Markov chain Monte Carlo methods (MCMC) 4 (e.g., Smith and Roberts, 1993) throughout, to perform all computations. As will be described in Section 3, the transformation step involves solving a varying-dimensional problem thus, we proceed as in Green (1995) and Denison, Mallik, and Smith (1998) and use a reversible-jump MCMC algorithm to obtain the transformation function. The paper is organized as follows. In Section 2 we briey discuss the characteristics of dietary intake data, and describe a subset of the Continuing Survey of Food Intakes by Individuals (CSFII, USDA, 1996) that was used for illustration of the procedure. The model and proposed estimation strategy are given in Section 3. We apply the methodology to a subset of CSFII and present results in Section 4. Finally, Section 5 gives a discussion of the approach we propose and of related problems in nutrition that merit further investigation. 2 Characteristics of dietary intake data We consider dietary intake data obtained via 24-hour recalls, by the CSFII carried out in 1994-1996. The CSFII is a nationwide food consumption survey designed as a multistage stratied area probability sample of the 50 states and the District of Columbia, and is intended to be self-weighting. We consider the subset consisting of males and females aged 14 to 18 years, who were interviewed between 1994 and 1996. Two observations were collected for each individual in the sample. Both observations were obtained by personal interview if possible otherwise, the second day interview was done over the phone. Within an individual, intakes were collected at least a week apart from each other thus, we assume that observations within an individual are independent. Because of non-negligible attrition rates, regression weights (e.g., Huang and Fuller, 1978) were constructed to adjust for nonresponse. The analyses we present in Section 4 are performed on weighted data, where weights, once computed, are assumed to be xed and 5 known. Observed intake data are aected not only by individual, but also by nuisance eects such as day of the week, month of the year, interview sequence (rst or later days) and interview method (in person or by phone). Prior to analysis, we adjust the data to remove these nuisance eects. We proceed as in Nusser et al. (1996) and use a ratio adjustment based on a regression model to partially remove the eects of day of week and interview method from observed intake data. To avoid carrying the survey weights throughout our analyses, we linearly transform the intake data to obtain a set of \equal weight" observations, as described in Dodd (1997). The unweighted analyses of the equal weight observations are essentially equivalent to the analyses that would be conducted on the original observations and their weights. In the remainder, Yij denotes the adjusted, equal-weight intake for individual i on day j . Dietary intake data have attributes that make their analysis challenging. Observed daily intakes have skewed distributions, and exhibit both between- and within-individual variability. In fact, the within-individual variance in observed intakes of most nutrients is sometimes larger than (or of the same order of magnitude as) between-individual variation, and is heterogeneous across individuals. Typically, as the mean intake of a nutrient increases, so does the variance of those intakes. Since our objective is to estimate F (y), the distribution of the usual intakes, we must remove the day-to-day variability from the observed intakes. An additive relationship between observed intake and usual intake in the normal scale is often adopted to model (transformed) observed intakes. A linear measurement error model approach that allows for the incorporation of heterogeneous within-individual measurement error variances is then appropriate for dietary intake data. How to transform observed intakes into the normal scale so that transformed intakes are normally 6 distributed and the additive relationship holds is a matter of ongoing discussion (Nusser et al. 1996 Stefanski and Bay, 1996 Chen, 1998). Here, we adopt the Nusser et al. (1996) approach, and assume that in the normal scale, a linear measurement error model is a reasonable choice to describe the relationship between observed and usual intakes. 3 Model and estimation strategy We implement a fully Bayesian approach to the problem of estimating the marginal posterior distributions of percentiles of the usual intake distribution of dietary components. The basic approach uses three nested sampling algorithms to properly account for all uncertainties: 1) Transformation of observed dietary intake data to normality 2) Removal of measurement error in the normal scale 3) Back-transformation to the original scale. We now describe each of these steps in detail. 3.1 Transformation to normality As discussed in Nusser et al. (1996), standard power transformations fail to properly transform intake data to normality for most nutrients. We use cubic splines to improve the transformation. Our data consist of pairs, (Yij zij ) where Yij is the observed intake for the ith individual on the j th day raised to the power which provides the best (in terms of minimizing mean squared error) transformation to normality, and the zij are the corresponding normal scores. We use Blom's (1958) formula to compute the zij . Our goal is to compute a function g;1 (z ) such that g(Yij ) = Xij , where Xij is approximately normal. We postulate a cubic spline for g;1(z ), indexed by a vector of unknown parameters and contaminated by normal noise. We use maximum likelihood (ML) to estimate the parameters in the model. The model is: Yij = g;1 (z ) + ij 7 = 3 k X X p 0 + pzij + p+4tp + ij p=1 (1) p=1 where tp = (zij ; rp)3Ifz r g and the ij , j = 1 ::: ni i = 1 ::: n, are normal random variables with mean 0. The number of knots in (1) is given by k, and their locations are denoted r1 r2 ::: rk. We dene r(k) = (r1 r2 ::: rk)0 and = (k r(k) ). The Yij are the sample quantiles of the power-transformed data, and thus cannot be considered to be iid random variables. As a result, the covariance matrix of might be modelled as a scale factor 2 times a weight matrix (W), which is proportional to the asymptotic variance of the sample quantiles (see, e.g., Schervish, 1996, pp. 404-410). The variance of ij will take the form 2 pij (1;pij )=f 2 (yp ) where yp is the true sample quantile and pij corresponds to the pij th percentage point (0 < pij < 1) the covariance between ij and kl is given by Cov(ij kl) = 2 (minfpij pklg; pij pkl )=(f (yp )f (yp )). To approximate these terms, we use kernel density estimation. Given the number of knots k and their location r(k), the ML estimate of is obtained, via the generalized least squares equations: ij p ij ij ij kl Z0W;1Z^ = Z0W;1Y where Z is an N (k + 4) design matrix (with N = Pni=1 di ) and Y is the vector of power-transformed observations. In the remainder, and to keep notation simple, we assume that di = d for all individuals, so that N = nd. In most applications, the weight matrix W is very large (equal to the number of observations N ) and therefore computation of its inverse is impractical. To investigate whether estimates of the parameters in (1) are sensitive to a simplied formulation of the model, we considered an alternative representation for W in our application: a diagonal matrix obtained by setting all o-diagonal elements of W to 0. We proceed as in Denison et al. (1998) and specify prior distributions for the number 8 of knots, k and the location of the knots, r(k). We chose a discrete uniform prior distribution for the knot location, conditional on k, so that rjk discrete U (z11 ::: znd), (with additional constraints), and a Poisson distribution with rate for the number of knots k, so that k Poisson ( ). In the example given in Section 4, we x at some \known" value. 3.1.1 Details of algorithm for transformation The dimension of the parameter vector changes with k, the number of knots in model (1). As a result, we use reversible jump MCMC as discussed in Green (1995) and Denison et al. (1998) to simulate from the posterior distribution of which species the appropriate transformation. The idea is simple. At each iteration l = 1 ::: M1, a new knot can be introduced, an old knot can be deleted, or an old knot can be moved to a new location. Consequently, each iteration consists of three steps: 1. Choose type of move: Birth of a new knot, with probability bk . Death of an existing knot, with probability dk . New location for a knot, with probability k . 2. Compute MLE ^ (kl) and check monotonicity of g;1(kl)(:). 3. Accept move, with probability (l) (dened below). For M1 large enough, the algorithm \converges". We monitor the behavior of the iterations using a mean squared error criterion computed as MSE (l) = (nd);1(Y ; g;1(l)(:))0W;1(Y ; g;1(l)(:)): 9 (2) Once the algorithm has \converged", we invert draws l = 1 ::: m1 with m1 < M1, of functions g;1(l)(:), and evaluate each draw at the set of nd values of Yij to obtain a sample of fXij g(l) that are approximately standard normal. That is fXij g(l) = g(l)(Yij ) N(0 1): To compute the MLE of (kl) we use generalized least squares as described in Section 3.1. Because g;1 (:) must be monotonic, at each step we check that the lth draw satises the condition by evaluating the derivative of g;1(l)(:) on a grid of values of z given by the knots and midpoints between the knots. If these function evaluations are not all positive, we obtain an estimate of via linear programming, as the objective function and all constraints are linear in . Non-monotonicity may occur between midpoints and knots, and thus our approach does not guarantee that g;1(l)(:) is monotonic. However, we are reasonably condent that non-monotonicity will usually be uncovered by focusing on the grid. Given k p(k), and c 0:5, we follow Denison et al. (1998), and dene bk = c minf1 p(k + 1)=p(k)g, dk = c minf1 p(k ; 1)=p(k)g, and k = 1 ; bk ; dk , where p(k) is the prior density for the number of knots. Note that for k = 0, bk = 1, and for k = kmax, bk = 0. With this formulation, the probability of accepting the proposed move has a very simple form: = min f1 (likelihood ratio) (prior ratio) (proposal ratio)g where (birth) = min f1 (likelihood ratio) !(k)g (death) = min f1 (likelihood ratio) !(k);1g (move) = min f1 (likelihood ratio)g 10 and 8 ; 7k : !(k) = nd ;nd The quantity !(k) is the ratio of the number of locations at which a knot may be placed to the number of data points. (The above result is specic to a cubic spline for additional details, see Denison et al. (1998), p. 338). 3.2 Measurement error model We make the assumption that the measurement error is additive in the normal scale. Using m1 < M1 sets of transformed values, fXij g(l), l = 1 ::: m1, we t an additive measurement error model (MEM) as proposed by Nusser et al. (1996) Xij(l) = x(il) + u(ijl) (3) where x(il) is the usual intake of the nutrient for the ith individual for the lth draw, and u(ijl) is the measurement error for the ith individual on the j th day, in the normal scale for the lth draw. There may be considerable heterogeneity of the measurement error variances across individuals (see e.g., Nusser et al., 1996), so we formulate our MEM as a hierarchical model with three levels. We omit the superscript that denotes draw to keep the notation simple, but it is important to remember that the hierarchical model is formulated for each draw fXij g(l), l = 1 ::: m1. In level 1, the individual's daily intake is modelled as a normally distributed random variable with mean equal to the individual's usual intake and with a subject-specic measurement error variance: Xij jxi ui2 N (xi ui2 ): In level 2, we model the heterogeneity in the usual intakes and in the measurement 11 error variances across individuals: xijx x2 N (x x2) log(ui2 )jA A2 N (log(A ) A2 ): Finally, in level 3 we place at priors on the remaining hyper-parameters: x x2 log(A ) A2 Uniform: We use the Gibbs sampler to draw values from the posterior distribution of the parameters in the hierarchical MEM model. All full conditionals are of standard form, with the exception being the full conditional distribution of ui2 , which is proportional to Y (log(ui2 )jxi A A2 Xi) / (ui2 );1=2 exp f; 212 ui j X(X ; x )2g ij i j expf; 21 (log(ui2 ) ; log(A ))2g A where Xi = (Xi1 ::: Xid)0. To draw values from (log(ui2 )jxi A A2 Xi), we use a Metropolis-Hastings algorithm (e.g., Smith and Roberts, 1993) with a normal approximation to the full conditional distribution of log(ui2 ) as a candidate density. For each transformed sample fXij g(l), l = 1 ::: m1, we obtained M2 draws from the joint posterior distribution of fx x2 A A2 g. For m2 < M2 of these, we simulated sets of x(is) ui2 , i = 1 ::: n, s = 1 ::: m2, from xij(xs) x2 and log(ui2 )j(As) A2 , respectively, to transform back to original scale. Note that by sampling from the population as opposed to transforming back the original subjects, we are accounting for the additional variability of only having a nite (incomplete) sample of individuals from the population. (s) (s) 12 (s) 3.3 Transformation back to original scale As we described earlier, for each of the m2 draws, we obtained a sample of n usual intakes and n measurement error variances (x(is) ui2 ) in the normal scale. To make inferences about the quantiles of the intake distribution, we now need to transform the usual intake draws back to the original scale. By denition, (s) y = E fY jx = xg = E fg;1 (x + u)jx = xg: To estimate this expectation, for each (xi ui2 ) draw, we generate a large number q of uij from uij N (0 ui2 ) and approximate the expectation using a Monte Carlo mean: yi q;1 Pqj=1 g;1(xi + uij ). The number of Monte Carlo replicates q is chosen so as to obtain the required precision for yi. For m1 transformations and m2 samples of usual intakes from the measurement error model, we get m1 m2 samples of size n: fyig(t), t = 1 ::: m1 m2, from which we can approximate marginal posterior distributions of interest. For example, we derive the marginal posterior distribution of percentiles of the usual intake distribution of interest, Pr fy(t) ag = , for = 0:01 0:05 : : : :99. We discuss this further in Section 4. 3.4 Summary of Complete Algorithm The three stages described in Section 3 can be summarized as follows: 1. Draw transformations g(l)(Y ) = X , l = 1 ::: M1. 2. Obtain transformed intakes X11(l) ::: Xnd(l) for l = 1 ::: m1 out of M1 draws. 3. Using transformed sample fXij g(l), t MEM Xij(l) = x(il) + u(ijl) 13 via Gibbs, and obtain m1 m2 samples (m2 out of M2 draws for MEM) 2(ls) (x(1ls) u2(1ls)) ::: (x(nls) un ) l = 1 ::: m1 s = 1 ::: m2: 4. Backtransform: yi(ls) = Eq?fg;1(l)(x + u)jx = x(ils)g where Eq?(:) is MC average over draws uv(ls) N (0 ui2(ls)), v = 1 ::: q. 5. Obtain marginal posterior distributions of percentiles of f (ls)(y) and other relevant quantities. 4 Example As stated in Section 2, we now illustrate the methodology using a cohort of females and males, ages 14-18 from CSFII 1994-1996. The female cohort consisted of 303 individuals each of which had dietary data collected on two non-contiguous days. The male cohort consisted of 332 individuals also with two non-consecutive days of dietary intake data each. We focus on six dietary components: calcium, cholesterol, iron, protein, vitamin A, and vitamin C. In the case of calcium, iron, protein, vitamin A, and vitamin C, we are interested in estimating the proportion of teen-agers whose usual intakes do not meet recommendations. In the case of cholesterol, we are concerned with excessive intakes, and thus focus on the right tail of the distribution. 4.1 Performance of algorithm The reversible jump MCMC algorithm worked well. Figures 1 and 2 show two realizations (l = 2 and l = 2 000) from the posterior distribution of g(), and the pairs (Yij zij ) for females and males respectively, for each dietary component. We see from these gures that the WLS procedure places more weight on the center of the distribution and less 14 weight on the tails (where there is considerably more variability). The transformation draws shown in the gures correspond to the case where the weight matrix W was taken to be diagonal. The reversible jump MCMC algorithm converges quickly as monitored by the MSE (2) and to the same value based on multiple starting points (not shown in gures). For the prior distribution on the number of knots, we set = 6. We chose a small value for the mean number of knots as the data had already been power-transformed, and just a few additional knots are likely to be needed to complete the transformation to normality. Results were not sensitive to changes in the value of , in the range 3 ; 8. The number of knots drawn from the posterior distribution for the various dietary components ranged from about two to fourteen. We monitored the convergence of the Markov chain of the parameters of the measurement error model using Gelman and Rubin-type statistics (Gelman and Rubin, 1992) and autocorrelation plots (as suggested in Cowles and Carlin, 1995). The convergence again was rather quick (within about 100 iterations). For posterior inference, we sampled m1 = 25 transformations (every 40th iteration after a burn-in of 1000, M1 = 2000) and for each transformation, sampled m2 = 20 iterations (every 10th iteration after a burn-in of 100, M2 = 300) from the measurement error model, for a total of 500 back-transformed samples of size 303 for females (332 for males) of the usual intakes for which we compute posterior medians and 95% credible intervals (using the 2:5th and 97:5th quantiles of the posterior distribution) of the quantiles and compute density plots. 4.2 Choice of weight matrix As mentioned earlier, the weight matrix W has dimensions N N . In our example, N = 303 2 for females, and N = 332 2 for males. As N can be quite large, the 15 inversion of W can be impractical and very time consuming. Thus, we investigated whether results would be sensitive to using a simplied (diagonal) version of W for computation. We chose to use the diagonal weight matrix for model-tting as a compromise, since we can account for the extra variability of the quantiles in the tails and yet keep computations manageable. Because a kernel density estimator is used to estimate the density at the quantiles, use of the full weight matrix W may result in a procedure that is not only inconvenient from a computational point of view, but also unstable, as density estimates at the tails get very small. To decide whether results are sensitive to the choice of a diagonal version of W viz a viz the \complete" version, we repeated the analyses using both forms of the weight matrix, for several of the dietary components under consideration. We only show results obtained for vitamin C (females), which appear in Table 1 and Figure 3. The eect of ignoring the o-diagonal elements of W in the computations had very little eect on nal results. Estimates of quantities of interest, such as the mean, the standard deviation, and the quantiles of usual intakes are very similar, regardless of the weight matrix chosen. For example, every 95% credible interval obtained using the diagonal weight matrix covers the corresponding point estimate obtained using the full weight matrix, and in fact, most point estimates are within a standard deviation of each other. 4.3 CSFII 1994-1996 We applied the method we propose to dietary intake data collected in the CSFII during the period 1994-1996, for the two cohorts described in Section 2. Figs. 4 and 5 display two estimates of the usual intake distribution of each dietary component for females and males, respectively. The density estimates drawn in dotted lines correspond to 16 the distribution of individual two-day means. These \observed mean" distributions are skewed for all dietary components except protein, whose empirical mean distribution is almost symmetric but leptokurtic. As a result, it would not be appropriate to t a normal measurement error model to intake data to remove the within-individual variance. Thus, a dierent parametric form must be chosen for the distribution of observed intake means, or dietary intake data should be transformed into normality prior to variance estimation. Following the approach described in Section 3, we obtained the usual intake density estimates shown in solid lines in Figs. 4 and 5. The gures show, as expected, that after removal of measurement error, the estimated distributions of usual intakes have smaller variability than the distributions of two-day means. Tables 2 and 3 show the mean, standard deviation, and selected percentiles of the distribution of observed individual means for each dietary component, for females and males, respectively. In addition, tables also show the ratio of within- to betweenindividual variances for each dietary component. These variance ratios are all close to one, indicating that the measurement error variances are of about the same order of magnitude as the between-individual variances. Therefore, these within-individual variance components cannot be ignored. Tables 4 and 5 show the mean and the 2:5th and 97:5th percentiles of the posterior distribution of the mean, standard deviation, and selected percentiles of the usual intake distribution for each dietary component, for females and males, respectively. A comparison of the entries in Tables 4 and 5 to those in tables 2 and 3 conrmed what Figs. 4 and 5 show intake distributions have less variability and lighter tails that result from the model's removal of the measurement error in the observed daily intakes. The dierences between the two estimated densities can be large 95% credible intervals in tables 4 and 5 often do not contain the corresponding quantile of the observed individual 17 mean distributions. This is particularly noticeable in the upper tail of the distributions. Table 6 shows the mean and the 2:5th and 97:5th percentiles of the posterior distribution of the prevalence of nutrient inadequacy or, in the case of cholesterol, the prevalence of excessive intake, for females and males. Here, we estimate the prevalence of nutrient inadequacy as the proportion of individuals whose usual intake of the dietary component is less than 83% of the Recommended Dietary Allowance (RDA e.g., NRC, 1989, page 285) for the nutrient (see, e.g., Carriquiry, 1998 IOM, 1999). For calcium, iron, protein, vitamin A, and vitamin C, table 6 shows selected attributes of the posterior distribution of Pr(y 0:83 RDA) for females and males, respectively. In the case of cholesterol, we show the mean, 5th and 95th percentiles of the posterior distribution of Pr(y > 300mg). The interpretation of the entries in the table is the usual one. For example, for females, the point estimate of the prevalence of nutrient inadequacy for calcium is :84, and a posteriori, the probability that prevalence is between :77 and :91 is 95%. 5 Discussion The analysis of dietary intake data is challenging, even if we do not take into account the various sources of biases and errors that are often present in this type of data. It is recognized (see, e.g., IOM, 1999) that individuals tend to under-report the amount of food they consume. The extent of the under-reporting is known to vary by nutrient, and by gender-age-ethnic group, but little additional information about the direction and size of the biases is available. Attempts have been made to calibrate reported intake using various biochemical markers (see, e.g., IOM, 1999). These methods, however, are still in the experimental stage, are very costly, and are useful to adjust energy intakes at best. Nothing is known about the under-reporting of, for example, trace minerals. It is also known that the USDA databases used to map foods into nutrients are not 18 always error-free (Schubert et al. 1987 Haytowitz et al, 1996). For example, the USDA databases lack precise information on folate content of foods, as a national fortication eort that adds folate to various food items was implemented only in 1998 (IOM, 1998). In this work, we do not take into account these potential sources of biases in dietary intake data. Rather, we focus on the problem of developing appropriate methods to analyze the data. Estimating usual intake distributions of nutrients from dietary intake data can be dicult, as was argued in Section 2. The approach we have chosen consists in transforming the observed intakes into the normal scale, removing the measurement error in the normal scale, and then transforming individual estimated usual intakes back into the original scale. An alternative approach consists in using a parametric model other than the normal to represent the relationship between observed and usual intakes. For example, a Weibull or a Gamma distribution might be an appropriate representation for the distribution of intakes in the population. This approach has the drawback that each new dietary component would require the identication of the most suitable model, thereby limiting the usefulness of the method for researchers in nutrition and areas other than statistics. The normal-scale measurement error model we propose in Section 3.2 makes an assumption that is not necessarily satised: that once observed individual intake means are transformed into normality, both the usual intake and the measurement error components are also normally distributed. This is not necessarily so, although informal tests suggest that for all the dietary components we investigated, the assumptions of model (3) appear to hold. A deconvolution approach that guarantees that both the usual intakes and the measurement errors are normally distributed has also been proposed (Stefanski and Carroll, 1990 Stefanski and Carroll, 1991 Chen, 1999). For the specic case 19 of dietary intake data, Chen (1999) argues that results obtained using a deconvolution approach are not noticeably dierent from those obtained by Nusser et al. (1996) using a frequentist version of the method we discuss in this manuscript. We argue in Section 2 that a Bayesian framework is the most appropriate in this estimation problem, as the method for estimating usual nutrient intake distributions consists of several steps. Because the estimated transformation into normality and the estimated variance components in the measurement error model are used as if they were true values, the standard errors for estimators of the parameters of the usual intake distribution in the Nusser et al. (1996) approach underestimate the true uncertainty about the value of those parameters. An advantage of the Bayesian paradigm is that it permits proper accounting of all uncertainties, so that the posterior variance of, for example, the prevalence of nutrient inadequacy, reects the uncertainty about all parameters in the model. Thus, we expect that the 95% credible intervals obtained from the marginal posterior distributions will be wider than the 95% condence intervals obtained from a frequentist analysis such as that presented by Nusser et al. (1996). Direct comparison of the Bayesian and frequentist approaches is not possible as the model used for the transformation function in this paper is dierent from the one used in the Nusser et al. (1996) manuscript. Nonetheless, we carried out the analysis using the frequentist version of the method. Computations were done using C-SIDE (Iowa State University, 1997), a software developed to implement the Nusser et al. (1996) method. Results obtained from a frequentist viewpoint are presented in Tables 7 and 8, for females and males, respectively. Point estimates of percentiles are somewhat similar when comparing both approaches. The 95% credible sets, however, tend to be wider, and need not be symmetric around the posterior means of the percentiles. In our example, we estimated the prevalence of nutrient inadequacy in the popula20 tion as the proportion of individuals with usual intakes below 83% of the RDA (NRC, 1989) for the nutrient. It has been argued (e.g., Beaton, 1994 Carriquiry, 1999) that the appropriate cut-o is the median of the distribution of requirements in the population, rather than the RDA. The National Academy of Sciences, however, has not yet published the value of the median requirement for any gender-age group. The exception is calcium, for which the Academy of Sciences has concluded that the median requirement for any group cannot be determined with the information that is currently available about calcium intakes and requirements (IOM, 1998b). Under simple assumptions, 83% of the RDA is approximately equal to the median requirement of the nutrient. In Section 3.1, we used a generalized least squares approach to estimate the parameters of the function that transforms daily intakes into normality. An alternative approach is as follows: dene g to be the function g;1(z ) such that P (Yij y) = P (Z g;1(y)), where Z is distributed as a standard normal random variable. Again consider a cubic spline form for the function g. In this case, an iterative procedure is needed to obtain maximum likelihood estimates of the parameters in the model. If we let = (k r(k) ) as before, the likelihood for this model is Qni=1 Qdj=1 f (g;1(yij )), where f denotes a standard normal density. To estimate the parameters in this model, we obtain an initial value for the parameters using GLS, and then carry out a single Newton-Raphson step to approach the MLE using analytic derivatives. We have discussed the specic problem of estimating usual nutrient intake distributions, and presented an application consisting of estimating the prevalence of nutrient inadequacy among teen-agers using dietary intake data collected between 1994 and 1996. Several related problems still require investigation. An extension of the methods presented here to the case where the usual intake distributions of food intakes is of interest is not straightforward. Diculties arise because in the case of foods, it is important i 21 to consider not only the amount of a food consumed, but also the probability that the individual would have consumed the food on the day when the interview was conducted. For many food items, the probability of consumption is not independent of the amount consumed, so estimating the marginal distribution of usual intake of foods can be challenging. Yet, the problem is an important one, as the distribution of usual food intakes is required to assess exposure rates to toxicants found in the food supply in a group. Ratios of dietary components are also of importance. For example, researchers may be interested in assessing the proportion of individuals in a group who consume, on the average, more than 30% of calories from fat, or more than 10% of calories from saturated fat. The methods presented in this paper for estimating the usual intake distribution for a nutrient cannot be directly applied to ratios of dietary components as those described above. Typically, both the numerator and the denominator in the ratio are observed subject to measurement error, and cannot be assumed to be independent. References Beaton, G.H. (1994) Criteria of an adequate diet. In: Shils, R.E., Olson, J.A., Shike, M. eds. Modern Nutrition in Health and Disease. Lea and Febiger, Philadelphia. Blom, G. (1958) Statistical Estimates and Transformed Beta Variables. Wiley, New York. Carriquiry, A.L. (1999). Assessing the prevalence of nutrient inadequacy. Public Health Nutrition. In press. Carroll, R.J., Freedman, L.S., and Hartman, A.M. (1996) Use of semiquantitative food frequency questionnaires to estimate the distribution of usual intake. American Journal of Epidemiology, 143:392-404. 22 Chen, C. (1999) Spline estimators of the distribution function of a variable measured with error. Doctoral Thesis, Department of Statistics, Iowa State University. Cowles, K., and Carlin, B.S. (1996) Markov chain Monte Carlo convergence diagnostics: A comparative review, Journal of the American Statistical Association, 81:86-98. Denison, D.G.T., Mallik, B.K., and Smith, A.F.M. (1998) Automatic Bayesian curve tting. Applied Statistics, 60:333-350. Dodd, K. (1997) A Technical Guide to C-SIDE. Technical Report 96-TR 32, Dietary Assessment Research Series Report 9, Department of Statistics and Center for Agricultural and Rural Development (CARD), Iowa State University, Ames. Eckert, R.S., Carroll, R.J., and Wang, N. (1997) Transformations to additivity in measurement error models. Biometrics, 53:262-272. Gelman, A., and Rubin, D.B. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7:457-472. Green, P.J. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711-732. Haytowitz, D.B. Pehrsson, P.R. Smith, J., Gebhardt, S.E., Mathews R.H. and Anderson, B.A. (1996) Key foods: setting priorities for nutrient analysis. Journal of Food Composition and Analysis, 9:331-364. Huang, E.T., Fuller, W.A. (1978) Nonnegative regression estimation for sample survey data. ASA Proceedings of the Social Statistics Section, 300-305. Institute of Medicine (1998a) Dietary Reference Intakes: Thiamin, Riboavin, Niacin, Vitamin B6 , Folate, Vitamin B12, Pantothenic Acid, Biotin, and Choline. Preprint, 23 National Academy Press, Washington, DC. Institute of Medicine (1998b) Dietary Reference Intakes: Calcium, Phosphorus, Magnesium, Vitamin D, and Fluoride. Preprint, National Academy Press, Washington, DC. Department of Statistics and Center for Agricultural and Rural Development, Iowa State University. (1996) A User's Guide to C-SIDE: Software for Intake Distribution, Version 1.0. Technical Report 96-TR 31. Center for Agricultural and Rural Development, Iowa State University, Ames. National Research Council (1986) Nutrient Adequacy. National Academy Press, Washington, DC. National Research Council (1989) Recommended Dietary Allowances. 10th ed. National Academy Press, Washington, DC. Nusser, S.M., Carriquiry, A.L., Dodd, K.W., and Fuller, W.A. (1996) A semiparametric transformation approach to estimating usual daily intake distributions. Journal of the American Statistical Association, 91:1440-1449. Schubert, A., Holden, J.M., and Wolf, W.R. (1987) Selenium content of a core group of fooods based on a critical evaluation of published analytical data. Journal of the American Dietetics Association, 87:285-299. Smith, A.F.M., Roberts, G.O. (1993) Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of Royal Statistical Society B. 55:3-23. Schervish, M. (1996) Theory of Statistics. Springer-Verlag, New York. 24 Spiegelhalter, D.J, Best, N.G, Gilks, W.R, and Inskip, H. (1996). Hepatitis B: a case study in MCMC methods. in Markov Chain Monte Carlo in Practice, eds. Gilks WR, Richardson S, Spiegelhalter DJ, Chapman and Hall, pp. 339-358. Stefanski, L.A., and Bay, J.M. (1996) Simulation extrapolation deconvolution of nite population cumulative distribution function estimators. Biometrika 83:407-417. Stefanski, L.A., and Carroll, R.J. (1990) Deconvoluting kernel density estimators. Statistics, 21:169-184. Stefanski, L.A., and Carroll, R.J. (1991) Deconvolution-based score tests in measurement error models. The Annals of Statistics, 19:249-259. U.S. Department of Agriculture, Agricultural Research Service (1997). Continuing Survey of Food Intakes by Individuals, 1994-1996. CSFII Report, Washington, DC: U.S. Government Printing Oce. 25 Figure 1: Power transformed intake data Yij (points), and two draws of the transformation function, at the 2nd (solid line) and 2,000th iterations (dashed line). Data correspond to females. 26 Figure 2: Power transformed intake data Yij (points), and two draws of the transformation function, at the 2nd (solid line) and 2,000th iterations (dashed line). Data correspond to males. 27 Figure 3: Densities of the usual intake of vitamin C for females aged 14-18, estimated using the diagonal and non-diagonal forms of the weight matrix W in the transformation into normality. 28 Figure 4: Estimated densities of the usual intake of dietary components in females aged 14-18. The dotted curves correspond to the distribution of two-day means. The solid curves correspond to the Bayesian estimator described in Section 3. 29 Figure 5: Estimated densities of the usual intake of dietary components in males aged 14-18. The dotted curves correspond to the distribution of two-day means. The solid curves correspond to the Bayesian estimator described in Section 3. 30 Mean Std. Dev. 1st percentile 5th percentile 10th percentile 50th percentile 90th percentile 95th percentile 99th percentile Diagonal W Full W 88.2 (77.4, 99.7) 52.9 (40.8, 66.4) 14.4 (7.8, 22.5) 23.7 (17.2, 32.5) 31.1 (23.7, 41.3) 77.0 (64.9, 89.8) 158.9 (134.6, 187.4) 187.7 (154.5, 228.6) 247.1 (194.9, 324.9) 86.1 (76.6, 96.9) 52.0 (40.6, 66.6) 15.3 (9.1, 23.2) 24.3 (16.6, 31.8) 31.3 (23.3, 40.6) 74.5 (63.3, 85.0) 156.0 (134.3, 185.8) 186.3 (157.4, 228.1) 246.5 (195.9, 317.7) Table 1: Mean, standard deviation, and quantiles of the usual intake distribution of vitamin C for females 14-18. Values in parenthesis are the 95% credible intervals. Estimates were obtained using a diagonal and a full weight matrix for estimation of the transformation into normality. 31 Calcium Cholesterol Iron Protein Vit A Vit C Mean Std Dev Ratio 1st 5th 10th 50th 90th 95th 99th (mg) 750 348 .9 210 273 343 685 1255 1405 1765 (mg) 208 125 1.6 36 59 76 180 369 434 668 (mg) 13.4 7.0 1.0 3.9 5.9 7.1 12.0 20.1 25.2 42.5 (g) 64.8 23.0 1.3 23.8 27.6 36.3 62.4 95.5 106.5 125.4 (g) 790 680 1.1 112 176 231 577 1536 2005 3419 (mg) 88.0 83.1 1.1 7.3 11.1 15.6 61.1 191.4 251.0 405.2 Table 2: Mean, standard deviation, and selected percentiles of the distribution of twoday individual means, and ratio of within- to between-individual variance in intakes for females aged 14-18. Calcium Cholesterol Iron Protein Vit A Vit C Mean Std Dev Ratio 1st 5th 10th 50th 90th 95th 99th (mg) 1259 661 .8 305 478 600 1112 2108 2631 3553 (mg) 315 180 1.5 59 104 132 281 556 627 961 (mg) 21.0 11.1 1.0 6.6 8.7 10.2 18.4 33.4 42.6 63.3 (g) 101.2 38.6 1.0 38.0 51.1 58.4 95.1 150.8 179.5 206.3 (g) 1203 881 1.1 152 262 359 965 2322 3041 4055 (mg) 112.6 96.5 1.1 6.7 15.2 22.5 84.8 249.0 309.2 393.1 Table 3: Mean, standard deviation, and precentiles of the distribution of two-day individual means, and ratio of within- to between-individual variances in intakes for males aged 14-18. 32 Calcium Cholesterol Mean Std. 1st 5th 10th 50th 90th 95th 99th (mg) 748 (702, 800) 248 (202, 296) 298 (233, 369) 389 (324, 451) 447 (390, 510) 724 (672, 781) 1077 (988, 1183) 1188 (1072, 1321) 1389 (1216, 1585) (mg) 207 (192, 222) 62 (35, 84) 95 (66, 137) 117 (93, 156) 133 (110, 165) 200 (184, 217) 290 (248, 324) 317 (264, 369) 372 (287, 458) Iron (mg) 13.3 (12.5, 14.3) 4.2 (3.3, 5.3) 6.2 (4.9, 7.7) 7.7 (6.6, 8.8) 8.6 (7.7, 9.6) 12.7 (11.9, 13.6) 18.8 (17.0, 21.0) 21.1 (18.8, 24.1) 25.6 (21.8, 31.1) Protein Vit A Vit C (g) (g) (mg) 64.8 779 88.2 (61.7, 67.7) (697, 868) (77.4, 99.7) 14.4 406 52.9 (11.1, 18.3) (317, 522) (40.8, 66.4) 35.7 208 14.4 (27.9, 42.5) (144, 289) (7.8, 22.5) 42.7 295 23.7 (36.6, 48.2) (229, 368) (17.2, 32.5) 47.0 355 31.1 (41.6, 51.9) (292, 428) (23.7, 41.3) 64.0 687 77.0 (60.4, 67.1) (610, 774) (64.9, 89.8) 83.5 1323 158.9 (77.3, 90.5) (1147, 1548) (134.6, 187.4) 89.4 1551 187.7 (81.7, 97.8) (1322, 1893) (154.5, 228.6) 100.6 2032 247.1 (89.5, 115.0) (1630, 2678) (194.9, 324.9) Table 4: Mean, standard deviation, and selected percentiles of the usual intake distribution of dietary components for females aged 14-18. Values in parentheses are the lower and upper bounds of 95% credible intervals. 33 Calcium Cholesterol Mean Std. 1st 5th 10th 50th 90th 95th 99th (mg) 1252 (1158, 1332) 461 (374, 563) 478 (362, 589) 627 (530, 735) 726 (633, 828) 1185 (1100, 1275) 1869 (1666, 2062) 2102 (1855, 2385) 2569 (2205, 3035) (mg) 311 (244, 335) 101 (63, 129) 133 (99, 174) 168 (140, 206) 192 (163, 226) 299 (239, 323) 445 (327, 504) 493 (352, 567) 588 (422, 728) Iron (mg) 20.9 (19.6, 22.4) 7.0 (5.5, 8.7) 9.5 (7.9, 11.4) 11.7 (10.0, 13.3) 13.1 (11.7, 14.8) 19.9 (18.3, 21.5) 30.1 (27.5, 33.7) 33.9 (30.2, 38.6) 41.3 (34.8, 49.5) Protein (g) 64.8 (61.7, 67.7) 14.4 (11.1, 18.3) 35.7 (27.9, 42.5) 42.7 (36.6, 48.2) 47.0 (41.6, 51.9) 64.0 (60.4, 67.1) 83.5 (77.3, 90.5) 89.4 (81.7, 97.8) 100.6 (89.5, 115.0) Vit A (g) 1180 (1082, 1290) 563 (446, 685) 313 (221, 428) 450 (361, 561) 554 (460, 664) 1086 (976, 1203) 1939 (1702, 2215) 2235 (1927, 2580) 2816 (2320, 3374) Vit C (mg) 110.8 (100.1, 123.6) 61.3 (49.0, 73.2) 21.5 (13.8, 32.0) 33.7 (25.1, 44.6) 43.0 (34.5, 55.4) 98.6 (86.9, 112.1) 192.9 (167.7, 221.1) 226.7 (193.8, 262.5) 293.2 (240.4, 352.0) Table 5: Mean, standard deviation, and selected percentiles of the usual intake distribution of dietary components for males aged 14-18. Values in parentheses are the lower and upper bounds of 95% credible intervals. 34 Females Calcium RDA 1,200 mg Prevalence .84 (.77, .91) Iron RDA 15 mg Prevalence .48 (.38, .56) Protein RDA 44 g Prevalence .01 (.00, .05) Vitamin A RDA 800 g Prevalence .48 (.37, .56) Vitamin C RDA 60 mg Prevalence .26 (.17, .35) Cholesterol Cut-point 300 mg Prevalence .08 (.00, .16) Males 1,200 mg .33 (.24, .41) 12 mg .02 (.00, .05) 59 g .00 (.00, .02) 1,000 g .30 (.22, .38) 60 mg .14 (.07, .21) 300 mg .50 (.18, .59) Table 6: Mean of the posterior distribution of prevalence of nutrient inadequacy among females and males aged 14-18, and 5th and 95th posterior percentiles. Here, prevalence is dened as Pr(y 0:83RDA), where the RDA for each nutrient is the value published in the 1989 NRC report. For cholesterol, we report the mean, 5th and 95th percentiles of the posterior distribution of the prevalence of excessive intakes Pr(y > 300mg). 35 Calcium Cholesterol 1st 5th 10th 50th 90th 95th 99th (mg) 293 (227, 359) 393 (332, 454) 455 (399, 511) 727 (682, 772) 1092 (990, 1194) 1214 (1083, 1345) 1469 (1271, 1667) (mg) 105 (78, 132) 130 (107, 153) 144 (123, 165) 204 (188, 220) 284 (247, 321) 311 (263, 359) 369 (297, 441) Iron (mg) 6.1 (4.9, 7.3) 7.8 (6.7, 8.9) 8.8 (7.8, 9.8) 13.0 (12.3, 13.7) 19.3 (17.1, 21.5) 21.9 (18.8, 25.0) 28.0 (22.7, 33.3) Protein Vit A Vit C (g) (g) (mg) 35.9 210 16 (29.7, 42.1) (140, 280) (9, 23) 43.4 315 26 (38.2, 48.6) (243, 387) (19, 34) 47.6 386 34 (43.1, 52.1) (315, 456) (26, 42) 64.5 729 77 (61.3, 67.7) (660, 798) (68, 86) 84.2 1362 158 (77.9, 90.5) (1127, 1597) (131, 185) 90.4 1639 191 (82.6, 98.2) (1300, 1978) (153, 229) 102.6 2328 266 (91.6, 113.6) (1705, 2951) (200, 332) Table 7: Selected percentiles of the usual intake distribution of dietary components for females aged 14-18, estimated using the Nusser et al. (1996) frequentist approach. Values in parentheses are the lower and upper bounds of the approximate 95% condence intervals computed using a balance repeated replication method. 36 Calcium Cholesterol 1st 5th 10th 50th 90th 95th 99th (mg) 464 (372, 556) 620 (532, 708) 720 (636, 804) 1186 (1112, 1260) 1885 (1697, 2073) 2136 (1885, 2387) 2687 (2283, 3091) (mg) 142 (110, 174) 179 (150, 208) 202 (175, 229) 303 (281, 325) 442 (391, 493) 489 (424, 554) 589 (489, 689) Iron (mg) 9.0 (7.5, 10.5) 11.4 (10.0, 12.8) 12.8 (11.5, 14.1) 19.8 (18.6, 21.0) 30.4 (27.2, 33.6) 34.4 (30.1, 38.7) 43.3 (36.1, 50.5) Protein (g) 55 (47, 63) 66 (59, 73) 73 (67, 80) 99 (94, 103) 133 (123, 143) 144 (131, 157) 166 (147, 185) Vit A (g) 360 (261, 459) 508 (407, 609) 607 (508, 706) 1102 (1004, 1200) 1919 (1649, 2189) 2230 (1862, 2598) 2938 (2319, 3557) Vit C (mg) 22 (14, 30) 35 (26, 44) 45 (36, 54) 100 (90, 110) 199 (168, 230) 238 (195, 281) 329 (255, 402) Table 8: Selected percentiles of the usual intake distribution of dietary components for males aged 14-18, estimated using the Nusser et al. (1996) frequentist approach. Values in parentheses are the lower and upper bounds of the approximate 95% condence intervals computed using a balance repeated replication method. 37