Modeling Physical Activity Data Dave Osthus and Bryan Stanfill Center for Survey Statistics and Methodology, Iowa State University March 28, 2011 Dave & Bryan PAMS Outline Modeling Usual Energy Expenditure Motivation Potential groupings and influential factors A look at the data Analysis outline Results Modeling Physical Activity Behavior Set up of problem Binomial regression idea Logistic regression idea Dave & Bryan PAMS Motivation Due to the current obesity epidemic there has been heightened interest in energy consumption exceeding expenditure, especially in certain groups of the population. Nutritionists have developed models to better understand energy consumption, but a similar treatment of energy expenditure (EE) is less developed. Dr. Beyler and others developed a model to account for the various biases in self-report questionnaires and result in an estimated distribution of the usual energy expenditure for groups of people. Dave & Bryan PAMS Potential Groupings and Influential Factors Gender Race/Ethnicity Age BMI/Weight Education Number of adults in a household Smoking status Dave & Bryan PAMS Survey Objectives The Physical Activity Measurement Study (PAMS) is a survey designed to obtain information on physical activity patterns of Adult women and men (21-70) Hispanic and African American populations (limited sample size) Rural and non-rural adults Dave & Bryan PAMS Survey Objectives, Continued More specifically, Individuals are sampled from four counties in Iowa: Marshall, Black Hawk, Dallas and Polk. Goal is 1200 participants at the end of the study, spanning two years. Approximately equal number of males and females. Approximately 10% African American and 10% Hispanic. Minorities oversampled. Dave & Bryan PAMS Data Collection Process Data collection was intended to sample individuals uniformly over two years, partitioned into eight quarters. At the individual level: Data were collected on two non-consecutive installments. For each installment: Individual wore SenseWear Monitor for 24 hours. 24-hour activity recall administered via phone the following day. Dave & Bryan PAMS A Look at the Data Below are the daily energy expenditure distributions (in kilocalories per day) by gender and measurment type. The top row corresponds to measured EE from SenseWear Monitor and the bottom row reported EE from the 24-hour recall. The red vertical lines indicate the mean EE for each cell. Female Male 50 40 Measured 30 20 count 10 0 50 40 Reported 30 20 10 0 2000 4000 6000 8000 10000 12000 2000 4000 Energy Expendature Dave & Bryan PAMS 6000 8000 10000 12000 Scatter plots Scatter plots of reported EE vs. measured EE by gender. Female Male 12000 Reported EE 10000 8000 6000 4000 2000 2000 4000 6000 8000 2000 Measured EE Dave & Bryan PAMS 4000 6000 8000 A Look at the Data, Continued Box-plots of the reported EE and measured EE for women in age groupings with ranges of 21-40, 41-50, 51-60 and 61-70 and sample sizes of 69, 76, 103 and 78, respectively. A clear pattern is present in the monitor EE, which indicates that grouping data of this kind is justified. Monitor Energy Expenditure by Age Group 61−70 ●● ● Reported Energy Expenditure by Age Group 61−70 ● ● 51−60 ● ● ● ● Age (years) Age (years) 51−60 ● 41−50 ●●● 21−40 ● ● ●●●● 1500 2000 2500 3000 3500 ● 4000 41−50 ● ● ● ●● 21−40 4500 3000 4000 5000 Reported EE Dave & Bryan PAMS ● ● ● 2000 Monitor EE ● 6000 ● ● 7000 ●● 8000 Procedure for Estimating Usual Daily EE Parameters 1 Transform measured and reported EE to approximate normality 2 Test for nuisance effects (e.g. day of week, season, replicate) 3 Fit a model to the adjusted transformed measurements for each group to account for sources of variation and bias 4 Fit a population model based on group-level estimates to reduce the number of parameters (where appropriate) 5 Estimate a distribution of usual daily EE for each group on the original scale For what follows, we only consider females with two complete replicates, i.e. two sets of monitor and 24-hour recall data, with at least 99% of their monitor data accounted for. Also, all females with either a meaured or reported daily energy expenditure above 8000 kcals/day were considered outliers and removed. Dave & Bryan PAMS Transform EE Measurements to Approximate Normality 150 150 1500 2500 3500 Monitor EE 4500 7.2 7.6 8.0 8.4 100 0 0 0 20 50 40 50 Frequency 100 Frequency 80 60 Frequency 60 40 20 0 Frequency 80 100 100 120 120 The distributions of measured EE and reported EE on the left and right, respectively are plotted on both the original and log scale. The log transformation produces a distribution that is passably normal. 1000 Monitor EE on Log Scale 3000 5000 Reported EE Measured EE 7000 7.5 Reported EE Dave & Bryan PAMS 8.0 8.5 9.0 Reported EE on Log Scale Adjust for Nuisance Effects A weighted linear regression taking into account both survey weights and the stratified sampling design was fit to the log transformed EE data. In particular we’re interested to see if the nuisance effects: Rep, Quarter 1, Quarter 2 or Weekend (day of week) are significant. Reported EE Measured EE Covariate Intercept Rep Age Age2 Quarter1 Quarter2 Hispanic Black College Smoke Weekend Estimate 7.8559767 0.0041662 0.0027932 -0.0000836 0.0029372 0.0157222 -0.0257205 0.0399515 -0.0120230 -0.0342825 0.0263776 p-value <.0001 0.7127 0.6471 0.1707 0.9101 0.5753 0.7194 0.2422 0.5864 0.1428 0.2246 Covariate Intercept Rep Age Age2 Quarter1 Quarter2 Hispanic Black College Smoke Weekend No nuisance effects appear to be significant. Dave & Bryan PAMS Estimate 7.7928713 0.0183775 0.0091680 -0.0001002 0.0412050 -0.0052034 -0.1240973 0.0925985 -0.0274299 0.0024525 -0.0080371 p-value <.0001 0.1954 0.3965 0.3446 0.2965 0.8956 0.2308 0.0274 0.3968 0.9421 0.7660 Measured EE Model xgij = µg + tgi + dgij + ugij where xgij is a measurement of daily EE in normal scale for individual i on day j in group g from an unbiased reference instrument µg is the mean daily EE in the normal scale for group g tgi is individual i’s mean deviation from the mean daily EE of group g, in the normal scale 2 tgi ∼ N(0, σtg ) dgij is individual i’s deviation from his of her mean daily EE on day j in the normal scale 2 dgij ∼ N(0, σdg ) ugij is random measurement error for individual i on day j in group g, in the normal scale 2 ugij ∼ N(0, σug ) and 2 2 2 Var(xgij ) = Var(µg + tgi + dgij + ugij ) = σtg + σdg + σug under the assumptions that Cov(tgi , dgij ) = Cov(tgi , ugij ) = Cov(dgij , ugij ) = 0 Dave & Bryan PAMS Reported EE Model ygij = µyg + β1g (tgi + dgij ) + rgi + egij where ygij is the measurement of daily EE in normal scale for individual i on day j in group g from the self-report instrument µyg is the group mean of self-reported daily EE in the normal scale for group g β1g is the slope that accounts for the systematic error in the relationship between self-report and actual daily EE in group g rgi is individual i’s deviation from the self-reported group-level mean in the normal scale 2 rgi ∼ N(0, σrg ) egij is random measurement error in the self-report on the normal scale for individual i on day j in group g 2 egij ∼ N(0, σeg ) and 2 2 2 2 2 2 Var(ygij ) = Var(µyg + β1g (tgi + dgij ) + rgij + egij ) = β1g σtg + β1g σdg + σrg + σeg under the assumptions that all random effects are uncorrelated. Dave & Bryan PAMS Parameter Estimates for Females in Log Scale The following are age group fixed effect parameter estimates obtained via method of moments. Interpretation (Parameter) Usual PA (µg ) Reported PA (µyg ) Pop Bias (β1g ) Group 1 (<41) n=69 7.878 (0.019) 8.010 (0.035) 0.47 (0.24) Group 2 (41-50) n=76 7.791 (0.018) 7.987 (0.027) 0.89 (0.22) Group 3 (51-60) n=103 7.753 (0.015) 8.016 (0.024) 1.13 (0.14) Group 4 (61-70) n=78 7.700 (0.018) 7.985 (0.026) 0.72 (0.14) The following are age group random effect parameter estimates obtained via method of moments. Interpretation (Parameter) 2 Usual PA (σtg ) 2 Day to day (σdg ) 2 Monitor ME(σug ) 2 Recall ME (σeg ) 2 Person-Bias(σrg ) Group 1 (<41) n=69 0.0271 (0.0056) 0.028 (0.016) -0.012 (0.015) 0.0160 (0.0045) 0.0579 (0.0104) Group 2 (41-50) n=76 0.0147 (0.0045) 0.0082 (0.0025) 0.0019 (0.0024) 0.0060 (0.0021) 0.0275 (0.0045) Dave & Bryan PAMS Group 3 (51-60) n=103 0.0208 (0.0037) 0.0041 (0.0009) 0.0053 (0.0020) 0.0060 (0.0016) 0.0222 (0.0056) Group 4 (61-70) n=78 0.0242 (0.0052) 0.0040 (0.0017) 0.0001 (0.0015) 0.0053 (0.0015) 0.0198 (0.0057) Age Plots Estimated lines relating mean daily EE and reported daily EE by age groups. 7.5 8.0 8.5 9.0 8.5 8.0 24PAR EE (log scale) 7.5 7.5 8.0 8.5 Age Group 4 7.5 8.0 9.0 8.0 8.5 ● ● ● ●● ● ● ●● ● ●●● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● 7.5 24PAR EE (log scale) 8.5 ● ● ● 7.0 Monitor EE (log scale) Dave & Bryan 9.0 9.0 Age Group 3 9.0 8.5 8.0 7.0 Monitor EE (log scale) 7.0 7.5 7.0 9.0 Monitor EE (log scale) ● ● ● ●●● ● ● ●●●● ● ● ●● ●● ● ● ●● ● ● ●●●● ● ● ●● ● ●● ●● ● ● ● ●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ● ●● ●● ●● ● ●●● ● ● ● ● ● ● 7.0 ● ●● ● ●● ●● ●●● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ● ● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●●● ● ● ● ●●● ● ● ●● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ● 7.0 9.0 7.5 8.0 8.5 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●● ●● ●● ● ● ●●●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● 7.0 24PAR EE (log scale) Age Group 2 7.0 24PAR EE (log scale) Age Group 1 7.5 8.0 8.5 Monitor EE (log scale) PAMS 9.0 Estimated Distributions of Usual Energy Expenditure 0.0004 0.0008 Age Group 1 Age Group 2 Age Group 3 Age Group 4 0.0000 probability density 0.0012 Estimated distributions of usual daily EE for all four age groups 1500 2000 2500 3000 3500 usual daily EE (kcal/d) Dave & Bryan PAMS 4000 4500 Estimated Distributions of Usual Energy Expenditure Estimated distributions of usual daily EE, individual means of daily monitor EE, and individual means of reported daily EE for age group 1 (age < 41) 1e−03 Age Group 1 (Age 21 − 40) 6e−04 4e−04 2e−04 0e+00 probability density 8e−04 usual daily EE monitor means 24PAR means 2000 4000 6000 daily EE (kcal/d) Dave & Bryan PAMS 8000 Modeling Physical Activity Behavior The Center for Disease Control (CDC) and World Health Organization (WHO) make physical activity recommendations for age groups based on intensity (METs) and type (aerobic or muscle-training) of physical activity. This brings two questions to mind: 1 Are people complying with these recommendations? 2 What covariates are related to compliance? Dave & Bryan PAMS Recommendations in More Detail Recommendations for adults (18-64 years) are of the following form: Aerobic exercise should be in bouts of at least 10 minute durations of the following intensity/duration levels: 1 2 3 At least 150 minutes of moderate-intensity (3-6 METs) aerobic exercise OR At least 75 minutes of vigorous-intensity (over 6 METs) aerobic exercise OR An equivalent combination of moderate and vigorous aerobic exercise Muscle-strengthening activities should be performed two or more days a week. How could compliance with these recommendations be modeled? Dave & Bryan PAMS First Idea: Model Relationship between P(Bout) and Age Binomial Regression exp(β0 + β1 ∗ xi ) 1 + exp(β0 + β1 ∗ xi ) pi is the probability age i adults will participate in a continuous bout of some defined period of time on any given day xi is age i log (pi /(1 − pi )) = β0 + β1 ∗ xi ⇒ pi = y 2 0 7 10 6 4 ● ● 0.8 ● ● ● ● 0.6 ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.4 P(Hour Bout) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 n 2 1 11 14 8 6 0.0 Age(x) 20 21 22 23 24 25 1.0 P(Hour Bout) vs. Age ● ● 20 30 40 50 Age Dave & Bryan PAMS 60 70 Fitted First Idea Fitting the previous model we get the following parameter estimates β0 β1 Estimate 1.60326 -0.02416 Std. Error 0.34769 0.00672 z-value 4.611 -3.595 Pr (> |z|) 4.00e-06 0.000325 We can interpret the β1 parameter as follows: All else being equal, the odds of an individual with age x + 1 engaging in 60 minutes of physical activity on a given day are a factor of 0.98 times that of an individual with age x. 1.0 P(Hour Bout) vs. Age ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● 0.2 0.0 P(Hour Bout) ● ● 20 30 40 50 Age Dave & Bryan PAMS 60 70 A Bayesian Approach to First Idea A Bayesian approach to this model considers β0 and β1 to be random variables which makes pi a random quantity. A general procedure is as follows: Define pi as before with likelihood L(β0 , β1 ) ∝ N Y piyi (1 − pi )ni −yi . i=1 Put a prior on β0 and β1 . Here we put Beta priors on two specific ages which effectively adds two observations to our dataset and is given by a1 a2 g (β0 , β1 ) ∝ page25 (1 − page25 )b1 page65 (1 − page65 )b2 . Simulate from the joint posterior which is given by f (β0 , β1 |y ) ∝ N+2 Y i=1 Dave & Bryan PAMS piyi (1 − pi )ni −yi . Bayesian Results Below is the joint posterior distribution of β0 and β1 along with the fitted line to the data with a 95% credible bands. The Bayesian approach allows us to make statements such as: we are 95% sure the true probability that a 24 year old will engage in one hour of physical activity is between 65% and 80%. Joint Posterior for β0 and β1 ● 0.01 1.0 P(Hour Bout) vs. Age Curve ● ● 0.00 0.8 ● ● ● ● −4.6 −0.01 ● ● ● ● ● ● ● ● ● ● ● P(Hour Bout) −0.02 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.4 −2.3 0.6 ● ● ● ● ● −0.04 0.2 −0.03 ● −0.05 −6.9 0.5 1.0 1.5 2.0 2.5 0.0 β1 ● ● ● 3.0 ● 20 β0 30 40 50 age Dave & Bryan PAMS 60 70 Second Idea: Model Number of Bouts per Person Let Yij ∼ Poisson(λi ) where Yij is the number of bouts recorded for person i on day j Poisson Regression log(λi ) = β0 + β1 Agei + β2 Genderi + β3 BMIi + · · · Distribution of Number of Bouts 500 400 count 300 200 100 0 0 1 2 3 4 5 Number of Bouts Dave & Bryan PAMS 6 7 Poisson Regression continued We would expect a lot more 0’s than this model allows so we’d make this a zero-inflated Poisson model as follows: ( Yij ∼ Poisson(λi ) with probability θ Yij = 0 with probability 1 − θ In this case we’re interested in the value of θ and identifying the sub-populations (according to the value of λi ) that deflate θ by not getting the recommended level of activity. Dave & Bryan PAMS Join In Ideas? Thoughts? Recommendations? Dave & Bryan PAMS