Calculus for Biologists Lab Math 1180-002 Spring 2012 Lab #13 - Diabetes, Obesity and Hypothesis Testing Report due date: Tuesday, April 24, 2012 at 9 a.m. Goal: To develop one- and two-tailed hypothesis tests in assessing the relationship between obesity and type 2 diabetes. You will use Monte Carlo simulations to compute p-values. ⋆ Create a new script, either in R (laptop) or with a text editor (Linux computers). Background Type 2 diabetes (hereafter referred to as ‘diabetes’) is a worldwide epidemic that affects hundreds of millions of people. It’s estimated that more than 25 million Americans have the disease. Very severe complications may accompany the disease, contributing to its ranking as the seventh leading cause of death in the U.S. In addition, there is a strong association between diabetes and obesity. We’ll take a look at the statistics to see what conclusions can be made about this connection. The plan: hypothesis testing We will identify two major null hypotheses (H0 ) and then test them against sample population sizes to determine the significance of the results. It will be your job to formulate these hypotheses. Part I: Is obesity more common in type 2 diabetics than in the general population? An estimated 36% of the U.S. adult (> 20 years) population is obese (Centers for Disease Control). Suppose investigators conducting a metabolic study evaluate the obesity criteria of 100 diabetics. They find that exactly 73 of these individuals are obese. Record your answers to the following questions to include in your assignment. Keep track of all question numbers throughout the remainder of the lab. 1a. State the null hypothesis regarding the proportion of the diabetic population that is obese. 1b. State the alternative hypothesis for this situation, given that you have reason to believe that the true proportion of diabetics with obesity lies on the same side of the null hypothesis as the sample population mentioned above. To test these hypotheses in R, we need to assign some variables. Define q.ob as the prevalence of obesity in the general population, diabetics as the total number of people tested and diab.with.ob as the number of sampled diabetics with obesity. q.ob = ## diabetics = ## diab.with.ob = ## To perform the Monte Carlo simulation, we will create the function sim to generate a number of fake populations for us. sim = function(n.people,n.pops,q){ sim.pops = matrix(sample(c(0,1), n.people*n.pops, replace=TRUE, prob=c(1-q,q)), n.people, n.pops) return(sim.pops) } Let’s do 1000 simulations. Define null1 as the proportion you specified in your null hypothesis. n.sim = 1000 1 of 3 L13 null1 = ## As we did last week, we just need to simulate a bunch of populations and calculate how many diabetics have obesity in each one. The data will be generated according to the frequency of obesity given by the null hypothesis. We’ll call the result monte1. monte1 = colSums(sim(diabetics, n.sim, null1)) 1c. 1d. 1e. 1f. What probability (in words) will the p-value of the data give us? Will this be a one- or two-tailed test? What does it mean to reject the null hypothesis? What is the minimal number of simulations that need to satisfy the alternative hypothesis, in order for us to reject the null hypothesis? The following line of code will compute the p-value pval1 for the current data. pval1 = length(which(monte1 >= diab.with.ob))/n.sim The which() command finds the particular elements of its argument that satisfy the given condition. We use length to count how many of these populations there are. We divide by n.sim to get the probability. 1g. Record pval1. 1h. Should you accept or reject the null hypothesis based on this value? Explain. 1i. Based on this information, is obesity more common in type 2 diabetics than in the general population? Explain. Part II: Is diabetes more common in the obese population than in the general population? An estimated 11% of the U.S. adult (> 20 years) population has type 2 diabetes (American Diabetes Association). Suppose the same group of investigators tests 100 obese individuals for diabetes. They find that exactly 8 of these individuals are diabetic. 2a. State the null hypothesis regarding the proportion of the obese population that has type 2 diabetes. 2b. State the alternative hypothesis for this situation, given that you have reason to believe that the true proportion of obese individuals with diabetes lies on the same side of the null hypothesis as the sample population mentioned above. 2c. State a second alternative hypothesis, given you have no information about where the true proportion lies. We will go through the same process to test these hypotheses as in Part I. Define q.diab as the prevalence of diabetes in the general adult population, obese as the number of people the investigators tested for diabetes and ob.with.diab as the number found to be diabetic. q.diab = ## obese = ## ob.with.diab = ## To implement the Monte Carlo method, define the appropriate null hypothesis proportion null2, and execute the subsequent lines. We will again run 1000 simulations, so there is no need to redefine n.sim for this part. null2 = q.diab ## freq of diab in ob same as in normal monte2 = colSums(sim(obese, n.sim, null2)) 2 of 3 L13 2d. What probability (in words) will the p-value of a one-tailed test give us? 2e. What probability (in words) will the p-value of a two-tailed test give us? For the one-tailed test, we need to know the number of simulations that resulted in numbers less than the expectation from the general population. The fraction of all simulations satisfying this is our one-tailed p-value. x.lo = which(monte2 <= ob.with.diab) pval2.onetail = length(x.lo)/n.sim 2f. Record pval2.onetail. 2g. Should you accept or reject the null hypothesis based on this value? Explain. 2h. Based on this information, is diabetes more common in the obese population than in the general population? Explain. For the two-tailed test, we need to know both x.lo and the number of simulations that resulted in a total at least as far away from the expectation, but on the other side of it. The sum of the corresponding fractions of these numbers give us the two-tailed p-value. ## define expectation null.expect = round(null2*100) ## add the difference to the mean; if subtracted, would get x.lo instead x.hi = which(monte2 >= null.expect + abs(null.expect - ob.with.diab)) ## compute two-tailed p-value pval2.twotail = length(x.lo)/n.sim + length(x.hi)/n.sim 2i. Record pval2.twotail. 2j. Should you accept or reject the null hypothesis based on this value? Explain. 2k. Based on this information, does the prevalence of diabetes differ significantly between the obese and general populations? Explain. ⋆ Save your script so that you can use it for your assignment. 3 of 3 L13