Calculus for Biologists Lab Math 1180-002 Spring 2012

advertisement
Calculus for Biologists Lab
Math 1180-002
Spring 2012
Lab #13 - Diabetes, Obesity and Hypothesis Testing
Report due date: Tuesday, April 24, 2012 at 9 a.m.
Goal: To develop one- and two-tailed hypothesis tests in assessing the relationship between obesity and type 2
diabetes. You will use Monte Carlo simulations to compute p-values.
⋆ Create a new script, either in R (laptop) or with a text editor (Linux computers).
Background
Type 2 diabetes (hereafter referred to as ‘diabetes’) is a worldwide epidemic that affects hundreds of millions
of people. It’s estimated that more than 25 million Americans have the disease. Very severe complications
may accompany the disease, contributing to its ranking as the seventh leading cause of death in the U.S. In
addition, there is a strong association between diabetes and obesity. We’ll take a look at the statistics to see
what conclusions can be made about this connection.
The plan: hypothesis testing
We will identify two major null hypotheses (H0 ) and then test them against sample population sizes to determine
the significance of the results. It will be your job to formulate these hypotheses.
Part I: Is obesity more common in type 2 diabetics than in the general population?
An estimated 36% of the U.S. adult (> 20 years) population is obese (Centers for Disease Control). Suppose
investigators conducting a metabolic study evaluate the obesity criteria of 100 diabetics. They find that exactly
73 of these individuals are obese.
Record your answers to the following questions to include in your assignment. Keep track of all question
numbers throughout the remainder of the lab.
1a. State the null hypothesis regarding the proportion of the diabetic population that is obese.
1b. State the alternative hypothesis for this situation, given that you have reason to believe that the true
proportion of diabetics with obesity lies on the same side of the null hypothesis as the sample population
mentioned above.
To test these hypotheses in R, we need to assign some variables. Define q.ob as the prevalence of obesity in
the general population, diabetics as the total number of people tested and diab.with.ob as the number of
sampled diabetics with obesity.
q.ob = ##
diabetics = ##
diab.with.ob = ##
To perform the Monte Carlo simulation, we will create the function sim to generate a number of fake populations
for us.
sim = function(n.people,n.pops,q){
sim.pops = matrix(sample(c(0,1), n.people*n.pops, replace=TRUE, prob=c(1-q,q)),
n.people, n.pops)
return(sim.pops)
}
Let’s do 1000 simulations. Define null1 as the proportion you specified in your null hypothesis.
n.sim = 1000
1 of 3
L13
null1 = ##
As we did last week, we just need to simulate a bunch of populations and calculate how many diabetics have
obesity in each one. The data will be generated according to the frequency of obesity given by the null hypothesis.
We’ll call the result monte1.
monte1 = colSums(sim(diabetics, n.sim, null1))
1c.
1d.
1e.
1f.
What probability (in words) will the p-value of the data give us?
Will this be a one- or two-tailed test?
What does it mean to reject the null hypothesis?
What is the minimal number of simulations that need to satisfy the alternative hypothesis, in order for
us to reject the null hypothesis?
The following line of code will compute the p-value pval1 for the current data.
pval1 = length(which(monte1 >= diab.with.ob))/n.sim
The which() command finds the particular elements of its argument that satisfy the given condition. We use
length to count how many of these populations there are. We divide by n.sim to get the probability.
1g. Record pval1.
1h. Should you accept or reject the null hypothesis based on this value? Explain.
1i. Based on this information, is obesity more common in type 2 diabetics than in the general population?
Explain.
Part II: Is diabetes more common in the obese population than in the general population?
An estimated 11% of the U.S. adult (> 20 years) population has type 2 diabetes (American Diabetes Association).
Suppose the same group of investigators tests 100 obese individuals for diabetes. They find that exactly 8 of
these individuals are diabetic.
2a. State the null hypothesis regarding the proportion of the obese population that has type 2 diabetes.
2b. State the alternative hypothesis for this situation, given that you have reason to believe that the true
proportion of obese individuals with diabetes lies on the same side of the null hypothesis as the sample
population mentioned above.
2c. State a second alternative hypothesis, given you have no information about where the true proportion
lies.
We will go through the same process to test these hypotheses as in Part I. Define q.diab as the prevalence of
diabetes in the general adult population, obese as the number of people the investigators tested for diabetes and
ob.with.diab as the number found to be diabetic.
q.diab = ##
obese = ##
ob.with.diab = ##
To implement the Monte Carlo method, define the appropriate null hypothesis proportion null2, and execute the
subsequent lines. We will again run 1000 simulations, so there is no need to redefine n.sim for this part.
null2 = q.diab ## freq of diab in ob same as in normal
monte2 = colSums(sim(obese, n.sim, null2))
2 of 3
L13
2d. What probability (in words) will the p-value of a one-tailed test give us?
2e. What probability (in words) will the p-value of a two-tailed test give us?
For the one-tailed test, we need to know the number of simulations that resulted in numbers less than the
expectation from the general population. The fraction of all simulations satisfying this is our one-tailed p-value.
x.lo = which(monte2 <= ob.with.diab)
pval2.onetail = length(x.lo)/n.sim
2f. Record pval2.onetail.
2g. Should you accept or reject the null hypothesis based on this value? Explain.
2h. Based on this information, is diabetes more common in the obese population than in the general population? Explain.
For the two-tailed test, we need to know both x.lo and the number of simulations that resulted in a total at
least as far away from the expectation, but on the other side of it. The sum of the corresponding fractions of
these numbers give us the two-tailed p-value.
## define expectation
null.expect = round(null2*100)
## add the difference to the mean; if subtracted, would get x.lo instead
x.hi = which(monte2 >= null.expect + abs(null.expect - ob.with.diab))
## compute two-tailed p-value
pval2.twotail = length(x.lo)/n.sim + length(x.hi)/n.sim
2i. Record pval2.twotail.
2j. Should you accept or reject the null hypothesis based on this value? Explain.
2k. Based on this information, does the prevalence of diabetes differ significantly between the obese and
general populations? Explain.
⋆ Save your script so that you can use it for your assignment.
3 of 3
L13
Download