1180:Lab11 1 Simulating Disease Incidence James Moore

advertisement
1180:Lab11
James Moore
April 9th, 2013
1
Simulating Disease Incidence
For many diseases, we use incidence to refer to the number of new diagnoses in a population (or
subpopulation). In contrast the prevalence refers to the total number of living people with the
disease in the (sub)population.
Suppose that there is a disease that you have a 1% chance of contracting every year, independent
of how old you actually are. Once you have it, you always have it, so you can only get it once.
Let A be a random variable that represents in which year get this disease. A is a geometrically
distributed random variable with probability .01 (p = .99). First we plot the pdf.
Age<-seq(100)
plot(Age,dgeom(Age,prob=.01),type=’h’,ylim=c(0,.01),
xlab=’age’,Main=’Age Incidence of Some Disease’)
At what age are you most likely to get the disease? How long can you expect to live
disease free on average? What is the median age incidence? You will need formulas from
section 7.6.
People don’t only get diseases on their birthdays. If we allow time to be continuous instead of
discrete, then A is an exponentially distributed random variable.
lines(Age,dexp(Age,rate=.01),ylim=c(0,.01),col=’red’)
Save this plot(1). How do the geometric and exponential distributions differ? Why
does the incidence get lower with age?
2
Multistage Cancer Models
Most diseases have slightly more interesting age incidence curves. The incidence of many cancers
actually rises as we age. This led to the hypothesis that cancer involves a sequence of mutations that
have to happen in a certain sequence. To see how this works let’s return to the binomial distribution.
1
Let MY be a binomially distributed random variable that describes how many mutations a person
has after they have been alive for Y years. We’ll suppose the process has a ‘success’ rate of .02.
Here a success refers to a mutation in an oncogene during the prior year, hardly a success at all.
The probability distribution of M1 0 should mostly be at zero with a few ones, whereas M1 00 should
average one. Because we know the distribution, we’ll use the function plot to make the picture
instead of hist.
mutations=seq(0,3)
par(mfrow=c(1,4))
plot(mutations,dbinom(mutations,size=10,prob=.02),
type=’h’,xlab=’Number of Mutations’,ylab=’Probability’,ylim=c(0,1))
plot(mutations,dbinom(mutations,size=30,prob=.02),
type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1))
plot(mutations,dbinom(mutations,size=50,prob=.02),
type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1))
plot(mutations,dbinom(mutations,size=90,prob=.02),
type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1))
Save this plot(2). These plots make it slightly difficult to see cancer incidence. If we want to
know how many people have cancer at age 30, we need to know how many people have at least 3
mutations. To do this we have to use the cdf of the binomial to find out how many people have
two or fewer and then subtract that from one.
1-pbinom(2,size=10,prob=.02)
We find that less than .1% of people have cancer by this time. Use this procedure to find out
how many people have cancer at age 30, 50 and 90. What is the mean number of
mutations as a function of age? What is the variance? Compute these for each age
10,30,50,90.
If we want to find prevalences at non-integer ages we have to use the poisson distribution. This
distribution only has a single parameter that describes that is the rate constant multiplied by the
length of time. For example, for a ten year old
1-ppois(2,lambda=.02*10)
Note that we did not get exactly the same answer as the binomial distribution. The poisson and the
binomial distribution fundamentally differ. In the binomial distribution each year either contains
a mutation or does not. In the poisson distribution, each year could potentially contain multiple
mutations.Calculate the probabilities that a 30,50 and 90 year old has cancer. What is
the mean number of mutations as a function of age? What is the variance? Compute
these for each age 10,30,50,90.
To get a better idea of how the prevalence changes over time, we can plot the probability of
having cancer versus age.
2
plot(Age,1-pbinom(2,size=Age,prob=.02),xlab=’Age’,ylab=’Prevalence’)
Add the prevalence predicted by the poisson distribution to this plot as a red line.
Save the resulting plot(3). The two prevalences should be close to each other.
What we have plotted is the cumulative distribution function for the time of disease incidence.
We can differentiate to find the incidence. Calculate the derivative of the prevalence curve
that came from the poisson distribution. This should be the pdf of the random variable I
that describes incidence. The pdf you have derived actually comes from a gamma distribution. We
can verify this by first plotting the cdf of the gamma distribution.
lines(Age,pgamma(Age,shape=3,rate=.02))
It should lie directly on top of the old red line. Save this plot(4).
Now we can efficiently plot the expected incidence curve.
plot(Age,dgamma(Age,shape=3,rate=.02),xlab=’Age’,ylab=’Incidence’)
Save this plot(5). Compare this plot to the incidence curve for our original disease
(plot 1). What differences are there? Explain why the incidence initially increases.
Explain why it flattens out. (Hint: What must happen to the incidence rate at very
high ages?)
3
Colorectal Cancer Rates
Now we’re going to try to fit this incidence curve to actual cancer rates. Humans are complicated
and so is cancer. Therefore we’re going to cheat a little and pretend that life begins at 25, which
is as far back as I can remember anyway. Here are the incidence rates of colorectal cancer (per
100000)
Age Male Female
25-29
2.5
2.3
30-34
4
3.9
5.8
5.1
35-39
40-44 11.1
10.6
19.6
45-49 23.1
46
35.8
50-54
55-59 84.3
57.1
98.6
60-64 166.7
65-69 259.1
148.6
189.5
70-74 325.3
257
75-79 419.1
80-84 490.3
309.5
3
The interesting thing here is that age is clumped into 5 year intervals. This means that we’ll
have to view time as occurring in discrete 5-year chunks. We’ll define a new age vector that sits in
the middle of each interval. We’ll also define a vector called AgeCat, that just counts the number
of 5 year intervals.
Age2=seq(27.5,82.5,5)
AgeCat=seq(12)
Store the incidence rates into two vectors, one for each gender. Then plot the male
incidence rate as a function of age. Now we want to try to fit a model curve to this. We
need something like the gamma function, but for discrete time. This is the negative binomial
distribution. The negative binomial describes how many trials it takes to get a particular number
of ‘successes’. In other words, describes how many five year intervals it takes to get a certain
number of mutations. We’ll add the following prediction curves.
points(Age2,dnbinom(AgeCat,prob=.1,size=5),col=’red’)
points(Age2,dnbinom(AgeCat,prob=.125,size=6),col=’green’)
points(Age2,dnbinom(AgeCat,prob=.155,size=7),col=’blue’)
The prob refers to the probability of getting a mutation during any five year stretch. The size
represents the number of mutations necessary. Save this plot(6). According to this model,
how many mutations are necessary and what is the probability of getting a mutation
during a five year span? Make a similar plot for female incidence rates. Change the
prob and size to find the best fit you can. Save a plot(7) of the data and your best fit
to it. What is different between men and women according to your fit?
4
Download