1180:Lab11 James Moore April 9th, 2013 1 Simulating Disease Incidence For many diseases, we use incidence to refer to the number of new diagnoses in a population (or subpopulation). In contrast the prevalence refers to the total number of living people with the disease in the (sub)population. Suppose that there is a disease that you have a 1% chance of contracting every year, independent of how old you actually are. Once you have it, you always have it, so you can only get it once. Let A be a random variable that represents in which year get this disease. A is a geometrically distributed random variable with probability .01 (p = .99). First we plot the pdf. Age<-seq(100) plot(Age,dgeom(Age,prob=.01),type=’h’,ylim=c(0,.01), xlab=’age’,Main=’Age Incidence of Some Disease’) At what age are you most likely to get the disease? How long can you expect to live disease free on average? What is the median age incidence? You will need formulas from section 7.6. People don’t only get diseases on their birthdays. If we allow time to be continuous instead of discrete, then A is an exponentially distributed random variable. lines(Age,dexp(Age,rate=.01),ylim=c(0,.01),col=’red’) Save this plot(1). How do the geometric and exponential distributions differ? Why does the incidence get lower with age? 2 Multistage Cancer Models Most diseases have slightly more interesting age incidence curves. The incidence of many cancers actually rises as we age. This led to the hypothesis that cancer involves a sequence of mutations that have to happen in a certain sequence. To see how this works let’s return to the binomial distribution. 1 Let MY be a binomially distributed random variable that describes how many mutations a person has after they have been alive for Y years. We’ll suppose the process has a ‘success’ rate of .02. Here a success refers to a mutation in an oncogene during the prior year, hardly a success at all. The probability distribution of M1 0 should mostly be at zero with a few ones, whereas M1 00 should average one. Because we know the distribution, we’ll use the function plot to make the picture instead of hist. mutations=seq(0,3) par(mfrow=c(1,4)) plot(mutations,dbinom(mutations,size=10,prob=.02), type=’h’,xlab=’Number of Mutations’,ylab=’Probability’,ylim=c(0,1)) plot(mutations,dbinom(mutations,size=30,prob=.02), type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1)) plot(mutations,dbinom(mutations,size=50,prob=.02), type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1)) plot(mutations,dbinom(mutations,size=90,prob=.02), type=’h’,xlab=’Number of Mutations’,ylab=’ ’,ylim=c(0,1)) Save this plot(2). These plots make it slightly difficult to see cancer incidence. If we want to know how many people have cancer at age 30, we need to know how many people have at least 3 mutations. To do this we have to use the cdf of the binomial to find out how many people have two or fewer and then subtract that from one. 1-pbinom(2,size=10,prob=.02) We find that less than .1% of people have cancer by this time. Use this procedure to find out how many people have cancer at age 30, 50 and 90. What is the mean number of mutations as a function of age? What is the variance? Compute these for each age 10,30,50,90. If we want to find prevalences at non-integer ages we have to use the poisson distribution. This distribution only has a single parameter that describes that is the rate constant multiplied by the length of time. For example, for a ten year old 1-ppois(2,lambda=.02*10) Note that we did not get exactly the same answer as the binomial distribution. The poisson and the binomial distribution fundamentally differ. In the binomial distribution each year either contains a mutation or does not. In the poisson distribution, each year could potentially contain multiple mutations.Calculate the probabilities that a 30,50 and 90 year old has cancer. What is the mean number of mutations as a function of age? What is the variance? Compute these for each age 10,30,50,90. To get a better idea of how the prevalence changes over time, we can plot the probability of having cancer versus age. 2 plot(Age,1-pbinom(2,size=Age,prob=.02),xlab=’Age’,ylab=’Prevalence’) Add the prevalence predicted by the poisson distribution to this plot as a red line. Save the resulting plot(3). The two prevalences should be close to each other. What we have plotted is the cumulative distribution function for the time of disease incidence. We can differentiate to find the incidence. Calculate the derivative of the prevalence curve that came from the poisson distribution. This should be the pdf of the random variable I that describes incidence. The pdf you have derived actually comes from a gamma distribution. We can verify this by first plotting the cdf of the gamma distribution. lines(Age,pgamma(Age,shape=3,rate=.02)) It should lie directly on top of the old red line. Save this plot(4). Now we can efficiently plot the expected incidence curve. plot(Age,dgamma(Age,shape=3,rate=.02),xlab=’Age’,ylab=’Incidence’) Save this plot(5). Compare this plot to the incidence curve for our original disease (plot 1). What differences are there? Explain why the incidence initially increases. Explain why it flattens out. (Hint: What must happen to the incidence rate at very high ages?) 3 Colorectal Cancer Rates Now we’re going to try to fit this incidence curve to actual cancer rates. Humans are complicated and so is cancer. Therefore we’re going to cheat a little and pretend that life begins at 25, which is as far back as I can remember anyway. Here are the incidence rates of colorectal cancer (per 100000) Age Male Female 25-29 2.5 2.3 30-34 4 3.9 5.8 5.1 35-39 40-44 11.1 10.6 19.6 45-49 23.1 46 35.8 50-54 55-59 84.3 57.1 98.6 60-64 166.7 65-69 259.1 148.6 189.5 70-74 325.3 257 75-79 419.1 80-84 490.3 309.5 3 The interesting thing here is that age is clumped into 5 year intervals. This means that we’ll have to view time as occurring in discrete 5-year chunks. We’ll define a new age vector that sits in the middle of each interval. We’ll also define a vector called AgeCat, that just counts the number of 5 year intervals. Age2=seq(27.5,82.5,5) AgeCat=seq(12) Store the incidence rates into two vectors, one for each gender. Then plot the male incidence rate as a function of age. Now we want to try to fit a model curve to this. We need something like the gamma function, but for discrete time. This is the negative binomial distribution. The negative binomial describes how many trials it takes to get a particular number of ‘successes’. In other words, describes how many five year intervals it takes to get a certain number of mutations. We’ll add the following prediction curves. points(Age2,dnbinom(AgeCat,prob=.1,size=5),col=’red’) points(Age2,dnbinom(AgeCat,prob=.125,size=6),col=’green’) points(Age2,dnbinom(AgeCat,prob=.155,size=7),col=’blue’) The prob refers to the probability of getting a mutation during any five year stretch. The size represents the number of mutations necessary. Save this plot(6). According to this model, how many mutations are necessary and what is the probability of getting a mutation during a five year span? Make a similar plot for female incidence rates. Change the prob and size to find the best fit you can. Save a plot(7) of the data and your best fit to it. What is different between men and women according to your fit? 4