Calculus for Biologists Lab Math 1180-002 Spring 2012 Lab #12 - Confidence Limit Fun Report due date: Tuesday, April 17, 2012 at 9 a.m. Goal: To determine the confidence limits of the frequency of DNA mutations in a gene segment. You will compare the results of the exact, Monte Carlo and normal approximation methods. ? Create a new script, either in R (laptop) or with a text editor (Linux computers). DNA damage revisited We will look in again on the lab conducting experiments on several organisms within an animal species to determine the frequency of mutational alterations in a 250-base-pair gene segment of DNA. Enter the number of base pairs. base.pairs = ## We will define the function B.sim which will simulate a number of mutated base pairs in a given organism. This function will simulate a series of 0s and 1s, saving them to a table based on the number of base pairs we care about, the number of organisms tested, and the probability of finding any mutation in a base pair. A 1 will correspond to the presence of some mutation, and a 0 otherwise. B.sim = function(bp,animal,p){ b.tmp = matrix(sample(c(0,1), bp*animal, replace=TRUE, prob=c(1-p,p)), bp, animal) return(b.tmp) } Let’s say that the results from a single gene segment are reported. Our goal is to find the range of frequencies within which we are confident that this data could even happen. Hence, we need to find the lower and upper confidence limits. First, we’ll simulate this result. Note that a single gene segment corresponds to one animal, and so the second argument to B.sim will be 1. Instead of pre-defining the mutation frequency p, we will choose a random number between 15 and 13 . organism = B.sim(base.pairs,1,runif(1,min=1/5,max=1/3)) We need to now count the total number of damaged base pairs for organism, which we’ll call n.mut. Insert the appropriate code to do this. n.mut = ## Exact confidence limits In the quest to determine the confidence limits for obtaining n.mut, we need to set the limits we want. For now, we will find the limits of the 95% confidence interval. Define α so that 1 − α = 0.95. alpha = ## This means that for any true proportion of damaged base pairs lying below the lower limit of the confidence interval, our data can be generated less than a fraction α/2 of the time. The same holds for a true value above the upper limit of the interval. In order to compute these probabilities, we need to know how likely it is that the number of damaged base pairs will be at least (or at most) n.mut given any true proportion p. As was the case last week, the lab situation may be described by a binomial distribution, where the number of mutated base pairs is k, the total number of base pairs is n, and the probability that any given base pair has a mutation is p. We need to calculate Pr(k > n.mut) and Pr(k 6 n.mut). Note that if p is too small, the probability that mutations are more frequent should be really small (as in less than 0.025). Similarly, if there’s a high probability of damaged base pairs, then there should not be very many instances in which we see fewer than n.mut number of mutations. To calculate the required probabilities, we need to evaluate the binomial distribution for a range of k and p values 1 of 4 L12 and then sum over them: base.pairs Pr(k > n.mut) = X b(k; base.pairs, p) and Pr(k 6 n.mut) = n.mut X k=n.mut b(k; base.pairs, p) k=0 To implement this in R, we will evaluate b(k; n, p) for all possible values of k and for 1000 different values of p. p = seq(0,1,length=1000) ## all p K = matrix(rep(0:base.pairs,length(p)),base.pairs+1,length(p)) ## 1000 repeats of all k P = matrix(rep(p,each=base.pairs+1),base.pairs+1,length(p)) ## 251 repeats of all p Note that now K and P are large matrices, both identical in dimension (same number of rows and columns). Because of this, we can substitute them into the binomial distribution formula from last week: binom = choose(base.pairs,K)*(P^K)*(1-P)^(base.pairs-K) To calculate the sums we need, we’ll use colSums over the appropriate subset of rows in binom. We make our transition point at 1+n.mut because there can be no 0th element of a list. p.low.exact = colSums(binom[(1+n.mut):base.pairs, ]) ## n.mut to base.pairs sum p.hi.exact = colSums(binom[0:(n.mut+1), ]) ## 0 to n.mut sum Now we can plot the results. Finally. The following lines of code will plot on the left-hand side of the figure p versus p.low.exact. par(mfrow=c(1,2),oma=c(0,0,2,0)) plot(p,p.low.exact,type="l", xlab="p", ylab="Probability", main="Determine lower limit") abline(h=alpha/2,col="red") Looking at the graph you just produced, estimate the p value at which the red cutoff line meets the curve. Save this as low.guess, and then execute the abline command below. low.guess = ## abline(v=low.guess, col="dodgerblue") Make your guess as accurate as possible. You may need to try several values. Do this by changing low.guess and re-executing all lines (starting from the par command). Now repeat the process for the right-hand graph. You can try new guesses for hi.guess in the same way. When you are satisfied with your choices, execute all lines from the par command. When you are done, you should have two side-by-side plots displaying one black curve (should be unique for each plot) and containing one blue vertical line and one red horizontal line. plot(p,p.hi.exact,type="l", xlab="p", ylab="Probability", main = "Determine upper limit") abline(h=alpha/2,col="red") mtext(side=3,"Summary - Exact Method",outer=TRUE,font=2,cex=1.25) hi.guess = ## abline(v=hi.guess,col="dodgerblue") Plot 12.1: Save this plot to include in your assignment. Record the values of low.guess and hi.guess you chose, to include in your assignment. These are (close to) the exact lower and upper confidence limits, pl and ph . Note that under normal circumstances, you would use some method to solve the equations Pr(k > n.mut) = α/2 and Pr(k 6 n.mut) = α/2. Monte Carlo method In the case where you cannot solve for the lower and upper confidence limits easily, you can estimate them using the Monte Carlo method. This method is centered on generating a large number of simulations to estimate the fraction of simulations that result in at least (in the case of finding pl ) or at most (in the case of finding ph ) n.mut/base.pairs fraction of mutated base pairs in a gene segment. The simulations are carried out for a reasonable number of p values. 2 of 4 L12 The following lines of code will implement 250 simulations for the number of mutated base pairs, for each of 30 different assumed probabilities of finding a mutation. These will range from 0 to 1. Instead of using the binomial distribution to directly define the likelihood that any given frequency will lead to a higher or lower range of mutation, we estimate the outcomes through these simulations. Pay close attention to the comments. p.sim = seq(0,1,length=30) ## 30 p values to test n.sim = 250 ## number of simulations to do for each value in p.sim p.low.monte = rep(0,length(p.sim)) ## storage for Pr(no. damaged pairs is AT LEAST n.mut) p.hi.monte = rep(0,length(p.sim)) ## storage for Pr(no. damaged pairs is AT MOST n.mut) for (j in 1:length(p.sim)){ ## Run simulations for current p=p.sim[j] and get sum of mutated pairs in each segment ## Compute equivalent fractions of total base pairs tmp.lo = colSums(B.sim(base.pairs,n.sim,p.sim[j]))/base.pairs ## Take those fractions and find out how many are AT LEAST as large as ## n.mut/base.pairs, and compute fraction of results satisfying this criterion tmp1 = length(which(tmp.lo >= n.mut/base.pairs))/n.sim ## Save result to p.low.monte p.low.monte[j] = tmp1 ## Do the same thing for the AT MOST sum, and save results to p.hi.monte tmp.hi = colSums(B.sim(base.pairs,n.sim,p.sim[j]))/base.pairs tmp2 = length(which(tmp.hi <= n.mut/base.pairs))/n.sim p.hi.monte[j] = tmp2 } Please let me know if you have any questions at all on this for loop. Now plot the results. par(mfrow=c(1,2),oma=c(0,0,2,0)) plot(p.sim,p.low.monte, col="dodgerblue", pch="x", xlab="p", ylab="Probability", main="Determine lower limit") abline(h=alpha/2,col="red") plot(p.sim,p.hi.monte, col="dodgerblue", pch="x", xlab="p", ylab="Probability", main = "Determine upper limit") abline(h=alpha/2,col="red") mtext(side=3,"Summary - Monte Carlo Method",outer=TRUE,font=2,cex=1.25) Plot 12.2: Save this figure for your assignment. Normal approximation method The normal approximation is yet another way to estimate confidence limits. When applicable, it provides a much simpler – and less time-consuming – way of calculating the desired interval. Using your text as a reference, we n.mut to be the estimated true fraction of mutated base pairs, and σ 2 = p.hat(1-p.hat) to will define p̂ = base.pairs be the variance, according to the normal approximation of the binomial. Save these quantities in R as p.hat and est.var, respectively. Also define est.sd, the estimated standard deviation, appropriately. p.hat = ## est.var = ## est.sd = ## Recall from the text that the estimated lower and upper 95% confidence limits based on this approximation are σ pl = p̂ − 1.96 √ n σ ph = p̂ + 1.96 √ n 3 of 4 L12 Use R to define and calculate these limits, named p.l and p.h. p.l = ## p.h = ## You will now compare the results of the various methods. Make sure you can identify which lines and symbols correspond to the different methods. par(mfrow=c(1,2)) plot(p,p.low.exact,type="l",xlim=c(0,1),ylim=c(0,1), xlab="p", ylab="Probability", main="Lower limit", lwd=2) points(p.sim,p.low.monte, col="dodgerblue", pch="x") abline(h=alpha/2,col="red") points(p.l,alpha/2,pch=20) plot(p,p.hi.exact,type="l",xlim=c(0,1),ylim=c(0,1), xlab="p", ylab="Probability", main="Upper limit",lwd=2) points(p.sim,p.hi.monte, col="dodgerblue", pch="x") abline(h=alpha/2,col="red") points(p.h,alpha/2,pch=20) Plot 12.3: Save this to include in your assignment. ? Save your script so that you can use it for your assignment. 4 of 4 L12