Calculus for Biologists Lab Math 1180-002 Spring 2012

advertisement
Calculus for Biologists Lab
Math 1180-002
Spring 2012
Lab #12 - Confidence Limit Fun
Report due date: Tuesday, April 17, 2012 at 9 a.m.
Goal: To determine the confidence limits of the frequency of DNA mutations in a gene segment. You will compare
the results of the exact, Monte Carlo and normal approximation methods.
? Create a new script, either in R (laptop) or with a text editor (Linux computers).
DNA damage revisited
We will look in again on the lab conducting experiments on several organisms within an animal species to determine
the frequency of mutational alterations in a 250-base-pair gene segment of DNA. Enter the number of base pairs.
base.pairs = ##
We will define the function B.sim which will simulate a number of mutated base pairs in a given organism. This
function will simulate a series of 0s and 1s, saving them to a table based on the number of base pairs we care
about, the number of organisms tested, and the probability of finding any mutation in a base pair. A 1 will
correspond to the presence of some mutation, and a 0 otherwise.
B.sim = function(bp,animal,p){
b.tmp = matrix(sample(c(0,1), bp*animal, replace=TRUE, prob=c(1-p,p)), bp, animal)
return(b.tmp)
}
Let’s say that the results from a single gene segment are reported. Our goal is to find the range of frequencies
within which we are confident that this data could even happen. Hence, we need to find the lower and upper
confidence limits.
First, we’ll simulate this result. Note that a single gene segment corresponds to one animal, and so the second
argument to B.sim will be 1. Instead of pre-defining the mutation frequency p, we will choose a random number
between 15 and 13 .
organism = B.sim(base.pairs,1,runif(1,min=1/5,max=1/3))
We need to now count the total number of damaged base pairs for organism, which we’ll call n.mut. Insert the
appropriate code to do this.
n.mut = ##
Exact confidence limits
In the quest to determine the confidence limits for obtaining n.mut, we need to set the limits we want. For now,
we will find the limits of the 95% confidence interval. Define α so that 1 − α = 0.95.
alpha = ##
This means that for any true proportion of damaged base pairs lying below the lower limit of the confidence
interval, our data can be generated less than a fraction α/2 of the time. The same holds for a true value above
the upper limit of the interval.
In order to compute these probabilities, we need to know how likely it is that the number of damaged base pairs
will be at least (or at most) n.mut given any true proportion p. As was the case last week, the lab situation may
be described by a binomial distribution, where the number of mutated base pairs is k, the total number of base
pairs is n, and the probability that any given base pair has a mutation is p.
We need to calculate Pr(k > n.mut) and Pr(k 6 n.mut). Note that if p is too small, the probability that
mutations are more frequent should be really small (as in less than 0.025). Similarly, if there’s a high probability
of damaged base pairs, then there should not be very many instances in which we see fewer than n.mut number
of mutations.
To calculate the required probabilities, we need to evaluate the binomial distribution for a range of k and p values
1 of 4
L12
and then sum over them:
base.pairs
Pr(k > n.mut) =
X
b(k; base.pairs, p) and
Pr(k 6 n.mut) =
n.mut
X
k=n.mut
b(k; base.pairs, p)
k=0
To implement this in R, we will evaluate b(k; n, p) for all possible values of k and for 1000 different values of p.
p = seq(0,1,length=1000) ## all p
K = matrix(rep(0:base.pairs,length(p)),base.pairs+1,length(p)) ## 1000 repeats of all k
P = matrix(rep(p,each=base.pairs+1),base.pairs+1,length(p)) ## 251 repeats of all p
Note that now K and P are large matrices, both identical in dimension (same number of rows and columns).
Because of this, we can substitute them into the binomial distribution formula from last week:
binom = choose(base.pairs,K)*(P^K)*(1-P)^(base.pairs-K)
To calculate the sums we need, we’ll use colSums over the appropriate subset of rows in binom. We make our
transition point at 1+n.mut because there can be no 0th element of a list.
p.low.exact = colSums(binom[(1+n.mut):base.pairs, ]) ## n.mut to base.pairs sum
p.hi.exact = colSums(binom[0:(n.mut+1), ]) ## 0 to n.mut sum
Now we can plot the results. Finally. The following lines of code will plot on the left-hand side of the figure p
versus p.low.exact.
par(mfrow=c(1,2),oma=c(0,0,2,0))
plot(p,p.low.exact,type="l", xlab="p", ylab="Probability", main="Determine lower limit")
abline(h=alpha/2,col="red")
Looking at the graph you just produced, estimate the p value at which the red cutoff line meets the curve. Save
this as low.guess, and then execute the abline command below.
low.guess = ##
abline(v=low.guess, col="dodgerblue")
Make your guess as accurate as possible. You may need to try several values. Do this by changing low.guess
and re-executing all lines (starting from the par command).
Now repeat the process for the right-hand graph. You can try new guesses for hi.guess in the same way. When
you are satisfied with your choices, execute all lines from the par command. When you are done, you should
have two side-by-side plots displaying one black curve (should be unique for each plot) and containing one blue
vertical line and one red horizontal line.
plot(p,p.hi.exact,type="l", xlab="p", ylab="Probability",
main = "Determine upper limit")
abline(h=alpha/2,col="red")
mtext(side=3,"Summary - Exact Method",outer=TRUE,font=2,cex=1.25)
hi.guess = ##
abline(v=hi.guess,col="dodgerblue")
Plot 12.1: Save this plot to include in your assignment.
Record the values of low.guess and hi.guess you chose, to include in your assignment. These are (close to)
the exact lower and upper confidence limits, pl and ph . Note that under normal circumstances, you would use
some method to solve the equations Pr(k > n.mut) = α/2 and Pr(k 6 n.mut) = α/2.
Monte Carlo method
In the case where you cannot solve for the lower and upper confidence limits easily, you can estimate them using
the Monte Carlo method. This method is centered on generating a large number of simulations to estimate the
fraction of simulations that result in at least (in the case of finding pl ) or at most (in the case of finding ph )
n.mut/base.pairs fraction of mutated base pairs in a gene segment. The simulations are carried out for a
reasonable number of p values.
2 of 4
L12
The following lines of code will implement 250 simulations for the number of mutated base pairs, for each of
30 different assumed probabilities of finding a mutation. These will range from 0 to 1. Instead of using the
binomial distribution to directly define the likelihood that any given frequency will lead to a higher or lower range
of mutation, we estimate the outcomes through these simulations. Pay close attention to the comments.
p.sim = seq(0,1,length=30) ## 30 p values to test
n.sim = 250
## number of simulations to do for each value in p.sim
p.low.monte = rep(0,length(p.sim)) ## storage for Pr(no. damaged pairs is AT LEAST n.mut)
p.hi.monte = rep(0,length(p.sim))
## storage for Pr(no. damaged pairs is AT MOST n.mut)
for (j in 1:length(p.sim)){
## Run simulations for current p=p.sim[j] and get sum of mutated pairs in each segment
## Compute equivalent fractions of total base pairs
tmp.lo = colSums(B.sim(base.pairs,n.sim,p.sim[j]))/base.pairs
## Take those fractions and find out how many are AT LEAST as large as
##
n.mut/base.pairs, and compute fraction of results satisfying this criterion
tmp1 = length(which(tmp.lo >= n.mut/base.pairs))/n.sim
## Save result to p.low.monte
p.low.monte[j] = tmp1
## Do the same thing for the AT MOST sum, and save results to p.hi.monte
tmp.hi = colSums(B.sim(base.pairs,n.sim,p.sim[j]))/base.pairs
tmp2 = length(which(tmp.hi <= n.mut/base.pairs))/n.sim
p.hi.monte[j] = tmp2
}
Please let me know if you have any questions at all on this for loop.
Now plot the results.
par(mfrow=c(1,2),oma=c(0,0,2,0))
plot(p.sim,p.low.monte, col="dodgerblue", pch="x",
xlab="p", ylab="Probability", main="Determine lower limit")
abline(h=alpha/2,col="red")
plot(p.sim,p.hi.monte, col="dodgerblue", pch="x",
xlab="p", ylab="Probability", main = "Determine upper limit")
abline(h=alpha/2,col="red")
mtext(side=3,"Summary - Monte Carlo Method",outer=TRUE,font=2,cex=1.25)
Plot 12.2: Save this figure for your assignment.
Normal approximation method
The normal approximation is yet another way to estimate confidence limits. When applicable, it provides a much
simpler – and less time-consuming – way of calculating the desired interval. Using your text as a reference, we
n.mut
to be the estimated true fraction of mutated base pairs, and σ 2 = p.hat(1-p.hat) to
will define p̂ = base.pairs
be the variance, according to the normal approximation of the binomial. Save these quantities in R as p.hat and
est.var, respectively. Also define est.sd, the estimated standard deviation, appropriately.
p.hat = ##
est.var = ##
est.sd = ##
Recall from the text that the estimated lower and upper 95% confidence limits based on this approximation are
σ
pl = p̂ − 1.96 √
n
σ
ph = p̂ + 1.96 √
n
3 of 4
L12
Use R to define and calculate these limits, named p.l and p.h.
p.l = ##
p.h = ##
You will now compare the results of the various methods. Make sure you can identify which lines and symbols
correspond to the different methods.
par(mfrow=c(1,2))
plot(p,p.low.exact,type="l",xlim=c(0,1),ylim=c(0,1),
xlab="p", ylab="Probability", main="Lower limit", lwd=2)
points(p.sim,p.low.monte, col="dodgerblue", pch="x")
abline(h=alpha/2,col="red")
points(p.l,alpha/2,pch=20)
plot(p,p.hi.exact,type="l",xlim=c(0,1),ylim=c(0,1),
xlab="p", ylab="Probability", main="Upper limit",lwd=2)
points(p.sim,p.hi.monte, col="dodgerblue", pch="x")
abline(h=alpha/2,col="red")
points(p.h,alpha/2,pch=20)
Plot 12.3: Save this to include in your assignment.
? Save your script so that you can use it for your assignment.
4 of 4
L12
Download