Calculus for Biologists Lab Math 1180-002 Spring 2012 Lab #7 - Descriptive Statistics for Simulated Populations Report due date: Tuesday, March 6, 2012 at 9 a.m. Goal: To use the geometric mean to describe data, and to visualize the variance and standard deviation of data. ? Create a new script, either in R (laptop) or with a text editor (Linux computers). Geometric Mean Suppose we have a population of beings (pick your favorite one) in which there are really good times (years, months, days, etc.) and somewhat bad times. On average, we’ll assume (as in lecture) that the population can either reproduce at a rate R of 0.7, 0.9 and 1.5 with equal probability. Of course, these rates are randomly chosen each year. We will simulate 50 different populations for a total of 30 time steps. Each population will have an initial size of 10. Notice that in the code to follow the storage matrix for reproduction rates has one fewer row than that for the population sizes. This is because we only need to choose exactly 30 rates; the population size, on the other hand, will be updated 30 times in addition to the initial condition at time 0. R.list = c(0.7,0.9,1.5) ## save rates to a list ## define a function that picks some number C rates from R.list rates = function(C){ sample(R.list,C,replace=TRUE, prob=c(1/3,1/3,1/3)) } total.time = 30 ## time steps n.pops = 50 ## number of populations N = matrix(0,total.time+1,n.pops) ## storage table for population sizes R = matrix(0,total.time,n.pops) ## storage table for reproduction rates ## Simulate! N0 = 10 ## initial population size N[1, ] = N0 ## initialize storage table for (j in 1:total.time){ R[j, ] = rates(n.pops) ## choose CURRENT rate N[j+1, ] = R[j, ]*N[j, ] ## update NEW population size with this rate } Now that we have our population in hand, let’s see what it looks like. Remember that the standard form the population size after time t with reproduction rates R1 , R2 , . . . , Rt−1 is Nt = Rt−1 Rt−2 . . . R2 R1 R0 N0 As an exercise, you should choose one of your populations (1 through 50) and plot the simulated solution and the one obtained based on the formula above (using the rates chosen for that same population). We can determine the expected population trajectory with the true geometric mean, which can be calculated using the following: P R̃ = e ln(ri )pi , (1) where ri takes on the values in R.list. Use the formula in (1) to calculate the actual geometric mean of this system, and save the result. true.geom.mean = # ??? Now we’re ready to plot some things. The following code will create two side-by-side plots of the data. The one on the left represents that standard population sizes. The one on the right is a plot of the natural log of these populations. The latter plot just makes it easier to see the behavior of the solutions. The thick black line in each plot is the normal or log-transformed solution Nt = R̃t N0 , where R̃ is the true.geom.mean. The log plot is actually the natural logarithm of the populations. The default base for R’s log command is e. 1 of 3 L7 times = 0:total.time colors=rainbow(n.pops) par(mfrow=c(1,2)) matplot(times, N, type = "l", col = colors, xlab="Time",lty=1, main="Simulated populations") lines(times,true.geom.mean^times*N0, type="l", col="black", lwd=2) matplot(times, log(N), type = "l", col = colors, xlab="Time",lty=1, main="log simulated populations") lines(times,log(true.geom.mean^times*N0), type="l", col="black", lwd=2) abline(h=0,lty=2) Note the dotted line in the graph on the right. This tells you when the population gets into trouble. (Think about why this is!) Plot 7.1: Save this figure to include in your assignment. Variance There are other interesting pieces of information you can obtain from the data. One is the spread (or variability). There are two statistical measures that describe this: variance (Var) and standard deviation (σ). Recall that Var = σ 2 , so if we can find the variance, we can easily compute the standard deviation. Now, instead of computing the spread of, say, reproductive rates, we are interested in looking at the spread of the population sizes for each time point between 0 and 30. A good way to do this is to check the expected value of the squared difference of the populations from the overall mean at that time. In computer-world, it’s a lot easier to calculate the difference between the expected squared data and the square of the expectation, for each time t: σ2 = n X x2i pi − X̄ 2 (2) i=1 This means we need several pieces of information: 1. the value of each population size (xi ) 2. the probability for each population size at each time (pi ) 3. the expected value at each time (X̄) Note that here, i spans 1 to 50 and corresponds to the individual populations. In the code, j will reference each time point. We will first compute the expected population sizes for every time, as well as the expectation of the squared population sizes. expect = rep(0,total.time+1) ## save expected values squared.expect = rep(0,total.time+1) ## save second term of the variance formula expect.of.square = rep(0,total.time+1) ## save first term of the variance formula for (j in 1:(total.time+1)){ x.i = as.numeric(names(table(N[j, ]))) p.i = as.numeric(tabl e(N[j, ]))/n.pops ## all pop. sizes obtained at time j ## the fraction of pops. that were of each size expect[j] = sum(x.i*p.i) squared.expect[j] = expect[j]^2 expect.of.square[j] = sum(x.i^2*p.i) } The as.numeric() command in the loop forces things to be numbers (as opposed to strings of “letters”). We end with the variance: variance = expect.of.square - squared.expect And, plot it. plot(times,variance,type="o",xlab="Time",main="Variance of simulated data") Plot 7.2: Save this figure to include in your assignment. 2 of 3 L7 Standard Deviation Since the standard deviation is just the positive square root of the variance, we can easily compute it in R, using the sqrt function. This – you guessed it – takes the square root of its argument. std.dev = sqrt(variance) Many scientific data are plotted on graphs with “error bars” at each point. In some cases, these bars represent one standard deviation distance above and below the mean plotted. That’s what you are going to do next. You should get the following warning when you run the lines of code below: Warning message: In arrows(times, expect - std.dev, times, expect + std.dev, angle = 90, zero-length arrow is of indeterminate angle and so skipped : Ignore this error. It just says that the population sizes at one of the time points has a standard deviation of 0. Can you guess which one this is? par(mfrow=c(1,1)) ## set up plot for data, but don't plot anything yet matplot(times,N,type="n",xlab="Time",ylim=c(floor(min(expect-std.dev)/10)*10,max(N)), main = "Population spread over time") ## add grid lines to the plot region abline(v=times,h=seq(floor(min(expect-std.dev)/10)*10,max(N),by=25), col="lightgray") ## now actually plot all the data, as blue dots matpoints(times,N,col="deepskyblue",pch=20,cex=0.75) ## plot the expected values at each time point, as black diamonds points(times,expect,pch=18,col="black",cex=1.25) ## plot lines extending from mean - std.dev to the mean + std.dev arrows(times, expect-std.dev, ## first set of points times, expect+std.dev, ## second set of points angle=90, ## don't plot arrowheads; plot flat heads instead code=3, ## plot "arrows" at both ends length=0.05, ## arrowhead length lty=1, col="black") Plot 7.3: Save this plot to include in your assignment. ? Save your script so that you can use it for your assignment. 3 of 3 L7