Lab4b - Personal Pages

advertisement
Lab4 – part (b)
while loops let you compute until something occurs. The basic syntax is
while(…) {
indented block
of code
}
Where the “…” is a Boolean expression, i.e. evaluates to T or F. The indented block of code is
executed until the Boolean becomes False. Warning: it is very easy to write an infinite loop! You
must update the Boolean during execution.
while (6 > 2) {
print(“Still Running”)
}
To break an infinite loop, hit esc or STOP.
Consider the following skeleton:
n <- 10
while (n > 0) {
print(
) # Countdown
}
print(“Blastoff!”)
How would you change this second example so that it prints 10,9,8,…,1 then Blastoff?
For example, the following conducts a geometric experiment to determine the first 2 when
rolling a fair die sequentially.
counter = 1
seenH = F
while (seenH == F) {
x <- sample(c(1:6),1,replace=T)
print(x)
if (length(subset(x,x == 2)) == 1) {
seenH = T
} else {
counter = counter + 1
}
}
counter
Exercises:
1. Write a function sumRows that takes no parameters but uses the following line of code
to create a 30x1 matrix filled with “NA”
output.mat<-matrix(NA, nrow=30,ncol=1)
then creates a 30x30 matrix filled with 900 random numbers
data.mat<-matrix(runif(900), nrow=30)
then fills output.mat with the sums of entries in each row. Note that nrow(data.mat)
gives the number of rows you need to loop through.
2. Write a function InARow(p,k) that repeatedly conducts a Bernoulli(p) experiment until k
successes in a row are observed.
3. Write a function monteCarloInARow(p,k,n) that conducts a simulation to determine the
probability of the occurrence in (12) by repeating it n times.
Aside: A shortcut to (10), but one you shouldn’t use for your Lab write-up, is the apply
command. You could do it in one line via
output.mat <- apply(data.mat,1,mean)
For those who have programmed before this is the same as the map command. The first
parameter is the object to be operated on. The third parameter is the function to be operated, in
this case the mean. The second parameter tells R how to operate on the object. Since we want to
take the mean of each row we put a 1 here. If we wanted the mean of each column we’d put 2.
Exercise 14: Rotten Eggs Problem
The Wooster Voice printed the following account of an exchange between a student worker in
the cafeteria and a health inspector:
“The recipe calls for four fresh eggs for each pound of pasta in the carbonara. During her visit,
the health inspector pointed out that FDA research indicates one in four eggs of the recent batch
carries salmonella bacterium, so the cafeteria should never use more than three eggs when
preparing a dish. The student worker asked if throwing out three eggs from each dozen and using
the remaining nine would serve the same purpose.”
a. Does this idea make any sense at all? Why or why not?
b. Suppose the following argument is made for three-egg dishes rather than four egg dishes:
Let X be the number of eggs carrying salmonella. Then p(0) = P(X=0)=(.75)3=.422 for
three-egg dishes while p(0) = P(X=0)=(.75)4=.316 for four-egg dishes. What assumption
is being made in order to justify these calculations? Is that reasonable or not?
c. Suppose a dozen eggs happens to contain exactly 3 with salmonella, and that the student
worker does as he proposed to do, and discards 3 at random. Let X be the number of eggs
carrying salmonella among the four eggs selected at random from the remaining nine.
Conduct a simulation in R to determine the probability of getting a healthy dish with four
eggs. Repeat this experiment 100,000 times.
d. Use the result in (c) to approximate the probability distribution of X. Comment on
whether or not the student’s suggestion reduces risk of salmonella exposure.
e. Do you think simulation or theory is an easier way to solve (d) above? Justify your
answer.
Lab 4, Exercise 15: Capture re-capture
Suppose there are N fish in some lake, and you don’t know the value N. You capture a number
M of fish and put tags on them, then release them. Now M/N of the fish in the lake have tags.
Now capture n fish at random from the lake (this is called the recapture sample) and let m be the
number of tagged fish in the recapture sample.
Take a moment and write an equation you can use to estimate N based on the sample. Notice
how this problem is very different from how the hypergeometric is used in probability.
Let N = 400, M = 40, and n = 50. For an individual fish, let X=1 if tagged and 0 if untagged. We
are now going to conduct a simulation to estimate N. You should use loops, conditionals,
subsetting, and functions freely, and should try to program in good style.
a. Write a function that conducts a sample of size 50 without replacement from a vector
consisting of 40 tagged fish (that’s forty 1’s) and 360 untagged fish. Your function
should return the number of tagged fish found.
b. Calculate the frequency of tagged fish in this sample and use it to estimate N.
Hint: To generate estimates of the total population size use Floor(40*50/ i) where i keeps the
sums from the sample of size 50 of 0s and 1s. This gives truncated values of nM/m
c. Repeat (a) 1000 times and for each trial find the number of tagged fish. Use vectors to
store all 1000 results.
d. Draw a histogram, summary stats, and normal probability plot. Is it approximately
normal? Are the values close to what you’d expect from the hypergeometric distribution?
e. Determine how many of the samples had no tagged fish via bar count.
f. Draw a histogram of these estimates. Describe the shape and give values for its mean and
sd. If values are missing, how many? How does that relate to the questions above?
Since you sampled 1000 times from a distribution of samples of size 50, the histogram you
created in (f) is called a sampling distribution. This lab is our first taste of bootstrapping as a
way to create lots of data from a small amount of data.
In chapter 6 (page 231) we’ll define bias as a random variable. We’ll say an estimator is
unbiased if the expected value of the estimator matches the parameter we are trying to estimate.
The distribution of the estimator is the sampling distribution. In this case, we’re trying to
estimate N and we are using (b), repeated 1000 times. By comparing the average of the results in
(c) with what we know (that N = 400 for our simulations), answer the following:
g. Is the estimator for the true population size unbiased? If not, what is the approximate
magnitude of the bias?
Exercise 16: Q-Q plots as a goodness of fit test
We will check whether an exponential distribution is a reasonable model for the failure times in
days for air conditioning systems in aircraft. The data are:
97, 51, 11, 4, 141, 18, 142, 68, 77, 80, 1, 16, 106, 206, 82, 54, 31, 216, 46, 111, 39, 63, 18, 191,
18, 163, 24
a.
b.
c.
d.
e.
Draw a histogram and boxpolot. Find the mean and 5 number summary.
Create theoretical quantiles: generate numbers from 1-25 in a variable i.
Create a variable j = –ln(1-(i-.5)/25) and you’ll have quantiles for the Exp(1) distribution
Draw the Q-Q plot in R. Does it look exponential?
The slope of the line in this Q-Q plot is an estimate of Lambda, the mean of the
exponential distribution. Add a regression line to the Q-Q plot to find a numerical value.
How does this value compare to the mean of the original data set?
Lab 5, On Your Own #5
We’ll learn how to plot the normal density curve for user specified mean and standard deviation.
To plot a curve you need to set up the x-axis and y-axis values in two variables (R vectors). Then
the plot command can be used to plot all the (x,y) points in your vectors. This example will
guide you through the process for the standard normal curve, Z ~ N(0,1) . The x-axis should
range from plus/minus 4 standard deviations from the mean:
xaxis<-seq(-4,4,.01) # sets up the xaxis from -4 to 4 in increments of 0.01, or 800 points
for each of the points in xaxis, you need to evaluate the standard normal density function:
yaxis<-dnorm(xaxis) #default values for mean and sd are 0 and 1 respectively.
now you can plot the (x,y) pts:
plot(xaxis,yaxis,type="l",col="blue",lwd=3, xlab="")
plot(xaxis,yaxis, pch="*",cex=.75,col="tomato3",xlab="")
plot(xaxis,yaxis, pch="#", cex=.75, col="chocolate" ,xlab="")
title(main="standard normal distribution",xlab="Z")
We can also use a plot character (pch) with border and fill color:
xaxis2<-seq(-4,4,.1) #note the different value for by in this sequence
yaxis2<-dnorm(xaxis2)
plot(xaxis2,yaxis2, pch=25, cex=.75, col="tomato3", bg="steelblue", lwd=2)
#cex and lwd by default equal 1, inc/dec size by %, i.e. lwd=2 is 100% inc.
(5)
Repeat the above process for a normal density curve with a mean of 100 and standard
deviation of 15. Use a plot character and colors in this plot, complete with titles.
Links:
http://www.statmethods.net/advgraphs/parameters.html
http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.pdf
Download