NAVAL POSTGRADUATE SCHOOL LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT Statistics (OA3102) Prof. Fricker Lab #2: Sampling, Sampling Distributions, and the Central Limit Theorem Goal: Use R to demonstrate sampling methods, sampling distributions, and the Central Limit Theorem (CLT). Lab type: Interactive lab demonstration followed by hands-on exercises. Time allotted: Lecture for ~50 minutes followed by ~50 minutes of group exercises. R libraries: Just the base package. Data: mk48.down.csv R HINT OF THE WEEK 1. Keeping a Log of Your R Session a. While using a script can help you keep a record of the commands your run or the functions you've written, it does not keep track of the details of what happened during a particular session. b. In Windows-based PCs running the R Console, the easiest way to save what you've done is to use the "Save to file…" option under the File pull down menu in the R Console. This will save whatever is in your console window to a text file. This is usually sufficient for most sessions, but note that for long sessions or if you print out large datasets, earlier material may have exceeded the R Console's window and be lost. c. On Macs running the R Console, use the "Save As…" option under the File pull down menu. DEMONSTRATION 2. Now, before we begin, let's talk a bit about lists, data frames, and matrices in R. a. The most general R data type is the list. A list can hold objects of any type: numeric, character, matrices, functions, even other lists. Furthermore, the objects don’t even have to be the of the same mode or the same length. b. Here’s one example of a list: result <- list(LETTERS[1:5],1:10,letters[1:3],"by Ron Fricker") c. What things are in result? You can type result to see. Note that each of the elements are of different lengths i. You can extract items from a list with the square brackets; this gives you another list. For example, what do you get with result[1]? How do you know it’s a list? Remember the class function from last class. Also, note Revision: January 2012 2 Prof. Fricker that it’s printed out on two lines, the first one containing [[1]]. (If the elements of the list were named, then the name would have printed on this line.) ii. Usually instead of this sub-list you want the contents of the sub-list; for that you can use double square brackets. Try result[[1]] and see what you get. iii. You can check to see what mode (i.e., type) each of the elements is with the mode() function. For example, try: mode(result[[1]]) mode(result[[2]]) What's the difference between mode and class? The mode function tells you the storage mode of an object (e.g., "numeric," "character," "list," etc.) while class is an object attribute (e.g., "matrix," "array," "data frame," etc.). Sometimes they're the same, but often different. iv. If you want to assign names to the elements in the list: names(result) <- c("CAPS", "Ints", "Sm.Ltrs", "Author") v. Now what do you get when you type result? vi. With names you can use the dollar sign syntax to access contents of a sublist, as in result$CAPS. This is equivalent to result[[1]]. d. Data frames are the most frequently used R objects for storing data, at least for statistical analysis, and they are essentially a special kind of list. i. Data frames have a specific structure: Columns are variables and rows are observations. In addition, a data frame must be rectangular, so all columns must be of the same length and all rows must be of the same length. ii. What is useful about data frames from an analysis point of view is that the columns can be of different types (numeric, character, factor, logical, etc.) iii. The function data.frame() can be used to create a data frame. In a data frame the variables generally are named. Here’s a simple data frame: First.name <- c("Joe", "Peggy", "Harry", "Joan") Last.name <- c("Sixpack", "Sue", "Henderson", "Jett") Age <- c(32, 27, 38, 35) Active.duty.ind <- c(1,1,1,0) At.NPS <- c(TRUE, FALSE, TRUE, TRUE) Fake.data <- data.frame(First.name, Last.name, Age, Active.duty.ind, At.NPS) iv. Now, you can list the names of all the variables in the data frame, as you've done with other data frames, with names(Fake.data). v. And, as we have been doing in class, you can use the name of a variable to call it. For example: Fake.data$First.name. Revision: January 2012 3 Prof. Fricker 3. Matrices most often come into play in R when you want to do numerical calculations. a. For a matrix in R, every element of a matrix has to be of the same mode (for example, they must all be numeric, or all character, or all logical). Most commonly, matrices are numeric and used in computations. For example, two matrices A and x can be multiplied using the notation A%*%x. b. The function for creating a matrix is, not surprisingly, matrix(). See the help for the arguments, particularly nrow and ncol which define the size of the matrix and byrow which defines how the data is read into the matrix. To illustrate, consider the following example and, before running it in R, guess what the matrix will look like: Fake.matrix <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE) 4. Illustrating the Central Limit Theorem (CLT). a. The CLT says that sums of iid random variables have an approximately normal distribution. The greater the number of r.v.s summed, the better the approximation. What this means, in particular, is that sample means have an approximate normal distribution. Let’s illustrate this with a couple of simulation examples, similar in spirit to the applet we looked at in class. b. To begin, let’s look at a random variable that’s uniformly distributed on the unit interval [0,1]: X~U[0,1]. It’s easy to generate such r.v.s in R using the runif() function, which we will use to create a matrix of 300,000 random draws from a U[0,1] distribution: Xmatrix <- matrix(runif(10000*30),nrow=10000) This syntax creates a matrix with 10,000 rows and 30 columns, each entry of which is a random draw from a U[0,1] distribution. Check that the range of variables looks reasonable with summary(Xmatrix) and hist(Xmatrix) c. Now, let’s look at how by increasing the sample size the sample mean becomes more and more normally distributed. The syntax below creates four new vectors which are the row means of the first 2, 5, 10 and all 30 columns of the matrix. um2<-rowMeans(Xmatrix[,1:2]) um5<-rowMeans(Xmatrix[,1:5]) um10<-rowMeans(Xmatrix[,1:10]) um30<-rowMeans(Xmatrix) So, they’re each vectors that are 10,000 observations long, and we would expect um2 to be the least normally distributed and um30 to be quite close to normal. Check this with: par(mfrow=c(2,2)) qqnorm(um2); qqnorm(um5); qqnorm(um10); qqnorm(um30) Revision: January 2012 4 Prof. Fricker It doesn’t take long for the CLT to “kick in,” does it? Even for this example, in which the population distribution is very non-normally distributed, the normal probability plots start to look pretty straight with means of just five observations. 5. Picturing a (Non-Normal) Sampling Distribution: A Real-Data Example. a. Read in the file named mk48.down.csv. mk48.down <- read.csv(file.choose()) Of course, once you type the above, you need to then find the CSV file on your computer and click on it via the dialog box that pops up. You now have the data in a vector called mk48.down$down.days whose 9,505 entries give the number of “down days” for the Marines’ MK-48 Logistic Vehicle System (LVS). How did I know that the name of the vector is down.days? Remember: names(mk48.down) b. What sort of distribution does it look like the data came from? hist(mk48.down$down.days,prob=TRUE) Answer: it looks a lot like an exponential distribution. c. What is a good estimate of the exponential distribution’s parameter ? Answer: the method of moments and maximum likelihood techniques (both of which we will learn about in upcoming lectures) give ˆ 1/ X . This is called a point estimate, where we are using the data to estimate the parameter . In this case we have 0.01581165, which we find by executing the following command: 1/mean(mk48.down$down.days) How good is the fit? Let's overlay the parametric distribution exp(0.0158) with the density histogram: hist(mk48.down$down.days,prob=TRUE) curve(dexp(x,0.0158),lwd=2,col="red",add=TRUE) Looks pretty good, eh? However, note that plotting a histogram and a probability density curve like this is not the best way to make such a comparison. Your eye is simply not calibrated finely enough to see more than obvious, gross differences. A better approach is to use the qqplot fucntion, plotting the data versus some random observations from an exp(0.0158) distribution. This is called a quantilequantile (or Q-Q for short) plot, which is very similar to a normal probability plot (the qqnorm function in R), but instead of comparing the data to the theoretical quantiles of a normal distribution, we now compare two data sets against each other. As with the normal probability plot, if they come from the same distribution, the points on the qqplot should fall close to a straight line. Here're the R commands, where we first generate 9,505 observations from an exp(0.0158) distribution. Then we plot them versus the actual data. Finally, we overlay a straight line to help us visually see what's going on. Revision: January 2012 5 Prof. Fricker rand.exps <- rexp(9505,0.0158) qqplot(mk48.down$down.days, rand.exps) abline(lm(y~x,data=qqplot(mk48.down$down.days,rand.exps))) Here we see that there are some down days observations that are a lot larger than would be expected if the data did come from an exp(0.0158) distribution. But if we focus in on the majority of observations less than 400 days, it doesn't look too bad: qqplot(mk48.down$down.days[mk48.down$down.days<=400], rexp(400,0.0158)) And almost all of the observations are have less than 400 days down: table(mk48.down$down.days<=400) FALSE 101 TRUE 9404 So, we'll assume we know the population comes from an exp(0.0158) distribution. d. By the CLT, we know X has a normal distribution, but 1/ X does not. What does the distribution of 1/ X from an exp(0.0158) distribution look like? Well, that depends on the sample size since n is involved in the calculation of X . For this demonstration, let’s imagine we’re interested in a sample of size n=10 and we want to know what the sampling distribution for ̂ looks like. One way to do this is to use simulation, generating lots of samples of size 10 from an exp(0.0158) distribution and then plotting them to get a picture of the sampling distribution: ee <- matrix (rexp (10000 *10, rate=0.0158), nrow=10000) ee.m10 <- rowMeans (ee) hist (1/ee.m10) Here we can clearly see a skewed distribution, which validates our original assertion that the distribution of 1/ X is not normal. e. To be a bit fancier, we can overlay a normal density curve to better show the skew: hist (1/ee.m10,prob=TRUE,xlim=c(-0.1,0.1)) curve(dnorm(x,mean(1/ee.m10),sd(1/ee.m10)),lwd=2,col="red",add=TRUE) And, if we want to be a bit more formal we can use a normal probability plot: qqnorm(1/ee.m10) qqline(1/ee.m10) Thus, what we see here is that not all sampling distributions are normal. Remember, a sampling distribution is just the probability distribution of a statistic, and often they are not normally distributed. 6. Another Approach: Using Sampling to Construct an Empirical Estimate of the Sampling Distribution of 1/ X . Revision: January 2012 6 Prof. Fricker a. In the previous section, we approximated the population distribution of LVS down days with an exponential distribution. That's what would be referred to as a "parametric" approach to estimating the sampling distribution. It's parametric because we chose a particular family of distributions (exponential) and then fit a particular distribution from this family by estimating the parameter of the exponential distribution from the data. b. An alternative approach, which is "nonparametric," is to use the data itself to estimate the sampling distribution. We will do this by sampling directly from the data, for which the sample() function will be very helpful. The idea is that we will repeatedly randomly sample 10 observations from the data, calculate the inverse of the means of each sample, and plot on a histogram. c. To begin, we calculate a vector called resamples that contains 10,000 inverse means of 10 observations. Each sample of 10 is drawn without replacement from the 9,505 observations, but any particular observation in the data can show up in more than one sample of 10. resamples <- vector(length = 10000) for(i in 1:10000){ resamples[i] <- 1/mean(sample(mk48.down$down.days,size=10)) } d. Now, let’s plot a histogram of these 10,000 resamples and compare it to the histogram that used the random exponentials in the last example: par(mfrow=c(1,2)) hist(1/ee.m10) hist(resamples) They look pretty close, eh? But as we previously discussed, it’s actually pretty hard for the human eye to distinguish differences between two or more histograms this way, so let’s compare with a Q-Q plot: qqplot(resamples, 1/ee.m10) abline(lm(y~x,data=qqplot(resamples, 1/ee.m10))) Not a bad fit! It looks like there is just a bit of deviation in the right tails of the distributions, but overall pretty darn good. So, it looks like we basically get the same sampling distribution estimate whether we use a parametric or nonparametric approach. Revision: January 2012 7 Prof. Fricker GROUP #___EXERCISES Members: ______________, ______________, ______________, ______________ 1. Illustrate the CLT on sample totals using some very non-normal data. In particular, draw random samples from a gamma distribution (see the rgamma() function) with parameter shape = 2. First, using a normal probability plot, demonstrate that the data is not normal. a. Now, vary the sample size (n) for the sample total from very small (say 2) to quite large (you choose). i. The apply() and sum() functions will likely be useful if you first create a matrix of gamma distributed data, as in the earlier demonstration. ii. Be sure to simulate enough samples that you get relatively smooth histograms and/or normal probability plots. b. As your output, create a single chart with a sequence of plots showing how the normal approximation gets better and better as n gets large – that is, as the CLT "kicks in." i. The par(mfrow=c(a,b)) command will be useful for putting multiple plots on one chart, where a is the number rows and b is the number of columns in the “matrix” of figures. 2. The coefficient of variation (CV) is a normalized measure of the dispersion of a probability distribution. It is defined as the ratio of the standard deviation to the mean : CV= Empirically estimate the CV sampling distribution for the MK-48 LVS data for various sample sizes as follows. a. Resample 10,000 times from the data samples of size n. b. For each resample, estimate the CV as the ratio of the sample standard deviation to the sample mean: s / x . Note that the sample mean and standard deviation are calculated on the same sample of data. c. As your output, create a single chart with a sequence of plots showing what happens as n gets bigger. Does the CLT "kick in" for the CV? Revision: January 2012 8 Prof. Fricker Name: _____________________________ INDIVIDUAL EXERCISES 1. Repeat the demonstration of the CLT in the lab on a discrete uniform distribution. In particular, consider the uniform distribution on the integers from 1 to 6, which would simulate a fair die. (a) You can easily generate 300,000 observations from such a random variable in R using the runif() function combined with the ceiling() function: ceiling(runif(10000*30,0,6)) Here, runif(10000*30,0,6) generates 300,000 random observations between 0 and 6 and the ceiling() function rounds each observation up to the next higher integer, thereby simulating 300,000 rolls of a fair die. (b) So, using the matrix() function, repeat the illustration of the CLT in item 3, but using the discrete uniform distribution just specified. Turn in a sequence of quantile-quantile plots showing the progression of the sample mean towards normality for increasing sample sizes. 2. For X ~ (2,1) , the mean is X E ( X ) 2 and the variance is Var( X ) X2 2 . For the various sample sizes, empirically demonstrate that for the total of n iid observations, T0 X1 X 2 X n , it follows that E (T0 ) n X 2n and Var(T0 ) n X2 2n . What do I mean by "empirically demonstrate" here? I mean that you should simulate some data and from it some totals, then estimate the theoretical quantities using an appropriate statistic on the totals, and then show that the estimates get closer and closer to the theoretical quantities as the number of totals is increased. For example, choose an n, say n=5. Now, using simulation, generate m totals, which are each the sum of five gamma random variables: T0,i X1,i X 2,i X 3,i X 4,i X 5,i . Now, estimate E (T0 ) with E (T0 ) T0 1 m T0,i and show that, as you let m get m i 0 large, E (T0 ) 10 2n . Revision: January 2012 9