Sampling Distributions and the CLT

advertisement
NAVAL
POSTGRADUATE
SCHOOL
LAB #2: SAMPLING, SAMPLING
DISTRIBUTIONS, AND THE CLT
Statistics (OA3102)
Prof. Fricker
Lab #2: Sampling, Sampling Distributions,
and the Central Limit Theorem
Goal: Use R to demonstrate sampling methods, sampling distributions, and the Central
Limit Theorem (CLT).
Lab type: Interactive lab demonstration followed by hands-on exercises.
Time allotted: Lecture for ~50 minutes followed by ~50 minutes of group exercises.
R libraries: Just the base package.
Data: mk48.down.csv
R HINT OF THE WEEK
1. Keeping a Log of Your R Session
a. While using a script can help you keep a record of the commands your run or the
functions you've written, it does not keep track of the details of what happened
during a particular session.
b. In Windows-based PCs running the R Console, the easiest way to save what
you've done is to use the "Save to file…" option under the File pull down menu in
the R Console. This will save whatever is in your console window to a text file.
This is usually sufficient for most sessions, but note that for long sessions or if
you print out large datasets, earlier material may have exceeded the R Console's
window and be lost.
c. On Macs running the R Console, use the "Save As…" option under the File pull
down menu.
DEMONSTRATION
2. Now, before we begin, let's talk a bit about lists, data frames, and matrices in R.
a. The most general R data type is the list. A list can hold objects of any type:
numeric, character, matrices, functions, even other lists. Furthermore, the objects
don’t even have to be the of the same mode or the same length.
b. Here’s one example of a list:
result <- list(LETTERS[1:5],1:10,letters[1:3],"by Ron Fricker")
c. What things are in result? You can type result to see. Note that each of the
elements are of different lengths
i. You can extract items from a list with the square brackets; this gives you
another list. For example, what do you get with result[1]? How do you
know it’s a list? Remember the class function from last class. Also, note
Revision: January 2012
2
Prof. Fricker
that it’s printed out on two lines, the first one containing [[1]]. (If the
elements of the list were named, then the name would have printed on this
line.)
ii. Usually instead of this sub-list you want the contents of the sub-list; for that
you can use double square brackets. Try result[[1]] and see what you get.
iii. You can check to see what mode (i.e., type) each of the elements is with the
mode() function. For example, try:
mode(result[[1]])
mode(result[[2]])
What's the difference between mode and class? The mode function tells
you the storage mode of an object (e.g., "numeric," "character," "list," etc.)
while class is an object attribute (e.g., "matrix," "array," "data frame," etc.).
Sometimes they're the same, but often different.
iv. If you want to assign names to the elements in the list:
names(result) <- c("CAPS", "Ints", "Sm.Ltrs", "Author")
v. Now what do you get when you type result?
vi. With names you can use the dollar sign syntax to access contents of a
sublist, as in result$CAPS. This is equivalent to result[[1]].
d. Data frames are the most frequently used R objects for storing data, at least for
statistical analysis, and they are essentially a special kind of list.
i. Data frames have a specific structure: Columns are variables and rows are
observations. In addition, a data frame must be rectangular, so all columns
must be of the same length and all rows must be of the same length.
ii. What is useful about data frames from an analysis point of view is that the
columns can be of different types (numeric, character, factor, logical, etc.)
iii. The function data.frame() can be used to create a data frame. In a data
frame the variables generally are named. Here’s a simple data frame:
First.name <- c("Joe", "Peggy", "Harry", "Joan")
Last.name <- c("Sixpack", "Sue", "Henderson", "Jett")
Age <- c(32, 27, 38, 35)
Active.duty.ind <- c(1,1,1,0)
At.NPS <- c(TRUE, FALSE, TRUE, TRUE)
Fake.data <- data.frame(First.name, Last.name, Age,
Active.duty.ind, At.NPS)
iv. Now, you can list the names of all the variables in the data frame, as you've
done with other data frames, with names(Fake.data).
v. And, as we have been doing in class, you can use the name of a variable to
call it. For example: Fake.data$First.name.
Revision: January 2012
3
Prof. Fricker
3. Matrices most often come into play in R when you want to do numerical calculations.
a. For a matrix in R, every element of a matrix has to be of the same mode (for
example, they must all be numeric, or all character, or all logical). Most
commonly, matrices are numeric and used in computations. For example, two
matrices A and x can be multiplied using the notation A%*%x.
b. The function for creating a matrix is, not surprisingly, matrix(). See the help for
the arguments, particularly nrow and ncol which define the size of the matrix and
byrow which defines how the data is read into the matrix. To illustrate, consider
the following example and, before running it in R, guess what the matrix will look
like:
Fake.matrix <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE)
4. Illustrating the Central Limit Theorem (CLT).
a. The CLT says that sums of iid random variables have an approximately normal
distribution. The greater the number of r.v.s summed, the better the
approximation. What this means, in particular, is that sample means have an
approximate normal distribution. Let’s illustrate this with a couple of simulation
examples, similar in spirit to the applet we looked at in class.
b. To begin, let’s look at a random variable that’s uniformly distributed on the unit
interval [0,1]: X~U[0,1]. It’s easy to generate such r.v.s in R using the runif()
function, which we will use to create a matrix of 300,000 random draws from a
U[0,1] distribution:
Xmatrix <- matrix(runif(10000*30),nrow=10000)
This syntax creates a matrix with 10,000 rows and 30 columns, each entry of
which is a random draw from a U[0,1] distribution. Check that the range of
variables looks reasonable with
summary(Xmatrix)
and
hist(Xmatrix)
c. Now, let’s look at how by increasing the sample size the sample mean becomes
more and more normally distributed. The syntax below creates four new vectors
which are the row means of the first 2, 5, 10 and all 30 columns of the matrix.
um2<-rowMeans(Xmatrix[,1:2])
um5<-rowMeans(Xmatrix[,1:5])
um10<-rowMeans(Xmatrix[,1:10])
um30<-rowMeans(Xmatrix)
So, they’re each vectors that are 10,000 observations long, and we would expect
um2 to be the least normally distributed and um30 to be quite close to normal.
Check this with:
par(mfrow=c(2,2))
qqnorm(um2); qqnorm(um5); qqnorm(um10); qqnorm(um30)
Revision: January 2012
4
Prof. Fricker
It doesn’t take long for the CLT to “kick in,” does it? Even for this example, in
which the population distribution is very non-normally distributed, the normal
probability plots start to look pretty straight with means of just five observations.
5. Picturing a (Non-Normal) Sampling Distribution: A Real-Data Example.
a. Read in the file named mk48.down.csv.
mk48.down <- read.csv(file.choose())
Of course, once you type the above, you need to then find the CSV file on your
computer and click on it via the dialog box that pops up.
You now have the data in a vector called mk48.down$down.days whose 9,505
entries give the number of “down days” for the Marines’ MK-48 Logistic Vehicle
System (LVS).
How did I know that the name of the vector is down.days? Remember:
names(mk48.down)
b. What sort of distribution does it look like the data came from?
hist(mk48.down$down.days,prob=TRUE)
Answer: it looks a lot like an exponential distribution.
c. What is a good estimate of the exponential distribution’s parameter ? Answer:
the method of moments and maximum likelihood techniques (both of which we
will learn about in upcoming lectures) give ˆ  1/ X . This is called a point
estimate, where we are using the data to estimate the parameter . In this case we
have 0.01581165, which we find by executing the following command:
1/mean(mk48.down$down.days)
How good is the fit? Let's overlay the parametric distribution exp(0.0158) with
the density histogram:
hist(mk48.down$down.days,prob=TRUE)
curve(dexp(x,0.0158),lwd=2,col="red",add=TRUE)
Looks pretty good, eh? However, note that plotting a histogram and a probability
density curve like this is not the best way to make such a comparison. Your eye is
simply not calibrated finely enough to see more than obvious, gross differences.
A better approach is to use the qqplot fucntion, plotting the data versus some
random observations from an exp(0.0158) distribution. This is called a quantilequantile (or Q-Q for short) plot, which is very similar to a normal probability plot
(the qqnorm function in R), but instead of comparing the data to the theoretical
quantiles of a normal distribution, we now compare two data sets against each
other. As with the normal probability plot, if they come from the same
distribution, the points on the qqplot should fall close to a straight line.
Here're the R commands, where we first generate 9,505 observations from an
exp(0.0158) distribution. Then we plot them versus the actual data. Finally, we
overlay a straight line to help us visually see what's going on.
Revision: January 2012
5
Prof. Fricker
rand.exps <- rexp(9505,0.0158)
qqplot(mk48.down$down.days, rand.exps)
abline(lm(y~x,data=qqplot(mk48.down$down.days,rand.exps)))
Here we see that there are some down days observations that are a lot larger than
would be expected if the data did come from an exp(0.0158) distribution. But if
we focus in on the majority of observations less than 400 days, it doesn't look too
bad:
qqplot(mk48.down$down.days[mk48.down$down.days<=400],
rexp(400,0.0158))
And almost all of the observations are have less than 400 days down:
table(mk48.down$down.days<=400)
FALSE
101
TRUE
9404
So, we'll assume we know the population comes from an exp(0.0158) distribution.
d. By the CLT, we know X has a normal distribution, but 1/ X does not. What does
the distribution of 1/ X from an exp(0.0158) distribution look like? Well, that
depends on the sample size since n is involved in the calculation of X .
For this demonstration, let’s imagine we’re interested in a sample of size n=10
and we want to know what the sampling distribution for ̂ looks like. One way to
do this is to use simulation, generating lots of samples of size 10 from an
exp(0.0158) distribution and then plotting them to get a picture of the sampling
distribution:
ee <- matrix (rexp (10000 *10, rate=0.0158), nrow=10000)
ee.m10 <- rowMeans (ee)
hist (1/ee.m10)
Here we can clearly see a skewed distribution, which validates our original
assertion that the distribution of 1/ X is not normal.
e. To be a bit fancier, we can overlay a normal density curve to better show the
skew:
hist (1/ee.m10,prob=TRUE,xlim=c(-0.1,0.1))
curve(dnorm(x,mean(1/ee.m10),sd(1/ee.m10)),lwd=2,col="red",add=TRUE)
And, if we want to be a bit more formal we can use a normal probability plot:
qqnorm(1/ee.m10)
qqline(1/ee.m10)
Thus, what we see here is that not all sampling distributions are normal.
Remember, a sampling distribution is just the probability distribution of a
statistic, and often they are not normally distributed.
6. Another Approach: Using Sampling to Construct an Empirical Estimate of the
Sampling Distribution of 1/ X .
Revision: January 2012
6
Prof. Fricker
a. In the previous section, we approximated the population distribution of LVS
down days with an exponential distribution. That's what would be referred to as a
"parametric" approach to estimating the sampling distribution. It's parametric
because we chose a particular family of distributions (exponential) and then fit a
particular distribution from this family by estimating the parameter of the
exponential distribution from the data.
b. An alternative approach, which is "nonparametric," is to use the data itself to
estimate the sampling distribution. We will do this by sampling directly from the
data, for which the sample() function will be very helpful. The idea is that we
will repeatedly randomly sample 10 observations from the data, calculate the
inverse of the means of each sample, and plot on a histogram.
c. To begin, we calculate a vector called resamples that contains 10,000 inverse
means of 10 observations. Each sample of 10 is drawn without replacement from
the 9,505 observations, but any particular observation in the data can show up in
more than one sample of 10.
resamples <- vector(length = 10000)
for(i in 1:10000){
resamples[i] <- 1/mean(sample(mk48.down$down.days,size=10))
}
d. Now, let’s plot a histogram of these 10,000 resamples and compare it to the
histogram that used the random exponentials in the last example:
par(mfrow=c(1,2))
hist(1/ee.m10)
hist(resamples)
They look pretty close, eh? But as we previously discussed, it’s actually pretty
hard for the human eye to distinguish differences between two or more histograms
this way, so let’s compare with a Q-Q plot:
qqplot(resamples, 1/ee.m10)
abline(lm(y~x,data=qqplot(resamples, 1/ee.m10)))
Not a bad fit! It looks like there is just a bit of deviation in the right tails of the
distributions, but overall pretty darn good. So, it looks like we basically get the
same sampling distribution estimate whether we use a parametric or
nonparametric approach.
Revision: January 2012
7
Prof. Fricker
GROUP #___EXERCISES
Members: ______________, ______________, ______________, ______________
1. Illustrate the CLT on sample totals using some very non-normal data. In
particular, draw random samples from a gamma distribution (see the rgamma()
function) with parameter shape = 2. First, using a normal probability plot,
demonstrate that the data is not normal.
a. Now, vary the sample size (n) for the sample total from very small (say 2)
to quite large (you choose).
i. The apply() and sum() functions will likely be useful if you first
create a matrix of gamma distributed data, as in the earlier
demonstration.
ii. Be sure to simulate enough samples that you get relatively smooth
histograms and/or normal probability plots.
b. As your output, create a single chart with a sequence of plots showing how
the normal approximation gets better and better as n gets large – that is, as
the CLT "kicks in."
i. The par(mfrow=c(a,b)) command will be useful for putting
multiple plots on one chart, where a is the number rows and b is
the number of columns in the “matrix” of figures.
2. The coefficient of variation (CV) is a normalized measure of the dispersion of a
probability distribution. It is defined as the ratio of the standard deviation  to the
mean : CV=  Empirically estimate the CV sampling distribution for the
MK-48 LVS data for various sample sizes as follows.
a. Resample 10,000 times from the data samples of size n.
b. For each resample, estimate the CV as the ratio of the sample standard
deviation to the sample mean: s / x . Note that the sample mean and
standard deviation are calculated on the same sample of data.
c. As your output, create a single chart with a sequence of plots showing
what happens as n gets bigger. Does the CLT "kick in" for the CV?
Revision: January 2012
8
Prof. Fricker
Name: _____________________________
INDIVIDUAL EXERCISES
1. Repeat the demonstration of the CLT in the lab on a discrete uniform distribution.
In particular, consider the uniform distribution on the integers from 1 to 6, which
would simulate a fair die.
(a) You can easily generate 300,000 observations from such a random variable in
R using the runif() function combined with the ceiling() function:
ceiling(runif(10000*30,0,6))
Here, runif(10000*30,0,6) generates 300,000 random observations
between 0 and 6 and the ceiling() function rounds each observation up to
the next higher integer, thereby simulating 300,000 rolls of a fair die.
(b) So, using the matrix() function, repeat the illustration of the CLT in item 3,
but using the discrete uniform distribution just specified. Turn in a sequence
of quantile-quantile plots showing the progression of the sample mean
towards normality for increasing sample sizes.
2. For X ~ (2,1) , the mean is  X  E ( X )  2 and the variance is Var( X )   X2  2 .
For the various sample sizes, empirically demonstrate that for the total of n iid
observations, T0  X1  X 2   X n , it follows that E (T0 )  n X  2n and
Var(T0 )  n X2  2n .
What do I mean by "empirically demonstrate" here? I mean that you should
simulate some data and from it some totals, then estimate the theoretical
quantities using an appropriate statistic on the totals, and then show that the
estimates get closer and closer to the theoretical quantities as the number of totals
is increased.
For example, choose an n, say n=5. Now, using simulation, generate m totals,
which are each the sum of five gamma random variables:
T0,i  X1,i  X 2,i  X 3,i  X 4,i  X 5,i .
Now, estimate E (T0 ) with E (T0 )  T0 
1 m
 T0,i and show that, as you let m get
m i 0
large, E (T0 )  10  2n .
Revision: January 2012
9
Download