As a general explanation of the notation P(X < x) or P(Z < z) the “P” stands for “probability” meaning P(X < x) would be interpreted as “the probability that for some variable X we would get less than some value x”. For example, if X represent car speed and x was 58 (note the difference in using upper and lower cases), then P(X < 58) would be interpreted as “the probability of a car traveling less than 58 miles an hour.” For P(Z < z) this is just “probability of getting a Z-score less than some value of z.” If, say, car speeds followed a normal distribution with mean of 60 and standard deviation of 6, and we converted this x = 58 to z = - 0.333 by (58 - 60)/6, then P(Z < 0.333) would mean “the probability of getting a Z-score less than negative 0.333” and we could use the standard normal table (i.e. the Z table) to find this probability. Using a different expression such as “>” provides a similar interpretation except we would substitute in the proper language of “greater than” for “less than”. This Sampling Distribution lesson is similar to the previous lesson for continuous random variables where we find a z-score and then use the standard normal table (or software) to find probabilities. The difference this week, though, is that we are finding z-scores for say a sample mean or sample proportion opposed to last lesson where our interest was just a single observation (e.g. last lesson we would have been interested in the probability a single vehicle would have traveled less than 58 mph while in this lesson our interest would be that the probability of the mean for a sample of 16 vehicles is less than 58 mph). In this situation car speeds were assumed normal – or approximately normal – allowing us to use the table for both examples. However, if car speeds were NOT normal or approximately normal, then for the example with the single vehicle we could not apply the z method; however for the second example we could if the sample size was at least 30 (which it wasn’t obviously when only discussing one vehicle i.e. sample size of one). At the end of the document I try to illustrate this with data and histograms. Now if our data is categorical instead of quantitative (e.g. in the activity for this lesson question 1 is about success/failure of a drug thus categorical while in question 2 with car speed this is quantitative) then our interest isn’t about a mean but a proportion (i.e. proportion of patients who experience “success”). With proportions we can still find probabilities by calculating z-scores provided certain conditions are met; in our case the text gives them as n*p >= 15 AND n*(1-p) >= 15; both conditions must be satisfied to apply the z method. NOTE: this condition given in the previous sentence varies, that is there is not established critical number accepted industry wide. Some will use 5 or 10 instead of 15. Since our book uses 15 so will we. As you go through this lesson you may wonder, “What is the distinction between a standard deviation and a standard error?” The latter is simply the standard deviation of the sampling distribution. Therefore it is still a standard deviation but refers to the sampling distribution of some statistic, for example the standard deviation of the distribution of the sample mean or the sample proportion. Say I asked each of you to take a random sample of some large data set and calculate the sample mean and/or sample proportion for your sample. These sample statistics would vary from student to student, unless someone’s random sample is exactly the same as another student’s; which is very unlikely. If each of your sample statistics were compiled into one spreadsheet then the standard deviation of these sample statistics would be considered a standard error. The idea of using standard deviation and standard error is to differentiate between what is being discussed. The standard deviation would be the standard deviation for the raw data while the standard error would relate to the standard deviation of the sample statistic. By using the term standard error we realize that one is talking about the variability of the sample mean and not the standard deviation of the raw data. 1 An attempt at illustrating the central limit theorem as it applies to the sample mean Please see below a series for histograms with a normal bell shape overlaid to illustrate this sample mean concept. The first histogram is for 2010 baseball player salaries (in millions of dollars, e.g a player making $8,500,000 would be listed as $8.5). As you can see the shape of these salaries is heavily skewed right (obscene in amount paid no doubt!). From this data I randomly selected samples of three sizes: 15, 30, and 50 and calculated the mean of each sample. This sampling process was repeated 1000 times. For example, the software randomly selected 15 salaries, calculated the mean of the 15; randomly selected 15 salaries, calculated the mean of the 15, put them back; etc. repeating this process 1000 times. This was repeated for sample sizes of 30 and 50. At the end of this process there were 1000 sample means for the sample size of 15; 1000 means for the sample of size 30; and 1000 means for the sample of size 50. Histograms were then drawn for each. Notice how the histograms for the sample means are vastly different from the histogram of the individual salaries. Theoretically the “acceptable” sample size where the histogram reaches approximately normal is 30 - this is the driving point behind the central limit theorem: as the sample size increases the distribution of the sample mean approaches a normal distribution regardless from what shape the raw data comes. Now if we consider this data as it applies to the last two lessons: for last lesson on Probability Distributions we could not use this data as obviously salaries are not at least approximately normal. So if we used this data and you were asked to find, e.g., the probability that a randomly selected player earned less than 2 million dollars you could not answer using z-score and the table – because the data doesn’t satisfy being normal or approximately normal. However, on this lesson of Sampling Distributions if you were asked to find the probability that the sample mean for 35 randomly selected players was less than 2 million dollars then you could apply these z-score methods since the sample of size 30 would satisfy the central limit theorem. Histogram of Salaries, Means N = 15, Means N = 30, Means N = 50 Normal Salaries Means N = 15 100 300 75 200 50 Frequency 100 0 25 -6 0 6 12 18 24 30 0 1 2 Means N = 30 120 75 90 50 60 25 30 1.6 2.4 3.2 4.0 4 5 6 7 8 4.8 5.4 Means N = 50 100 0 3 4.8 5.6 0 1.8 2.4 3.0 3.6 4.2 2 An attempt at explain the central limit theorem as it applies to the sample proportion Let us say that Election Day has come and gone. Leading up to that day, however, various polling groups will try to estimate what people are going to do. They do this using a random sample of registered voters because asking everyone is too timely and costly; plus not all registered voters actually vote. So each sample produces a sample proportion – remember this is categorical data as the polling company is asking those sampled who they are going to vote for and they tally the percentage of responses for each candidate. These random samples vary - i.e. do not include the same people in each sample and thus can lead to different sample proportions who say they will vote for candidate X. You may recall during the last presidential election that the various polling agencies had different estimates for the proportion of people who were going to vote for then Senator Obama. The standard error is then calculated as a "theoretical" estimate of the standard deviation of these various sample proportions. Think of say 1000 polling agencies conducting a poll involving the same sample size (say 3000 people). Each poll randomly selects 3000 people from the same group of registered voters. Unless one poll had the exact same sample of people as another poll, you would expect to have several different sample proportions; not necessarily each sample proportion being different but certainly would not expect the 1000 polls to have exactly the same sample proportion. If you put all of these 1000 sample proportions into a column in some software and calculated the standard deviation of this column that would represent the standard error of the sample proportion. In reality, however, nobody has time to conduct their own 1000 polls but instead just does one (e.g. Gallup does not conduct 1000 polls just the one poll, same with New York Times, etc.). But the theory to calculate the standard error of the proportion is applied as long as the necessary assumptions are met (i.e. np and n(1-p) being at least 15). Note that since we can only estimate p by p-hat, the sample proportion, we substitute p-hat into this equation. In any event, the np would be the number in the sample who said Yes to voting for Candidate X and n(1-p) would be the number who said they would not. The “p” represents the true proportion or what actually happens on Election Day - the final proportion who voted for Candidate X, while p-hat is the sample statistic an estimate of what percent of the population will vote for Candidate X on Election Day. 3