SAMPLING: Process of Selecting your Observations (Masoud Hemmasi, Ph.D.) SAMPLING: Process of Selecting your Observations • QUESTION: During presidential election campaigns, in a typical poll, of the potentially 100 million potential voters, how many would you say are contacted? • History and Evolution of Political Polling SAMPLING: Process of Selecting your Observations Types of Probability Sampling: Simple (Unrestricted) Random Sampling Complex (Restricted) Probability Sampling: Some times offer more efficient alternatives to Simple Random Sampling b. Stratified Random Sampling c. Cluster Sampling a. Systematic Sampling d. Convenience Sampling e. Double Sampling Types of Probability Sampling: Simple Random (or Unrestricted) Sampling A sampling procedure in which every element in the population has a known and equal chance of being selected as a subject (e.g., drawing names out of a hat). Advantage: has the least bias and offers the most generalizability. Disadvantage: At times, can be inefficient/expensive. Systematic Sampling If a sample size of n is desired from a population containing N elements, we might sample one element for every n/N elements in the population. First, we randomly select one of the first n/N elements from the population list. We then select every n/Nth element that follows in the population list. This method has the properties of a simple random sample, especially if the list of the population elements is a random ordering. Systematic Sampling Advantage: The sample usually will be easier to identify than it would be if simple random sampling were used. Example: Selecting every 100th listing in a telephone book after the first randomly selected listing Stratified Random Sampling The population is first divided into groups called strata with respect to salient/relevant characteristics (e.g., gender, age, race, department, location, industry, etc.) Each element in the population belongs to one and only one stratum. Best results are obtained when the elements within each stratum are as much alike as possible (i.e. a homogeneous group). A simple random sample is taken from each stratum. Advantage: If strata are homogeneous, this method is as “precise” as simple random sampling but with a smaller total sample size. Cluster Sampling The population is first divided into separate groups called clusters. Ideally, each cluster would be a small-scale version (representative) of the population. A simple random sample of the clusters is then taken. All elements within each selected cluster will make up the final sample. Example: A primary application is area sampling, where clusters are city blocks or other well-defined areas (neighborhoods, precincts, school districts, etc.). Cluster Sampling Advantage: The close proximity of elements can be cost and time effective (i.e. many sample observations can be obtained in a short time). Disadvantage: This method generally requires a larger total sample size than simple or stratified random sampling. Convenience Sampling It is a nonprobability sampling technique. Items are included in the sample without known probabilities of being selected. The sample is identified primarily by convenience. Example: A professor conducting research might use student volunteers to constitute a sample. Advantage: Sample selection and data collection are relatively easy. Disadvantage: It is impossible to determine how representative of the population the sample is. Sampling Process of Selecting your observations Sample Size Determination SAMPLING: Process of Selecting your Observations Standard Deviation—What does it measure? • Variations/differences in scores among members of a group with respect to a given characteristic (e.g., test scores for a class, income). • Standard deviation represents the average distance of a group of numbers from their mean. How do we calculate it? Hint: You can think of it as the average deviation from the norm/typical. For a Population: For a Sample: Sx Income level for particular a class like this: Xs = Incomes of students in an MBA Class $6,000 Grad Assistants $6,000 $15,000 Part-Time Employed $16,000 $39,000 $38,000 Part-Time Employed $50,000 $70,000 ΣX = $240,000 Average = x = $240,000 / 8 = $30,000 X Sum ( x ) Average = x Variance = 2 Std. Dev. = X - x (X - x )2 6,000 -24,000 576,000,000 6,000 -24,000 576,000,000 15,000 -15,000 225,000,000 16,000 -14,000 196,000,000 39,000 9,000 81,000,000 38,000 8,000 64,000,000 50,000 20,000 400,000,000 70,000 40,000 1,600,000,000 240,000 0 3,718,000,000 30,000 3,718,000,000 / 8 = 464,750,000 $21,558.06 SAMPLING: Process of Selecting Your Observations Freq Suppose frequency distribution of life of light bulbs is normal. x = life of light bulbs—e.g., . 3 bulbs lasted 108 hrs each ……… ………… What can we say …………….. ……………….. about the expected xi ……………………. life of a randomly ………………………. …………………………….. selected bulb (xi) = ? ……………………………………….. 85 90 95 100 105 110 x = 100 hrs x = 5 hrs 115 X= Hours Life of a randomly drawn light bulb: 100 – 5 Z x 100 + 5 Z Z = 1 for 68% confidence, Z = 1.96 for 95% confidence, Formula: X = x + Z x Z = 3for 99% confidence (Where Z is an index that reflects the level of confidence/certainty with which we wish to estimate x.) Income Distribution for a hypothetical population $1 $0 $3 $6 $2 $4 $7 $5 $9 $8 True Population Mean = μ = Σxi / n = 45 / 10 = $4.5 Population Standard Deviation: Income of a randomly drawn person (Xi) = ? = 2.87 SAMPLING: Process of Selecting your Observations This formula: X = x + Z x is ONLY applicable when the population distribution is NORMAL What is the Distribution of our hypothetical population? Distribution of the Hypothetical Population 10 9 8 7 6 5 4 3 2 1* * * * * * * * * * $0 $1 $2 $3 $4 $5 $6 $7 $8 $9 Uniform Distribution x SAMPLING: Process of Selecting your Observations X = x + Z x NOTE that X is the X of a sample of size n = 1 What is the generic formula for mean (X) of samples of any size (any n)? That is, what if instead of a single observation/case (X), we draw a random sample of a particular size from the population? Can we say something about the mean of that sample--X? • If (and only if) we know that our sample mean ( x ) comes from a normally distributed population, the same formula can be modified and applied. Std. Error Rather than X = x + Z x But, what does this statement mean? use X = x + Z x Sampling Distribution = Frequency distribution of sample means Sampling Distribution for Samples of Size n = 2 (from our earlier population) MEAN (X) Sample # SAMPLE 1 $0 & $1 0.5 2 $0 & $2 1.0 3 $0 & $3 1.5 . . . . . . 10 $1 & $2 1.5 11 $1 & $3 2.0 12 $1 & $4 2.5 . . . . . . 18 $2 & $3 2.5 19 $2 & $4 3.0 . . . . . . 43 $7 & $8 7.5 44 $7 & $9 8.0 45 $8 & $9 8.5 45 Possible Samples of size n = 2, thus 45 possible sample means. Distribution of these 45 sample means is called Sampling Distribution! See next slide!!! x = Standard Error is the standard dev. of these Xs Mean of all the 45 sample means xs = x = x = 4.5 (i.e., the same as mean of the original population So, the earlier statement means: if these sample means are normally distributed, we can use the related formula. Sampling Distribution of Samples of Size n=2 # x = ($0+$1)/2=$.50 μx = ($0+$3)/2=$1.50 & ($1+$2)/2=$1.50 x SAMPLE MEAN 1 $0 & $1 0.5 2 $0 & $2 1.0 3 $0 & $3 1.5 . . . . . . 10 $1 & $2 1.5 11 $1 & $3 2.0 . . . . . . 44 $7 & $9 8.0 45 $8 & $9 8.5 SAMPLING: Process of Selecting Your Observations Freq So, if we know that distribution of our Sample Means (i.e., Sampling Distribution) is NORMAL, as shown below: . ……… ………… ……X…….. ………--..….. x …….…….X….……. ………………………. …………………………….. ……………………………………….. x = x x = Standard Error = x / √ n We will be able to say the following about the mean ( x ) of a randomly selected sample: x = x + Z x Since μX = μX , substitute x for x : x = x + Z x SAMPLING: Process of Selecting your Observations QUESTION: What is the primary purpose of sampling? Answer: To use sample characteristics (e.g., X) as estimates of population characteristics (e.g., What is the significance of this formula? x x) = x + Z x Answer: Shows the relationship between x and x. --So, if x comes from a normal distribution, we can rewrite the formula to estimate x based on value of x. x =x + Z x x = x + Z x Question: But, is the sampling distribution (i.e., distribution of x ) always normal (so that we can use the above formula)? Let’s see it! Think of these as distribution of life of all individual light bulbs (X). (n = 1) (n = 1) (n = 1) (n = 1) Think of these as distribution of average life of samples of n light bulbs (X). Distribution of Sample Means (Xs) for Different Population Distributions SAMPLING: Process of Selecting your Observations Conclusion? As n increases, sampling distribution (i.e., distribution of Xs) will more and more resemble a normal distribution so that for all n > 30, sampling distribution will always be normal, regardless of the distribution of the original population. SAMPLING: Process of Selecting Your Observations Sampling distribution is Variable of interest X is NOT normally distributed. n1>30 Xs Distribution of Xs Mean of Xs = x Std. Dev. of Xs =x n2 >30 n3 >30 guaranteed to be normal only when n 30 is used. x1 x2 • x3 • xs • Distribution of x s for all samples of the same size (Sampling Distribution) Mean of x s = x Std. Error = = x = x SAMPLING: Process of Selecting your Observations So, for samples of n 30: _ x = X + Z x_ Standard Error = SO, x = x / √ n x = X + Z x / √ n Now, Let’s examine the elements of this formula! SAMPLING: Process of Selecting your Observations 1) We are interested in estimating x from x 2) Estimation involves a margin of error, that is 3) Actual Score = Estimate + Margin of Error _ x = X + Z x / √ n Estimate Actual Score Margin of Error, lets call it “E” So, when using random samples of size n > 30, margin of error in estimation would be: E = Z x / √ n SAMPLING: Process of Selecting your Observations E = Z x / √ n Square both sides of the equation: E =Z 2 2 2 x / n Rewrite it to solve for n: n=Z 2 2 x /E 2 • x (population Std. Dev.) is often unknown. Sx (Std. Dev. of a sample) is a reasonable estimate (substitute) for it. • Sx can be estimated based on previous studies or a pilot study. 2 2 n=Z S x 2 /E SAMPLING: Process of Selecting your Observations Sample size required for estimating a population mean* (x): 2 2 n=Z S 2 x /E n = Sample size required E = Margin of error we are willing/able to tolerate in estimating the population characteristic (mean) Z = An index reflecting the degree of confidence/ certainty we wish to have in achieving the level of precision/accuracy represented by E above. S = An estimate of Std. Dev. of the characteristic being estimated/studied. * The case of n for estimating a population proportion will be covered later. SAMPLING: Process of Selecting your Observations 2 An example: 2 n = Z S /E 2 Suppose you were to use a random sample to estimate average IQ of adult males. Suppose you know, from a pilot study that the Std. Dev. of males’ IQ is about 16 points. What size sample should you use if you wish to be 95% sure that your margin of error in estimating average IQ is no more than 3 points (that is if you wish to be 95% sure that the estimate you will obtain from the sample would be within +3 points of the actual/true average IQ of the adult male population)? Z=? Z=2 S = 16 E=3 S=? E=? n = 22 (16)2 / 32 = 113.78 round up = 114 SAMPLING: Process of Selecting your Observations Assuming worst case scenario when S is unknown: 2 2 2 n = Z S /E If no information is available on S, you can assume maximum variability by setting S = ¼ of Range. An Example: Suppose we were to use a random sample to estimate average IQ of adult males. Further suppose that we have absolutely no basis for determining the Std. Dev. of males’ IQ. But, we know that the IQ of the overwhelming majority of adult males ranges between 80 and 120. What size sample should we use if we wish to be 99% sure that our margin of error in estimating the average IQ is no more than 2 points (that is if we wish to be 99% sure that the estimate we will obtain from the sample would be within +2 points of the actual/true average IQ of the adult male population)? Range = 120 – 80 = 40 S = 40/4 = 10 Z=3 n = 32 ( 10)2 / 22 = 225 E=2 SAMPLING: Process of Selecting your Observations Assessing Resulting Accuracy/Precision of the Estimates, Given a Particular Sample Size: • • • • Suppose, we used a survey with lots of 7-point scale items, Collected data from 225 respondents, and Descriptive statistics on the data shows typical Std. Dev. on most items/variables is in the 1.3 to 1.5 range. What can we say about the precision/accuracy of our results, say, with 95% confidence/certainty? 2 2 n=Z S 2 2 2 E =Z S /n x /E 2 E = Z S / \/ n E = 2 (1.5) / \/ 225 = 3/15 = .2 ? We can be 95% certain that the sample mean for a typical variable is not off from the true population mean by more than two-tenth of a point. (e.g., if the reported sample mean on a given variable is 4.7, we can be 95% sure that the actual population mean is between 4.5 and 4.9). SAMPLING: Process of Selecting your Observations Sample size determination for estimating Proportions (p): EXAMPLE: Projecting the percentage of people who would be voting for a particular candidate in a presidential election. In such cases, dispersion is measured by = pq (instead of variance, s2) Where, p = proportion of the population that is expected to have the attribute under study, and q = (1- p), the proportion of the population that is expected NOT to have that attribute So, the sample size formula will change to: Or : NOTE: n = Z2 pq / E2 n = Z2 p(1-p) / E2 If we have no basis for judging the expected value of p, we can assume maximum variability (i.e., err on the side of overestimating the required sample size) by setting p at p=0.50 (see the example on next slid). SAMPLING: Process of Selecting your Observations Sample size determination for Estimating Proportions: n = Z2 p(1-p) / E2 EXAMPLE: Suppose you are to project the percentage of potential voters who would be expected to vote for the Republican candidate in the upcoming presidential election. Suppose you have no basis for estimating/guessing what the percentage could possibly be. Also, suppose that you want to be 99% confident/certain that your margin of error would be 3% (i.e., 99% certain that your projection/estimate will be within + 3% of the actual number). What size sample will you need? n = Z2 p(1-p) / E2 Z=3 p = 0.50 n = 32 ( 0.5) (0.5) / 0.032 E = 0.03 n = 9 (0.25) / 0.0009 = 2500 SAMPLING: Process of Selecting your Observations QUESTIONS OR COMMENTS ?