Chapter 7: Sampling and Sampling Distributions Learning Objectives LO1 Contrast sampling to census and differentiate among different methods of sampling, which include simple, stratified, systematic, and cluster random sampling; and convenience, judgment, quota, and snowball nonrandom sampling, by assessing the advantages associated with each. LO2 Describe the distribution of a sample’s mean using the central limit theorem, correcting for a finite population if necessary. LO3 Describe the distribution of a sample’s proportion using the z formula for sample proportions. Reasons for Sampling • Sampling is used for gathering useful information about a population • Sampling can provide information in a timely and convenient form • Sampling can save time and money. • For given resources, sampling is more efficient and can broaden the scope of a study. • The research process sometimes requires destructing product; sampling can reduce the cost of destroying product. • If accessing the population is impossible; sampling is the only option. Reasons for Taking a Census • When it is essential to eliminate the possibility that by chance a random sample may not be representative of the population. • When sampling errors have fatal consequences, a census is required for the safety of the consumer. Population Frame • A list, map, directory, or other sources used to identify and locate the population • A list , map, or directory such as a school list, trade association list, telephone directory, or even a list sold by list brokers is called a frame • The frame should ideally be a one-to-one correspondence with the population, but may have a Gap due to over registration or under-registration Population Frame • Over registration: the frame contains all members of the target population and some additional elements – Example: using Bell Montreal telephone registry as a listing of residences with Bell telephones in Montreal • Under registration: the frame does not contain all members of the target population. – Example: using the chamber of commerce membership directory as the frame for a target population of all businesses. Random Versus Nonrandom Sampling • Random sampling – A chance process mechanism used to select some units of the population – Every unit of the population have the same probability of being included in the sample. – Eliminates bias in the selection process – Also called probability sampling • Nonrandom Sampling – Every unit of the population does not have the same probability of being included in the sample. – Open to selection bias – Not an appropriate data gathering technique for use in most statistical methods presented in this text – Also known as nonprobability sampling Basic Random Sampling Techniques • Four Basic Sampling Techniques • Simple Random Sampling • Systematic Random Sampling • Stratified Random Sampling • Cluster (or Area) Sampling Simple Random Sample • The most elementary sampling technique • The basis for developing other random sampling methods • Use random number generator to select units • Random numbers: a sequence of numbers that lack any pattern • Number or code each frame unit from 1 to N. • Easier to perform for small populations • Cumbersome for large populations • Seldom used in practice Application of the Simple Random Sample Technique • Uses a random number table or a random number generator. • Each unit of the frame is numbered from 1 to N • Each unit of frame has an equal chance of being selected to sample • Use random number table to select n distinct numbers from N or between 1 and N, inclusively • Does not guarantee that sample is representative of the population Simple Random Sampling: Random Number Generator Table Simple Random Sample: Sample Members Selected 01 Acceleware Corp. 02 Apption Software 03 Auctionwire Inc. 04 Audability Inc. 05 b5media Inc. 06 Bond Consulting Group 07 Cadre Staffi ng Inc. 08 Direct Sales Force Inc. 09 Diversified Brands 2005 Inc. 10 Eagle Wake Ltd./ Ticket Gold 11 EFT Canada Inc. 12 Filemobile Inc. 13 Hutton Forest Products Inc. 14 KMA Contracting Inc. 15 League Assets Corp. 16 Lettuce Eatery (Freshii Inc.) 17 LOGiQ3 Inc. 18 MedicLINK Systems Ltd. 19 Mortgagebrokers.com Holdings Inc. 20 Rapido Trains Inc. 21 Pacesetter Directional and Performance Drilling Ltd. 22 PrecisionERP Inc. 23 Scalar Decisions Inc. 24 Siamons International Inc. 25 Simcoe Canada Land Development Inc. 26 Stiris Research Inc. 27 Sweetspot.ca Inc. 28 TAG Recruitment Group Inc. 29 Unity Telecom Corp. 30 Vortex Mobile (Vortxt Interactive Inc.) • Population Size = N = 30 • Sample Size = n = 6 Simple Random Sample: Numbered Population Frame Use Excel’s RANDBETWEEN function to generate a random sample size of 6. Stratified Random Sample • Population is divided into nonoverlapping subpopulations called strata. • Internally, sub-populations should be as homogeneous as possible; • Externally, they should contrast with each other. • A random sample is selected from each stratum. • Potential for reducing sampling error • Proportionate: the percentage of the sample taken from each stratum is proportionate to the percentage that each stratum is within the population • Disproportionate: proportions of the strata within the sample are different than the proportions of the strata within the population Stratified Random Sample: Population of FM Radio Listeners Systematic Sampling • Convenient and relatively easy to administer • Population elements are an ordered sequence (at least, conceptually). • The first sample element is selected randomly from the first k population elements. • Thereafter, sample elements are selected at a constant interval, k, from the ordered sequence frame. Problems With Systematic Sampling • When used with alphabetic ordered set, it is no better than simple random sampling and therefore does not guarantee representative samples. • The sample becomes nonrandom when the data is subject to periodicity Systematic Sampling: Example • Frame: Scott’s National manufacturers of Canada Directory listing N= 105,000 manufacturers in alphabetic order • Sample n = 1,000 • k = 105,000/1,000 = 105 • First sample element randomly selected from the first 105 manufacturers. • Assume the 5th purchase order was selected from random tables: the first element is the manufacturer coded 5 • Subsequent sample elements k+5, 2k+5, etc: , 110, 215, 320, . . .until 1,000 manufacturers are selected. Cluster or Area Sampling • The population is divided into nonoverlapping and internally homogeneous clusters or areas. • Each cluster is a miniature, or microcosm, of the population. • A subset of the clusters is selected randomly for the sample. • If the number of elements in the subset of clusters is larger than the desired value of n, these clusters may be subdivided to form a new set of clusters and subjected to a random selection process. Cluster Sampling • Advantages – – – – More convenient for geographically dispersed populations Reduced travel costs to contact sample elements Simplified administration of the survey Unavailability of sampling frame prohibits using other random sampling methods • Disadvantages – Statistically less efficient when the cluster elements are similar – Costs and problems of statistical analysis are greater than for simple random sampling. Two-Stage-Cluster Sampling • In cluster sampling, sometimes the clusters are too large, and a second set of clusters is taken from each original cluster. – This technique is called two-stage sampling. – Canadian Example: divide Canada into clusters of cities; then divide the cities into clusters of blocks; and randomly select individual houses from the block clusters. • Advantages: – Clusters are usually convenient to obtain – Cost of sampling entire population is reduced due to reduction in scope of study Nonrandom Sampling • Convenience Sampling: sample elements are selected for the convenience of the researcher • Judgment Sampling: sample elements are selected by the judgment of the researcher • Quota Sampling: sample elements are selected until the quota controls are satisfied • Snowball Sampling: survey subjects are selected based on referral from other survey respondents Nonsampling Errors • • • • • Data from nonrandom samples are not appropriate for analysis by inferential statistical methods. All errors other than sampling errors are nonsampling errors Sampling error occurs when the sample is not representative of the population. Sampling errors are unavoidable and usually not measurable. Biases may be avoidable and are usually measurable. Causes of Nonsampling Errors – Missing data, recording, data entry, and analysis errors – Poorly conceived concepts, unclear definitions, and defective questionnaires – Response errors occur when people do not know, will not say, or overstate in their answers – Virtually no statistical method exists to control for nonsampling errors. Diligence in planning survey and execution required Sampling Distribution of • Proper analysis and interpretation of a sample statistic requires knowledge of its distribution. • The sample mean is one of the more common statistics used in inferential statistics. Its underlying probability function and the inferential process Distribution of a Small Finite Population Suppose a small finite population consists of only N = 8 numbers: 54 55 59 63 64 68 69 70 Generating the Following Sample Space Taking Samples of for n = 2 with Replacement Excel Produced Histogram of the 64 Sample Means for n=2 Histogram of a Poisson-Distributed Population, λ = 1.25 Histogram of Sample Means for the Data In Previous Slide The Changing Shape of The Distribution of Sample Means Relative to the Sample Size n • The previous slides illustrate that as the size of the sample n increases and as the number of sample increase, the shape of the sample mean histogram generated by the sampling distribution becomes more symmetric and smoother looking. • The next set of slides demonstrate this for the case where sampling is from a population which has a uniform distribution in which a = 10 and b= 30 • Note that even for small sample sizes that the distribution of sample means begin to pile up in the middle • General Rule: as sample sizes become much larger, the sample mean distribution begins to approach a normal distribution and the variation among the means decreases. Means of 90 Samples (n = 2 to n = 30) from Uniformly Distributed Distribution 1,800 Randomly Selected Values from a Uniform Distribution Means of 60 Samples (n = 2) from a Uniform Distribution Means of 60 Samples (n = 5) from a Uniform Distribution Means of 60 Samples (n = 30) from a Uniform Distribution Central Limit Theorem ∗ Note that the central limit theorem itself does not specify what a “large sample size” is. As a guideline, it is assumed to be 30 or more, although this does not follow from the central limit theorem itself. The derivations are beyond the scope of this text and are not shown. Shapes of the Distribution of Sample Means for 3 Sample Sizes and the Normal and Uniform Distributions Distribution of Sample Means for 3 Sample Sizes and the U-shape and Normal Distributions Sampling from a Normal Population • The distribution of sample means is normal for any sample size. Z Formula for Sample Means Tire Store Example in Figure 7.6 Graphic Solution to the Store Example Demonstration Problem 7.1 For this problem, μ = 448, σ = 21, and n = 49. The problem is to determine P(441 ≤ x ≤ 446). The following diagram depicts the problem. Demonstration Problem 7.1 Sampling from a Finite Population without Replacement • In this case, the standard deviation of the distribution of sample means is smaller than when sampling from an infinite population (or from a finite population with replacement). • The correct value of this standard deviation is computed by applying a finite correction factor to the standard deviation for sampling from a infinite population. • If the sample size is less than 5% of the population size, the adjustment is unnecessary. Sampling from a Finite Population • Finite Correction Factor • Modified Z Formula Finite Correction Factor for Selected Sample Sizes Sampling Distribution of p • If research or experiment produces, not measurable, but countable items such as the frequency with which an attribute occurs then the sample proportion is often the statistic of choice • Example: Take samples of 3 with replacement from a group of 5 things. Total possible samples is 25 = 32 . If there are only two attributes or countable outcomes possible (defective and non-defective), each sample have a certain proportion of things defective or non-defective. • There will be 32 possible proportions. And as in the case for the means of measurable outcomes, these 32 proportions have a distribution, with parameters that differ from those of the original population. Sampling Distribution of p • Sampling Distribution of the population proportion and its parameters: • The Sample Proportion • The standard deviation of the distribution is P Q n Sampling Distribution of p Sampling Distribution is approximately normal if • n ∙ p > 5 and n ∙ q > 5 where (p is the population proportion and q = 1 − p) • The mean of sample proportions for all samples of size n randomly drawn from a population is p (the population proportion) and the standard deviation of sample proportions is P Q n which is sometimes referred to as the standard error of the proportion. Z Formula for Sample Proportions Solution for Demonstration Problem 7.3 COPYRIGHT Copyright © 2014 John Wiley & Sons Canada, Ltd. All rights reserved. Reproduction or translation of this work beyond that permitted by Access Copyright (The Canadian Copyright Licensing Agency) is unlawful. Requests for further information should be addressed to the Permissions Department, John Wiley & Sons Canada, Ltd. The purchaser may make back-up copies for his or her own use only and not for distribution or resale. The author and the publisher assume no responsibility for errors, omissions, or damages caused by the use of these programs or from the use of the information contained herein.