CHAPTER 5: DATA COLLECTION AND SAMPLING Outline • • • • Population and sample Sources of data Sampling Sampling plans – Simple random sampling – Stratified random sampling – Cluster sampling 1 POPULATION AND SAMPLE • Parameter: summary measure about population, usually unknown or known from some published sources • Statistic: summary measure about sample 2 SOURCES OF DATA • Primary data: – Data published (in printed form, on data tapes, disks, and internet) the same organization that collected data – Some government agencies: http://www.census.gov/ http://www.statcan.ca/ 3 Sample data available from the Statistics Canada website 4 Sample data available from the Statistics Canada website 5 SOURCES OF DATA • Secondary data: – Data published by an organization different from the one that originally collected and published the data – A popular source of the secondary data is the Statistical Abstract of the United States 6 SOURCES OF DATA • Observational and experimental studies: – Observational study: data is collected and recorded without controlling any factor like it is done in an experimental study – experimental study: if more than one factor may cause the same outcome, it may be desirable to vary one factor at a time and control (keep unchanged) the other factors e.g., • aircraft primer paints are applied to improve finished paint adhesion force which depends on – primer application method: dripping and spraying – type of primer paint: type 1, 2, 3 7 SOURCES OF DATA • an experiment was designed in which – three specimens were painted with each primer using each application method, a finish paint was applied, and the adhesion force was measured. The resulting data are shown below: Adhesion Force Data Primer Type Dipping 1 4.0, 4.5, 4.3 2 5.6, 4.9, 5.4 3 3.8, 3.7, 4.0 Spraying 5.4, 4.9, 5.6 5.8, 6.1, 6.3 5.5, 5.0, 5.0 8 SOURCES OF DATA • Surveys: – Personal interview – Telephone interview – Questionnaire survey 9 SAMPLING • Target population – The population about which inference is desired • Sampled population – The actual population about which the sample has been taken • Self-selected samples – The responders mail/call responses – Such samples are usually biased 10 SAMPLING PLANS • Simple random sampling • Stratified random sampling • Cluster sampling 11 SIMPLE RANDOM SAMPLING • Suppose we have data about the annual incomes of 40 families in a spreadsheet file RANDSAMP.XLS. • We want to choose a simple random sample of size 10 from this frame. • How can this be done? • And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population? 12 SIMPLE RANDOM SAMPLING The family income data are shown on right 13 SIMPLE RANDOM SAMPLING • A simple random sample is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table so that every possible sample with the same number of observations is equally likely to be chosen. • For example, let sample 1 consist of families 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and sample 2 consist of families 1, 2, 3, 4, 5, 6, 7, 8, 9, 11. If a simple random sample is chosen, then Samples 1 and 2 will be equally likely to be chosen. 14 SIMPLE RANDOM SAMPLING • Solution: The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows. • The following procedure produces the results. – Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C. – Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do this, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option. 15 SIMPLE RANDOM SAMPLING – Copy to a new range. Copy the range A10:C49 to the range E10:G49. – Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample. – Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F. 16 SIMPLE RANDOM SAMPLING The result of all the operations are shown on right 17 STRATIFIED RANDOM SAMPLING • Suppose we can identify various sub-populations within the total population. We call these sub-populations strata. • It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling. • This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum. 18 STRATIFIED RANDOM SAMPLING • To obtain a stratified random sample we must choose a total sample size n, and we must choose a sample size ni for each stratum i. • There are many ways to choose these numbers but the most popular method is proportional sample sizes. • The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata. 19 STRATIFIED RANDOM SAMPLING • Sears has data on all 1000 people in the city of Smalltown who have Sears credit cards. • Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit. • The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone. 20 STRATIFIED RANDOM SAMPLING • First, Sears must decide exactly how to stratify by age. • The reasoning is that different age groups probably have different attitudes and behavior regarding credit. • After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80. • Number of customers in each category are as follows: Category Number of Customers 18 to 30 132 31 to 62 766 63 to 80 102 1000 21 STRATIFIED RANDOM SAMPLING • In a stratified random sampling with proportional sample sizes, the total sample size of 100 is distributed in 3 categories as follows: Category Number of Customers Sample Size 18 to 30 132 132*100/100013 31 to 62 766 766*100/100077 63 to 80 102 102*100/100010 1000 100 22 CLUSTER SAMPLING • Suppose a company is interested in various characteristics of households in a particular city. The sampling units are households. • We could proceed with the sampling methods discussed but it would be more convenient another way. • We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks. • In this case the city blocks are called clusters and the sampling is called cluster sampling. 23 CLUSTER SAMPLING • The advantage of cluster sampling is sampling convenience (and possibly less cost). • It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster. 24