Sampling Design Sampling The process of obtaining information from a subset (sample) of a larger group (population) The results for the sample are then used to make estimates of the larger group Faster and cheaper than asking the entire population Two keys 1. Selecting the right people Have to be selected scientifically so that they are representative of the population 2. Selecting the right number of the right people To minimize sampling errors I.e. choosing the wrong people by chance SAMPLING • Sample -- contacting a portion of the population (e.g., 10% or 25%) – best with a very large population (n) – easiest with a homogeneous population • Census -- the entire population – most useful if the population ("n") is small – or the cost of making an error is high Population Vs. Sample Population of Interest Sample Population Sample Parameter Statistic We measure the sample using statistics in order to draw inferences about the parameters of the population. Characteristics of Good Samples • Representative • Accessible • Low cost …this (bad)… Sample Population …or this (VERY bad)… Sample Population Terminology Population The entire group of people of interest from whom the researcher needs to obtain information. Element (sampling unit) one unit from a population Sampling The selection of a subset of the population Sampling Frame Listing of population from which a sample is chosen Census A polling of the entire population Survey A polling of the sample Terminology Parameter The variable of interest Statistic The information obtained from the sample about the parameter Goal To be able to make inferences about the population parameter from knowledge of the relevant statistic - to draw general conclusions about the entire body of units Critical Assumption The sample chosen is representative of the population Steps in Sampling Process 1.Define the population 2.Identify the sampling frame 3.Select a sampling design or procedure 4.Determine the sample size 5.Draw the sample Sampling Design Process Define Population Determine Sampling Frame Determine Sampling Method Non-Probability Sampling •Convenience •Judgmental •Quota Probability Sampling •Simple Random Sampling •Stratified Sampling •Cluster Sampling Determine Appropriate Sample Size Execute Sampling Design 1. Define the Target Population Question: “Who, ideally, do you want to survey?” Answer: those who have the information sought. • What are their characteristics. • Who should be excluded? – age, gender, product use, those in industry – Geographic area It involves – defining population units – setting population boundaries – Screening (e.g. security questions, product use ) 1. Define the Target Population The Element ...... individuals families seminar groups sampling Unit…. individuals over 20 families with 2 kids seminar groups at “new” university Extent ............ individuals who have bought “one” families who eat fast food seminar groups doing MR Timing .......... bought over the last seven days 1. Define the Target Population The target population for a toy store can be defined as all households with children living in Calgary. What’s wrong with this definition? 2. Determine the Sampling Frame Obtaining a “list” of population (how will you reach sample) Students who eat at McDonalds? young people at random in the street? phone book students union listing University mailing list Problems with lists omissions ineligibles duplications Procedures E.g. individuals who have spent two or more hours on the internet in the last week 2. Determine the Sampling Frame Select “sample units” Individuals Household Streets Companies 3. Selecting a Sampling Design Probability sampling - equal chance of being included in the sample (random) – – – – simple random sampling systematic sampling stratified sampling cluster sampling Non-probability sampling - - unequal chance of being included in the sample (non-random) – convenience sampling – judgement sampling – snowball sampling – quota sampling 3. Selecting a Sampling Design Probability Sampling An objective procedure in which the probability of selection is nonzero and is known in advance for each population unit. also called random sampling. Ensures information is obtained from a representative sample of the population Sampling error can be computed Survey results can be projected to the population More expensive than non-probability samples 3. Selecting a Sampling Design Simple Random Sampling (SRS) Population members are selected directly from the sampling frame Equal probability of selection for every member (sample size/population size) 400/10,000 = .04 Use random number table or random number generator 3. Selecting a Sampling Design Simple Random Sampling N = the number of cases in the sampling frame n = the number of cases in the sample f = n/N = the sampling fraction NCn = the number of combinations (subsets) of n from N If you have a sampling frame of the 10,000 full-time students at the U of L and you want to survey .01 percent of them, how would you do it? 3. Selecting a Sampling Design Objective: To select n units out of N such that each NCn has an equal chance of being selected Procedure: Use a table of random numbers, a computer random number generator, or a mechanical device to select the sample 3. Selecting a Sampling Design Systematic Sampling • Order all units in the sampling frame based on some variable and number them from 1 to N • Choose a random starting place from 1 to N and then sample every k units after that systematic random sample number the units in the population from 1 to N decide on the n (sample size) that you want or need k = N/n = the interval size randomly select an integer between 1 to k then take every kth unit 3. Selecting a Sampling Design Stratified Sampling (I) The chosen sample contains a number of distinct categories which are organized into segments, or strata – equalizing "important" variables • year in school, geographic area, product use, etc. Steps: – Population is divided into mutually exclusive and exhaustive strata based on an appropriate population characteristic. (e.g. race, age, gender etc.) – Simple random samples are then drawn from each stratum. Stratified Random Sampling Stratified Random Sampling The sample size is usually proportional to the relative size of the strata. Ensures that particular groups (e.g. males and females) within a population are adequately represented in the sample Has a smaller sampling error than simple random sample since a source of variation is eliminated 3. Selecting a Sampling Design Stratified Sampling (II) Direct Proportional Stratified Sampling – The sample size in each stratum is proportional to the stratum size in the population Disproportional Stratified Sampling – The sample size in each stratum is NOT proportional to the stratum size in the population – Used if 1) some strata are too small 2) some strata are more important than others 3) some strata are more diversified than others 3. Selecting a Sampling Design Cluster Sampling The Population is divided into mutually exclusive and exhaustive subgroups, or clusters, usually based on geography or time period Each cluster should be representative of the population i.e. be heterogeneous. Means between clusters should be the same (homogeneous) Then a sample of the clusters is selected. then some randomly chosen units in the selected clusters are studied. cluster or area random sampling divide population into clusters (usually along geographic boundaries) randomly sample clusters measure units within sampled clusters 3. Selecting a Sampling Design When to use stratified sampling If primary research objective is to compare groups Using stratified sampling may reduce sampling errors When to use cluster sampling If there are substantial fixed costs associated with each data collection location When there is a list of clusters but not of individual population members 3. Selecting a Sampling Design Non-Probability Sampling Subjective procedure in which the probability of selection for some population units are zero or unknown before drawing the sample. information is obtained from a non-representative sample of the population Sampling error can not be computed Survey results cannot be projected to the population 3. Selecting a Sampling Design Non-Probability Sampling Advantages Cheaper and faster than probability Reasonably representative if collected in a thorough manner Types of Non-Probability Sampling (I) Convenience Sampling – A researcher's convenience forms the basis for selecting a sample. • people in my classes • Mall intercepts • People with some specific characteristic (e.g. bald) Judgement Sampling – A researcher exerts some effort in selecting a sample that seems to be most appropriate for the study. Types of Non-Probability Sampling Snowball Sampling – Selection of additional respondents is based on referrals from the initial respondents. • friends of friends – Used to sample from low incidence or rare populations. Quota Sampling The population is divided into cells on the basis of relevant control characteristics. – A quota of sample units is established for each cell. • 50 women, 50 men – A convenience sample is drawn for each cell until the quota is met. (similar to stratified sampling) Quota Sampling Let us assume you wanted to interview tourists coming to a community to study their activities and spending. Based on national research you know that 60% come for vacation/pleasure, 20% are VFR (visiting friends and relatives), 15% come for business and 5% for conventions and meetings. You also know that 80% come from within the province. 10% from other parts of Canada, and 10% are international. A total of 500 tourists are to be intercepted at major tourist spots (attractions, events, hotels, convention centre, etc.), as you would in a convenience sample. The number of interviews could therefore be determined based on the proportion a given characteristic represents in the population. For instance, once 300 pleasure travellers have been interviewed, this category would no longer be pursued, and only those who state that one of the other purposes was their reason for coming would be interviewed until these quotas were filled. Alberta Canada International Totals Pleasure .48 .06 .06 .60 Visiting .16 .02 .02 .20 Business .12 .015 .015 .15 Convention .04 .005 .005 .05 Totals .80 .10 .10 100 Probability Vs. NonProbability Sampling Disadvantages The probability of selecting one element over another is not known and therefore the estimates cannot be projected to the population with any specified level of confidence. Quantitative generalizations about population can only be done under probability sampling. In practice, however, marketing researchers also apply statistics to study non-probability samples. Generalization • You can only generalize to the population from which you sampled – U of L students not university students • geographic, different majors, different jobs, etc. – University students not Canadian population • younger, poorer, etc. – Canadians not people everywhere • less traditional, more affluent, etc. Drawing inferences from samples • Population estimates – % who smoke, buy your product, etc • 25% of sample • what % of population? – very dangerous with a non-representative sample or with low response rates Errors in Survey Random Sampling Error – random error- the sample selected is not representative of the population due to chance – the level of it is controlled by sample size – a larger sample size leads to a smaller sampling error. Population mean (μ) gross income = $42,300 Sample 1 (400/250,000) mean (Χ) = $41,100 Sample 2 (400/250,000) mean (Χ) = $43,400 Sample 3 (400/250,000) mean (Χ) = $36,400 Non-Sampling Errors (I) Non-sampling Error –systematic error –the level of it is NOT controlled by sample size. The basic types of non-sampling error – Non-response error – Response or data error A non-response error occurs when units selected as part of the sampling procedure do not respond in whole or in part – If non-respondents are not different from those that did respond, there is no non-response error Non-Sampling Errors (II) A response or data error is any systematic bias that occurs during data collection, analysis or interpretation – Respondent error (e.g., lying, forgetting, etc.) – Interviewer bias – Recording errors – Poorly designed questionnaires Data Preparation Steps in Data Preparation Editing Coding Entering Data Data Tabulation Reviewing Tabulations Statistically adjusting the data (e.g. weighting) Editing Carefully checking survey data for Completeness (no omissions) Non-ambiguous (e.g. two boxes checked instead of one) Right informant (e.g. under age, when all supposed to be over 18) Consistency e.g. charging something on a credit card when the person does not own a credit card Accuracy (e.g. no numbers out of range) Most important purpose is to eliminate or at least reduce the number of errors in the raw data. Solutions 1. Ideally re-interview respondent 2. Eliminate all unacceptable surveys (case wise deletion) (if sample is large and few unacceptable) 3. In calculations only the cases with complete responses are considered (pair wise deletion) (means that some statistics will be based on different sample sizes) 4. Code illegible or missing answers into a a “no valid response” category 5. substitute a neutral value - typically the mean response to the variable, therefore the mean remains unchanged Coding • The process of systematically and consistently assigning each response a numerical score. • The key to a good coding system is for the coding categories to be mutually exclusive and the entire system to be collectively exhaustive. • To be mutually exclusive, every response must fit into only one category. • To be collectively exhaustive, all possible responses must fit into one of the categories. • Exhaustive means that you have covered the entire range of the variable with your measurement. Coding • Coding Missing Numbers: When respondents fail to complete portions of the survey. – Whatever the reason for incomplete surveys, you must indicate that there was no response provided by the respondent. – For single digit responses code as “9”, 2 digit code as “99” Coding Open-Ended Questions: When open-ended questions are used, you must create categories. – All responses must fit into a category – similar responses should fall into the same category. e.g. Who services your car? ______________ Possible categories: self, garage, husband, wife, friend, relative etc. • To make it collectively exhaustive add an “other” or “none of the above” category –Only a few i.e. < 10% should fit into this category Precoded Questionnaires: Sometimes you can place codes on the actual questionnaire, which simplifies data entry. This… Are you: Male Female How satisfied are you with our product? ___Very Satisfied ___Somewhat Satisfied ___Somewhat Dissatisfied ___Very Dissatisfied ___No opinion Becomes this… Are you: (1) Male (2) Female How satisfied are you with our product? _1__Very Satisfied _2__Somewhat Satisfied _3__Somewhat Dissatisfied _4__Very Dissatisfied _5__No opinion 1. Are you solely responsible for taking care of your automotive service needs ___ Yes ___ No 2. If No who performs the simple maintenance ___________ 3. If scheduled maintenance is done on your automobile, how do you keep track of what has been done •Not tracked •auto dealer records •mental recollection •other 4. How often is your automobile serviced? •Once per month •Once every three months •Once every six months •Once per year •Other _______________ Code Book Col. No Question No. Question Des. Range of permissible values 1 ID # N/A 001-200 (this also means the surveys themselves should be numbered) 2 1 Responsible for Maintenance 0= No. 1=yes, 9= blank 3 2 perform simple maintenance 0=husband, 1=boyfriend, 2=father, 3=mother, 4=relative, 5=friend, 6=other, 9=blank 4 3 How maintenance tracked 0=not tracked, 1=auto dealer records, 2=personal records, 3=mental recollection, 4=other, 9=blank 5 4 How often maintenance performed 1=Once per month, 2=3 month, 3=6 months , 4=year , 5=other, 9= blank In questions that permit multiple responses, each possible response option should be assigned a separate column 6. Which magazines do you read, choose all that apply. • Time • National Geographic • Readers Digest • Chatelaine • MacLean's Col. No Question No. Question Des. Range of permissible values 15 6 Time 0 =read, 1= not read 16 6 Readers Dig. 0 =read, 1= not read 17 6 MacLean's 0 =read, 1= not read 18 6 National Geo. 0 =read, 1= not read 19 6 Chatelaine 0 =read, 1= not read For rank order questions, separate columns are also needed 7. Please rank the following brands of toothpaste in order of preference (1-5) with 1 being the most important • Crest • Aquafresh • Aim • Colgate • Arm & Hammer Col.# Q. No. Question Des. Range of permissible values 20 7 Crest rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 21 7 Colgate rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 22 7 Acquafresh rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 23 7 A & H rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 25 7 Pepsodent rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth Preparing the Data for Analysis Variable Re-specification • Existing data modified to create new variables • Large number of variables collapsed into fewer variables • E.g. If 10 reasons for purchasing a car are given they might be collapsed into four categories e.g. performance, price, appearance, and service • Creates variables that are consistent with research questions Entering Data • Problems can occur during data entry, such as transposing numbers and inputting an infeasible code(e.g out of range) – E.g. Score on range of 1-5 then 0, 6, 7, and 8 are unacceptable or out of range (might be due to transcription error) • Always check the data-entry work.