STAT 421 Week 2 1 Summary of concepts We have focused only on population units Target population Sampling frame Selecting a probability sample We have not discussed any characteristics of a population element 2 Characteristics of the target population We are really interested in making statements that summarize characteristics of the target population, i.e. population parameters The average school loan debt owed by currently enrolled ISU students The total surface area of county parks in the US The fraction of Des Moines households that fall below the poverty line 3 Data value for a target population element We calculate these population summaries from all of the data values associated with each population element, i.e., from yi = the value of characteristic for element i for i = 1, 2, …, N 4 Data value for a target population element A data value is yi = the value of characteristic for element i Examples Element = county in US yi = Element = student enrolled at ISU yi = Element = Des Moines household yi = 5 Population distribution of y y is often referred to as the … Variable of interest There are N data values for y There is one value of y for each of the N elements in the target population There may be fewer than N unique values So y has a DISCRETE distribution How do we summarize discrete distributions? Graphical Summary (Histogram) Numerical Summary (Population Parameters) 6 Population distribution of y Histogram Horizontal axis = all UNIQUE values of y Vertical axis = frequency of elements with a specific value of y To make a histogram, we start with a table with unique values of y and the frequency with which they occur 7 Population distribution of y Histogram for yi = number of courses a 2003 Stat 421 student i was registered for N = 24 Number of Number of courses students y c(y) 1 0 2 2 3 7 4 6 5 3 6 6 Histogram 8 7 Number of students c(y) 6 5 4 3 2 1 0 1 2 3 4 Number of courses y 5 6 8 Population distribution of y Discrete probability distribution of y Horizontal axis = all UNIQUE values of y Vertical axis = RELATIVE frequency of elements with a specific value of y , called P(y) Like a histogram with the frequencies (vertical axis) divided by N Usually start by making a table of Unique values of the variable, y Relative frequency of the unique values of y, P(y) 9 Population distribution of y Number of courses example Probability Distribution Relative frequency 8 P(y) 0.08 0.29 0.25 0.13 0.25 7 Number of students Number of courses c(y) / N y 1 0/24 2 2/24 3 7/24 4 6/24 5 3/24 6 6/24 6 5 4 3 2 1 0 1 2 3 4 5 6 Number of courses y 10 Population distribution of y Numerical Summary: Population Parameters Can also describe distribution of y through summary descriptions called parameters (more later) Total for y = (100 courses for all students) Mean of y = (100/24=4.2 courses per student) 11 Population distribution of y The population distribution of y is what we are trying to describe when we draw a sample, collect data, and calculate an estimate for a summary parameter The population distribution of y is FIXED No matter what sample design we choose Regardless of the sample we draw from a given design We do not assume a parametric distribution Forget normal distributions assumed in other classes (for now) 12 Survey design Survey design involves selecting methods for all phases of the survey process Objectives Sample design Data collection approach Analysis approach 13 Response process The process of collecting data from sampled units, e.g., via a questionnaire or observation form Response process Assume we have selected a probability sample from a frame The next step is to collect data from each sampled element This will lead to values for yi for each element in the sample We will act like we only collect one characteristic from each sampled element, but usually, we are collecting dozens or even 100s of different characteristics from each element 15 Problems in response process We rarely get complete data from all sampled units We may fail to obtain data through some part of the response or data collection process (e.g., nonresponse) or data that are free of error Data obtained may not be accurate (e.g., recall bias) 16 Problems in response process Outcomes for a sample selected from a frame Can not locate/contact a sampled unit (e.g., household) – unreachable Locate/contact a unit, but can’t get any data Sampled person refuses to participate - nonresponse Sampled person is unable to participate (illness)incapable Collect data on some, but not all characteristics Respondent doesn’t answer all questions Data collector forgets to record a variable 17 Response process and eligibility Recall that the frame (and thus sample) may contain units that do not belong in the target population In this case, we need to “screen” the unit to determine if the unit is eligible to be included in the survey Eligible means that the unit belongs to the target population 18 Summary of sampling and response process problems Sampling frames (and hence, the selected sample) may Include ineligible units Exclude eligible units The response process may Get incomplete data due to Unreachable Nonresponse Incapable 19 Summary of sampling and response process problems Sample process: includes only those units that were in the sampling frame Response process: includes only those that would have responded Were available during interview period, Were willing to be interviewed, and Were physically/mentally capable of providing responses 20 Sampled population The sampled population is the collection of all possible units that would be the outcome of the sampling and response process Could have been chosen in a sample, and Would have provided data during the response process if sampled 21 Target pop vs. sampling frame vs. sampled pop (Fig 1.1, p. 4) Ideally, these 3 representations of the population are completely overlapping and nonrespondents and ineligibles do not occur 22 Survey error Sampling error Nonsampling error Survey error model Total Survey Error = in an Estimate Sampling Error Due to selecting a sample instead of surveying the entire population + Nonsampling Error Due to mistakes or systematic deficiencies in sampling, response process, data processing Biemer & Lyberg, 2003, Introduction to Survey Quality 24 Sampling error Sampling error is the difference between an estimate of a population parameter (calculated with data from a sample) and the true population parameter being estimated True population mean from entire distribution Estimated mean calculated from the sample In a sample survey, we are collecting data from a subset of the population, i.e., we do not observe the whole population Estimate for any one sample is unlikely to perfectly match the population parameter 25 Example Population mean of N = 28 students for y = number of textbooks purchased by 2003 Stat 421 students: 4.21 books per student Randomly select a sample of n = 4 students Data on number of text books purchased: 2, 6, 3, 5 Estimated population mean: 4.00 books per student Two other samples yielded estimates of 3.50 and 4.75 books per student 26 Nonsampling error Nonsampling error includes all errors in data collection, processing and estimation except sampling error (1) Selection Errors (mismatch between the target population and the sampled population) Frame error Mismatch between target population and sampling frame Nonresponse error Inability to obtain data from a sampled unit 27 Nonsampling error (2)Specification error Discrepancy between concept of interest and how question is phrased (get incorrect data because of problem in question wording) (3)Measurement error Errors in data during interview or measurement process (respondent provided false/inaccurate info, interviewer made a recording mistake) (4)Processing error Errors in a computer program to process data or calculate estimates that generate an error in the value of the estimate 28 Reducing total survey error Sampling error Choose a sample design that produces precise estimates Nonsampling error Choose survey methods that encourage complete and accurate responses (nonresponse) Choose a frame close to target population (frame) Be very careful with questionnaire development (specification) Use trained and monitored interviewers (measurement) Use quality control for computer programs (processing) 29 Census vs. sample Census: collect data from all N members of a population Sampling error vanishes But nonsampling error for a census is often much larger than total survey error for a sample Sample: devote more effort to collecting high quality data from fewer sampled units Control sampling error via good sample design Implement more expensive, but more accurate data collection methods 30 Ch 1, problem 2 A student wants to estimate the percentage of mutual funds whose shares went up in price last week. She selects every tenth fund listing in the mutual fund pages of the newspaper. She calculates the percentage of those in which the share price increased. Target population Population unit Sampling frame Sampling unit Define yi , the value of characteristic for element i Possible sources of selection errors Possible sources of measurement errors 31 Ch 1, problem 1 The article “What Readers Say About Marijuana” reports that “more than 75% of the readers who took part in an informal PARADE telephone poll say marijuana should be as legal as alcoholic beverages” (Parade, 31 July 1994, 16). The telephone poll was announced on page 5 of the June 12 issue. Readers were instructed to “call 1-900-773-1200, at 75 cents a call, if you would like to answer the following questions. Use touch-tone phones only. To participate, call between 8 a.m. EDT [Eastern Daylight Time] on Saturday, June 11, and midnight EDT on Wednesday, June 15.” 32 Ch 1, problem 1 Target population Population unit Sampling frame Sampling unit Define yi , the value of characteristic for element i Possible sources of selection errors Possible sources of measurement errors 33 Ch 2: Probability Sampling and SRS Establish basic notation and concepts Population distribution of y Sampling distribution of an estimator under a design (this is not the sampled population!!) This is the object of inference Use this to evaluate quality of the estimate and make inference Apply these concepts and learn about estimation through SRS Selecting a SRS sample Estimating population parameters (means, totals, proportions) Estimating standard errors and confidence intervals Determining the sample size 34 Assume ideal setting (until further notice) Only sampling error, no nonsampling error Sampled population = target population Sampling frame is a perfect representation of target pop Data are collected on all sampled units Sample unit = element List of all elements, only those that are in target pop No frame or nonresponse errors Measurement process is perfect All responses and measurements are accurate 35 Class example Suppose we want to make inferences about 2003 Stat 421 students Interested in three population parameters The average course load of students The proportion of students who have a cell phone The total number of text books purchased by students this semester 36 Return to population distribution Indices for elements Each element has a unique label or index U = index set (set of labels) for all elements in the population Usually label an element by i = 1, 2, …, N Alternatively, a label could be name, or SSN U = {1, 2, …, N } Sampling frame is a list of labels or indices for each element in the population Select indices in the sampling process 38 Characteristic of interest y is the variable or characteristic of interest yi = characteristic of interest for unit i Set of y values in population y1 , y2 , …, yN Class example (3 y ’s) id 1 2 3 … 27 28 Number of courses Number of textbooks purchased Whether or not have cell phone courses 4 4 2 … 6 5 texts 5 3 2 … 6 6 cell 1 1 1 … 0 0 39 Population distributions for number of courses, number of books, whether/not have cell phone 40 Population distribution parameters Can also characterize a population distribution with population parameters Mean of y (proportion if y is binary) Total for y Variance of y (need this for sample size determination and expressing precision of estimates) Sometimes standard deviation, quantiles 41 Symbols for population distribution parameters y U = mean or expected value of y p = proportion of population having a particular characteristic Mean of a binary (0, 1) variable t = population total of y S 2 = variance of y S = standard deviation of y = generic parameter for population distribution of y 42 Parameter: population mean Examples Average number of miles driven per week by adults in US Average number of errors per client account Population mean of y (or expected value) N E [Y ] y U yi i 1 N Measure of central tendency (middle of distn) Units for the mean is y-units per element 43 Parameter: population mean What is the population mean number of textbooks purchased by students? N E[Y ] yU yi i 1 N 44 Parameter: population proportion Proportion (p) of population having a particular characteristic Mean of binary (0, 1) variable 1 , if unit i has characteristic yi 0 , if unit i doesn't have characteristic N p yi i 1 N 45 Parameter: population proportion What proportion of students have a cell phone? Data: 18 students have a cell phone N p yi i 1 N 46 Parameter: population total Examples Number of households in a region Number of deer in Iowa yi =number of households in area i N = number of areas in the region yi =number of deer observed in area i N = number of observation areas in Iowa Population total of Y N t y i Ny U i 1 Total number of y-units in the population 47 Parameter: population total What is the total number of books purchased by students in this class? Data: N t yi NyU i 1 48 Relationships for population mean, proportion, total N yU yi i 1 N t or p N N t yi NyU or Np i 1 49 Parameter: population variance Population variance N V [Y ] S 2 2 ( y y ) i U i 1 N 1 Measure of spread or variability in population’s response values S is the standard deviation of y NOT the standard error of an estimate, but is used in the formula for the standard error and confidence interval of an estimate 50 Parameter: population variance id 1 2 3 … 27 28 What is the population variance for the number of courses enrolled in per student? For having a cell phone? courses 4 4 2 … 6 5 texts 5 3 2 … 6 6 cell 1 1 1 … 0 0 N V [Y ] S 2 2 ( y y ) i U i 1 N 1 51 Summary of notation for the population distribution Basic pop unit: element (i ) Number of units or size of pop: N Values of the random variable: yi Parameters: characterize the population distribution Mean y U Proportion (mean of binary variable) p Total t 2 and standard deviation S Variance S Sometimes we will use population parameter to denote a generic 52 Summary of the population distribution Population distribution of characteristic of y is not known, but is the object of inference in conducting a survey Select a probability sample, collect data, and estimate unknown population distribution parameters using data collected from sample Population distribution and its parameters are fixed (constant) Values never change with design, sample, or estimator 53