Statistics 475 Notes 17 Reading: Lohr, Chapter 6.5 Notes: (1) On Homework 4, Question 6.2, by “representative sample,” the book means a “self weighting sample,” i.e., a sample in which all sampled units have the same sample weights. (2) On Wednesday, Professor Larry Brown will be talking to us about sampling issues in the U.S. Census. Example of Unequal Probability Sampling: Random Digit Dialing. I. Historical Background Telephone data collection plays an important role in household surveys. During the earliest years of survey research, most surveys were conducted either face-to-face or by mail. The main impediment to using telephones as the primary method of data collection was the coverage error associated with excluding households (those without telephones and those not listed in telephone directories). By the late 1960s, several factors had coalesced to increase the role of telephones in survey research: (1) Higher coverage rates allayed some of the concerns about noncoverage bias. Thornberry and Massey (1988) used data from the National Health Interview Survey (NHIS), a face-to-face cross1 sectional survey conducted each year, to show that the percentage of households without telephones decreased from about 20 percent in 1963 to about 10 percent in the early 1970s. (2) Escalating cost of face-to-face interviewing; a comparable telephone interview was less than half the cost of a face-to-face interview. (3) Publication of methodological research showing that data quality in the telephone mode was competitive with that observed with the face-to-face and mail modes. A key obstacle to the initial adoption of telephone surveys was the lack of a good method of drawing a probability sample of telephone numbers. Part of the problem was the absence of a frame of telephone numbers that could easily be used for sampling. The only frame available was a telephone directory and it suffered from several deficiencies. The most notable problem with the directories was their incompleteness. Another serious operational problem was that telephone directories were difficult to handle in practice since no electronic or computer files were available. National samples of listed telephone numbers required multistage sampling: geographic areas were sampled first, the directories for the sampled areas were obtained and then numbers listed in the sampled directories were selected. Cooper (1964) published one of the first recorded methods for random sampling of telephone numbers. He suggested using a local area directory to identify all the assigned 2 prefixes and then appending a random four digit suffix to create the seven digit number that was dialed. This method produced a probability sample, and it covered households that were not in the directory. However, Cooper’s method was not very efficient: only about 20 percent of the sampled numbers were residential. Nevertheless, this was a considerable improvement over the approximately 3 percent of numbers that were residential when the sample was based on sampling all possible telephone numbers. Another disadvantage of Cooper’s method was that researchers still had to acquire and search through a local telephone directory to identify eligible prefixes. In this sense, it was a difficult procedure to implement. II. The Mitofsky-Waksberg Method Warren Mitofsky was hired by CBS News as the director of the election and survey unit in 1967 and began trying to build survey capabilities for CBS. He faced all the abovementioned difficulties associated with selecting national samples of telephone numbers. He learned that AT&T had a computer file of all the assigned area code-prefixes in the country. He eventually convinced AT&T to sell the data. Having the list of all assigned area-code prefixes eliminated the need for dealing with telephone directories. Despite this important advance, Cooper’s method of randomly adding four digits to assigned area code-prefixes continued pose operational problems. Early on, researchers realized that the telephone companies assigned numbers in such a way that some of the area code-prefix combinations 3 had few residential numbers while others had many. Mitofsky developed a two stage sampling scheme that took advantage of the way numbers were assigned to greatly increase the efficiency of the sample: 1. Construct a frame of all area codes and three digit prefixes in the area of interest. To these, add all possible choices for the next two digits and thus prepare a list of all possible first eight digits of the ten digits in telephone numbers. These eight digits are treated as Primary Sampling Units (psu’s). 2. Randomly select an eight digit number and then randomly select a last two digits for the number. Call the number. If the number is not residential, then the psu is rejected. If the number is residential, then the psu is selected. Ask the survey question to the household called and also randomly add additional last two digits to the eight digit numbers and call the numbers until a fixed number k residential numbers in the psu have been obtained. 3. Repeat steps 1-2 until a fixed number of psus have been retained. The total size of the sample is number of psus retained times k. About 20-25 percent of the psus are typically retained. Using this two-stage procedure, the overall percentage of randomly generated sampled numbers that are residential often exceeds 60 percent, depending on how many second stage residences are sampled. This efficient procedure 4 reduces data collection cost and increases interviewer morale (Interviewers have never enjoyed dialing nonworking and nonresidential numbers). Waksberg (1978) studied the statistical properties of Waksberg’s procedure. The theory in Waksberg’s paper was very elegant. He showed that the method sampled first stage clusters with probabilities exactly equal to the number of residential numbers in the cluster without the researcher ever knowing (or needing to know) the number of residences in the cluster. The method also has the very desirable outcome of sampling every telephone number with equal probability. Here are the details: Let M i be the number of residential telephone numbers in the ith cluster and let k be the number of residential telephone numbers in each cluster that are to be selected into the sample if the cluster is selected in the first stage and is retained (i.e., the randomly chosen number from the cluster is found to be residential). P( jth residential number in ith cluster selected into sample)= P(ith cluster selected and retained)P(jth number selected|ith cluster retained) Mi k k K Mi K where K is the total number of residential telephone numbers in the population. To estimate a population total, we would need to know K, the total number of residential telephone numbers in the population and use sampling weights wij K / k . To estimate an average or proportion, the typical goal of telephone surveys, we do not need to know K. We only 5 need to know a “relative weight” wij (i.e., weight up to a constant) for each response yij in the sample and we can estimate the population mean as wij yij iS jSi yˆ wij . iS jSi For the Mitofsky-Waksberg methods, we have a selfweighting sample and can use a relative weight of wij 1 . Note that under ideal conditions, the Mitofsky-Waksberg method leads to a self-weighting sample of residential telephone numbers, but it does not give a self-weighting sample of households – some households may have more than one telephone number; others may not have a telephone. For the issue of households having more than one telephone number, we make the sampling weight for the household inversely proportional to the number of telephone numbers the household has. We will discuss how to deal with nonresponse when we cover Chapter 8. III. Complications posed by cell phones At the end of 2006, approximately 12% of US adults had only a cell phone and on residential landline phone. This poses several difficulties for telephone surveys: 1. People are less likely to respond to a cell phone survey because they must pay for the incoming call. It is common practice in the survey profession to offer 6 respondents on cell phones a small amount of money to reimburse them for the costs of the incoming call. 2. Federal law prohibits the calling of cell phones with the use of automatic dialing devices, which are commonly used by both survey organizations and telemarketers. But survey organizations are permitted to call cell phones if the numbers are dialed manually. IV. How the New York Times Conducts Its Polls How the Poll Was Conducted Published: September 17, 2008, New York Times The latest New York Times/CBS News Poll is based on telephone interviews conducted Sept. 12 through Sept. 16 with 1,133 adults throughout the United States. Of these, 1,004 said they were registered to vote. The sample of land line telephone exchanges called was randomly selected by a computer from a complete list of more than 42,000 active residential exchanges across the country. The exchanges were chosen so as to ensure that each region of the country was represented in proportion to its population. Within each exchange, random digits were added to form a complete telephone number, thus permitting access to listed and unlisted numbers alike. Within each household, one adult was designated by a random procedure to be the respondent for the survey. To increase coverage, this land line sample was supplemented by respondents reached through random dialing of cellphone numbers. The two samples were then combined. The combined results have been weighted to adjust for variation in the sample relating to geographic region, sex, race, marital status, age and education. In addition, the land line respondents were weighted to take account of household size and number of telephone lines into the residence, while the cellphone respondents were weighted according to whether they were reachable only by cellphone or also by land line. Because of fluctuations in party identification, this poll was also weighted by averaging in party preferences from three recent past Times/CBS News polls. 7 Some findings regarding voting were also weighted in terms of an overall “probable electorate,” which uses responses to questions dealing with voting history, attention to the campaign and likelihood of voting in 2008 as a measure of the probability of respondents’ turning out in November. In theory, in 19 cases out of 20, overall results based on such samples will differ by no more than three percentage points in either direction from what would have been obtained by seeking to interview all American adults. For smaller subgroups, the margin of sampling error is larger. Shifts in results between polls over time also have a larger sampling error. In addition to sampling error, the practical difficulties of conducting any survey of public opinion may introduce other sources of error into the poll. Variation in the wording and order of questions, for example, may lead to somewhat different results. Michael R. Kagay of Princeton, N.J., assisted The Times in its polling analysis. Complete questions and results are available at nytimes.com/polls. 8