Notes 17 - Wharton Statistics Department

advertisement
Statistics 475 Notes 17
Reading: Lohr, Chapter 6.5
Notes:
(1) On Homework 4, Question 6.2, by “representative
sample,” the book means a “self weighting sample,” i.e., a
sample in which all sampled units have the same sample
weights.
(2) On Wednesday, Professor Larry Brown will be talking
to us about sampling issues in the U.S. Census.
Example of Unequal Probability Sampling: Random Digit
Dialing.
I. Historical Background
Telephone data collection plays an important role in
household surveys. During the earliest years of survey
research, most surveys were conducted either face-to-face
or by mail. The main impediment to using telephones as
the primary method of data collection was the coverage
error associated with excluding households (those without
telephones and those not listed in telephone directories).
By the late 1960s, several factors had coalesced to increase
the role of telephones in survey research:
(1) Higher coverage rates allayed some of the concerns
about noncoverage bias. Thornberry and Massey
(1988) used data from the National Health
Interview Survey (NHIS), a face-to-face cross1
sectional survey conducted each year, to show that
the percentage of households without telephones
decreased from about 20 percent in 1963 to about
10 percent in the early 1970s.
(2) Escalating cost of face-to-face interviewing; a
comparable telephone interview was less than half
the cost of a face-to-face interview.
(3) Publication of methodological research showing
that data quality in the telephone mode was
competitive with that observed with the face-to-face
and mail modes.
A key obstacle to the initial adoption of telephone surveys
was the lack of a good method of drawing a probability
sample of telephone numbers. Part of the problem was the
absence of a frame of telephone numbers that could easily
be used for sampling. The only frame available was a
telephone directory and it suffered from several
deficiencies. The most notable problem with the
directories was their incompleteness. Another serious
operational problem was that telephone directories were
difficult to handle in practice since no electronic or
computer files were available. National samples of listed
telephone numbers required multistage sampling:
geographic areas were sampled first, the directories for the
sampled areas were obtained and then numbers listed in the
sampled directories were selected.
Cooper (1964) published one of the first recorded methods
for random sampling of telephone numbers. He suggested
using a local area directory to identify all the assigned
2
prefixes and then appending a random four digit suffix to
create the seven digit number that was dialed. This method
produced a probability sample, and it covered households
that were not in the directory. However, Cooper’s method
was not very efficient: only about 20 percent of the
sampled numbers were residential. Nevertheless, this was a
considerable improvement over the approximately 3
percent of numbers that were residential when the sample
was based on sampling all possible telephone numbers.
Another disadvantage of Cooper’s method was that
researchers still had to acquire and search through a local
telephone directory to identify eligible prefixes. In this
sense, it was a difficult procedure to implement.
II. The Mitofsky-Waksberg Method
Warren Mitofsky was hired by CBS News as the director of
the election and survey unit in 1967 and began trying to
build survey capabilities for CBS. He faced all the abovementioned difficulties associated with selecting national
samples of telephone numbers. He learned that AT&T had
a computer file of all the assigned area code-prefixes in the
country. He eventually convinced AT&T to sell the data.
Having the list of all assigned area-code prefixes eliminated
the need for dealing with telephone directories.
Despite this important advance, Cooper’s method of
randomly adding four digits to assigned area code-prefixes
continued pose operational problems. Early on, researchers
realized that the telephone companies assigned numbers in
such a way that some of the area code-prefix combinations
3
had few residential numbers while others had many.
Mitofsky developed a two stage sampling scheme that took
advantage of the way numbers were assigned to greatly
increase the efficiency of the sample:
1.
Construct a frame of all area codes and three
digit prefixes in the area of interest. To these,
add all possible choices for the next two digits
and thus prepare a list of all possible first eight
digits of the ten digits in telephone numbers.
These eight digits are treated as Primary
Sampling Units (psu’s).
2.
Randomly select an eight digit number and
then randomly select a last two digits for the
number. Call the number. If the number is
not residential, then the psu is rejected. If the
number is residential, then the psu is selected.
Ask the survey question to the household
called and also randomly add additional last
two digits to the eight digit numbers and call
the numbers until a fixed number k residential
numbers in the psu have been obtained.
3.
Repeat steps 1-2 until a fixed number of psus
have been retained. The total size of the
sample is number of psus retained times k.
About 20-25 percent of the psus are typically retained.
Using this two-stage procedure, the overall percentage of
randomly generated sampled numbers that are residential
often exceeds 60 percent, depending on how many second
stage residences are sampled. This efficient procedure
4
reduces data collection cost and increases interviewer
morale (Interviewers have never enjoyed dialing
nonworking and nonresidential numbers).
Waksberg (1978) studied the statistical properties of
Waksberg’s procedure. The theory in Waksberg’s paper
was very elegant. He showed that the method sampled first
stage clusters with probabilities exactly equal to the number
of residential numbers in the cluster without the researcher
ever knowing (or needing to know) the number of
residences in the cluster. The method also has the very
desirable outcome of sampling every telephone number
with equal probability. Here are the details:
Let M i be the number of residential telephone numbers in
the ith cluster and let k be the number of residential
telephone numbers in each cluster that are to be selected
into the sample if the cluster is selected in the first stage
and is retained (i.e., the randomly chosen number from the
cluster is found to be residential).
P( jth residential number in ith cluster selected into sample)=
P(ith cluster selected and retained)P(jth number selected|ith cluster retained) 
Mi k
k


K Mi K
where K is the total number of residential telephone
numbers in the population. To estimate a population total,
we would need to know K, the total number of residential
telephone numbers in the population and use sampling
weights wij  K / k .
To estimate an average or proportion, the typical goal of
telephone surveys, we do not need to know K. We only
5
need to know a “relative weight” wij (i.e., weight up to a
constant) for each response yij in the sample and we can
estimate the population mean as
wij yij


iS jSi
yˆ 
  wij .
iS jSi
For the Mitofsky-Waksberg methods, we have a selfweighting sample and can use a relative weight of wij  1 .
Note that under ideal conditions, the Mitofsky-Waksberg
method leads to a self-weighting sample of residential
telephone numbers, but it does not give a self-weighting
sample of households – some households may have more
than one telephone number; others may not have a
telephone. For the issue of households having more than
one telephone number, we make the sampling weight for
the household inversely proportional to the number of
telephone numbers the household has. We will discuss
how to deal with nonresponse when we cover Chapter 8.
III. Complications posed by cell phones
At the end of 2006, approximately 12% of US adults had
only a cell phone and on residential landline phone. This
poses several difficulties for telephone surveys:
1. People are less likely to respond to a cell phone survey
because they must pay for the incoming call. It is
common practice in the survey profession to offer
6
respondents on cell phones a small amount of money
to reimburse them for the costs of the incoming call.
2. Federal law prohibits the calling of cell phones with
the use of automatic dialing devices, which are
commonly used by both survey organizations and
telemarketers. But survey organizations are permitted
to call cell phones if the numbers are dialed manually.
IV. How the New York Times Conducts Its Polls
How the Poll Was Conducted
Published: September 17, 2008, New York Times
The latest New York Times/CBS News Poll is based on telephone interviews conducted
Sept. 12 through Sept. 16 with 1,133 adults throughout the United States. Of these, 1,004
said they were registered to vote.
The sample of land line telephone exchanges called was randomly selected by a computer
from a complete list of more than 42,000 active residential exchanges across the country.
The exchanges were chosen so as to ensure that each region of the country was
represented in proportion to its population.
Within each exchange, random digits were added to form a complete telephone number,
thus permitting access to listed and unlisted numbers alike. Within each household, one
adult was designated by a random procedure to be the respondent for the survey.
To increase coverage, this land line sample was supplemented by respondents reached
through random dialing of cellphone numbers. The two samples were then combined.
The combined results have been weighted to adjust for variation in the sample relating to
geographic region, sex, race, marital status, age and education. In addition, the land line
respondents were weighted to take account of household size and number of telephone
lines into the residence, while the cellphone respondents were weighted according to
whether they were reachable only by cellphone or also by land line.
Because of fluctuations in party identification, this poll was also weighted by averaging
in party preferences from three recent past Times/CBS News polls.
7
Some findings regarding voting were also weighted in terms of an overall “probable
electorate,” which uses responses to questions dealing with voting history, attention to the
campaign and likelihood of voting in 2008 as a measure of the probability of
respondents’ turning out in November.
In theory, in 19 cases out of 20, overall results based on such samples will differ by no
more than three percentage points in either direction from what would have been
obtained by seeking to interview all American adults. For smaller subgroups, the margin
of sampling error is larger. Shifts in results between polls over time also have a larger
sampling error.
In addition to sampling error, the practical difficulties of conducting any survey of public
opinion may introduce other sources of error into the poll. Variation in the wording and
order of questions, for example, may lead to somewhat different results.
Michael R. Kagay of Princeton, N.J., assisted The Times in its polling analysis. Complete
questions and results are available at nytimes.com/polls.
8
Download