Data Acquisition

advertisement
Data Acquisition
Overview
GIGO - garbage in, garbage out - is a core principle in statistics. No amount of sophisticated
data analysis can compensate for botched data acquisition. Successful data acquisition
consists of three interelated activities:
1. deciding what quantities should be measured,
2. specifying exactly how these quantities should be measured, and, finally
3. collecting the data in a manner which supports optimal statistical analysis.
Let’s consider each in turn.
1. Deciding what to measure is outside the scope of this course because it requires subjectmatter expertise. For example, deciding what variables need to be measured to successfully characterize a chemical process requires the expertise of a chemist and/or chemical
engineer.
2. Similarly, specifying exactly how the quantities of interest should be measured requires
subject-matter expertise. However, since the measurements should be done so as to
maximize precision while minimizing variability, and these are statistical considerations,
statistical expertise is involved. The exact specification of how a quantity should be
measured is called an operational definition. The author discusses these in section 4.1
and provides several examples.
3. Finally, collecting the data in a way which supports optimal statistical analysis requires
knowledge of the statistical methods to be used. Since all the statistical methods in
this course require that the data satisfy, in some form, the IID assumption,
acquiring data so as to satisfy this assumption is the primary goal.
There are two basic data acquisition scenarios: sampling a population and sampling a process. We will discuss sampling a population first, the topic of section 4.2.
Sampling Populations
What is the goal of sampling a population? Suppose we want to describe some population,
that is, we want to make quantitative statements about it. For example, suppose we want to
determine what fraction of IU students are not from Indiana, i.e., we want to say something
like “10% of IU students are not from Indiana.” The ideal way to do this would be to examine
each student’s records and determine if their home address is in Indiana. Suppose, due to
time, money or other constraints this is not possible. In this case we are forced to make our
1
statement about the IU students (the population) based on examining a fraction (sample)
of all the students. Since in most cases the sample is a miniscule fraction of the population,
making statements (inferences) about a population based on a sample is a risky, error-prone
endeavor.
An interesting example of the danger of inferences based on sample data is provided by
recent developments in the dating of the Shroud of Turin, a 4m linen cloth thought by many
to be the burial shroud of Jesus of Nazareth. In 1988 a sample of cloth was taken from the
Shroud and carbon-dated by two different laboratories. The labs gave consistent results and
the Shroud was dated to about 1260-1390 AD. This showed that the Shroud was not the
burial shroud of Jesus of Nazareth but instead was another medieval forgery. (There were
many such fakes. It’s been said that if you gathered all the wood pieces supposedly from
Jesus’ cross you could build a cathedral!) However, recently published research shows that
the sample was drawn from an area which is a medieval patch thus invalidating the carbon
dating. In addition, this research shows the shroud to be at least 1300 years old. For details,
see http://www.factsplusfacts.com/carbon-14-now-we-know.htm
The goal of statistics is to make the best of this bad situation. How? By making statistical
inferences, inferences which have a specified degree of reliability. With respect to estimating
the non-Hoosier fraction of IU students, an example of a statistical inference would be “we
are 95% confident that the fraction of non-Hoosier IU students is between 7% and 13%.” It
should be clear that making precise statements like this based on examining only a small
sample of IU students requires that we must be particular about the method used to get
that sample.
What properties should a sampling method have? Although a thorough answer is beyond
the scope of this course, for our purposes a good sampling method satisfies the following:
1. First and foremost, the sampling method must produce samples satisfying
the IID assumption, that is, the samples must consists of observations which are
independent and identically distributed.
2. The sampling method should be mathematical tractable, that is, it should be easy
to analyze the relationship between the sample and the population so as to enable
statistical inferences like the one about IU above.
3. The larger the sample size, the greater the likelihood that the sample is representative of
the population. Most people have an intuitive grasp of what is meant by representative
but let’s be more rigorous about it. A sample is representative of the population with
respect to a given variable to the extent that the sample distribution resembles the
population distribution (of that variable).
The simplest sampling method possessing these properties is Simple Random Sampling
(SRS’ing) when the following two conditions are met:
1. The sample size n is small relative to the population being sampled, and
2. The population begin sampled is stationary, i.e., not changing for the duration of the
sampling.
2
Simple random sampling is any sampling method which satisfies the following definition:
Any sampling procedure which selects a sample of size n from a population in such
a way that each size n subset of the population has an equal chance of being the
sample is a simple random sampling procedure. The resulting sample is called a
simple random sample (SRS).
Dealing out five-card hands is an example of simple random sampling. If the cards are
thoroughly shuffled then each hand is a simple random sample – each five card subset from
the deck has an equal chance of being the hand. Using this fact plus counting methods, we
can easily compute the probability of getting any particular type of hand, e.g., a full house,
thus demonstrating that simple random sampling is mathematically tractable. Note that
this example violates the first of the two conditions above; the sample size (5) is not that
small relative to the size of the population (52). Because of this the cards comprising the
sample are not independent: the value of the second card is not independent of the value
on the first card. If the first card is an ace, our chances of getting an ace on the second are
3/51 compared to 4/51 if the first card is not ace.
It is important to be able to distinguish simple random sampling from other sampling procedures. A common mistake, which our authors make, is to confuse simple random sampling
with procedures in which each individual has an equal chance of being in the sample. While
it is true that if you use simple random sampling then each individual will have an equal
chance of being in the sample, the converse is not true. In other words, the fact that a
sampling procedure ensures that each individual has an equal chance of being in the sample
does NOT imply that the procedure is a simple random sampling procedure. For example,
suppose we select a five card hand from a deck of 52 cards as follows. We split the deck into
half with respect to color: the 26 black cards go into one stack, the 26 red into a second. We
toss a nickel. If its heads, we select the top five cards from the black stack; if it’s tails we
select the top five cards from the red stack. Assuming both stacks are thoroughly shuffled
and the nickel toss is fair, each card in the deck has an equal chance of being in the resulting hand. However, this procedure is not simple random sampling. Why? Note that it is
impossible to get a hand containing both black and red cards. Thus each five card subset of
the deck does NOT have an equal chance of being the hand.
In summary, the simplest way to get an IID sample from a population is to
1. sample a stationary population,
2. use simple random sampling, and
3. use a sample size n which is small compared to the population size, i.e., n < 5% of the
population size.
3
Sampling a Process
When sampling a process, it is typically neither desirable nor necessary to acquire data using
simple random sampling. Instead systematic random sampling is often used. Systematic
random sampling is a sampling procedure which
1. selects every mth item, and
2. selects the first item randomly from among the first m items.
Under what circumstances will a systematic random sample from a process be IID? It will
be IID if the process is stationary and measurements of the process at all possible sampling
points are independent. In fact, any sample will be IID under these conditions. To see this
consider a simple example of a stationary process with independent measurements, namely,
a student repeatedly tossing a nickel. Assuming the student isn’t a magician, the process is
stationary since the proportion of heads will be constant at 0.5. Further, each measurement of
the process (coin toss outcome) will be independent of preceding and following measurements
(tosses). Under these conditions, any sample of n tosses, even n consecutive tosses, will be
IID. So why use systematic random sampling instead of simple random sampling? Because
it provides better information and is often easier to implement in engineering scenarios..
To see why systematic random sampling is preferable to simple random sampling when
sampling a process, consider getting a simple random sample of size 100 from the output of
a production line for a given day in order to estimate the percentage of defectives. Suppose
the line typically produces 10,000 units/day. In order to get a simple random sample of size
100 we use a random number generator to select which 100 of the 10,000 will be selected.
Suppose all the numbers are larger than 5000, i.e., the sample will consist entirely of units
produced in the afternoon. (Recall that this is possible since any subset of the 10,000 units
must be able to comprise the sample under simple random sampling.) Suppose the line has
to be halted in the middle of the day so only 5,000 units are produced. You can’t get your
sample. Further, even if you could get this simple random sample, would it provide all the
information you want? If the production line is stationary and the occurrence of defects is
totally random, i.e., a defect on item j + 1 occurs independently of a defect on item j, any
100 items from the afternoon will constitue an IID sample. However, suppose the line is not
stationary but goes out of control and produces a high proportion of defectives throughout
the morning. There will be no evidence of this in our sample of items produced in the
afternoon.
Suppose instead we get a sample of 100 using systematic random sampling, i.e., we select
every 100th item starting with item k, where k is a randomly selected number between 1
and 100, inclusive. If the production line is stationary and the occurence of defects is totally
random, these 100 items will constitute an IID sample. However, if the production line goes
out of control in the morning, half of the units in our sample will be from this time period
so we should be able to detect this event.
4
An additional advantage of systematic random sampling is that it can produce IID samples
when adjacent measurements from a process are not independent. When there is dependence
among measurements of a process, it is often the case that the dependence is short-range,
that is, the dependence between any two measurements decreases the greater the distance
( in time and/or space) between those two measurements. Thus observations sufficiently
separated are independent. For processes with this type of dependence, systematic random
sampling can produce IID data if the spacing between measurements, m, is made large
enough. The time series plot and lag plot below are of 200 consecutive measurements from
a process. From both plots it is clear that there is dependence, at least between adjacent
observations. The next pair of plots are of 100 observations acquired using systematic random
sampling with m = 2. Note that the dependence between every mth observation is much
less than between adjacent observations. The final pair of plots consists of 40 observations
using systematic random sampling with m = 5. Note that this sample appears to be IID.
5
6
Population or Process?
One final word about IID sampling. Whenever you acquire data you should always record
the order in which you acquire the measurements. For example, suppose you get a simple
random sample of water samples from a lake. Clearly the population is stationary and,
if you did the sampling correctly, the sample is IID. However, if you analyze the samples
sequentially using an electronic instrument, you now have process data since the instrument
might drift over time.
An Example of Non-IID Sampling: Stratified Sampling
There are situations in which it is better to use sampling techniques which do not yield IID
data. An important example is stratified sampling, a sampling method which is occasionally
used by engineers and scientists. Since this technique does not produce IID data, we will
not analyze stratified sample data in this course. It is used in situation in which we have
some information about the population we are sampling. Depending on the nature of this
information, we may be able to capitalize on it by using a more sophisticated sampling
procedure. For example, suppose we want to determine the fraction of ISU students who
want a new recreation center. Since building the rec center might require students to pay an
additional fee, juniors may not want it - why pay for something you won’t ever use? - whereas
freshmen may be more favorably disposed since it would be built before they graduate.
Given these differences among the classes we would like our sample to be representative with
respect to the fraction of freshmen, sophomores, juniors, and seniors, i.e., 25% freshmen, 25%
sophomores, etc. Using simple random sampling, we can’t guarantee this because a particular
subset consisting of all freshmen must have the same chance of being the sample as a specific
subset which is perfectly representative. The solution is to use stratified sampling.
7
To get a stratified sample of ISU students, we
1. partition the population into four groups (strata): freshman, sophomores, juniors, and
seniors;
2. get simple random samples of size n/4 from each stratum; and then
3. combine these samples to get the stratified sample of size n.
Using stratified sampling, we guarantee that our sample will be representative with respect
to class standing. Note that it is possible to stratify on more than one variable. For example,
it might be that male students are more interested in the rec center than female students.
Thus we would want equal numbers of males and females from each class. Therefore we
could stratify on both variables, getting simple random samples from the 8 strata (female
freshmen, male freshmen, etc.) resulting from partitioning the population with respect
to both gender and class standing. This approach is used by polling agencies to get very
accurate information on populations using small samples. For more information on stratified
sampling, see the authors’ brief discussion on page 166. Note that more complicated (nonIID) sampling procedures require more complicated statistical procedures; see the authors
discussion on estimating a population mean from a stratified sample on pages 166-168. In
particular, compare the formula for x̄str on page 168 with the formula for x̄.
8
Download