Data Acquisition

Data Acquisition Overview GIGO - garbage in, garbage out - is a core principle in statistics. No amount of sophisticated data analysis can compensate for botched data acquisition. Successful data acquisition consists of three interelated activities: 1. deciding what quantities should be measured, 2. specifying exactly how these quantities should be measured, and, finally 3. collecting the data in a manner which supports optimal statistical analysis. Let’s consider each in turn. 1. Deciding what to measure is outside the scope of this course because it requires subjectmatter expertise. For example, deciding what variables need to be measured to successfully characterize a chemical process requires the expertise of a chemist and/or chemical engineer. 2. Similarly, specifying exactly how the quantities of interest should be measured requires subject-matter expertise. However, since the measurements should be done so as to maximize precision while minimizing variability, and these are statistical considerations, statistical expertise is involved. The exact specification of how a quantity should be measured is called an operational definition. The author discusses these in section 4.1 and provides several examples. 3. Finally, collecting the data in a way which supports optimal statistical analysis requires knowledge of the statistical methods to be used. Since all the statistical methods in this course require that the data satisfy, in some form, the IID assumption, acquiring data so as to satisfy this assumption is the primary goal. There are two basic data acquisition scenarios: sampling a population and sampling a process. We will discuss sampling a population first, the topic of section 4.2. Sampling Populations What is the goal of sampling a population? Suppose we want to describe some population, that is, we want to make quantitative statements about it. For example, suppose we want to determine what fraction of IU students are not from Indiana, i.e., we want to say something like “10% of IU students are not from Indiana.” The ideal way to do this would be to examine each student’s records and determine if their home address is in Indiana. Suppose, due to time, money or other constraints this is not possible. In this case we are forced to make our 1 statement about the IU students (the population) based on examining a fraction (sample) of all the students. Since in most cases the sample is a miniscule fraction of the population, making statements (inferences) about a population based on a sample is a risky, error-prone endeavor. An interesting example of the danger of inferences based on sample data is provided by recent developments in the dating of the Shroud of Turin, a 4m linen cloth thought by many to be the burial shroud of Jesus of Nazareth. In 1988 a sample of cloth was taken from the Shroud and carbon-dated by two different laboratories. The labs gave consistent results and the Shroud was dated to about 1260-1390 AD. This showed that the Shroud was not the burial shroud of Jesus of Nazareth but instead was another medieval forgery. (There were many such fakes. It’s been said that if you gathered all the wood pieces supposedly from Jesus’ cross you could build a cathedral!) However, recently published research shows that the sample was drawn from an area which is a medieval patch thus invalidating the carbon dating. In addition, this research shows the shroud to be at least 1300 years old. For details, see http://www.factsplusfacts.com/carbon-14-now-we-know.htm The goal of statistics is to make the best of this bad situation. How? By making statistical inferences, inferences which have a specified degree of reliability. With respect to estimating the non-Hoosier fraction of IU students, an example of a statistical inference would be “we are 95% confident that the fraction of non-Hoosier IU students is between 7% and 13%.” It should be clear that making precise statements like this based on examining only a small sample of IU students requires that we must be particular about the method used to get that sample. What properties should a sampling method have? Although a thorough answer is beyond the scope of this course, for our purposes a good sampling method satisfies the following: 1. First and foremost, the sampling method must produce samples satisfying the IID assumption, that is, the samples must consists of observations which are independent and identically distributed. 2. The sampling method should be mathematical tractable, that is, it should be easy to analyze the relationship between the sample and the population so as to enable statistical inferences like the one about IU above. 3. The larger the sample size, the greater the likelihood that the sample is representative of the population. Most people have an intuitive grasp of what is meant by representative but let’s be more rigorous about it. A sample is representative of the population with respect to a given variable to the extent that the sample distribution resembles the population distribution (of that variable). The simplest sampling method possessing these properties is Simple Random Sampling (SRS’ing) when the following two conditions are met: 1. The sample size n is small relative to the population being sampled, and 2. The population begin sampled is stationary, i.e., not changing for the duration of the sampling. 2 Simple random sampling is any sampling method which satisfies the following definition: Any sampling procedure which selects a sample of size n from a population in such a way that each size n subset of the population has an equal chance of being the sample is a simple random sampling procedure. The resulting sample is called a simple random sample (SRS). Dealing out five-card hands is an example of simple random sampling. If the cards are thoroughly shuffled then each hand is a simple random sample – each five card subset from the deck has an equal chance of being the hand. Using this fact plus counting methods, we can easily compute the probability of getting any particular type of hand, e.g., a full house, thus demonstrating that simple random sampling is mathematically tractable. Note that this example violates the first of the two conditions above; the sample size (5) is not that small relative to the size of the population (52). Because of this the cards comprising the sample are not independent: the value of the second card is not independent of the value on the first card. If the first card is an ace, our chances of getting an ace on the second are 3/51 compared to 4/51 if the first card is not ace. It is important to be able to distinguish simple random sampling from other sampling procedures. A common mistake, which our authors make, is to confuse simple random sampling with procedures in which each individual has an equal chance of being in the sample. While it is true that if you use simple random sampling then each individual will have an equal chance of being in the sample, the converse is not true. In other words, the fact that a sampling procedure ensures that each individual has an equal chance of being in the sample does NOT imply that the procedure is a simple random sampling procedure. For example, suppose we select a five card hand from a deck of 52 cards as follows. We split the deck into half with respect to color: the 26 black cards go into one stack, the 26 red into a second. We toss a nickel. If its heads, we select the top five cards from the black stack; if it’s tails we select the top five cards from the red stack. Assuming both stacks are thoroughly shuffled and the nickel toss is fair, each card in the deck has an equal chance of being in the resulting hand. However, this procedure is not simple random sampling. Why? Note that it is impossible to get a hand containing both black and red cards. Thus each five card subset of the deck does NOT have an equal chance of being the hand. In summary, the simplest way to get an IID sample from a population is to 1. sample a stationary population, 2. use simple random sampling, and 3. use a sample size n which is small compared to the population size, i.e., n < 5% of the population size. 3 Sampling a Process When sampling a process, it is typically neither desirable nor necessary to acquire data using simple random sampling. Instead systematic random sampling is often used. Systematic random sampling is a sampling procedure which 1. selects every mth item, and 2. selects the first item randomly from among the first m items. Under what circumstances will a systematic random sample from a process be IID? It will be IID if the process is stationary and measurements of the process at all possible sampling points are independent. In fact, any sample will be IID under these conditions. To see this consider a simple example of a stationary process with independent measurements, namely, a student repeatedly tossing a nickel. Assuming the student isn’t a magician, the process is stationary since the proportion of heads will be constant at 0.5. Further, each measurement of the process (coin toss outcome) will be independent of preceding and following measurements (tosses). Under these conditions, any sample of n tosses, even n consecutive tosses, will be IID. So why use systematic random sampling instead of simple random sampling? Because it provides better information and is often easier to implement in engineering scenarios.. To see why systematic random sampling is preferable to simple random sampling when sampling a process, consider getting a simple random sample of size 100 from the output of a production line for a given day in order to estimate the percentage of defectives. Suppose the line typically produces 10,000 units/day. In order to get a simple random sample of size 100 we use a random number generator to select which 100 of the 10,000 will be selected. Suppose all the numbers are larger than 5000, i.e., the sample will consist entirely of units produced in the afternoon. (Recall that this is possible since any subset of the 10,000 units must be able to comprise the sample under simple random sampling.) Suppose the line has to be halted in the middle of the day so only 5,000 units are produced. You can’t get your sample. Further, even if you could get this simple random sample, would it provide all the information you want? If the production line is stationary and the occurrence of defects is totally random, i.e., a defect on item j + 1 occurs independently of a defect on item j, any 100 items from the afternoon will constitue an IID sample. However, suppose the line is not stationary but goes out of control and produces a high proportion of defectives throughout the morning. There will be no evidence of this in our sample of items produced in the afternoon. Suppose instead we get a sample of 100 using systematic random sampling, i.e., we select every 100th item starting with item k, where k is a randomly selected number between 1 and 100, inclusive. If the production line is stationary and the occurence of defects is totally random, these 100 items will constitute an IID sample. However, if the production line goes out of control in the morning, half of the units in our sample will be from this time period so we should be able to detect this event. 4 An additional advantage of systematic random sampling is that it can produce IID samples when adjacent measurements from a process are not independent. When there is dependence among measurements of a process, it is often the case that the dependence is short-range, that is, the dependence between any two measurements decreases the greater the distance ( in time and/or space) between those two measurements. Thus observations sufficiently separated are independent. For processes with this type of dependence, systematic random sampling can produce IID data if the spacing between measurements, m, is made large enough. The time series plot and lag plot below are of 200 consecutive measurements from a process. From both plots it is clear that there is dependence, at least between adjacent observations. The next pair of plots are of 100 observations acquired using systematic random sampling with m = 2. Note that the dependence between every mth observation is much less than between adjacent observations. The final pair of plots consists of 40 observations using systematic random sampling with m = 5. Note that this sample appears to be IID. 5 6 Population or Process? One final word about IID sampling. Whenever you acquire data you should always record the order in which you acquire the measurements. For example, suppose you get a simple random sample of water samples from a lake. Clearly the population is stationary and, if you did the sampling correctly, the sample is IID. However, if you analyze the samples sequentially using an electronic instrument, you now have process data since the instrument might drift over time. An Example of Non-IID Sampling: Stratified Sampling There are situations in which it is better to use sampling techniques which do not yield IID data. An important example is stratified sampling, a sampling method which is occasionally used by engineers and scientists. Since this technique does not produce IID data, we will not analyze stratified sample data in this course. It is used in situation in which we have some information about the population we are sampling. Depending on the nature of this information, we may be able to capitalize on it by using a more sophisticated sampling procedure. For example, suppose we want to determine the fraction of ISU students who want a new recreation center. Since building the rec center might require students to pay an additional fee, juniors may not want it - why pay for something you won’t ever use? - whereas freshmen may be more favorably disposed since it would be built before they graduate. Given these differences among the classes we would like our sample to be representative with respect to the fraction of freshmen, sophomores, juniors, and seniors, i.e., 25% freshmen, 25% sophomores, etc. Using simple random sampling, we can’t guarantee this because a particular subset consisting of all freshmen must have the same chance of being the sample as a specific subset which is perfectly representative. The solution is to use stratified sampling. 7 To get a stratified sample of ISU students, we 1. partition the population into four groups (strata): freshman, sophomores, juniors, and seniors; 2. get simple random samples of size n/4 from each stratum; and then 3. combine these samples to get the stratified sample of size n. Using stratified sampling, we guarantee that our sample will be representative with respect to class standing. Note that it is possible to stratify on more than one variable. For example, it might be that male students are more interested in the rec center than female students. Thus we would want equal numbers of males and females from each class. Therefore we could stratify on both variables, getting simple random samples from the 8 strata (female freshmen, male freshmen, etc.) resulting from partitioning the population with respect to both gender and class standing. This approach is used by polling agencies to get very accurate information on populations using small samples. For more information on stratified sampling, see the authors’ brief discussion on page 166. Note that more complicated (nonIID) sampling procedures require more complicated statistical procedures; see the authors discussion on estimating a population mean from a stratified sample on pages 166-168. In particular, compare the formula for x̄str on page 168 with the formula for x̄. 8

Data Acquisition

Related documents

Products

Support

Data Acquisition

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib