Statistics, Distributions and the IID Assumption

advertisement

Statistics, Distributions and the IID Assumption

Being students at an institute of technology, you’ve acquired and analyzed data in order to understand or model various phenomena. So, at some point, you’ve used statistics, at least according to the following definition:

Statistics: The discipline concerned with the optimal acquisition and analysis of data in order to model phenomena in the presence of variability.

So why is statistics important enough that your departments require you to take at least one course on it? The answer, and the operant word in the above definition, is variability :

“What makes the discipline of statistics useful and unique is that it is concerned with the process of getting data and understanding problems in the presence of variability . Statistical methods are designed to deal with variability. They help us make inferences about underlying processes on the basis of often imperfect and incomplete data.

Variability (or dispersion) in measurements and processes is a fact of life. Virtually all processes vary. Take, for example, a few items from a production line and measure a certain characteristic of these items. If your measurement has sufficient resolution, you will find that these measurements vary. Or count the number of flaws on different bolts of fabric, and you will notice variability among these counts from bolt to bolt. Or measure the thickness of certain thin wafers, the diameter of certain knobs, the yields of chemical batch processes, or the percentage of defective items in successive lots of 10,000. In all these cases, measurements will vary.”

As this passage from Applied Statistics for Engineers and Physical Scientists, 2nd. ed.

by

Hogg and Ledolter makes clear, variability is ubiquitous. Because of this and the fact that you will need to control and/or minimize the variability associated with various phenomena, you will find the methods and concepts of statistics useful in your life and career.

In order to control the variability associated with a given measurement or variable, we need a means of describing that variable. For example, consider your age in years. There is (presumably) no variability associated with this variable: assuming you don’t have a birthday tomorrow, your age tomorrow is the same as your age today. You can describe your age completely with a single number. Now consider the age of a randomly selected Rose sophomore. There is variability associated with this variable. If we measure it twice, i.e., we randomly select two Rose sophomores and record their ages, we may get different values.

How can we describe or model this variable? Clearly we can no longer use a single number unless all current sophomores are the same age. To motivate the answer, let’s consider describing this variable by summarizing all the relevant information provided by a given data set. Suppose the data set consists of the ages of all current Rose-Hulman sophomores.

Since we have one number (age) for each student (observation), we have a univariate data

1

set which we can view as simply a collection of numbers: { 19, 20, 19, 19, 18, ...

} . In order to summarize the relevant information, we need a method which can compress the information provided by 424 numbers into a much smaller set of numbers. Before continuing, pause and think about how you might do it.

We can summarize the relevant information about age in this data set by determining what ages are present and how often they occur. If we do this, we might end up with the following:

Age Frequency

18 3

19

20

21

319

100

2

This representation is the Distribution of the variable age. The distribution provides the following information about a variable from a data set:

1. the unique values of that variable which appear in the data set and

2. how often the variable takes on each unique value in the data set.

The above distribution tells us that there are 424 sophomores, that, among all the sophomores, there are only four different ages, and, finally, how often each of these ages occurs.

Note that the distribution describes the variable age by compressing the information from

424 numbers into only 8 numbers! By compressing 424 numbers into these 8 numbers, have we lost any information? Put another way, given this distribution, can you reproduce the data set? The answer is no, because instead of the original data, { 19, 20, 19, 18, ...

} you might think the data was { 20, 19, 19, 18, ...

} . In other words, you might get very lucky and reproduce the data exactly except for one minor difference - you switch the first and second numbers, 19 and 20, respectively. Note that both the original data and your reproduction yield the same age distribution, but the ORDER is different. The take-home message is that if the ordering of the data provides important information, then the distribution of a variable does NOT capture all the relevant information about that variable present in the data set.

Before we discuss ways in which the order of the data provides information, we note that there are two types of distributions. The above distribution is called a frequency distribution because it provides the frequency (number of occurrences) of each value in the data. Usually we are not so much interested in the frequency of a value but, rather, its relative frequency , the proportion of the data equal to this value. Thus the relative frequency distribution of a variable consists of the unique values it assumes in the data set and their corresponding proportions. (Each proportion is simply the number of times the variable assumes that value divided by the number of observations in the data set.) The relative frequency distribution for the sophomore ages is therefore

2

Age Frequency Relative Frequency

18

19

3

319

0.0071

0.7524

20

21

100

2

0.2358

0.0048

Of course, we can express these proportions as percentages to improve readability:

Age Percentage

18

19

0.71

75.24

20

21

23.58

0.48

In most cases where the ordering of the data is relevant, the ordering corresponds to a direction in time and/or space. The typical engineering application in which this occurs is process control. For example, consider pharmaceutical manufacturing. Engineers design and construct a manufacturing process to produce pills providing a certain dose. Clearly it is desirable for each pill to contain the same dose, but this is impossible due to inherent variability in the process. The best the engineers can hope to do is to control the process so that the dose varies within acceptable limits about the desired level. Process engineers do this by measuring the dose of randomly selected pills at regular intervals. If the process is “in control” then the distribution of the dose is constant over time ( stationary or stable ).

If the process is not “in control” then the process has changed or is changing (was or is nonstationary ) and the dose is no longer varying about the desired level and/or the variability has increased to unacceptable limits. To detect undesirable changes in the process and reverse them, engineers use control charts which plot characteristics of the distribution as a function of time. Although Chapter 6 (Quality and Reliability) provides an introduction to these methods, we will not cover it in this course. However, we will discuss one method for graphically determining if the data production process is stationary, the time series plot or run chart (as it is called by process engineers.)

The time series plot ( time sequence plot or run chart ) is so called because it plots the data on a variable as a series of measurements taken over time. If the process producing the data is stationary, i.e., the distribution of the variable is constant across time, then the times series plot should consist of points which

1. appear randomly scattered about the center (sample median

1

) of the data (depicted by a horizontal line), that is, the points do not exhibit an increasing, decreasing, or otherwise systematic trend about the center over time, and

2. appear to have a constant degree of dispersion or scatter, that is, the amount of variability does not exhibit an increasing, decreasing, or otherwise systematic trend over time.

1

The sample median is defined on page 62 of our text.

3

The graph below is a time series plot of data from a stationary process. Note that the points satisfy both properties.

The next three plots provide examples of data from nonstationary processes. The first plot displays data violating property 1 – the data possess an inverted-U trend so that points at either end tend to fall below the median line while points in the middle tend to fall above it.

The second plot displays data violating property 2 – the data exhibit a decreasing trend in variability so that points later in time tend to cluster more tightly together. The third plot depicts data violating both properties.

4

In some cases it is difficult to determine if patterns in the data are systematic or random.

One way to distinguish between these alternatives is to compare the time series plot of the data with a random order time series plot or random order run chart ( RORC ). By presenting the data in random order, the random order time series plot provides an example of data with the same distribution but with no systematic trends - any systematic trends are eliminated by randomizing the order. The graphs on the next page are random order plots of the data for the previous three examples. Comparison of these random order plots with the (nonrandom) time series plots confirms that these data were generated by nonstationary processes. Note that the differences in appearance between the time series and random order plots represent the information that is lost if we (incorrectly) ignore nonstationarity and attempt to summarize all relevant information in a data set using only its distribution.

5

Clearly the order of the data provides information if trends are present.

However, the converse is not true: if there are no trends, the order CAN contain relevant information.

6

Consider the time series and random order plots below. The two graphs differ - indicating there is information in the order of the data - yet there are no apparent trends. In this case the information takes a more subtle form: the value of the process at any point in time, p ( t ), is related to the preceding value, p ( t − 1). Close inspection of the time series plot reveals that the sign of p ( t ) is always opposite that of p ( t − 1). This relationship - and the information contained in it - depends on the order of the data. Thus the relationship is not present in the random order plot and the corresponding information is not captured by the distribution.

If the value of a variable at a given time point is related to past (or future) values, as in the preceding example, the data are said to consist of dependent observations - the value at a given point depends on past (and/or future) values. If the value at a given time point is not related to any previous (and/or future) values, the data are said to consist of independent observations. Thus, as the above graphs show, if the data consist of dependent observations, then the data contain information which the distribution will not capture.

In most real-world scenarios, if the observations are dependent, then the value at a given point will tend to be similar to the preceding value (in contrast to the example above). In

7

this case the time series plot will appear smoother than the random order plots. The time series and random order plots below provide an example of this phenomenon. Although the difference is subtle, we see that for this data the time series plot is smoother than the random order plot, indicating that adjacent observations are similar in value.

Note: Although the data can be dependent when there are no systematic trends about the center of the data, the presence of such trends will certainly make adjacent observations more similar in value than distant observations, i.e., make the data dependent.

8

Summary

From the preceding we conclude that the order of the data will NOT contain relevant information only if the following three conditions are met:

1. the data are randomly scattered about the center (median) of the data, that is, the points do not exhibit an increasing, decreasing, or otherwise systematic trend about the center over time;

2. the data have a constant degree of scatter, that is, the amount of variability does not exhibit an increasing, decreasing, or otherwise systematic trend over time; and

3. the data consist of independent observations.

If these three conditions are met, the distribution of a given variable, which consists of

1. the unique values of that variable which appear in the data set and

2. how often the variable takes on each unique value in the data set, summarizes all the relevant information about that variable in the data set. Note that data satisfying these three conditions are said to be Independent and Identically Distributed or IID .

Identically Distributed refers to the fact that if conditions 1 and 2 are met (there are no systematic trends), then the distribution is stationary and the observations are all from the same distribution, i.e., identically distributed.

Independent refers to the fact that the observations are independent.

Checking the IID Assumption using Time Series/Random Order Plots

ALL data analysis procedures in this course require that the data be IID in some manner, i.e., that the data meet the IID assumption. Therefore it is extremely important that you verify, as thoroughly as possible, that your data satisfy this assumption before you analyze it using our methods. To check that your data meet the IID assumption using time series and random order plots, do the following in order:

1. Construct a time series plot of the data and check for nonstationarity, i.e., trends about the median and systematic changes in variability. If the data are not obviously nonstationary then

2. construct one or more random order plots of the data. If these plots appear more rough than the time series plot, you probably have nonstationary and/or dependent data.

9

Real Data Examples

The preceding examples have consisted of artificial data (pseudorandom numbers) constructed to have the stated properties for illustrative purposes. The following two examples use real data.

Example 1: Yashchin ( Technometrics , 37, 311-323.) discusses a process for the chemical etching of silicon wafers used in integrated circuits. This process etches the layer of silicon dioxide until the layer of metal beneath is reached. The company monitors the thickness of the silicon dioxide layers because thicker layers require longer etching times. The data consists of the average thickness of 40 consecutive groups of four wafers. Based on the time series and random order plots below, what do you conclude about the stationarity of the wafer production process? Can you conclude anything about whether or not the data are dependent?

The most obvious feature of the time series plot are the two shifts in the location of the data: first up, then down. Clearly these shifts are not attributable to the random variation of the

10

data, i.e., they are systematic. Thus the process producing this data, the wafer production process, is nonstationary. Also, because of these shifts, adjacent values tend to be more similar than distant values so the data are dependent. Note that after taking into account the two shifts, the (local) variability in the data appears more or less constant across time, that is, there don’t appear to be any trends in the scatter of the data.

Example 2: The data for this example consists of the temperature of a town in Indiana at about 8 AM for the first 185 days of 2003, i.e., from January 1st to the middle of July.

From the two plots, what do you conclude about whether or not the data satisfy the IID assumption?

From the time series plot we see two clear trends: there is an increasing trend in temperature accompanied by a decreasing trend in its variability – note how the temperatures cluster more closely later in the year. Due to the increasing trend in temperature, the data are not independent: temperatures close in time, e.g., from the same month, tend to be more similar in value than temperatures more distant, e.g., from different months.

11

Download