Analysis of Categorical Data An experiment in which the observations are such that they can best be described as belonging to one of a set of categories is called a multinomial experiment. This type of description works best for data for which the values of mean and standard deviation are not important, but rather we are interested in the grade, or class the observation falls into. (Prime, Choice, etc.) (A,B,C,D) etc. All categories, or classes in which observations may lie, are mutually exclusive. We are interested in describing the number of data points that fall into each category. This type of analysis is useful for several situations, such as evaluating whether or not population proportions have been altered, determining whether or not a set of observations comes from a particular distribution, and verifying whether or not classification variables are independent. Evaluating Population Proportions Suppose we know that a particular population is composed of items which fall into several categories. We also know the true proportion of each of these items that falls into a particular category. If we take a random sample of the population, we can compare the proportion of observations that fall into each category for the sample to that which would be expected from the population to determine whether or not there has been a shift in the population proportions. We will use the chi-square test statistic and hypothesis testing to accomplish this comparison. Hypothesis Test on Population Proportions: H o : p1 p10 p2 p20 pk pk 0 H a : at least one pi pi 0 k (Oi Ei ) 2 i 1 Ei T .S .: 2 (Oi npi ) 2 npi i 1 k R.R.: 2 2 ,k 1 where: Oi = number of actual data points in category/class Ei = number of data points that are expected to be in the category/class k = number of categories/classes Example: In a given area, the population of birds consists of four species. Historically, the birds have been known to be in the proportions: CIVL 7012/8012 Probabilistic Methods for Engineers P1 = 0.30 of species #1 P2 = 0.30 of species #2 P3 = 0.30 of species #3 P4 = 0.10 of species #4. A random sample of the population of the birds has been taken in the same area. From the sample of n = 200 birds, the following was observed: O1 = 40 birds of species #1 O2 = 80 birds of species #2 O3 = 65 birds of species #3 O4 = 15 birds of species #4. We wish to determine at a level of significance 0.05, whether or not recent ecological changes in the area have disturbed the relative sizes of the bird populations. (Whether or not he proportions have been altered.) The null hypothesis is that the proportions have not changed. Ho : p1 = 0.30, p2 = 0.30, p3 = 0.30, p4 = 0.10 Ha: At least one pi ≠ pi0 If Ho is true, we would expect the following distribution from the sample: E1 = np1 = 200 x 0.30 = 60 birds E2 = np2 = 200 x 0.30 = 60 birds E3 = np3 = 200 x 0.30 = 60 birds E4 = np4 = 200 x 0.10 = 20 birds We can now calculate the test statistic: k 2 i 1 (Oi Ei ) 2 (40 60) 2 (80 60) 2 (65 60) 2 (15 20) 2 15 Ei 60 60 60 20 v k 1 4 1 3 2 7.81 T.S.: 2 ,k 1 0.05,3 R.R.: 2 7.81 15 7.81 Therefore, we reject Ho at the 0.05 level. The data provide evidence that the relative proportion of birds in the species has been altered. CIVL 7012/8012 Probabilistic Methods for Engineers Example: A manufacturer claims that his production line produces 85% Grade A items, 10% Grade B items, and 5% rejects. A random sample of 100 items from this production line included 80 grade A’s, 9 grade B’s, and 11 rejects. Does this sample contain sufficient evidence to reject the manufacturers claim at the 0.05 level of significance? CIVL 7012/8012 Probabilistic Methods for Engineers Example: (for them to try) In 200 tosses of a coin, 115 heads were observed. Test Ho: coin is fair vs. Ha: not Ho. CIVL 7012/8012 Probabilistic Methods for Engineers Example: The table below gives the numbers of students passed and failed by three instructors: A, B, and C. Test the hypothesis that the proportions of students failed by the three instructors are equal. Passed Failed A 50 5 CIVL 7012/8012 Probabilistic Methods for Engineers B 47 14 C 56 8 Testing for Goodness of Fit: Chi-Square Both the previous application and the current can be called a “goodness of fit” test. However, in this case we are interested in whether or not a set of observations follows a particular theoretical distribution. Procedure for goodness of fit test: 1. Determine the distribution you think the data fits. 2. Break the data into class intervals. 3. Count the number of data points in each interval. 4. Calculate the number of data points that should be in each interval if the data actually did fit the assumed distribution. 5. Apply the Chi-Square test to see whether or not the null hypothesis is rejected. Ho: The test data corresponds to distribution ‘X.’ Ha: The data do not support the null hypothesis at the α level of significance. k (Oi Ei ) 2 i 1 Ei T .S .: 2 R.R.: 2 2 ,v d . f .: v k p 1 where: Oi = number of actual data points in category/class Ei = number of data points that are expected to be in the category/class if the data fit the theoretical distribution k = number of categories/classes p = number of parameters of the theoretical distribution estimated by sample statistics. *For example, if the hypothesized distribution was the negative exponential, we would need to know λ only to determine the specific negative exponential distribution. For this case, p = 1. You have to think about the individual distributions and how many parameters are required. How many parameters do you have to estimate from the data. For the normal distribution, you need estimates of µ, σ, p = 2. *Categories/classifications should be made so that there is a theoretical frequency of at least 5 in each interval. If not, try combining intervals or omitting intervals with low frequency. CIVL 7012/8012 Probabilistic Methods for Engineers *Cochran (1954) stated that no theoretical frequency should be less than 1, and no more than 20% of the theoretical frequencies are less than 5. According to Cochran, this allows a good approximation to be obtained. Example: A computer scientist developed an algorithm for generating pseudorandom integers over the scale 0-9. He coded the algorithm and generated the 1000 pseudo random digits summarized in the table below. Is there evidence that the random number generator is working correctly? Use α= 0.05. Digit Oi Ei 0 94 100 1 93 100 2 112 100 3 101 100 4 104 100 CIVL 7012/8012 Probabilistic Methods for Engineers 5 95 100 6 100 100 7 99 100 8 108 100 9 94 100 Total 1000 1000 Example: The number of defects in printed circuit boards is hypothesized to follow a Poisson distribution. A random sample of n = 60 printed boards has been collected, and the number of defects observed. The following data result: Number of Defects 0 1 2 3 Observed Frequency 32 15 9 4 Test at the α = 0.05 level of significance to determine whether or not the data support the null hypothesis. CIVL 7012/8012 Probabilistic Methods for Engineers Example: A manufacturing engineer is testing a power supply used in a notebook computer. Using α = 0.05, he wishes to determine whether output voltage is adequately described by a normal distribution. From a random sample of n = 100 units, he obtains sample estimates of the mean and standard deviation as 5.04 V and 0.08 V, respectively. The observed cell frequencies are: Class Interval x < 4.948 4.948 ≤ 4.986 4.986 ≤ x < 5.014 5.014 ≤ x < 5.040 5.040 ≤ x < 5.066 5.066 ≤ x < 5.094 5.094 ≤ x < 5.132 5.132 ≤ x Oi 12 14 12 13 12 11 12 14 CIVL 7012/8012 Probabilistic Methods for Engineers Ei Kolmogrov-Smirnov Test for Goodness of Fit The Kolmogrov-Smirnov (K-S) test, like the chi-square goodness of fit test, compares a set of test data to some hypothesized theoretical distribution. Also, like the chi-square, it is nonparametric and distribution free (no assumption is made concerning the population from which the sample is drawn.) One important advantage of the K-S procedure over the chi-square test is that the K-S procedure is not as constrained by small samples. It is also believed to be more sensitive than the chi-squared test. However, it should be used for continuous distributions only? *Nonparametric Methods – Any method of inference (hypothesis testing, CI construction) that does not depend on the form of the underlying distribution of the observations. Procedure: 1. Obtain the difference between the between the cumulative distribution for the data and that of the theoretical distribution. 2. Compare the maximum difference between the two cumulative distribution sto a table-based K-S statistic. 3. R.R.: K-Scalc > K-Stable Example: Consecutive time headways for vehicles arriving at a certain point on a roadway were measured for a certain time period with the following results: Vehicle Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Headway (sec) 11.43 21.70 5.82 6.18 1.58 22.86 4.46 5.92 6.67 2.00 9.22 26.44 3.60 2.78 12.31 19.99 2.27 1.56 34.44 5.27 Use the K-S technique to determine whether or not these data come from an exponential distribution. (Use α= 0.05). CIVL 7012/8012 Probabilistic Methods for Engineers Solution: We first need to know λ. Sum of headways = 206.49 sec Average headway = 206.49/20 = 10.3245 sec/veh Λ = 1/10.3245 = 0.096857 veh/sec F(X) = 1 – e-λt We can now set up a table showing the cumulative distribution for the data and for the exponential distribution. Headway Cumulative Veh. Cum. Portion 1 0.05 1.56 2 0.1 1.58 3 0.15 2 4 0.2 2.27 5 0.25 2.78 6 0.3 3.6 7 0.35 4.46 8 0.4 5.27 9 0.45 5.82 10 0.5 5.92 11 0.55 6.18 12 0.6 6.67 13 0.65 9.22 14 0.7 11.43 15 0.75 12.31 16 0.8 19.99 17 0.85 21.7 18 0.9 22.86 19 0.95 26.44 20 1 34.44 F(X) K-S difference 0.140236 0.090235634 0.1419 0.041899506 0.176106 0.026106496 0.197373 0.002626832 0.236057 0.013942725 0.294385 0.005615223 0.350779 0.00077908 0.399766 0.000233586 0.430905 0.019095091 0.43639 0.063609615 0.450406 0.099593558 0.475881 0.124119174 0.590583 0.059416813 0.669476 0.030524164 0.696481 0.053519424 0.855745 0.055744815 0.877763 0.027763416 0.890754 0.009246223 0.922765 0.027235269 0.964412 0.035587704 Maximum The calculated K-S difference must be compared with the value from the following table of K-S differences. For a sample size of 20, we obtain K-Stable = 0.294 for the specified level of test. We cannot, then, conclude that the data differs from the exponential distribution. CIVL 7012/8012 Probabilistic Methods for Engineers Actual and Theoretical Cummulative Frequencies 1.2 Cummulative Frequency 1 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 Headway (sec) Data CIVL 7012/8012 Probabilistic Methods for Engineers Exponential 30 35 40 CIVL 7012/8012 Probabilistic Methods for Engineers