Discrete Distributions

advertisement
Analysis of Categorical Data
An experiment in which the observations are such that they can best be described as
belonging to one of a set of categories is called a multinomial experiment. This type of
description works best for data for which the values of mean and standard deviation are
not important, but rather we are interested in the grade, or class the observation falls into.
(Prime, Choice, etc.) (A,B,C,D) etc. All categories, or classes in which observations
may lie, are mutually exclusive. We are interested in describing the number of data
points that fall into each category.
This type of analysis is useful for several situations, such as evaluating whether or not
population proportions have been altered, determining whether or not a set of
observations comes from a particular distribution, and verifying whether or not
classification variables are independent.
Evaluating Population Proportions
Suppose we know that a particular population is composed of items which fall into
several categories. We also know the true proportion of each of these items that falls into
a particular category. If we take a random sample of the population, we can compare the
proportion of observations that fall into each category for the sample to that which would
be expected from the population to determine whether or not there has been a shift in the
population proportions. We will use the chi-square test statistic and hypothesis testing to
accomplish this comparison.
Hypothesis Test on Population Proportions:
H o : p1  p10
p2  p20
pk  pk 0
H a : at least one pi  pi 0
k
(Oi  Ei ) 2
i 1
Ei
T .S .:   
2
(Oi  npi ) 2

npi
i 1
k
R.R.:  2   2 ,k 1
where: Oi = number of actual data points in category/class
Ei = number of data points that are expected to be in the category/class
k = number of categories/classes
Example:
In a given area, the population of birds consists of four species. Historically, the birds
have been known to be in the proportions:
CIVL 7012/8012 Probabilistic Methods for Engineers
P1 = 0.30 of species #1
P2 = 0.30 of species #2
P3 = 0.30 of species #3
P4 = 0.10 of species #4.
A random sample of the population of the birds has been taken in the same area. From
the sample of n = 200 birds, the following was observed:
O1 = 40 birds of species #1
O2 = 80 birds of species #2
O3 = 65 birds of species #3
O4 = 15 birds of species #4.
We wish to determine at a level of significance 0.05, whether or not recent ecological
changes in the area have disturbed the relative sizes of the bird populations. (Whether or
not he proportions have been altered.) The null hypothesis is that the proportions have
not changed.
Ho : p1 = 0.30, p2 = 0.30, p3 = 0.30, p4 = 0.10
Ha: At least one pi ≠ pi0
If Ho is true, we would expect the following distribution from the sample:
E1 = np1 = 200 x 0.30 = 60 birds
E2 = np2 = 200 x 0.30 = 60 birds
E3 = np3 = 200 x 0.30 = 60 birds
E4 = np4 = 200 x 0.10 = 20 birds
We can now calculate the test statistic:
k
2  
i 1
(Oi  Ei ) 2 (40  60) 2 (80  60) 2 (65  60) 2 (15  20) 2




 15
Ei
60
60
60
20
v  k 1  4 1  3
2
 7.81
T.S.: 2 ,k 1   0.05,3
R.R.:  2  7.81
15  7.81
Therefore, we reject Ho at the 0.05 level. The data provide evidence that the relative
proportion of birds in the species has been altered.
CIVL 7012/8012 Probabilistic Methods for Engineers
Example:
A manufacturer claims that his production line produces 85% Grade A items, 10% Grade
B items, and 5% rejects. A random sample of 100 items from this production line
included 80 grade A’s, 9 grade B’s, and 11 rejects. Does this sample contain sufficient
evidence to reject the manufacturers claim at the 0.05 level of significance?
CIVL 7012/8012 Probabilistic Methods for Engineers
Example: (for them to try)
In 200 tosses of a coin, 115 heads were observed. Test Ho: coin is fair vs. Ha: not Ho.
CIVL 7012/8012 Probabilistic Methods for Engineers
Example:
The table below gives the numbers of students passed and failed by three instructors: A,
B, and C. Test the hypothesis that the proportions of students failed by the three
instructors are equal.
Passed
Failed
A
50
5
CIVL 7012/8012 Probabilistic Methods for Engineers
B
47
14
C
56
8
Testing for Goodness of Fit:
Chi-Square
Both the previous application and the current can be called a “goodness of fit” test.
However, in this case we are interested in whether or not a set of observations follows a
particular theoretical distribution.
Procedure for goodness of fit test:
1. Determine the distribution you think the data fits.
2. Break the data into class intervals.
3. Count the number of data points in each interval.
4. Calculate the number of data points that should be in each interval if the data
actually did fit the assumed distribution.
5. Apply the Chi-Square test to see whether or not the null hypothesis is rejected.
Ho: The test data corresponds to distribution ‘X.’
Ha: The data do not support the null hypothesis at the α level of significance.
k
(Oi  Ei ) 2
i 1
Ei
T .S .:   
2
R.R.:  2   2 ,v
d . f .: v  k  p  1
where:
Oi = number of actual data points in category/class
Ei = number of data points that are expected to be in the category/class if
the data fit the theoretical distribution
k = number of categories/classes
p = number of parameters of the theoretical distribution estimated by
sample statistics.
*For example, if the hypothesized distribution was the negative exponential, we would
need to know λ only to determine the specific negative exponential distribution. For this
case, p = 1. You have to think about the individual distributions and how many
parameters are required. How many parameters do you have to estimate from the data.
For the normal distribution, you need estimates of µ, σ, p = 2.
*Categories/classifications should be made so that there is a theoretical frequency of at
least 5 in each interval. If not, try combining intervals or omitting intervals with low
frequency.
CIVL 7012/8012 Probabilistic Methods for Engineers
*Cochran (1954) stated that no theoretical frequency should be less than 1, and no more
than 20% of the theoretical frequencies are less than 5. According to Cochran, this
allows a good approximation to be obtained.
Example:
A computer scientist developed an algorithm for generating pseudorandom integers over
the scale 0-9. He coded the algorithm and generated the 1000 pseudo random digits
summarized in the table below. Is there evidence that the random number generator is
working correctly? Use α= 0.05.
Digit
Oi
Ei
0
94
100
1
93
100
2
112
100
3
101
100
4
104
100
CIVL 7012/8012 Probabilistic Methods for Engineers
5
95
100
6
100
100
7
99
100
8
108
100
9
94
100
Total
1000
1000
Example:
The number of defects in printed circuit boards is hypothesized to follow a Poisson
distribution. A random sample of n = 60 printed boards has been collected, and the
number of defects observed. The following data result:
Number of Defects
0
1
2
3
Observed Frequency
32
15
9
4
Test at the α = 0.05 level of significance to determine whether or not the data support
the null hypothesis.
CIVL 7012/8012 Probabilistic Methods for Engineers
Example:
A manufacturing engineer is testing a power supply used in a notebook computer. Using
α = 0.05, he wishes to determine whether output voltage is adequately described by a
normal distribution. From a random sample of n = 100 units, he obtains sample estimates
of the mean and standard deviation as 5.04 V and 0.08 V, respectively. The observed cell
frequencies are:
Class Interval
x < 4.948
4.948 ≤ 4.986
4.986 ≤ x < 5.014
5.014 ≤ x < 5.040
5.040 ≤ x < 5.066
5.066 ≤ x < 5.094
5.094 ≤ x < 5.132
5.132 ≤ x
Oi
12
14
12
13
12
11
12
14
CIVL 7012/8012 Probabilistic Methods for Engineers
Ei
Kolmogrov-Smirnov Test for Goodness of Fit
The Kolmogrov-Smirnov (K-S) test, like the chi-square goodness of fit test, compares a
set of test data to some hypothesized theoretical distribution. Also, like the chi-square, it
is nonparametric and distribution free (no assumption is made concerning the population
from which the sample is drawn.) One important advantage of the K-S procedure over
the chi-square test is that the K-S procedure is not as constrained by small samples. It is
also believed to be more sensitive than the chi-squared test. However, it should be used
for continuous distributions only?
*Nonparametric Methods – Any method of inference (hypothesis testing, CI
construction) that does not depend on the form of the underlying distribution of the
observations.
Procedure:
1. Obtain the difference between the between the cumulative distribution for the data
and that of the theoretical distribution.
2. Compare the maximum difference between the two cumulative distribution sto a
table-based K-S statistic.
3. R.R.: K-Scalc > K-Stable
Example:
Consecutive time headways for vehicles arriving at a certain point on a roadway were
measured for a certain time period with the following results:
Vehicle Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Headway (sec)
11.43
21.70
5.82
6.18
1.58
22.86
4.46
5.92
6.67
2.00
9.22
26.44
3.60
2.78
12.31
19.99
2.27
1.56
34.44
5.27
Use the K-S technique to determine whether or not these data come from an exponential
distribution. (Use α= 0.05).
CIVL 7012/8012 Probabilistic Methods for Engineers
Solution:
We first need to know λ.
Sum of headways = 206.49 sec
Average headway = 206.49/20 = 10.3245 sec/veh
Λ = 1/10.3245 = 0.096857 veh/sec
F(X) = 1 – e-λt
We can now set up a table showing the cumulative distribution for the data and for the
exponential distribution.
Headway Cumulative Veh. Cum. Portion
1
0.05
1.56
2
0.1
1.58
3
0.15
2
4
0.2
2.27
5
0.25
2.78
6
0.3
3.6
7
0.35
4.46
8
0.4
5.27
9
0.45
5.82
10
0.5
5.92
11
0.55
6.18
12
0.6
6.67
13
0.65
9.22
14
0.7
11.43
15
0.75
12.31
16
0.8
19.99
17
0.85
21.7
18
0.9
22.86
19
0.95
26.44
20
1
34.44
F(X)
K-S difference
0.140236 0.090235634
0.1419 0.041899506
0.176106 0.026106496
0.197373 0.002626832
0.236057 0.013942725
0.294385 0.005615223
0.350779
0.00077908
0.399766 0.000233586
0.430905 0.019095091
0.43639 0.063609615
0.450406 0.099593558
0.475881 0.124119174
0.590583 0.059416813
0.669476 0.030524164
0.696481 0.053519424
0.855745 0.055744815
0.877763 0.027763416
0.890754 0.009246223
0.922765 0.027235269
0.964412 0.035587704
Maximum
The calculated K-S difference must be compared with the value from the following table
of K-S differences. For a sample size of 20, we obtain K-Stable = 0.294 for the specified
level of test. We cannot, then, conclude that the data differs from the exponential
distribution.
CIVL 7012/8012 Probabilistic Methods for Engineers
Actual and Theoretical Cummulative Frequencies
1.2
Cummulative Frequency
1
0.8
0.6
0.4
0.2
0
0
5
10
15
20
25
Headway (sec)
Data
CIVL 7012/8012 Probabilistic Methods for Engineers
Exponential
30
35
40
CIVL 7012/8012 Probabilistic Methods for Engineers
Download