Ch15. Goodness of fit

advertisement
359
Chapter 15. Goodness of fit
Problem PS287
We construct a normal probability plot of a sample generated from a normally distributed
population.
1. Report the mean 10 of a normal random variable in cell A2, its standard deviation 2
in cell B2. Generate a random sample of 100 observations of this normal variable in
the range A5:A104: = 𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉(𝑅𝐴𝑁𝐷( ); 𝐴$2; 𝐡$2). Compute the sample mean
and the sample standard deviation in cells D2 and E2. Report the values 1, 2, …, 100
in the range B5:B104. Rank the sample data in the range C5:C104. Example for cell
C5: = 𝑆𝑀𝐴𝐿𝐿(𝐴$5: 𝐴$104; 𝐡5) and likewise for cell C6:C104;
2. Compute the ranked expected values of the sample observations assuming a sample
from a normal population with the parameters 10 as mean and 2 as standard
deviation in the range D5:D104. Example for cell D5: =
𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉((𝐡5 − 0.5)⁄100; $𝐴$2; $𝐡$2) and likewise for cells D6:D104. Compute
the ranked expected values of the sample observations assuming a sample from a
normal population with parameters the sample mean (cell D2) and the sample
standard deviation (cell E2) in the range E5:E104. Example for cell E5: =
𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉((𝐡5 − 0.5)⁄100; $𝐷$2; $𝐸$2) and likewise for cells E6:E104;
3. Construct a histogram of the sample data using bin values 6, 7, …, 14;
4. Construct a normal probability plot of the sample using a Scatter with straight lines
and markers. Put the ordered sample data on the horizontal axis and the expected
sample points in both columns D and E on the vertical axis. Use a lower bound of 4
and an upper bound of 16 for both axes. Construct a 45° degree line through the
origin (4 4), the line of identity. Notice that both broken lines follow closely this line
of identity with the one based on the sample statistics usually a slightly better fit.
Assignment PA287
To compute the expected values of the sample points the values (𝑗 − 0.5)⁄𝑛 , 𝑗 = 1, 2, … , 𝑛
are replaced by the somewhat better approximations (𝑗 − 0.375)⁄(𝑛 + 0.25), 𝑗 = 1,2, … , 𝑛.
Use these approximations in the above analysis and construct the normal probability plots.
360
Problem PS288
1. Construct a normal probability plot as in Problem PS287 but use 8 as value of the
mean to compute the expected sample values in column D (keep the standard
deviation equal to 2). Notice that the broken line lies fully underneath the line of
identity but more or less parallel indicating that the mean estimated for the
population is too small;
2. Construct a normal probability plot as in Problem PS287 but use 3 as value of the
standard deviation to compute the expected sample values in column D (keep the
mean equal to 10). Notice how the broken line is steeper than the line of identity
indicating that the standard deviation estimated for the population is too large.
Assignment PA288
1. Repeat the analysis above using 12 as the mean to compute the expected sample
values while keeping 2 as the standard deviation;
2. Repeat the analysis above using 1 as the standard deviation to compute the expected
sample values while keeping 10 as the mean;
3. Repeat the analysis above using 9 as the mean and 3 as the standard deviation to
compute the expected sample values.
361
Problem PS289
We construct a normal probability plot of a sample generated from a gamma distributed
population.
1. Report the parameter 𝛼 = 4 of a gamma random variable in cell A2, the parameter
value 𝛽 = 2 in cell B2. Generate a random sample of 100 observations of this gamma
variable in the range A5:A104. Example for cell A5: =
𝐺𝐴𝑀𝑀𝐴. 𝐼𝑁𝑉(𝑅𝐴𝑁𝐷( ); $𝐴$2; 1⁄$𝐡$2) and likewise for cells A6:A104.
Compute the sample mean and the sample standard deviation in cells D2 and E2.
Report the values 1, 2, …, 100 in cells C5 to cell C104. Rank the sample data in the
range C5:C104. Example for cell C5: = 𝑆𝑀𝐴𝐿𝐿(𝐴$5: 𝐴$104; 𝐡5) and likewise for cells
C6:C104;
2. Compute the ranked expected values of the sample observations assuming a sample
from a normal population with mean πœ‡ = 𝛼 ⁄𝛽 = 4⁄2 = 2 and standard deviation
𝜎 = 𝑆𝑄𝑅𝑇(𝛼)⁄𝛽 = 1 in the range D5:D104. Example for cell D5: =
𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉((𝐢5 − 0.5)⁄100; $𝐴$2⁄$𝐡$2; 𝑆𝑄𝑅𝑇($𝐴$2)⁄$𝐡$2) and likewise for cells
D6:D104. Compute the ranked expected values of the sample observations assuming
a sample from a normal population with parameters the sample mean (cell D2) and
the sample standard deviation (cell E2) in the range E5:E104. Example for cell E5: =
𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉((𝐡5 − 0.5)⁄100; $𝐷$2; $𝐸$2) and likewise for cell E6:E104;
3. Construct a normal probability plot as in Step 4 of Problem PS287 but use -1 and 6
as lower and upper bound on both axes. Construct the line of identity. Notice the
typical concave shape of the plots for right skew data.
Assignment PA289
Apply the analysis above for data generated from a left skew beta distribution with
parameters 𝛼 = 10 and 𝛽 = 2. Use 0.4 and 1 as the lower and upper bound on the axes.
362
Problem PS290
We construct a probability plot to test whether a sample is generated from a gamma
distributed population.
1. Report the parameters of a gamma distribution 𝛼 = 4 and 𝛽 = 2 in cells A2 and B2.
Generate a random sample of 100 observations of this gamma variable in the range
A5:A104: = 𝐺𝐴𝑀𝑀𝐴. 𝐼𝑁𝑉(𝑅𝐴𝑁𝐷( ); $𝐴$2; 1⁄$𝐡$2).
Compute the sample mean and the sample standard deviation in cells D2 and E2.
Report the values 1, 2, …, 100 in cells B5 to cell B104. Rank the sample data in the
range C5:C104 as in Problem PS287;
2. Compute the ranked expected values of the sample observations assuming a sample
from a gamma population with parameters 𝛼 = 4 and 𝛽 = 2. Example for cell D5: =
𝐺𝐴𝑀𝑀𝐴. 𝐼𝑁𝑉((𝐡5 − 0.5)⁄100 ; $𝐴$2; 1⁄$𝐡$2).
Compute the ranked expected values of the sample observations assuming a sample
from a gamma population using the sample to estimate the parameters of the gamma
density. Example for cell E5: =
𝐺𝐴𝑀𝑀𝐴. 𝐼𝑁𝑉((𝐡5 − 0.5)⁄100 ; $𝐷$2 ∗ $𝐷$2⁄($𝐸$2 ∗ $𝐸$2) ; $𝐸$2 ∗ $𝐸$2⁄$𝐷$2);
3. Construct a probability plot as in Step 4 of Problem PS287. Use 0 as the lower bound
on the axes and 6 as upper bound. Construct the line of identity. Notice the slightly
better fit when using the sample information to estimate the parameters of the
gamma density.
Assignment PA290
Generate a sample of 100 values of a normal population with expected value 2 and standard
deviation 1. Construct a probability plot to investigate whether this sample is taken from a
gamma population with the same expected value and standard deviation. Also use the
sample data to estimate the parameters of the gamma density.
363
Problem PS291
Consider the data set “Baguette”.
1. Check Problem PS186 for a histogram of the variable ‘weight’;
2. Construct probability plots testing for the underlying distribution of the variable
‘weight’ to be normal or gamma. Use the sample data to estimate the parameters
required. Construct the line of identity. Which distribution seems to give the best fit?
Assignment PA291
Consider the data set “Baguette”. Add the variable ‘Price/100g’ to the data. Construct
probability plots testing for the underlying distribution of this variable to be normal or
gamma. Use the sample data to estimate the parameters required. Construct the line of
identity. Which distribution seems to give the best fit?
364
Problem PS292
Consider the data set “Breaking strength”.
1. Construct a probability plot testing for the underlying distribution of the variable
‘strength’ to be normal. Use the sample data to estimate the mean and the standard
deviation. Construct the line of identity. Notice that the sample data are far from
normal;
2. Construct a probability plot testing for the underlying distribution of the natural
logarithm of the variable ‘strength’ to be normal. Use the logarithm of the sample
data to estimate the mean and the standard deviation. Construct the line of identity.
Notice how a log transformation can strongly normalize data.
Assignment PA292
Consider the data set “Breaking strength”. Construct a probability plot testing for the
underlying distribution of the variable ‘strength’ to be gamma. Use the sample data and the
method of maximum likelihood to estimate the parameters of the gamma distribution (see
Problem PS242). Construct the line of identity. Notice that the gamma distribution provides
a better fit than the normal. Can the fit be improved by a probability plot of the natural
logarithm of the data as in Step 2 above?
365
Problem PS293
We apply a chi-square test to check the underlying distribution (Poisson) of sample data
generated as Poisson data.
1. Generate 100 values of a Poisson random variable with expected value 4 (see
Problem PP115, Step 3). Assume the values in the range B6:B105. Compute the
sample mean in cell D6;
2. Report the values 0, 1, 2, …, 7 in cells D9:D16 and >=8 in cell D17. Compute the
observed frequencies from the sample in cells E9:E17. Example for cell E9: =
πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐡$6: 𝐡$105; 𝐷9) and likewise for cells E10:E16. In cell E17: =
πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐡$6: 𝐡$105; " ≥ 8"). Compute the expected frequencies assuming the
sample is generated from a Poisson population with expected value 4 in cells F9:F17.
Example for cell F9: = 100 ∗ 𝑃𝑂𝐼𝑆𝑆𝑂𝑁. 𝐷𝐼𝑆𝑇(𝐷9; 4; 0) and likewise for cells
F10:F16. In cell F17: = 100 ∗ (1 − 𝑃𝑂𝐼𝑆𝑆𝑂𝑁(7; 4; 1)). Compute the chi-square
values in cells G9:G17. Example for cell G9: = π‘ƒπ‘‚π‘ŠπΈπ‘…(𝐸9 − 𝐹9; 2)⁄𝐹9 and likewise
for cells G10:G17. Compute the chi-square sum in cell G18: = π‘†π‘ˆπ‘€(𝐺9: 𝐺17);
3. Repeat the computations of columns F and G in columns I and J (same rows) but
using the sample mean as expected value for the Poisson distribution;
4. Test whether the sample is generated from a Poisson distribution with expected
value 4 by computing the p-value in two different ways in cells L9 and L10. In cell L9:
= 𝐢𝐻𝐼𝑆𝑄. 𝑇𝐸𝑆𝑇(𝐸9: 𝐸17; 𝐹9: 𝐹17). In cell L10: = 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐺18; 8);
5. Test whether the sample is generated from a Poisson distribution with the sample
mean as expected value by computing the p-value in cell L12: =
𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐽18; 7). Notice that the more direct approach as in Step 5 is not
possible here because the number of degrees of freedom has to be decreased by 1.
Assignment PA293
Repeat Steps 1, 2 and 4 above testing whether the sample values are generated from a
Poisson random variable with expected value 3.
366
Problem PS294
Consider the data set ‘Airline’. We test whether the number of no-shows in economy class
can be assumed to be Poisson distributed.
1. Compute the average number of no-shows in cell E2;
2. Report the values 0, 1, …6 in cells E5 to E11 and >=7 in cell E12 (no-shows of 7 and 8
will be combined because of their small number). Compute the observed frequencies
in the range F5:F12 and the expected frequencies in the range G5:G12 using the
sample mean as expected value of the Poisson distribution (see Problem PS293).
Compute the chi-square values in cells H5:H12 and their sum in cell H13;
3. Test whether the number of no-shows may be assumed to be Poisson distributed by
computing the p-value in cell J5.
Assignment PA294
Apply a chi-square test to test whether the number of no-shows in business class in the data
set ‘Airline’ may be assumed to be Poisson distributed.
367
Problem PS295
The number of people killed per month in road accidents on Belgian roads in the year 2011
was as follows:
Month
January
February
March
April
May
June
Number days
31
28
31
30
31
30
Number killed
70
64
73
61
80
66
Month
July
August
September
October
November
December
Number days
31
31
30
31
30
31
Number killed
66
89
68
72
77
76
1. Report the months of the year in cells A1 to L1, the number killed in A2 to L2 and the
number of days in the month in A3 to L3. Compute the total number of deaths in cell
M2 (862), the total number of days in the year in cell M3 (365) and the average
number of people killed per day in cell N2 (2.3613);
2. Assuming the number of people killed throughout the year to be constant, compute
the expected number killed in each month in cells A4 to L4. Example for cell A4: =
𝐴3 ∗ $𝑁$2 and likewise for the remaining months. Compute chi-square values in
cells A5 to L5. Example for cell A5: = π‘ƒπ‘‚π‘ŠπΈπ‘…(𝐴2 − 𝐴4; 2)⁄𝐴4 and likewise for the
remaining cells. Compute the chi-square sum in cell M5: = π‘†π‘ˆπ‘€(𝐴5: 𝐿5);
3. Test whether the assumption of a constant probability of being killed over the
months of the year can be accepted by computing the p-value in cell O5.
Answer: 𝑝 − π‘£π‘Žπ‘™π‘’π‘’ = 0.6843 and the assumption cannot be rejected.
Assignment PA295
The frequency number of goals scored in the English premier league soccer in the season
2013-2014 by the home and the visiting teams is as follows (a total of 380 matches were
played):
number
home
visiting
0
95
137
1
113
114
2
85
76
3
49
49
4
28
10
5
5
3
6
4
1
7
1
0
368
Test whether the number of goals scored by the home (visiting) team may be assumed to be
Poisson distributed using a chi-square test.
369
Problem PS296
Using a chi-square test, we investigate the normality of data generated from a normal
distribution.
1. Report the mean 10 of a normal random variable in cell A2, its standard deviation 2
in cell B2. Generate a random sample of 100 observations of this normal variable in
the range A5:A104: = 𝑁𝑂𝑅𝑀. 𝐼𝑁𝑉(𝑅𝐴𝑁𝐷( ); 𝐴$2; 𝐡$2). Compute the sample mean
and the maximum likelihood estimate of the standard deviation in cells D2 and E2.
For cell E2: = 𝑆𝑇𝐷𝐸𝑉. 𝑃(𝐴5: 𝐴104);
2. Report bin values 6 to 14 in cells C6 to C14, the observed frequencies in cells D6 to
D15. Example for cell D6: = πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐴$5: 𝐴$104; "≤"&𝐢6), for cell D7: =
πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐴$5: 𝐴$104; " ≤ "&𝐢7) − πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐴$5: 𝐴$104; " ≤ "&𝐢6) and likewise
for cells D8 to D14. For cell D15: = πΆπ‘‚π‘ˆπ‘π‘‡πΌπΉ(𝐴$5: 𝐴$104; " > "&𝐢14). Compute the
expected frequencies in cells E6 to cell E15. Example for cell E6: = 100 ∗
𝑁𝑂𝑅𝑀. 𝐷𝐼𝑆𝑇(𝐢6; $𝐴$2; $𝐡$2; 1), for cell E7: = 100 ∗
(𝑁𝑂𝑅𝑀. 𝐷𝐼𝑆𝑇(𝐢7; $𝐴$2; $𝐡$2; 1) − 𝑁𝑂𝑅𝑀. 𝐷𝐼𝑆𝑇(𝐢6; $𝐴$2; $𝐡$2; 1)) and likewise
for cells E8 to E14. For cell E15: 100 ∗ (1 − 𝑁𝑂𝑅𝑀. 𝐷𝐼𝑆𝑇(𝐢14; 𝐴$2; 𝐡$2; 1). Compute
the chi-square terms in cells F6 to F15. Example for cell F6:
= π‘ƒπ‘‚π‘ŠπΈπ‘…(𝐷6 − 𝐸6; 2)⁄𝐸6 and likewise for cells F7 to F15. Compute the sum of the
chi-square values in Cell F16;
3. Compute the p-value in cell H6: = 𝐢𝐻𝐼𝑆𝑄. 𝑇𝐸𝑆𝑇(𝐷6: 𝐷15; 𝐸6: 𝐸15) and an
alternative computation in cell H7: = 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐹16; 9);
4. Repeat Step 2 using the sample estimates in cells D2 and E2 to compute the expected
frequencies. Use 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇 with 7 degrees of freedom.
Assignment PA296
1. Generate normal data as in Step 1 above;
2. Apply Steps 2 and 3 above but use an expected value of 9 and a standard deviation of
2 to compute expected frequencies;
3. Apply Steps 2 and 3 above but use an expected value of 10 and a standard deviation
of 3 to compute expected frequencies.
370
Problem PS297
Using a chi-square test, we investigate the normality of data generated from a gamma
distribution.
1. Generate a sample from a gamma distributed population as in Step 1 of Problem
PS289 (parameters 4 and 2). Compute sample mean and the (population) standard
deviation in cells D2 and E2;
2. Work as in Step 2 of Problem PS296. Use bin values 0.4, 0.8, …,3.6, 4 (11 intervals).
To compute expected frequencies use expected value 2 and standard deviation 1 for
the normal distribution. Compute p-values as in Step 3 in Problem PS296. Use 10
degrees of freedom in the function 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇;
3. Repeat Step 2 using the sample estimates in cells D2 and E2 to compute the expected
frequencies. Use 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇 with 8 degrees of freedom.
Assignment PA297
Use a chi-square test to investigate whether data generated from a gamma distribution are
gamma distributed. Apply the three steps above but use the gamma distribution to compute
expected frequencies.
371
Problem PS298
Consider the data set ‘Baguette’.
1. Add the price per 100g in column F. Compute its average value in cell G2 and the
maximum likelihood of its standard deviation in cell H2: = 𝑆𝑇𝐷𝐸𝑉. 𝑃(𝐹2: 𝐹75);
2. To test the sample of price/100g for normality, use 7 classes: use the value 0.243 as
upper bound of the first class in cell G4 and a class width of 0.7, hence an upper
bound of 0.593 for the last class but one in cell G9. Compute the observed and the
expected frequencies in columns H and I as in Step 2 of Problem PS296.
Compute the chi-square values in cells J4 to J10 and their sum in cell J11.
Answer: 6.3318.
Compute the p-value in cell J13: = 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐽11; 4) .
Answer: 0.1757;
3. Test whether the price/100g may be assumed to be gamma distributed. Use the
same classes as in Step 2. Compute the expected frequencies in the range K4:K10.
Use the sample mean and the sample standard deviation of cells G2 and H2 to
estimate the parameters of the gamma distribution (this is not quite correct in that
these estimates do not result in maximum likelihood estimates of the gamma
parameters). Example for cell K4: = 74 ∗
𝐺𝐴𝑀𝑀𝐴. 𝐷𝐼𝑆𝑇(𝐺4; 𝐺$2 ∗ 𝐺$2⁄(𝐻$2 ∗ 𝐻$2); 𝐻$2 ∗ 𝐻$2⁄𝐺$2; 1).
Compute the chi-square values in cells L4 to L10 and their sum in cell L11.
Answer: sum = 3.7618.
Compute the p-value in cell L13: = 𝐢𝐻𝐼𝑆𝑄. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝐿11; 4) .
Answer: 0.4392.
Assignment PA298
Use a chi-square test to investigate whether the variable ‘weight’ in the data set ‘Baguette’ is
normally distributed. Repeat the test for the gamma distribution.
372
Problem PS299
Consider the data set “Breaking strength”.
1. Compute mean and the (maximum likelihood estimate) of the standard deviation of
the variable “Breaking strength” in cells F1 and F2. Test the hypothesis that breaking
strength is normally distributed. Use 8 classes with upper bound of the first class
equal to 6.
Answer: chi-square sum = 41.6843, 𝑝 − π‘£π‘Žπ‘™π‘’π‘’ = 6.8234𝐸 − 08;
2. Compute the logarithm of breaking strength in column B, its mean in cell J1, its
(maximum likelihood estimate) of the standard deviation in cell J2. Test the
hypothesis that the logarithm of breaking strength is normally distributed. Use 8
classes with upper bound of the first class equal to 1.5.
Answer: chi-square sum = 2.0291, 𝑝 − π‘£π‘Žπ‘™π‘’π‘’ = 0.8451.
Assignment PA299
Consider the data set “Decatlon2011”. Use a chi-square test to test the normality of the
times realized to cover the 100 meter dash.
Download