359 Chapter 15. Goodness of fit Problem PS287 We construct a normal probability plot of a sample generated from a normally distributed population. 1. Report the mean 10 of a normal random variable in cell A2, its standard deviation 2 in cell B2. Generate a random sample of 100 observations of this normal variable in the range A5:A104: = πππ π. πΌππ(π π΄ππ·( ); π΄$2; π΅$2). Compute the sample mean and the sample standard deviation in cells D2 and E2. Report the values 1, 2, …, 100 in the range B5:B104. Rank the sample data in the range C5:C104. Example for cell C5: = πππ΄πΏπΏ(π΄$5: π΄$104; π΅5) and likewise for cell C6:C104; 2. Compute the ranked expected values of the sample observations assuming a sample from a normal population with the parameters 10 as mean and 2 as standard deviation in the range D5:D104. Example for cell D5: = πππ π. πΌππ((π΅5 − 0.5)⁄100; $π΄$2; $π΅$2) and likewise for cells D6:D104. Compute the ranked expected values of the sample observations assuming a sample from a normal population with parameters the sample mean (cell D2) and the sample standard deviation (cell E2) in the range E5:E104. Example for cell E5: = πππ π. πΌππ((π΅5 − 0.5)⁄100; $π·$2; $πΈ$2) and likewise for cells E6:E104; 3. Construct a histogram of the sample data using bin values 6, 7, …, 14; 4. Construct a normal probability plot of the sample using a Scatter with straight lines and markers. Put the ordered sample data on the horizontal axis and the expected sample points in both columns D and E on the vertical axis. Use a lower bound of 4 and an upper bound of 16 for both axes. Construct a 45° degree line through the origin (4 4), the line of identity. Notice that both broken lines follow closely this line of identity with the one based on the sample statistics usually a slightly better fit. Assignment PA287 To compute the expected values of the sample points the values (π − 0.5)⁄π , π = 1, 2, … , π are replaced by the somewhat better approximations (π − 0.375)⁄(π + 0.25), π = 1,2, … , π. Use these approximations in the above analysis and construct the normal probability plots. 360 Problem PS288 1. Construct a normal probability plot as in Problem PS287 but use 8 as value of the mean to compute the expected sample values in column D (keep the standard deviation equal to 2). Notice that the broken line lies fully underneath the line of identity but more or less parallel indicating that the mean estimated for the population is too small; 2. Construct a normal probability plot as in Problem PS287 but use 3 as value of the standard deviation to compute the expected sample values in column D (keep the mean equal to 10). Notice how the broken line is steeper than the line of identity indicating that the standard deviation estimated for the population is too large. Assignment PA288 1. Repeat the analysis above using 12 as the mean to compute the expected sample values while keeping 2 as the standard deviation; 2. Repeat the analysis above using 1 as the standard deviation to compute the expected sample values while keeping 10 as the mean; 3. Repeat the analysis above using 9 as the mean and 3 as the standard deviation to compute the expected sample values. 361 Problem PS289 We construct a normal probability plot of a sample generated from a gamma distributed population. 1. Report the parameter πΌ = 4 of a gamma random variable in cell A2, the parameter value π½ = 2 in cell B2. Generate a random sample of 100 observations of this gamma variable in the range A5:A104. Example for cell A5: = πΊπ΄πππ΄. πΌππ(π π΄ππ·( ); $π΄$2; 1⁄$π΅$2) and likewise for cells A6:A104. Compute the sample mean and the sample standard deviation in cells D2 and E2. Report the values 1, 2, …, 100 in cells C5 to cell C104. Rank the sample data in the range C5:C104. Example for cell C5: = πππ΄πΏπΏ(π΄$5: π΄$104; π΅5) and likewise for cells C6:C104; 2. Compute the ranked expected values of the sample observations assuming a sample from a normal population with mean π = πΌ ⁄π½ = 4⁄2 = 2 and standard deviation π = πππ π(πΌ)⁄π½ = 1 in the range D5:D104. Example for cell D5: = πππ π. πΌππ((πΆ5 − 0.5)⁄100; $π΄$2⁄$π΅$2; πππ π($π΄$2)⁄$π΅$2) and likewise for cells D6:D104. Compute the ranked expected values of the sample observations assuming a sample from a normal population with parameters the sample mean (cell D2) and the sample standard deviation (cell E2) in the range E5:E104. Example for cell E5: = πππ π. πΌππ((π΅5 − 0.5)⁄100; $π·$2; $πΈ$2) and likewise for cell E6:E104; 3. Construct a normal probability plot as in Step 4 of Problem PS287 but use -1 and 6 as lower and upper bound on both axes. Construct the line of identity. Notice the typical concave shape of the plots for right skew data. Assignment PA289 Apply the analysis above for data generated from a left skew beta distribution with parameters πΌ = 10 and π½ = 2. Use 0.4 and 1 as the lower and upper bound on the axes. 362 Problem PS290 We construct a probability plot to test whether a sample is generated from a gamma distributed population. 1. Report the parameters of a gamma distribution πΌ = 4 and π½ = 2 in cells A2 and B2. Generate a random sample of 100 observations of this gamma variable in the range A5:A104: = πΊπ΄πππ΄. πΌππ(π π΄ππ·( ); $π΄$2; 1⁄$π΅$2). Compute the sample mean and the sample standard deviation in cells D2 and E2. Report the values 1, 2, …, 100 in cells B5 to cell B104. Rank the sample data in the range C5:C104 as in Problem PS287; 2. Compute the ranked expected values of the sample observations assuming a sample from a gamma population with parameters πΌ = 4 and π½ = 2. Example for cell D5: = πΊπ΄πππ΄. πΌππ((π΅5 − 0.5)⁄100 ; $π΄$2; 1⁄$π΅$2). Compute the ranked expected values of the sample observations assuming a sample from a gamma population using the sample to estimate the parameters of the gamma density. Example for cell E5: = πΊπ΄πππ΄. πΌππ((π΅5 − 0.5)⁄100 ; $π·$2 ∗ $π·$2⁄($πΈ$2 ∗ $πΈ$2) ; $πΈ$2 ∗ $πΈ$2⁄$π·$2); 3. Construct a probability plot as in Step 4 of Problem PS287. Use 0 as the lower bound on the axes and 6 as upper bound. Construct the line of identity. Notice the slightly better fit when using the sample information to estimate the parameters of the gamma density. Assignment PA290 Generate a sample of 100 values of a normal population with expected value 2 and standard deviation 1. Construct a probability plot to investigate whether this sample is taken from a gamma population with the same expected value and standard deviation. Also use the sample data to estimate the parameters of the gamma density. 363 Problem PS291 Consider the data set “Baguette”. 1. Check Problem PS186 for a histogram of the variable ‘weight’; 2. Construct probability plots testing for the underlying distribution of the variable ‘weight’ to be normal or gamma. Use the sample data to estimate the parameters required. Construct the line of identity. Which distribution seems to give the best fit? Assignment PA291 Consider the data set “Baguette”. Add the variable ‘Price/100g’ to the data. Construct probability plots testing for the underlying distribution of this variable to be normal or gamma. Use the sample data to estimate the parameters required. Construct the line of identity. Which distribution seems to give the best fit? 364 Problem PS292 Consider the data set “Breaking strength”. 1. Construct a probability plot testing for the underlying distribution of the variable ‘strength’ to be normal. Use the sample data to estimate the mean and the standard deviation. Construct the line of identity. Notice that the sample data are far from normal; 2. Construct a probability plot testing for the underlying distribution of the natural logarithm of the variable ‘strength’ to be normal. Use the logarithm of the sample data to estimate the mean and the standard deviation. Construct the line of identity. Notice how a log transformation can strongly normalize data. Assignment PA292 Consider the data set “Breaking strength”. Construct a probability plot testing for the underlying distribution of the variable ‘strength’ to be gamma. Use the sample data and the method of maximum likelihood to estimate the parameters of the gamma distribution (see Problem PS242). Construct the line of identity. Notice that the gamma distribution provides a better fit than the normal. Can the fit be improved by a probability plot of the natural logarithm of the data as in Step 2 above? 365 Problem PS293 We apply a chi-square test to check the underlying distribution (Poisson) of sample data generated as Poisson data. 1. Generate 100 values of a Poisson random variable with expected value 4 (see Problem PP115, Step 3). Assume the values in the range B6:B105. Compute the sample mean in cell D6; 2. Report the values 0, 1, 2, …, 7 in cells D9:D16 and >=8 in cell D17. Compute the observed frequencies from the sample in cells E9:E17. Example for cell E9: = πΆπππππΌπΉ(π΅$6: π΅$105; π·9) and likewise for cells E10:E16. In cell E17: = πΆπππππΌπΉ(π΅$6: π΅$105; " ≥ 8"). Compute the expected frequencies assuming the sample is generated from a Poisson population with expected value 4 in cells F9:F17. Example for cell F9: = 100 ∗ πππΌππππ. π·πΌππ(π·9; 4; 0) and likewise for cells F10:F16. In cell F17: = 100 ∗ (1 − πππΌππππ(7; 4; 1)). Compute the chi-square values in cells G9:G17. Example for cell G9: = ππππΈπ (πΈ9 − πΉ9; 2)⁄πΉ9 and likewise for cells G10:G17. Compute the chi-square sum in cell G18: = πππ(πΊ9: πΊ17); 3. Repeat the computations of columns F and G in columns I and J (same rows) but using the sample mean as expected value for the Poisson distribution; 4. Test whether the sample is generated from a Poisson distribution with expected value 4 by computing the p-value in two different ways in cells L9 and L10. In cell L9: = πΆπ»πΌππ. ππΈππ(πΈ9: πΈ17; πΉ9: πΉ17). In cell L10: = πΆπ»πΌππ. π·πΌππ. π π(πΊ18; 8); 5. Test whether the sample is generated from a Poisson distribution with the sample mean as expected value by computing the p-value in cell L12: = πΆπ»πΌππ. π·πΌππ. π π(π½18; 7). Notice that the more direct approach as in Step 5 is not possible here because the number of degrees of freedom has to be decreased by 1. Assignment PA293 Repeat Steps 1, 2 and 4 above testing whether the sample values are generated from a Poisson random variable with expected value 3. 366 Problem PS294 Consider the data set ‘Airline’. We test whether the number of no-shows in economy class can be assumed to be Poisson distributed. 1. Compute the average number of no-shows in cell E2; 2. Report the values 0, 1, …6 in cells E5 to E11 and >=7 in cell E12 (no-shows of 7 and 8 will be combined because of their small number). Compute the observed frequencies in the range F5:F12 and the expected frequencies in the range G5:G12 using the sample mean as expected value of the Poisson distribution (see Problem PS293). Compute the chi-square values in cells H5:H12 and their sum in cell H13; 3. Test whether the number of no-shows may be assumed to be Poisson distributed by computing the p-value in cell J5. Assignment PA294 Apply a chi-square test to test whether the number of no-shows in business class in the data set ‘Airline’ may be assumed to be Poisson distributed. 367 Problem PS295 The number of people killed per month in road accidents on Belgian roads in the year 2011 was as follows: Month January February March April May June Number days 31 28 31 30 31 30 Number killed 70 64 73 61 80 66 Month July August September October November December Number days 31 31 30 31 30 31 Number killed 66 89 68 72 77 76 1. Report the months of the year in cells A1 to L1, the number killed in A2 to L2 and the number of days in the month in A3 to L3. Compute the total number of deaths in cell M2 (862), the total number of days in the year in cell M3 (365) and the average number of people killed per day in cell N2 (2.3613); 2. Assuming the number of people killed throughout the year to be constant, compute the expected number killed in each month in cells A4 to L4. Example for cell A4: = π΄3 ∗ $π$2 and likewise for the remaining months. Compute chi-square values in cells A5 to L5. Example for cell A5: = ππππΈπ (π΄2 − π΄4; 2)⁄π΄4 and likewise for the remaining cells. Compute the chi-square sum in cell M5: = πππ(π΄5: πΏ5); 3. Test whether the assumption of a constant probability of being killed over the months of the year can be accepted by computing the p-value in cell O5. Answer: π − π£πππ’π = 0.6843 and the assumption cannot be rejected. Assignment PA295 The frequency number of goals scored in the English premier league soccer in the season 2013-2014 by the home and the visiting teams is as follows (a total of 380 matches were played): number home visiting 0 95 137 1 113 114 2 85 76 3 49 49 4 28 10 5 5 3 6 4 1 7 1 0 368 Test whether the number of goals scored by the home (visiting) team may be assumed to be Poisson distributed using a chi-square test. 369 Problem PS296 Using a chi-square test, we investigate the normality of data generated from a normal distribution. 1. Report the mean 10 of a normal random variable in cell A2, its standard deviation 2 in cell B2. Generate a random sample of 100 observations of this normal variable in the range A5:A104: = πππ π. πΌππ(π π΄ππ·( ); π΄$2; π΅$2). Compute the sample mean and the maximum likelihood estimate of the standard deviation in cells D2 and E2. For cell E2: = πππ·πΈπ. π(π΄5: π΄104); 2. Report bin values 6 to 14 in cells C6 to C14, the observed frequencies in cells D6 to D15. Example for cell D6: = πΆπππππΌπΉ(π΄$5: π΄$104; "≤"&πΆ6), for cell D7: = πΆπππππΌπΉ(π΄$5: π΄$104; " ≤ "&πΆ7) − πΆπππππΌπΉ(π΄$5: π΄$104; " ≤ "&πΆ6) and likewise for cells D8 to D14. For cell D15: = πΆπππππΌπΉ(π΄$5: π΄$104; " > "&πΆ14). Compute the expected frequencies in cells E6 to cell E15. Example for cell E6: = 100 ∗ πππ π. π·πΌππ(πΆ6; $π΄$2; $π΅$2; 1), for cell E7: = 100 ∗ (πππ π. π·πΌππ(πΆ7; $π΄$2; $π΅$2; 1) − πππ π. π·πΌππ(πΆ6; $π΄$2; $π΅$2; 1)) and likewise for cells E8 to E14. For cell E15: 100 ∗ (1 − πππ π. π·πΌππ(πΆ14; π΄$2; π΅$2; 1). Compute the chi-square terms in cells F6 to F15. Example for cell F6: = ππππΈπ (π·6 − πΈ6; 2)⁄πΈ6 and likewise for cells F7 to F15. Compute the sum of the chi-square values in Cell F16; 3. Compute the p-value in cell H6: = πΆπ»πΌππ. ππΈππ(π·6: π·15; πΈ6: πΈ15) and an alternative computation in cell H7: = πΆπ»πΌππ. π·πΌππ. π π(πΉ16; 9); 4. Repeat Step 2 using the sample estimates in cells D2 and E2 to compute the expected frequencies. Use πΆπ»πΌππ. π·πΌππ. π π with 7 degrees of freedom. Assignment PA296 1. Generate normal data as in Step 1 above; 2. Apply Steps 2 and 3 above but use an expected value of 9 and a standard deviation of 2 to compute expected frequencies; 3. Apply Steps 2 and 3 above but use an expected value of 10 and a standard deviation of 3 to compute expected frequencies. 370 Problem PS297 Using a chi-square test, we investigate the normality of data generated from a gamma distribution. 1. Generate a sample from a gamma distributed population as in Step 1 of Problem PS289 (parameters 4 and 2). Compute sample mean and the (population) standard deviation in cells D2 and E2; 2. Work as in Step 2 of Problem PS296. Use bin values 0.4, 0.8, …,3.6, 4 (11 intervals). To compute expected frequencies use expected value 2 and standard deviation 1 for the normal distribution. Compute p-values as in Step 3 in Problem PS296. Use 10 degrees of freedom in the function πΆπ»πΌππ. π·πΌππ. π π; 3. Repeat Step 2 using the sample estimates in cells D2 and E2 to compute the expected frequencies. Use πΆπ»πΌππ. π·πΌππ. π π with 8 degrees of freedom. Assignment PA297 Use a chi-square test to investigate whether data generated from a gamma distribution are gamma distributed. Apply the three steps above but use the gamma distribution to compute expected frequencies. 371 Problem PS298 Consider the data set ‘Baguette’. 1. Add the price per 100g in column F. Compute its average value in cell G2 and the maximum likelihood of its standard deviation in cell H2: = πππ·πΈπ. π(πΉ2: πΉ75); 2. To test the sample of price/100g for normality, use 7 classes: use the value 0.243 as upper bound of the first class in cell G4 and a class width of 0.7, hence an upper bound of 0.593 for the last class but one in cell G9. Compute the observed and the expected frequencies in columns H and I as in Step 2 of Problem PS296. Compute the chi-square values in cells J4 to J10 and their sum in cell J11. Answer: 6.3318. Compute the p-value in cell J13: = πΆπ»πΌππ. π·πΌππ. π π(π½11; 4) . Answer: 0.1757; 3. Test whether the price/100g may be assumed to be gamma distributed. Use the same classes as in Step 2. Compute the expected frequencies in the range K4:K10. Use the sample mean and the sample standard deviation of cells G2 and H2 to estimate the parameters of the gamma distribution (this is not quite correct in that these estimates do not result in maximum likelihood estimates of the gamma parameters). Example for cell K4: = 74 ∗ πΊπ΄πππ΄. π·πΌππ(πΊ4; πΊ$2 ∗ πΊ$2⁄(π»$2 ∗ π»$2); π»$2 ∗ π»$2⁄πΊ$2; 1). Compute the chi-square values in cells L4 to L10 and their sum in cell L11. Answer: sum = 3.7618. Compute the p-value in cell L13: = πΆπ»πΌππ. π·πΌππ. π π(πΏ11; 4) . Answer: 0.4392. Assignment PA298 Use a chi-square test to investigate whether the variable ‘weight’ in the data set ‘Baguette’ is normally distributed. Repeat the test for the gamma distribution. 372 Problem PS299 Consider the data set “Breaking strength”. 1. Compute mean and the (maximum likelihood estimate) of the standard deviation of the variable “Breaking strength” in cells F1 and F2. Test the hypothesis that breaking strength is normally distributed. Use 8 classes with upper bound of the first class equal to 6. Answer: chi-square sum = 41.6843, π − π£πππ’π = 6.8234πΈ − 08; 2. Compute the logarithm of breaking strength in column B, its mean in cell J1, its (maximum likelihood estimate) of the standard deviation in cell J2. Test the hypothesis that the logarithm of breaking strength is normally distributed. Use 8 classes with upper bound of the first class equal to 1.5. Answer: chi-square sum = 2.0291, π − π£πππ’π = 0.8451. Assignment PA299 Consider the data set “Decatlon2011”. Use a chi-square test to test the normality of the times realized to cover the 100 meter dash.