Module H2 Practical 16 Goodness-of-fit tests Objectives: By the end of this practical you should be able to: carry out a goodness-of-fit test to determine whether a measurement variable follows a Poisson, binomial or normal distribution take appropriate action to improve the chi-square approximation present and state conclusions emerging from a chi-square test. 1. The aim of this exercise is to demonstrate how goodness-of-fit tests can be carried out using Excel facilities and to provide some revision on the Poisson distribution. Open the Excel workbook named H2_data.xls and move to the sheet named KiliWomenPoisson. In this worksheet, data on household size, for women headed households from the Kilimanjaro region of Tanzania (see sheet KilijaroWomen), have been summarized into a frequency table. The following are the descriptions of the variables: X = HHsize – the random variable representing household size, i.e. the number of persons living in the household. obsfreq – the observed number of households out of 94 sampled that are of the given HHsize. PoiProb – assuming X has a Poisson distribution with sampled mean as the Poisson parameter =4.4, this column has the probability that a randomly selected household will have size X = HHsize. expfreq – the expected frequency of households of size HHsize calculated using the Poisson probability PoiProb. Chisquare – contributions to the chi-square statistic, i.e. values (O-E)2/E. The Poisson mean (or parameter ) is given in cell C16 with red borders, currently set at =4.4. By eye, the appropriateness of the Poisson distribution can be judged according to how close the bar chart of the distribution of observed frequencies is to the expected frequencies based on assuming a Poisson distribution. More appropriately, the “best” is the one that minimises the chi-square test statistic shown in cell C18 with blue borders. SADC Course in Statistics Module H2 Practical 16 – Page 1 Module H2 Practical 16 (a) First check you understand how columns C, D and E have been created by clicking on one of the cells in each column and looking at the formula given. Note that the probability density function of the Poisson distribution with parameter is given by P( X k ) k e k! , for k 0,1, 2,3, where k! = (k)(k-1)(k-2)……(3)(2)(1). (Note: Module H1, the pre-requisite for Module H2, gives further details concerning this distribution). (b) Check also the formulae given in cells E14, C18 and B20. Note down these formulae below and explain how they relate to the goodness-of-fit tests explained in the lecture presentation. (c) Explain why the degrees of freedom=10 for the p-value in cell B20, corresponding to the chi-square statistic in cell C18. (d) Change the mean value and note how the graph and the chi-square value changes. Do you get any improvement in the closeness of the yellow and brown bars in the graph? (e) Now change the mean value by small amounts(no more than 0.1 or less at a time) and see if you can make the chi-square statistic any smaller than the value 2 = 8.36 resulting when the sample mean = 4.4 was used. Comment below on the ease with which a goodness-of-fit test for a Poisson distribution can be carried out using Excel. SADC Course in Statistics Module H2 Practical 16 – Page 2 Module H2 Practical 16 2. The worksheet Tete-Jan-raindays in file H2-data.xls has information on the number of rain days over 5-day and 10-day periods in January (days 1-30) for the years from 1953 to 2005. Also included is data on the total annual rainfall in mm (see last column). (a) Complete the table below to show an estimate for the chance (probability) of rain per day in each of the specified time periods in January. Period in January Estimated probability of rain per day Days 1-5 Days 6-10 Days 11-15 Days 16-20 Days 21-25 Days 26-30 (b) Set up an Excel spreadsheet with observed and expected frequencies, to determine whether data on number of rain days in the first five days of January follow a binomial distribution. Carry out a goodness-of-fit test for this purpose and comment on the results obtained. (c) In doing the above tests, you would have needed the number of years when there were 0, 1, 2, 3, 4, or 5 raindays in the first 5 days of January. These frequencies are shown in the table below. Complete the remainder of the table to show frequencies in the remaining 5day periods in January. Number of days 1-5 rain days 0 1 2 3 4 5 days 6-10 days 11-15 days 16-20 days 21-25 days 26-30 9 7 14 15 6 1 SADC Course in Statistics Module H2 Practical 16 – Page 3 Module H2 Practical 16 (d) Now explore, making appropriate substitutions to the spreadsheet you had set up, whether any of the remaining 5-day totals follow a binomial distribution. Remember that you may need to collapse some cells if the assumptions underlying the chi-square test appear invalid. Note down your conclusions. 3. IF YOU HAVE TIME, TRY ALSO THE FOLLOWING: In the same data set you used in Question 2, the annual rainfall total was also given (in column named TotRain. Investigate whether this data follows a normal distribution. You may begin with a normal probability plot, then proceed to carry out a goodness-of-fit test. Remember you will have to group your data and obtain observed frequencies in each group in order to perform your chi-square goodness-of-fit test. SADC Course in Statistics Module H2 Practical 16 – Page 4