Module I3 Sessions 14/16 Practical 2: Resolving common problems in data analysis Resources: Instat Climatic guide Chapter 10 (from Instat Help) Malawi Rainfall data (from the Instat library) This second practical continues with your practice in using a statistics package. At the same time you examine and resolve two common problems in statistical analysis. They are: When should an average be calculated? How should you analyse data when some values are zero? Activity 3 involves reading the text, and doing the practical work as you read, to reproduce the same tables shown in these sections. They have been adapted from Section 10.4 in the Instat Climatic Guide. The figures have been left with the same numbers as the guide to facilitate cross-referencing. Fig. 10-4a Malawicr data File Open From Library malawicr.wor X1- X27 Rainfall totals for 24 dekads (from 1st Oct.) for 1954/5 to 1982/3 (omitting 1962/3 & 1981/2) X28 Evaporation totals for dekads (from 1st October) X29 Crop coefficients for groundnut (11 periods) X30 Crop coefficients for cotton (17 periods) SADC Course in Statistics Module I3 Sessions 14/16 – Page 1 Module I3 Sessions 14/16 X31, X32 Numbers 1,2,... 17 and 1, 2, … 24 (for plotting) SADC Course in Statistics Module I3 Sessions 14/16 – Page 2 Module I3 Sessions 14/16 Activity 3 1. Average and then calculate, or calculate and then average? Climatic data from Malawi are used to illustrate two common problems in data analysis, and how they can be resolved. They also show how general skills in data analysis allow users to offer support in new application areas. Ten-day data for 27 years from a site in Malawi are available. a. Open the file Malawicr.wor, and check the data are as shown in Fig. 10-4a. b. What month do the data in the figure finish in? In these data, each column contains the data for one year. So in Fig. 10.4a, X1 shows there no rain in October, then 18.8mm in the first 10 days (dekad) of November, etc on. In a full study, the rainfall records would be transposed, so each column holds the data for a single decade within the year, rather than for each year. Fig. 10-4b Calculating summary statistics for Malawi data Manage Manipulate Row Statistics Manage Calculate (x35 = x28/2) c. With the data in its current form, use the Manage Manipulate Row Statistics dialogue as shown below to calculate the summary values (Fig. 10-4b). SADC Course in Statistics Module I3 Sessions 14/16 – Page 3 Module I3 Sessions 14/16 d. Dekad 6 has been marked in the figure above. When is this, within the year? It is the … dekad in the month of ……….., i.e. the period from ……… to ……… The calculation of the mean and standard deviation for each dekad is shown in Fig. 10-4b. A common analysis by climatologists is to compare the mean rainfall with half the evaporation, so that was also calculated, in Fig. 10-4b. Comparing x33 with x35 (Fig. 10-4b) shows that dekad 6 is where the mean rainfall first exceeds half the evaporation. e. Plot the results though the year to give, Fig. 10-4c. Fig. 10-4c Plot of mean rainfall with evaporation/2 f. From the data or the graph, which dekad had the highest mean rainfall? Dekad number ………., i.e. the ………. dekad in the month of …………. . This plot is sometimes used to give the "average" length of the season, i.e. the season is defined as the period during which the average rainfall (X33) is greater than half the evaporation (X35). Here it is periods 6 to 19. It does give a first indication, though it is deceptively difficult to interpret, as we now show. SADC Course in Statistics Module I3 Sessions 14/16 – Page 4 Module I3 Sessions 14/16 It is also useful to have an idea of the variability of the season length from year to year and that is not available from the analysis as done above. It is the first indication that the analysis above may not be the best way of looking at the season length. There is an alternative. Instead of averaging the rainfall and then comparing with the evaporation, calculate the season length each year, and then average the season lengths. So average the lengths as the last stage in the analysis, rather than averaging the rainfall, at the beginning. To get the lengths of the season, first get the start, then the end, and subtract them. The start is easy. For example, defining the start of the season as the first decade with rainfall of more than half the evaporation then Instat’s Start of the Rains dialogue, can be used. It is designed for daily data, but can equally be used with dekads. g. Use the dialogue in Fig. 10-4d, to give the dekad of the start shown in X37. Fig. 10-4d Calculating the start and end of the season Climatic Events Start of the Rains Changes for the end of the season h. Use the same dialogue to give the end of the season, in X37, as indicated in Fig. 10-4d. (Change the starting dekad to the one with the highest mean rainfall in the season, see Fig. 10-4c. Then find the first dekad after that point, where the SADC Course in Statistics Module I3 Sessions 14/16 – Page 5 Module I3 Sessions 14/16 i. j. rainfall is less than half the evapotranspiration. The option to look for less is not on the dialogue, add it to the command that generates the analysis, as shown in Fig. 10-4d.) Add a column to give the year numbers, in X38, as shown in Fig. 10-4d. (Or just give a column that goes from 1 to 27. You could just type, or use the Manage Data Regular sequence dialogue.) Summarise X36 and X37 to give the mean and standard deviations. What are they? Mean dekad of the start = …… The standard deviation = ……, i.e. about …… days Mean dekad of the end = …… The standard deviation = ……., i.e. about ….. days You should have found that the mean of the starting dekads is very close to the value in Fig. 10-4c. The mean of the ending dekads is however a long way from the value of dekad 19 in Fig. 10-4c. The values each year are plotted in Fig. 10-4e, with the supposed average dekads from Fig. 10-4c marked. Fig. 10-4e The start and end of the season Comparing Fig. 10-4c and 10-4e shows that, if the value each year is available, then the method shown in Fig. 10-4d should be used. This calculated the starting and ending dekad each year, and then averaged these values. So, if in doubt, first calculate and then average, don’t average and then calculate. SADC Course in Statistics Module I3 Sessions 14/16 – Page 6 Module I3 Sessions 14/16 2. Take care when there are zero values in the data. The second topic involves examining the rainfall data through the season. Fig. 10-4a shows the decade totals are often zero at the start and end of the "year". When analysing data with zeros it is usually useful to split the analysis into two parts. The number of zeros is considered first and then the non-zero values are analysed further. a. With the data as in Fig 10-4a, use the Manage Manipulate Row Statistics dialogue, as shown in Fig. 10-4f. In the dialogue "restrict" the data to omit the zeros and include the count of the number of observations. b. Use the Manage Calculate dialogue, as shown in Fig. 10-4f. to calculate X42 with the % of years with no rain. Fig. 10-4f Dealing with zeros in data Manage Manipulate Row Statistics Manage Calculate x42 = 100*(27-x41)/27 For example, Fig. 10-4f shows that the 3rd dekad in October had rain in 9 of the 27 years (X41). Hence the calculation shows that 2/3 of these years were dry, (X42). In the 9 years with rain the mean was 14.9mm and the standard deviation 9.3mm. c. Report the same results for the 3rd dekad in November. SADC Course in Statistics Module I3 Sessions 14/16 – Page 7 Module I3 Sessions 14/16 d. Plot the percentage of years without rain in each dekad, as shown in Fig. 10-4g. Fig. 10-4g Plot of the percentage of zeros The coefficient of variation is often calculated in climatic studies. This is defined as c.v. = 100 * (Standard deviation)/Mean i.e. here x44 = 100 * x40 / x39 Fig. 10-4h Coefficient of variation of 10 day rainfall data through the season Manage Calc x43=100*x40/x39 Graphics Plot x43 by dekad SADC Course in Statistics Manage Calc x44=100*x34/x33 Graphics Plot x44 by dekad Module I3 Sessions 14/16 – Page 8 Module I3 Sessions 14/16 e. Use the Manage => Calculations dialogue to give the c.v. for each dekad, using the non-zero values, then graph the c.v. as shown in the left side of Fig. 10-4h. How large, roughly is the c.v. from this graph? Does it vary much through the year? The coefficient of variation seems roughly constant through the year, at about 80%. Its value is, of course, much less "stable" at the start and end of the year, because there are then fewer non-zero values. Now repeat the calculation, but without omitting the zero values first. This is also shown above, on the right in Fig. 10-4h and uses the initial calculation of summary statistics, shown earlier in Fig. 10-4b. What do you deduce now? The c.v. is often reported in the summary of rainfall totals, but usually without omitting the zero values first. Although the shape looks impressive, a comparison with Fig. 10-4g shows that this is effectively a very complicated way of giving the same information, namely that non-zero values are less likely in the middle of the season! There are 2 general messages from this example. The first is that when there is obvious structure in the data, and here that structure includes "dry decades", i.e. zero values, then take great care with any analysis that ignores the structure. Data often contains zeros. The usual analysis with such data is to look at the zeros first, i.e. examine what proportion (or percentage) of the values are zero. Then analyse the non-zero data This is what was done above to produce the summaries in X39 to X42 and the left-hand plot in Fig. 10-4h. The second message concernes the use of the c.v. itself. Once the zero values have been plotted, as in Fig. 10-4g, check that the c.v. provides a useful summary of the data. Sometimes the c.v. is useful, but often it is easier to interpret the mean, and perhaps the SADC Course in Statistics Module I3 Sessions 14/16 – Page 9 Module I3 Sessions 14/16 standard deviation separately. Here, the dekad totals are very skew, hence using the standard deviation (and therefore the c.v.) is suspect. Often percentiles are more appropriate summaries. As an example, the data in malawicr.wor were transposed, and then plotted as boxplots, Fig. 10-4i. Fig. 10-4i Boxplots of the 10-day data File New, Manage Data Copy x1-x27 from malawicr to x1x24 with transpose, then Graphics Boxplot SADC Course in Statistics Module I3 Sessions 14/16 – Page 10