Practical 2: Resolving common problems in data analysis

advertisement
Module I3 Sessions 14/16
Practical 2: Resolving common problems in
data analysis
Resources:


Instat Climatic guide Chapter 10 (from Instat Help)
Malawi Rainfall data (from the Instat library)
This second practical continues with your practice in using a statistics package. At the
same time you examine and resolve two common problems in statistical analysis. They are:

When should an average be calculated?

How should you analyse data when some values are zero?
Activity 3 involves reading the text, and doing the practical work as you read, to reproduce
the same tables shown in these sections. They have been adapted from Section 10.4 in the
Instat Climatic Guide. The figures have been left with the same numbers as the guide to
facilitate cross-referencing.
Fig. 10-4a Malawicr data
File  Open From Library  malawicr.wor
X1- X27
Rainfall totals for
24 dekads (from
1st Oct.) for
1954/5 to 1982/3
(omitting 1962/3
& 1981/2)
X28
Evaporation totals
for dekads (from
1st October)
X29
Crop coefficients
for groundnut (11
periods)
X30
Crop coefficients
for cotton (17
periods)
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 1
Module I3 Sessions 14/16
X31, X32
Numbers 1,2,...
17 and 1, 2, …
24 (for plotting)
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 2
Module I3 Sessions 14/16
Activity 3
1. Average and then calculate, or calculate and then average?
Climatic data from Malawi are used to illustrate two common problems in data analysis,
and how they can be resolved. They also show how general skills in data analysis allow
users to offer support in new application areas.
Ten-day data for 27 years from a site in Malawi are available.
a. Open the file Malawicr.wor, and check the data are as shown in Fig. 10-4a.
b. What month do the data in the figure finish in?
In these data, each column contains the data for one year. So in Fig. 10.4a, X1 shows
there no rain in October, then 18.8mm in the first 10 days (dekad) of November, etc on.
In a full study, the rainfall records would be transposed, so each column holds the data for
a single decade within the year, rather than for each year.
Fig. 10-4b Calculating summary statistics for Malawi data
Manage  Manipulate  Row Statistics
Manage  Calculate (x35 = x28/2)
c. With the data in its current form, use the Manage  Manipulate  Row
Statistics dialogue as shown below to calculate the summary values (Fig. 10-4b).
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 3
Module I3 Sessions 14/16
d. Dekad 6 has been marked in the figure above. When is this, within the year?
It is the … dekad in the month of ……….., i.e. the period from ……… to ………
The calculation of the mean and standard deviation for each dekad is shown in Fig. 10-4b.
A common analysis by climatologists is to compare the mean rainfall with half the
evaporation, so that was also calculated, in Fig. 10-4b.
Comparing x33 with x35 (Fig. 10-4b) shows that dekad 6 is where the mean rainfall first
exceeds half the evaporation.
e. Plot the results though the year to give, Fig. 10-4c.
Fig. 10-4c Plot of mean rainfall with evaporation/2
f. From the data or the graph, which dekad had the highest mean rainfall?
Dekad number ………., i.e. the ………. dekad in the month of …………. .
This plot is sometimes used to give the "average" length of the season, i.e. the season is
defined as the period during which the average rainfall (X33) is greater than half the
evaporation (X35). Here it is periods 6 to 19. It does give a first indication, though it is
deceptively difficult to interpret, as we now show.
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 4
Module I3 Sessions 14/16
It is also useful to have an idea of the variability of the season length from year to year and
that is not available from the analysis as done above. It is the first indication that the
analysis above may not be the best way of looking at the season length.
There is an alternative. Instead of averaging the rainfall and then comparing with the
evaporation, calculate the season length each year, and then average the season lengths. So
average the lengths as the last stage in the analysis, rather than averaging the rainfall, at the
beginning.
To get the lengths of the season, first get the start, then the end, and subtract them. The
start is easy. For example, defining the start of the season as the first decade with rainfall
of more than half the evaporation then Instat’s Start of the Rains dialogue, can be used. It
is designed for daily data, but can equally be used with dekads.
g. Use the dialogue in Fig. 10-4d, to give the dekad of the start shown in X37.
Fig. 10-4d Calculating the start and end of the season
Climatic  Events  Start of the Rains
Changes for the end of the
season
h. Use the same dialogue to give the end of the season, in X37, as indicated in Fig.
10-4d. (Change the starting dekad to the one with the highest mean rainfall in the
season, see Fig. 10-4c. Then find the first dekad after that point, where the
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 5
Module I3 Sessions 14/16
i.
j.
rainfall is less than half the evapotranspiration. The option to look for less is not
on the dialogue, add it to the command that generates the analysis, as shown in
Fig. 10-4d.)
Add a column to give the year numbers, in X38, as shown in Fig. 10-4d. (Or just
give a column that goes from 1 to 27. You could just type, or use the Manage 
Data  Regular sequence dialogue.)
Summarise X36 and X37 to give the mean and standard deviations. What are
they?
Mean dekad of the start = …… The standard deviation = ……, i.e. about …… days
Mean dekad of the end = …… The standard deviation = ……., i.e. about ….. days
You should have found that the mean of the starting dekads is very close to the value in
Fig. 10-4c. The mean of the ending dekads is however a long way from the value of dekad
19 in Fig. 10-4c. The values each year are plotted in Fig. 10-4e, with the supposed average
dekads from Fig. 10-4c marked.
Fig. 10-4e The start and end of the season
Comparing Fig. 10-4c and 10-4e shows that, if the value each year is available, then the
method shown in Fig. 10-4d should be used. This calculated the starting and ending dekad
each year, and then averaged these values.
So, if in doubt, first calculate and then average, don’t average and then calculate.
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 6
Module I3 Sessions 14/16
2. Take care when there are zero values in the data.
The second topic involves examining the rainfall data through the season. Fig. 10-4a
shows the decade totals are often zero at the start and end of the "year". When analysing
data with zeros it is usually useful to split the analysis into two parts. The number of zeros
is considered first and then the non-zero values are analysed further.
a. With the data as in Fig 10-4a, use the Manage  Manipulate  Row
Statistics dialogue, as shown in Fig. 10-4f. In the dialogue "restrict" the data to
omit the zeros and include the count of the number of observations.
b. Use the Manage  Calculate dialogue, as shown in Fig. 10-4f. to calculate X42
with the % of years with no rain.
Fig. 10-4f Dealing with zeros in data
Manage  Manipulate  Row Statistics
Manage  Calculate  x42 = 100*(27-x41)/27
For example, Fig. 10-4f shows that the 3rd dekad in October had rain in 9 of the 27 years
(X41). Hence the calculation shows that 2/3 of these years were dry, (X42). In the 9 years
with rain the mean was 14.9mm and the standard deviation 9.3mm.
c. Report the same results for the 3rd dekad in November.
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 7
Module I3 Sessions 14/16
d. Plot the percentage of years without rain in each dekad, as shown in Fig. 10-4g.
Fig. 10-4g Plot of the percentage of zeros
The coefficient of variation is often calculated in climatic studies. This is defined as
c.v. = 100 * (Standard deviation)/Mean
i.e. here
x44 = 100 * x40 / x39
Fig. 10-4h Coefficient of variation of 10 day rainfall data through the season
Manage  Calc 
x43=100*x40/x39
Graphics  Plot x43 by dekad
SADC Course in Statistics
Manage  Calc 
x44=100*x34/x33
Graphics  Plot x44 by dekad
Module I3 Sessions 14/16 – Page 8
Module I3 Sessions 14/16
e. Use the Manage => Calculations dialogue to give the c.v. for each dekad, using
the non-zero values, then graph the c.v. as shown in the left side of Fig. 10-4h.
How large, roughly is the c.v. from this graph? Does it vary much through the year?
The coefficient of variation seems roughly constant through the year, at about 80%. Its
value is, of course, much less "stable" at the start and end of the year, because there are
then fewer non-zero values.
Now repeat the calculation, but without omitting the zero values first. This is also shown
above, on the right in Fig. 10-4h and uses the initial calculation of summary statistics,
shown earlier in Fig. 10-4b.
What do you deduce now?
The c.v. is often reported in the summary of rainfall totals, but usually without omitting the
zero values first. Although the shape looks impressive, a comparison with Fig. 10-4g
shows that this is effectively a very complicated way of giving the same information,
namely that non-zero values are less likely in the middle of the season!
There are 2 general messages from this example. The first is that when there is obvious
structure in the data, and here that structure includes "dry decades", i.e. zero values, then
take great care with any analysis that ignores the structure. Data often contains zeros. The
usual analysis with such data is to look at the zeros first, i.e. examine what proportion (or
percentage) of the values are zero. Then analyse the non-zero data This is what was done
above to produce the summaries in X39 to X42 and the left-hand plot in Fig. 10-4h.
The second message concernes the use of the c.v. itself. Once the zero values have been
plotted, as in Fig. 10-4g, check that the c.v. provides a useful summary of the data.
Sometimes the c.v. is useful, but often it is easier to interpret the mean, and perhaps the
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 9
Module I3 Sessions 14/16
standard deviation separately. Here, the dekad totals are very skew, hence using the
standard deviation (and therefore the c.v.) is suspect. Often percentiles are more
appropriate summaries. As an example, the data in malawicr.wor were transposed, and
then plotted as boxplots, Fig. 10-4i.
Fig. 10-4i Boxplots of the 10-day data
File  New, Manage  Data  Copy
x1-x27 from malawicr to x1x24 with transpose, then Graphics  Boxplot
SADC Course in Statistics
Module I3 Sessions 14/16 – Page 10
Download