Module I3 Sessions 6&7 Practical 2: Processing single variables 2 This practical sheet uses both Excel and CAST. More summary statistics with Excel Go to the same Excel workbook as the last practical, Summary statistics.xls. Open the sheet called Rice yields. Column B contains the data for the rice yields, from the workbook called survey.xls arranged in ascending order. Summary statistics are given below the data. They should be familiar by now. To reinforce the calculations of the mean deviation and the standard deviation, columns D, E and F have been started. Double click or drag the 2nd value (-16.28) in column D to complete that column. Do the same for columns E and F. Are the statistics calculated from first principles, the same as using the Excel functions? Explain briefly any differences: The quartile deviation is a third measure of variation. Your next tasks are to calculate its value, as well as the median, from first principles. The cells I22 and I23gives the observation numbers of the lower quartile and the median, using the general formula: Observation = r * (n+1)/100, where r = 25 for the lower quartile, and 50 for the median. Use the same formula to give the observation number of the upper quartile in cell I24. The formula and corresponding values for the lower quartile are then given in cells I10 and I11. Use the observation number for the upper quartile to give its value in cell I12. Then give the inter-quartile range, IQR, (difference between the quartiles), in I13, And the quartile deviation – half this range – in I14. You now have 3 measures of spread, the quartile deviation, the mean deviation and the standard deviation. Their respective sizes are in the following table. SADC Course in Statistics Module I3 Sessions 6/7 – Page 1 Module I3 Sessions 6&7 Complete the table: Summary statistic Value Quartile deviation Reason it is in this order I expect it to be less than the standard deviation because…………………… Mean deviation 9.6 I expect it to be less than the standard deviation because…………………….. Standard deviation 11.9 I expect it to be greater than the quartile deviation because of the 70-95-100 rule of thumb Quartiles in Excel First use excel to get the MIN and MAX (Hint: Use the arrow next to Σ or use excel’s functions). Excel has a function, called quartile. It is used in cell J10 as an alternative way of giving the lower quartile value. The function there is =QUARTILE(B5:B40,1), where 1 represents the lower quartile, 4 represents the upper. Using column J give the corresponding values from Excel for the quartiles and hence for the IQR and the quartile deviation (For further explanation refer to fx>QUARTILES>Help on this function). Compare the values from Excel and from first principles for the quartiles. Does the difference concern you? Explain why, or why not. (Hint: See CAST page 2.2.2.) Warnings about means and standard deviations Read CAST page 3.3.2 that shows possible problems using the mean and standard deviation in an uncritical way. Hence complete the following table: Data set Problems with mean and s.d. What to do Symmetrical None Use them Clusters Outlier Skew SADC Course in Statistics Module I3 Sessions 6/7 – Page 2 Module I3 Sessions 6&7 Read CAST page 3.3.3 on the possible problems using the mean, and particularly the s.d. when there are outliers. Hence complete the following table: Data Mean Mean (day in year) Standard deviation As given 126.2 5th May 18.7 days (about 2.5 weeks) Low outlier (50) 124.8 3rd May 21.1 days (about 3 weeks) No planting rain (0) High outlier (220) High outlier (365) Missing value (999) What conclusion is written in CAST following this exercise? More practice with outliers Go back to the Excel workbook Summary statistics.xls. Open the sheet called Samaru. The data in column B are the same as you examined in CAST on page 3.3.3. Try inserting the value from CAST (0, 50, 220, 365, 999) in turn in cell B61. Check that the mean and sd in cells B64 and B65 give the same results as CAST. Now extend the mean and sd calculations along that row for all the other columns. Then try the odd values from the table below (200, 0, 365, 999) to the end of the season (Column I) and complete the table below. Data Mean Mean (day in year) Standard deviation As given 292.9 18 Oct 8.9 days (just over 1 week) Low outlier (200) 291.3 16 Oct 15.1 days (just over 2 weeks) No ending date (0) High outlier (365) Missing value (999) SADC Course in Statistics Module I3 Sessions 6/7 – Page 3 Module I3 Sessions 6&7 More on the coefficient of variation Take away the extra observations to use the actual data. Give the coefficient of variation in cell B66 for the start of the rains. What is its value? _____________ What is the value of the cv for the end of the rains? ___________ Can you explain why the cv is not a useful summary for this sort of data? No/Maybe/Yes If “Maybe” or “Yes”, then please try to explain. Here is an exercise to help. The end of the season (Column called End) was the first day after 1st September that there was no water in the soil. We started counting days from the 1st January. But for the end of the rains we could equally well define 1 September (day 245) as day 1. Create a new column, J that is the same as the End, but subtracting 244 from each value. Work out the mean and standard deviation of these new values. How do they compare? Complete the table below. Summary statistic Original data (Column I) New data Column (J) Mean Day 293 (18 October) Day ….. ( Standard deviation 8.9 days Comment ) Same date cv Now are you able to complete the comment above? If not, then try the exercise again, making 1st October (day 275) into the new day 1. SADC Course in Statistics Module I3 Sessions 6/7 – Page 4 Module I3 Sessions 6&7 Using SSC-Stat Make the active cell somewhere in the data set. Then use SSCstat=> Analysis => Descriptive statistics. Complete the dialogue as shown below, with Start1 and End as the selected Variables, to process the 2 columns called Start1 and End, and give all the summary statistics used in this practical. This shows it is easy to give any summary statistics you would like. What is more important is to provide summary values that are appropriate and that you and the readers can interpret. From the work in CAST and in this practical, you should now be able to interpret the mean and standard deviation and know when it is an appropriate summary. Similarly you should be able to interpret the 5-number summary. If (as here), the coefficient of variation can not usefully be interpreted, it should not be given. SADC Course in Statistics Module I3 Sessions 6/7 – Page 5