Module I3 Sessions 6&7 Practical 3: Processing multiple variables This practical uses both CAST and Excel. Overall Summary statistics in Excel Go into Excel and open the workbook called summary statistics.xls. Go to the sheet, called ANOVA. The data are the same as were used in practical 2, in the sheet called Rice Yields. The mean has been given, in cell B39. Use the Excel functions from the last practicals to give the corrected sum of squares in B40, the d.f. (i.e. n-1) in B41, the variance in B42, and the standard deviation in B43. Use the 70-95-100 rule of thumb to state roughly what you know about the rice yields of farmers: Examine the data in columns C and D, that give the same values you have just entered into B40 to B43 - . So far this is just revision of work you did in earlier practicals. Explain for someone new, why the formula SUM(D2:D37) which gives the corrected ssq in cell D40 gives the same value as using the Excel function DEVSQ in cell B40. Explaining the variation In the survey, there was also the information of which variety each farmer used. The column is called Var for variety. It is a factor (or category) column with 3 leavels, namely NEW, OLD and TRAD. It is possible that some of the (currently unknown) variability in the yields can be explained from the knowledge of the variety of rice grown. SADC Course in Statistics Module I3 Sessions 6&7 – Page 1 Module I3 Sessions 6&7 Find how different the rice yields are for each variety. One way is to use SSCStat as follows: Put the cursor within the data. Then use SSCstat => Analysis => Summary Statistics The Variable to be analysed is y and the analysis is to be by a factor, which is Var. Just calculate the means as shown in the table below. Which statement below do you believe to be true? The means are quite similar for the 3 varieties. True/False So I do not expect the knowledge of the variety to explain much of the variation in the data. The means are quite different for the 3 varieties. True/False So I expect the knowledge of the variety to explain quite a lot of the variation in the data. Column F in the Excel worksheet called ANOVA has copied the mean yields from the summary you obtained. In column G the first row of data is 23.6, and this is calculated as follows: (45.4 – 40.6)² = 4.8² = 23.6 That is for variety OLD. NEW Explain how the other 2 values in column G are calculated. They correspond to the variety NEW and TRAD are calculated: (____ - 40.6)² = ___ = ____ TRAD SADC Course in Statistics Module I3 Sessions 6&7 – Page 2 Module I3 Sessions 6&7 Hence explain how the value 3528 is calculated – in cell G40. Also state how much of the overall variation of the data is explained by the variety factor. Calculated as follows: Amount of variation explained Explain in words how the residual column (column I) is calculated and give examples for the observations below. The residual is the data value minus the ____________________ For the first observation it is therefore 53.6 – 45.4 = 8.2 For the 4th observation it is therefore 33.6 For the 5th observation For the 8th observation Explain the values in column J, including the corrected ssq in cell J40 Each value in J2 to J37 is the _______ of the values in ____. For example the value _____ in cell _____ is ______. These are therefore called the squared _________ Hence J40 is the residual ____________ We have chosen to calculate both columns G and J. How could we get the results in cells G40 and J40 by just calculating one of them? When you divide sums of squares by the corresponding degrees of freedom, you get mean squares or variances. That is why this breakdown of the variability in the data is called the Analysis of Variance, or ANOVA. What you noted above – that the sums of squares add up – does not happen with the other measures of variation, like the mean or the quartile deviations. The ONLY one of these measures of variation that is “neat” mathematically, is the variance (and hence the standard deviation). That’s perhaps the main reason that this measure of variation is the most used and hence the most important for you to understand. SADC Course in Statistics Module I3 Sessions 6&7 – Page 3 Module I3 Sessions 6&7 Take the sums of squares from row 40 and the degrees of freedom from row 41 and complete the table below: Term d.f. ssq msq 35 4955 142 ratio Variety Residual Total (A copy is at the bottom of the ANOVA_1 sheet. Check you have calculated this correctly.) The proportion of the variation explained by an explanatory measurement is called the coefficient of determination and denoted by R². It is usually given as a percentage, so multiply the result by 100. Calculate R² for this case. R² = _____%, i.e. _____ % of the variation in the yields can be explained by the different varieties. Finally we consider what this all means for the standard deviation. That is important because standard deviations are easier to interpret than variances. From the ANOVA table above, the overall standard deviation (see also cell B43) is s = √ 142 = 11.9. Now after taking out the variation due to the varieties the unexplained variation is s = √43 = 6.6. So the variance has been roughly divided by 4 and hence the standard deviation has been roughly halved, by including the variety factor. That result could be deduced from the value of R². It is roughly ¾, so ¼ of the variation remains unexplained. The corresponding effect on the standard deviation is therefore to halve it. A second ANOVA table The worksheet called ANOVA_2 in the same workbook has been started. It contains the village information, rather than the varieties. It needs to be completed, in the same way as the information on the worksheet for the varieties. SADC Course in Statistics Module I3 Sessions 6&7 – Page 4 Module I3 Sessions 6&7 It is slightly easier, because the villages are “in order” in the worksheet. Drag down the 4 village means in column F, to complete the column. The original values are in the Summary sheet if you make a mistake and need to start this column again. Then complete this sheet in the same way as the ANOVA_1 sheet. (Do as much as possible without looking at the formulae in the other sheet.) Hence complete the ANOVA table for the villages and give the coefficient of determination. Term d.f. ssq msq 35 4955 142 ratio Village Residual Total R² = _____%, i.e. _____ % of the variation in the yields can be explained by the different villages. Use the same logic as at the end of question 2 to calculate the residual standard deviation, and explain how this could be deduced from the value of R². SADC Course in Statistics Module I3 Sessions 6&7 – Page 5 Module I3 Sessions 6&7 Reading CAST A new section in CAST has been constructed for this SADC course to explain the same ideas as above. It is section 3.4 and is called “Variation and Groups”. It would be good to read this material in pairs, so you can discuss, and relate to the work in Excel, done above. Read pages 3.4.1 and 3.4.2 to review the idea of explained and unexplained variation. In 3.4.2, for the Mooring’s data the overall standard deviation of the October to December rainfall is about 100mm (s=109.2). Use the slider to give the seasonal forecast that is about ¾ of the way from “Worthless” to “Good”. This makes the unexplained standard deviation from the 3 forecasts around 50mm, or about half the value without the forecast. If the unexplained standard deviation is about halved, what roughly would to R² value be? (Hint – see the end of question 2 above) Explain briefly why having a forecast of the seasonal rainfall that has less than the overall variation, would be a good thing. (Hint: The parallel with the example at the bottom of page 3.4.2 may help) Read page 3.4.3. Summarise the 2 ideas in this page, using either (or both) the example of the varieties (question 2) or villages (question 3) above. Variation between groups: SADC Course in Statistics Module I3 Sessions 6&7 – Page 6 Module I3 Sessions 6&7 Variation within groups: Read page 3.4.4 and try the applet, making sure you click within the “Total”, “Between” and “Within” boxes, to see which deviations are then displayed. Also open the Excel workbook as before and use either the sheet ANOVA_1 or ANOVA_2. The text in CAST includes the following statement The following relationship requires some algebra to prove but is important. Which columns in the Excel sheet correspond to the different parts of this formula? And which cells give the 3 values above of the ssq? Copy the appropriate formula from above, into the table below: Terms Columns in Excel Cell giving ssq Formula Between Within Total Move the slider on page 3.4.4 so about the same proportion of the variation is explained as in the examples above for a) varieties and b) villages. What was the residual (SSwithin ) in each case? Same proportion as: Residual ssq (SSwithin ) Varieties Villages Read page 3.4.5. Try the 5 data sets and hence give a similar interpretation of the R² values for the varieties and the villages. Term SADC Course in Statistics Explanation Module I3 Sessions 6&7 – Page 7 Module I3 Sessions 6&7 Varieties Villages Read page 3.4.6, the final page in this section. Try the applet in the section titled “Illustration of calculations” with different amounts of variation explained and different sample sizes. Try a number of samples. For a given position of the slider which mean squares stay roughly the same size when the sample size is changed? Explain why that should be the case. Roughly by how much does the remaining mean square change, as the sample size changes? And what therefore happens to the ratio (called F in the applet)? What is the practical importance of that result? SADC Course in Statistics Module I3 Sessions 6&7 – Page 8