Module I3 Sessions 4&5
Some students may be concerned that their fear of maths formulae will prevent them understanding certain statistical ideas. This used to be a valid concern, because data had to be processed by hand, and this required an understanding of the formulae.
Now you will use a spreadsheet, or a statistical package for the calculations. So courses can be more practical and no longer need to include formulae.
BUT, if you could understand and use the simplest formulae, then it will give you more confidence in using the computer for your analyses. And you may be surprised at how much you already know!
1. The mean
Here is the entry in the SIAC Glossary for the mean.
The mean is a measure of the “middle”, sometimes called the “average”.
It is often given the symbol. x .
To calculate the mean, sum all the data values and divide by the number of them. For example with the 7 values 12, 15, 11, 18, 13, 14, 18, the mean is: x
(12 15 11 18 13 14 18) / 7
14.4
As a formula, the mean is given by x
, where
is short for “the sum of”, “ x ” signifies that each value is taken in turn, and n is the number of observations.
To check you are comfortable with this formula:
Add the 7 numbers above by hand (i.e. without even using a calculator!) to get the total, i.e. Σx and then divide by 7 to give the mean as shown above.
Ans: x = ______ hence mean = ______
Go into Excel and enter the 7 numbers into a column. Then (with the cell beneath as the active one) click on the Σ symbol in the toolbar to give the sum, see below.
In another cell, divide this total by n=7 to give the mean.
SADC Course in Statistics Module I3 Sessions 4&5 – Page 1
Module I3 Sessions 4&5
Instead of clicking on Σ, press the small arrow next to it, to give alternative functions. Use this to give the mean directly, as well as the count, minimum and maximum.
The Σ function in Excel More simple statistical functions in Excel
2. More practice with the mean
No: Question
1 Calculate x = x/n for the first n positive numbers, when n = 6.
Answer
2
3
(Hint: 1 is the first positive number…)
Now (with the calculation performed above in mind) calculate the mean of the numbers
121, 122, 123, 124, 125, 126 without using a calculator.
How did you do it? (See some options below) a) The new numbers are 120 more than 1, 2, … 6, so I just added 120 to the x calculated in the first question. b) I subtracted 120 from each number to give 1, 2, … 6. So, that must add 120 to the mean. c) I checked from the formula that x
= (120+x)/n
= ((120 +1) +(120+2)+ …)/6
= 120 +(1+2+3+4+5+6)/6
= 120 + original mean
SADC Course in Statistics Module I3 Sessions 4&5 – Page 2
Module I3 Sessions 4&5
3
4
3. The median
Here is the entry in the SIAC glossary about the median. Read it and complete the questions below.
No.
1
2
5
The median is the "middle value" of a list. If the list has an odd number of entries, the median is the middle entry after sorting the list into increasing order. If the list has an even number of entries, the median is halfway between the two middle numbers after sorting.
For example with the same 7 values shown for the mean and maximum above the sorted data are as follows:
11 12 13 14 15 18 18.
The median is therefore the 4 th
value in the sorted list, i.e. 14.
Question
Calculate the median of the n = 6 numbers, above, i.e. the median of 121, 122, 123, 124, 125, 126.
Change the largest value, 126 to 246 by adding 120 to it.
Does the median change? If so how much?
Does that make the mean change?
If you think the mean changes, then work out the new value, if possible without doing the calculation again?
How did you work out the mean?
(See the table of alternatives below)
Answer a) I did the calculation again using my hand calculator. b) I did the calculation again using Excel or a statistics package. c) I noticed that adding 120 to the last number simply adds 120 to the total, and that’s 20 per number. So it must add 20 to the mean. d) I thought it would add 20 to the mean, but wasn’t sure. So I checked using my calculator. e) I used my calculator/Excel/Stats Package and then noticed it added 20 to the previous value. Then I understood the question better!
SADC Course in Statistics Module I3 Sessions 4&5 – Page 3
Module I3 Sessions 4&5
4. A different measure of spread
The mean deviation is sometimes given by the formula mean deviation
x
x /
n
1
,
(Here the
symbol is called the “modulus” or “absolute value”, so throw away the minus, for negative values.)
The mean deviation is, as is shown by the formula, an average (mean) difference from the mean.
Calculate the mean deviation for the same 6 values used above.
121 122 123 124 125 126
Ans: _______.
(Note: some statistics packages, e.g. Instat, have an option to calculate the mean deviation.)
5. Dividing by (n-1)
Why divide by (n-1) in the formula above?
Here are some possible answers. Tick all those that help to explain the reason to you: a) You start with n pieces of information. You use one piece of information to calculate the mean, (which you need first). So you have only (n-1) pieces of information left to calculate the mean deviation. The number of pieces of information to calculate the spread, is called the “degrees of freedom”. b) If you only have one observation, e.g. 121, you can give the mean, (trivially) but cannot calculate any spread. If you have 2 observations, then the only information about spread is the difference. So you have one piece of information less to calculate the spread. c) It would be simpler to divide by n, but statisticians always like to complicate things. d) If you divide by n, then
x
x / n
is like the formula
for the mean. So I can see why it is called the “mean deviation”. It is still roughly the same formula when we divide by n-1.
SADC Course in Statistics Module I3 Sessions 4&5 – Page 4
Module I3 Sessions 4&5
6. The variance and standard deviation of the data
The variance is a measure of variability, and is often denoted by s
2
.
The variance, s 2 is given by the formula s
2 x
x
n
1
a.
Calculate the variance for the same values, i.e.
121, 122, 123, 124, 125, 126. Ans: _______
So the variance is roughly the mean value of
x
x
2 .
The standard deviation (s.d.) is a commonly used summary measure of variation or spread of a set of data. It is a “typical” distance from the mean .
Usually, about 70% of the observations are closer than 1 standard deviation from the mean and most (about 95%) are within 2 s.d. of the mean.
The standard deviation is a symmetrical measure of spread, and hence is less useful and more difficult to interpret for data sets that are skew . It is also sensitive to (i.e. its value can be greatly changed by) the presence of outliers in the data.
With the data values
12 15 11 18 13 14 18 the variance , s
2 was calculated as 7.62, so the standard deviation:
s = √7.62 = 2.8.
The mean, x was 14.4, so ( x - s) = (14.4 – 2.8) = 11.6 and ( x + s) = 17.2
So 4 of the 7 observations are within one standard deviation of the mean, while the other three are outside.
b.
Simply put, the standard deviation is s, which is the (square root) of the variance.
Calculate the standard deviation, s. Ans: ______ c.
Is the standard deviation larger than the mean deviation? Ans: Yes/No
SADC Course in Statistics Module I3 Sessions 4&5 – Page 5
Module I3 Sessions 4&5 a) Yes. d.
Do you think this will almost always be the case? b) No, sometimes it will be larger, and sometimes smaller. c) I am not sure – it would be good to check with more examples. d) Yes, because when you square (in calculating the variance) you give the big deviations even more importance.
7. Using Excel for these formulae a.
Go into Excel and follow these instructions.
No: Instruction
1 Type n in Cell A1, and x in cell B2
2 Enter the numbers 1, 2, … 7 below n. Cells (A2-A8)
3 Enter 12, 15, 11, 18, 13, 14, 18 below x. Cells (B2:B8)
Comment
Naming the variables
4 In cells (A10:A13) type Sum, (n-1), mean, stdev
5 In B10 use the Σ function or type =SUM(B2:B8) This is Σx
6 In B11 use the ▼by the Σ, or type =COUNT(B2:B8)-1 This is (n-1)
7 In B12 use the ▼again, or type =AVERAGE(B2:B8) This is x
.
Answer
=
=
=
8 In B13 use f x
and then STDEV or type
=STDEV(B2:B8)
The standard deviation =
9 Decrease the number of decimals to 2 in B12 and B13 Tidying
Now, to reinforce the formulae, the standard deviation is to be calculated from first principles. This has the bonus that you will calculate the mean deviation at the same time.
Excel version of this is called AVEDEV but this divides by n instead of n-1 so we prefer to calculate it ourselves. b.
Follow the instructions below:
No: Instruction
1 Type (x-mean) in Cell D1, │x-mean│in E1 and (x-mean) 2 in F1
Comment
Naming the variables
2 Enter the formula =(B2-B$12) in D2. Copy down to
D8
Note the $ to fix D12
3 Enter the formula ABS(D2) into E2. Copy down to E8 The absolute deviations
4 In E10 use the Σ function or type SUM(E2:E8) Sum of these deviations
Answer
SADC Course in Statistics Module I3 Sessions 4&5 – Page 6
Module I3 Sessions 4&5
5 In E11 calculate E10/B11
6 Cut the decimals back to 2 so the deviations and results
Mean deviation
Also do this below
7
11
12 are clearer
In D11 type mdev to remind you what is calculated
In F13 calculate SQRT(F11)
Compare F13 with B13 to check they are the same
8 Enter the formula D2*D2 into F2. Copy down to F8 The deviations squared
9 In F10 use the Σ function or type SUM(F2:F8) This is the mean.
10 In F11 calculate F10/B11 The variance
The standard deviation
(Hint: if they are not check all your values are in the right cells)
This exercise has been to re-enforce the use of the formulae. They can help you to understand the summary statistics. It has also shown that once you know a formula you can use it in Excel from first principles to construct a new summary. In earlier sessions you saw that you could construct a new graph – a (jittered) dot plot if you understood the concept. This is the parallel idea for summary statistics.
8. The coefficient of variation
The coefficient of variation, sometimes called the cv, is a summary statistic that Excel does not have a function for, but it is easy to calculate.
The formula is cv = 100*stdev/mean
Its appeal s that it is dimensionless, you do not need to know the units of measurement for it have meaning.
It measures the variation in a set of data (stdev), as a fraction of the mean and expressed as a percentage. a.
Type cv in cell A14 and, in B14, calculate the cv for the data in B2:B8.
No summary statistic should be used unless it helps you and the reader to interpret the data. The cv is an overused summary statistic and is sometimes used when it is not a sensible summary to calculate.
SADC Course in Statistics Module I3 Sessions 4&5 – Page 7
Module I3 Sessions 4&5 b.
Looking at the formula, what are the situations when the cv would not be sensible? (See also the hint below.)
Hint: Copy all the values from column B (i.e. B1 to B14) into column H, so you can play without affecting the main data. Then change the last value of 18 to -18. You find that the cv is now 132%.
There is nothing intrinsically wrong with a cv of over 100%, but calculating the cv with negative values (even zeros) is starting to look odd. c.
You can make it a nightmare, by replacing the 14 by -44 (say), so the mean is now 1. What is the cv now? d.
And even more exciting would be to use -51! Why is that exciting?
The use of the coefficient of variation (cv) is also discussed in the presentation.
SADC Course in Statistics Module I3 Sessions 4&5 – Page 8