Statistics

advertisement
The Sample Variance
© Chistine Crisp
Edited by Dr Mike Hughes
The Sample Variance
Can you find the medians and means for the following 3
data sets?
Median Mean,
Set A
Set B
Set C
1
1
1
2
1
5
3
1
5
4
4
5
5
5
5
6
6
5
7
9
5
8
9
5
9
9
9
5
5
5
x
5
5
5
Although the medians and means are the same, the data
sets are not really alike.
The spread or variability of the numbers is quite different.
How can we measure the spread within the data sets?
ANS: The range and inter-quartile range both measure
spread but neither uses all the data items.
Interquartile range we will do later with
Cumulative Frequency
The Sample Variance
Median Mean,
Set A
Set B
Set C
1
1
1
2
1
5
3
1
5
4
4
5
5
5
5
6
6
5
7
9
5
8
9
5
9
9
9
5
5
5
x
5
5
5
If you had to invent a method of measuring spread that
used all the data items, what could you do?
One thing we could do is find out how far each item is
from the mean and add up these differences. e.g.
Set A: x
x x
1 2 3 4
4 3 2 1
 ( x  x) 
5
0
6
1
7
2
8
3
9
4
x5
4 3 . . . + 3 + 4 = 0
Data sets B and C give the same result. The negative and
positive values have cancelled each other out.
The Sample Variance
To avoid the effect of the negative values we can either
•
•
ignore the negative signs, or
square each difference ( since the squares
will all be positive ).
Squaring is more convenient for developing theory, so, e.g.
Set A: x
xx
( x  x)2
1 2 3 4
4 3 2 1
16 9
4
1
5
0
0
6
1
1
7
2
4
2
(
x

x
)
 60

Let’s do this calculation for all 3 data sets:
8
3
9
9
4
16
The Sample Variance
Mean, x
Set A: x
Set B: x
Set C: x
Set A:
1
1
1
2
(
x

x
)
 60

2
1
5
3
1
5
4
4
5
Set B:
5
5
5
6
6
5
7
9
5
2
(
x

x
)
 98

8
9
5
9
9
9
5
5
5
Set C:
2
(
x

x
)
 32

The larger value for set B shows greater variability.
Set C has least variability.
Can you see a snag with this measurement?
ANS: The calculated value increases if we have more data,
so comparing data sets with different numbers of items
would not be possible.
To allow for this, we need to take n, the number of items,
into account.
The Sample Variance
There are 2 formulae that can be used,
msd 
or
s2 
2
(
x

x
)

n
2
(
x

x
)

n1
the mean square deviation.
the sample variance.
Our data is nearly always a sample from a large unknown
set of data ( the population ) and we take samples to find
out about the population. The 1st formula does not give
the best estimate of the variance of the population so is
not used.
The Sample Variance
So, there are 2 quantities and their square roots that we
need to be clear about
2
(
x

x
)

msd 
n
and
rmsd 
Also
s2 
and
s
2
(
x

x
)

n
2
(
x

x
)

n1
2
(
x

x
)

n1
the mean square deviation,
POPULATION VARIANCE
the root mean square
deviation.
POPULATION
STANDARD DEVIATION
the sample variance,
WE nearly ALWAYS use
THESE TWO formula
the sample standard
deviation.
The Sample Variance
e.g. Find the rmsd and msd of the following data:
x
Mean,
(i)
7
x

x
n
9
14

30
x
 10
3
2
(
x

x
)

(7  10) 2  (9  10) 2  (14  10) 2
2
 

n
3
9  1  16

 8  67 ( 3 s. f . )
3
2
x

2
(ii)  
 x 2  49  81  196  102  326  100
n
3
3
 8  67 ( 3 s. f . )
The 2nd form is exactly the same as the first form but
quicker to use !!
The Sample Variance
e.g. Find the sample SD and Variance of the following
data:
x
Mean,
(i)
(ii)
7
x

x
n
9
14

30
x
 10
3
2
2
2
2
(
x

x
)
(
7

10
)

(
9

10
)

(
14

10
)

s2 

n1
2
9  1  16

 13.0 ( 3 s. f . )
2
2
2
x  nx 49  81  196  3  102 326

2
s 


 150
n1
2
2
 13.0 ( 3 s. f . )
The 2nd form is in general quicker to use.
The Sample Variance
This all seems very complicated but help is at hand.
Both the quantities, rmsd and s are given by your
calculator.
e.g. Find the root mean square deviation, rmsd, and
the sample standard deviation, s, for the following data:
x
7
9
14
Use the Statistics function on your calculator and
enter the data. Select the list of calculations.
You will be able to find the following:
2  94392 . . . and 3  60555 . . .
The rmsd is smaller than s ( because we are dividing by
a larger number ). Correct to 3 s.f. we have
rmsd 
2
(
x

x
)

n
 2  94 ,
s
2
(
x

x
)

n1
 3  61
The Sample Variance
So, for the data
x
7
9
14
we have
rmsd 
2
(
x

x
)

n
 2  94 ,
2
(
x

x
)

s
n1
 3  61
Squaring these gives
msd 
 ( x  x)
n
2
 8  67,
( mean square deviation )
The part of the formula,
s2 
2
(
x

x
)

n1
 13
( sample variance )
2
(
x

x
)
, is in your formulae

sheet, labelled Sxx. (said as Sum of squares X X)
An expanded form of the expression is also given. All you
have to do is divide by the correct quantity.
SUMMARY
The Sample Variance
 The mean square deviation, msd, and sample variance,
both measure the spread or variability in the data.
 If we have raw data we use the statistical functions on
the calculator to find the rmsd or sample standard
deviation.
The sample standard deviation is the larger than the
rmsd because we divide by (n-1)
 To find the msd or sample variance, we square the
relevant quantity given by the calculator:
sample variance  s2
msd = (rmsd)2
 Your formulae sheet will gives the formula or
2
equivalent:
S xx   ( x  x )   x 
2
2
 x 
n
Then, we divide by n for the msd or (n – 1) for s2.
The Sample Variance
Frequency Data
The formula for the variance can be easily adapted
to find the variance of frequency data.
 x 
2
S xx   ( x  x )   x 
2
2
Becomes for FREQUENCY DATA
n
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
We usually only use the formulae if we are given
summary data. With raw data we enter the data
into the calculator and use the statistical functions
to get the answers directly.
The Sample Variance
Frequency Data
But note that ...............
S xx   ( x  x )   x
2
becomes
S xx   ( xf  x )   x
2

x


2
2
  x  nx
2
n

xf 

f
2
2
f
2
  x 2 f  nx 2
The Sample Variance
SO
becomes
MSD= SXX/n
and VARIANCE = SXX/(n-1)
2
2
2
x

n
x
x


2
msd 

x
n
n
2
2
2
2
x

n
x
x

n
x


Variance 

n1
n1
Frequency Data
x

msd 
2
f  nx

n
2
x
2
2
x
f

n
x

Variance 

n1
n
2
f
x
2
2
2
x
f

n
x

n1
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
5
8
10
4
Solution:
Using the calculator functions, the mean, m = 4  65
sample standard deviation,
s  3  17 ( 3 s. f . )
Although we don’t need the formula for this question, let’s
check we have the correct value by using the formula:
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
Solution:
5
8
10
4
 xf 
2
S xx   x f 
2
f
2
(
1

3

.
.
.

10

4
)
So, S xx  1 2  3  . . .  10 2  4 
20
 190  55
190  55
S xx
2

 10  029  s  3  17 ( 3 s. f . )
s 
19
n1
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Length (cm)
1-9
Frequency, f
2
10-14 15-19 20-29
7
12
9
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Solution: We need the class mid-values
Length (cm)
1-9
10-14 15-19 20-29
x
5
12
17
24·5
Frequency, f
2
7
12
9
xf
10
84
204
220.5
x2
25
144
289
600.25
x2f
50
1008 3468 5402.25
2
2
x
f

n
x

Sample Variance 
n1
n   f  30
x
 xf
f
 17.283
9928.25  30(17.2832 )

 33.351
29
Standard deviation, s = 5  77 ( 3 s. f . )
The Sample Variance
e.g.3 Find the mean and sample variance of 20 values of
x given the following:
 x  82 and
2
x
  370
Solution:
Since we only have summary data, we must use the formulae
sample mean,
x

x
n
S xx   x 2  nx 2
82
 x   41
20
S xx  370  20(4.12 )
 33  8
S xx
sample variance, s 
 1  78 ( 3 s. f . )
n1
2
The Sample Variance
SUMMARY
Raw data
msd
x


 nx
n
2
2
2
2
x

n
x

Sample Variance 
n1
Frequency data
2
2
x
f

n
x

msd 
n
2
2
x
f

n
x

Sample var iance 
n1
MSD is called POPULATION VARIANCE
Take square root for rmsd and sample
standard deviation
RMSD is called POPULATION STANDARD DEVIATION
The Sample Variance
Exercise
Find the mean, sample standard deviation and sample
variance for each of the following samples, using calculator
functions where appropriate.
1.
2.
x
f
1
7
Time ( mins )
f
2
9
3
14
1-5
7
3. 10 observations where
4
12
6-10
9
5
8
11-15 16-20 21-25
14
12
8
 x  432 and
 x  18912
2
The Sample Variance
1.
x
f
1
7
Answer:
2
9
mean,
3
14
4
12
5
8
x  31
standard deviation, s = 1  28 ( 3 s. f . )
2
variance,
s
 64calculator value
N.B. To find s we need to use the 1full
for s, not the answer to 3 s.f.
2.
Time ( mins )
1-5
6-10 11-15 16-20 21-25
2
x
3
8
13
18
23
f
7
9
14
12
8
Answer:
mean, x  13 5
standard deviation, s = 6  41
( 3 s.f. )
2
variance, s  41  1 ( 3 s. f . )
The Sample Variance
3. 10 observations where  x  432 and
Solution:
x

x
 
mean,
 x 
x  43  2
n
2
S xx   x 2 
variance,
n
S xx
s 
n1
2
 x  18912
2


Standard deviation, s 
S xx
(432) 2
 18912
10
 249  6
s 2  27  7 (3 s.f. )
27  7
 5  27 ( 3 s.f. )
The Sample Variance
There are 2 formulae that can be used to measure spread:
msd 
or
s2 
2
(
x

x
)

n
2
(
x

x
)

n1
the mean square deviation.
the sample variance,
In many books you will find the word variance used for
the 1st of these formulae and you may have used it at
GCSE.
However, our data is nearly always a sample from a large
unknown set of data ( the population ) and we take the
sample to find out about the population. The 1st formula
does not give the best estimate of the variance of the
population so is not used.
The Sample Variance
So, there are 2 quantities and their square roots that we
need to be clear about
msd 
and
rmsd 
Also
s2 
and
s
2
(
x

x
)

n
2
(
x

x
)

n
2
(
x

x
)

n1
2
(
x

x
)

n1
the mean square deviation
the root mean square
deviation.
the sample variance,
the sample standard
deviation.
The Sample Variance
e.g. Find the root mean square deviation, rmsd, and
the sample standard deviation, s, for the following data:
x
7
9
14
Use the Statistics function on your calculator and
enter the data. Select the list of calculations.
You will be able to find the following:
2  94392 . . .
3  60555 . . .
Ignore the calculator notation.
The rmsd is smaller than s ( because we are dividing by
a larger number ). Correct to 3 s.f. we have
rmsd 
2
(
x

x
)

n
 2  94 ,
s
2
(
x

x
)

n1
 3  61
The Sample Variance
Squaring these gives
msd 
2
(
x

x
)

n
 8  67,
( mean square deviation )
s2 
2
(
x

x
)

n1
 13
( variance )
Using the formulae:
If summary data are given, you will need to use the
formulae instead of the calculator functions.
The part of the formula,
2
(
x

x
)
, is in your formulae

booklet ( see correlation and regression ), labelled Sxx.
An expanded form of the expression is also given. All you
have to do is divide by the correct quantity, n or n  1.
SUMMARY
The Sample Variance
 The mean square deviation, msd, and sample variance,
both measure the spread or variability in the data.
 If we have raw data we use the stats functions on the
calculator to find the rmsd or sample standard deviation.
The sample standard deviation is the larger of these
quantities.
 To find the msd or sample variance, we square the
relevant quantity given by the calculator:
sample variance  s2
msd = (rmsd)2
 For summary data, we use the formulae book, choosing
2
the appropriate form:
S xx   ( x  x )   x 
2
2
 x 
n
Then, we divide by n for the msd or (n – 1) for s2.
The Sample Variance
e.g.1 For the following sample data, find
(a) the root mean square deviation, rmsd,
(b) the mean square deviation, msd,
(c) the sample standard deviation, s, and
(d) the sample variance s2.
x
12
15
14
9
Answer: Using the calculator functions,
(a)
(c)
rmsd  2  29 ( 3 s. f . ) (b) msd  5  25
s  2  65 ( 3 s. f . )
(d)
s2  7
The Sample Variance
e.g.2 Given the following summary of data for a sample of
size 5, find
(a) the mean square deviation, msd,
(b) the root mean square deviation, rmsd ,
(c) the sample variance s2
(d) the sample standard deviation, s , and,
n  5,
2
(
x

x
)
 24

Solution: Using the formulae book, S xx 
(a)
(b)
(c)
(d)
S xx
24
msd =

 48
5
n
rmsd = 4  8  2  19 ( 3 s . f . )
S xx
24
2
2
s 
 s 
6
n1
4
s  6  2  45 ( 3 s . f . )
2
(
x

x
)

Frequency Data
The Sample Variance
The formula for the variance can be easily adapted
to find the variance of frequency data.
 x 
2
S xx   ( x  x )   x 
2
2
n
becomes
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
As before, we only use the formulae if we are given
summary data.
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
5
8
Solution:
10
4
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
1  3  . . .  10  4
So, S xx  1  3  . . .  10  4 
20
 190  55
S xx
190  55
2
s 

 10  039  s  3  17 ( 3 s. f . )
n1
19
2
2
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Length (cm)
1-9
Frequency, f
2
10-14 15-19 20-29
7
12
9
Solution: We need the class mid-values
x
5
12
17
24·5
Frequency, f
2
7
12
9
We can now enter the values of x and f on our
calculators.
Standard deviation, s = 5  77 ( 3 s. f . )
The Sample Variance
SUMMARY
 To find the root mean square deviation, rmsd, or the
sample standard deviation, s, using the calculator
functions,
•
•
the values of x ( and f ) are entered and
checked,
the table of calculations gives both values,
•
the larger value is the sample standard deviation,
s, and this is the value that is most often used by
statisticians,
•
the variance is the square of the standard
deviation.
The Sample Variance
Outliers
We’ve already seen that an outlier is a data item that
lies well away from the other data. It may be a genuine
observation or an error in the data.
e.g. 1 Consider the following data:
10 12 14 17 19 21 81
With this data set, we would immediately suspect an
error. The value 81 was likely to have been 18. If
so, there would be a large effect on the mean and
standard deviation although the median would not be
affected and there would be little effect on the IQR.
The presence of possible outliers is an argument in
favour of using median and IQR as measures of data.
The Sample Variance
In an earlier section, we met a method of identifying
outliers using a measure of 1·5  IQR above or below the
median.
A 2nd method used to identify outliers is to find points that
are further than 2 standard deviations from the mean.
e.g. 2. Consider the following sample:
10 12 14 17 18 19 21 22 24 33
The sample mean and sample standard deviation are :
mean, x  19
standard deviation, s = 6  62 ( 3 s. f . )
So,
2 s  13  2
and
x  13  2  32  2
The point 33 is more than 2 standard deviations above
the mean so, using this measure, it is an outlier.
Download