2.4 Measures of Variation

advertisement
2.4 Measures of Variation
What is variability in data?
 Measuring how much the
group as a whole deviates from
the center.
 Gives you an indication of
what is the spread of the data.
 The common measures of
variation in data are – range,
deviation, variance and
standard deviation.
Range
The range is the simplest measure of variation. It is difference
between the biggest and smallest random variable.
Range = Maximum value - Minimum value
Range has the advantage of being easy to compute.
Its disadvantage, however, is that it uses only two entries from the
entire data set.
Age based on class survey data: 26, 25, 35, 35, 40, 41, 21, 19, 20, 20,
30, 25, 24, 47, 36, 16, 23, 48, 40, 21, 27, 22, 39, 34, 26, 25, 16, 24,
33, 32, 28, 48, 40, 38.
Range = maximum – minimum = 48 – 16 = 32
Deviation, Variance and Standard Deviation
The deviation of an entry xi in a data set is the difference between
that entry and the mean μ of the data set i.e. xi – μ
The population variance of the population data set of N entries
N
is:
(x - m )2
s =
2
å
i
i=1
N
The population standard deviation is the square root of the
population variance i.e. s = s 2
The sample
variance of the sample data set of N entries is:
n
å(x - x)
2
s2 =
i
i=1
n -1
The sample standard deviation is the square root of the
sample variance i.e. s = s2
Deviation, Variance and Standard Deviation
Age based on class survey: 26, 25, 35, 35, 40, 41, 21, 19, 20, 20, 30,
25, 24, 47, 36, 16, 23, 48, 40, 21, 27, 22, 39, 34, 26, 25, 16, 24, 33,
32, 28, 48, 40, 38.
Population size N = 34, Population mean μ = 1024/34 = 30.11765
Age (xi)
xi - μ
(xi – μ)2
26
-4.1176
16.9550
25
-5.1176
26.1903
:
:
:
:
:
:
38
7.8823
62.1314
Σ=2797.5294
σ2 = 82.2803
σ = 9.0708
Deviation, Variance and Standard Deviation
Variance and standard deviation take into consideration all the
data. However they are both easily influenced by extreme scores
since it is a square term.
Variance is hard to interpret since it is a squared measure, standard
deviation is interpreted as the average deviation from the mean.
Interpreting Standard Deviation
When interpreting the standard deviation, remember that it is a
measure of the typical amount an entry deviates from the mean.
The more the entries are spread out, the greater the standard
deviation.
Interpreting Standard Deviation
Empirical Rule or The 68-95-99.7 rule:
For a bell shaped symmetric distribution 68% of the data lies
within one standard deviation of the mean, 95% of the data lies
within two standard deviations of the mean and 99.7% of the data
lies within 3 standard deviations of the mean.
Interpreting Standard Deviation
Chebychev’s theorem
When the distribution is not bell shaped or symmetric then this
theorem gives a lower bound to the proportion of data the lies with
k standard deviations of the mean. It states that:
The proportion of any data set lying within k standard deviations
of the mean is at least 1 -
•
1
k2
k=2, In any data set, at least 1 -
1 3
= i.e. 75% of the data lies
2
2 4
within 2 standard deviations of the mean.
Standard Deviation of Grouped Data
Sample standard deviation for a frequency distribution is:
c
s=
å(x - x) f
i
i=1
i
n -1
Where c is the number of classes, xi is the ith data point in the
sample, fi is the corresponding frequency, n is the sample size.
2.5 Measures of Position
What are measures of position?
 A measure of position gives
you some idea of where
particular data values would
rank in an ordering of a data
set
 where a data value falls with
respect to the mean of the
sample or population..
Quartiles
Quartiles divide the data into 4 equal parts.
We need three quartiles to divide any data set into 4 equal
parts, Q1, Q2 and Q3.
 About a quarter of the data falls below the first quartile, Q1
 About a half of the data falls below the second quartile, Q2
 About three quarters of the data falls below the third
quartile, Q3
Interquartile range (IQR) of a data set is the difference
between the third and first quartiles, Q3 – Q1
Quartiles
In essence five values can use used to describe a data set: Minimum data
value, three quartiles - Q1, Q2, Q3 and maximum data value. These five
numbers are called the five number summary since they describe the
central tendency, the spread and the variation in the data.
Drawing a Box-whisker plot
 Find the five-number summary of the data set.
 Construct a horizontal; scale that spans the range of the data.
 Plot the five number above the horizontal scale.
 Draw a box above the horizontal scale from Q1 to Q3 and draw a vertical
line in the box at Q2.
 Draw whiskers from the box to minimum and maximum entries
For the age data: Min = 16, Q1=23.25, Q2 = 27.5, Q3 = 37.5, Max = 48
Whisker
Min entry
Box
Q1
Q2, Median
Whisker
Q3
Max entry
Percentiles and Other Fractiles
Fractiles are numbers that divide an ordered data set into
equal parts.
Some commonly used fractiles are:
Fractiles
Summary
Symbols
Quartiles
Divide a data set
into 4 equal parts
Q1, Q2, Q3
Deciles
Divide a data set
into 10 equal parts
D1, D2, D3,.. Q9
Percentiles
Divide a data set
into 100 equal parts
P1, P2, P3,.. P99
z-score
The standard score or z-score, represents the number of standard
deviations a given value x falls from the mean μ. To find the z-score for a
given value,
value  mean x  
z

stdev

A z-score can be positive, negative or zero.
If z is positive, the data point > the mean,
If z is negative, the data point < the mean,
If z = 0, the data point = mean.
Download