Lecture_04_ch2_222_w05_s34

advertisement
LESSON 4: MEASURES OF VARIABILITY
AND PROPORTION
Outline
• The range, variance, standard deviation and
coefficient of variation
• Interpretation of standard deviation
• Population and sample variance
• Approximation from the grouped data
• Skewness
• Interquartile range and box plots
• The proportion
1
MEASURES OF VARIABILITY: EXAMPLE
• Heights of players of two teams in inches are as follows:
Team I: 72,73,76,76,78, so mean=75, median=mode=76
Team II: 67,72,76,76,84, so mean=75, median=mode=76
• How about the variation?
2
MEASURES OF VARIABILITY
RANGE
• The first and simplest measure of variability is the
range.
• The range of a set of measurements is the numerical
difference between the largest and smallest
measurements.
Range = Largest value - Smallest value
3
MEASURES OF VARIABILITY
RANGE
• Team I Range
= 78-72
=
inches
• Team II Range
= 84-67
=
inches
• So, Team I
variation is
a. less
b. more
4
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• A major drawback of the range is that it uses only two
extreme values, ignores all the intermediate values,
and provides no information on the dispersion of the
values between the smallest and largest
observations.
• On the other hand, variance / standard deviation /
CV, uses all the values and provides information on
the dispersion of the intermediate values
• Computation of variance / standard deviation / CV
requires computation of deviation from the mean
5
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Team I deviations from the mean:
(72-75)=-3, (73-75)=-2, (76-75)=1, (
-
)= , (
-
)=
6
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Team I deviations from the mean:
-3, -2, 1,
,
• From the property of mean (see Lesson 3, Slides 1011), sum of deviations from the mean is zero. Check
-3-2+1+
+
=
7
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Sum of squared deviations from the mean is not
necessarily 0 e.g., sum of squared deviations
  3   2  1  
2

2
2
2  
2
inch 2
• Although sum of squared deviations increases if the
dispersion increases, the sum depends on the
number of measurements. So, mean squared
deviations is a preferred measure of dispersion.
8
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Variance is the mean squared deviation
• For example, Team I variance
2
2
2

 3   2  1  

2  
2
5


5
inch 2
9
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Standard deviation is the root mean squared
deviation i.e., square root of variance.
• So, Team I standard deviation




 32   22  12  
2  
2
5
5
inch
10
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Coefficient of variation is the standard deviation
divided by the mean.
• So, Team I coefficient of variation


11
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Why there are three similar terms?
– In the above example, variance has unit inch2
– But, standard deviation has unit inch - the unit of
the original data. So, standard deviation may
sometimes be preferred over variance.
– Coefficient of variation is dimension less. Hence,
coefficient of variation is a useful quantity for
comparing the variability in data sets having
different standard deviations and different means.
12
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Interpretation of standard deviation
– It’s difficult to interpret
– A higher standard deviation implies a greater
variability
– Standard deviation is widely used to approximate
the proportion of measurements that fall into
various intervals of values. This is specially true if
the data has a bell-shaped distribution.
13
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Interpretation of standard deviation
– An empiricial rule states that if the data has a bellshaped distribution,
• approximately 68% measurements fall within one
standard deviation of the mean i.e., between
(mean-standard deviation) and (mean+standard
deviation)
• approximately 95% measurements fall within two
standard deviations of the mean, and
• virtually all the measurements fall within three
standard deviations of the mean
14
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
Mean
-3 -2 -1
+1 +2 +3
68.26%
95.44%
99.74%
15
MEASURES OF VARIABILITY
VARIANCE, STANDARD DEVIATION, CV
• Interpretation of standard deviation
– Example: suppose that the final marks has a bellshaped distribution, with a mean of 75 and a standard
deviation of 7. Then,
• approximately 68% marks fall between (75-7)=68
and (75+7)=82.
• approximately 95% marks fall between (75-27)=61
and (75+27)=89, and
• virtually all the measurements fall between (75-37)
=54 and (75+37)=96
16
POPULATION VARIANCE
• The population variance is the mean squared
deviation from the population mean:
N
2 
•
•
•
•
•
2
(
X


)
 i
i 1
N
Where 2 stands for the population variance
 is the population mean
N is the total number of values in the population
X i is the value of the i-th observation.
 represents a summation
17
SAMPLE VARIANCE
• The sample variance is defined as follows:
N
s2 
•
•
•
•
•
2
(
X

X
)
 i
i 1
n 1
Where s2 stands for the sample variance
X is the sample mean
n is the total number of values in the sample
X i is the value of the i-th observation.
 represents a summation
18
SAMPLE VARIANCE
• Notice that the sample variance is defined as the sum
of the squared deviations divided by n-1.
• Sample variance is computed to estimate the
population variance.
• An unbiased estimate of the population variance may
be obtained by defining the sample variance as the
sum of the squared deviations divided by n-1 rather
than by n.
• Defining sample variance as the mean squared
deviation from the sample mean tends to
underestimate the population variance.
19
SAMPLE VARIANCE
• A sample of monthly advertising expenses (in 000$)
is taken. The data for five months are as follows: 2.5,
1.3, 1.4, 1.0 and 2.0. Compute the sample variance.
2.5  1.3  1.4  1.0  2.0
X

5
2
2




2.5 
 1.3 
 1.4 
2
s 
5 1
2
2
2








4

2  1.0  2  2.0  2
2  
2

4
20
SAMPLE VARIANCE
• An alternate formula for the sample variance:
n
s2 
•
•
•
•
•
X
i 1
2
i
 nX
2
n 1
Where s2 stands for the sample variance
X is the sample mean
n is the total number of values in the sample
X i is the value of the i-th observation.
 represents a summation
21
SAMPLE VARIANCE
• A sample of monthly sales expenses (in 000 units) is
taken. The data for five months are as follows: 264,
116, 165, 101 and 209. Compute the sample
variance using the alternate formula.
264  116  165  101  209
X

5

264  116  165  101  209

2
s
2
2
2
2
2
 5

2
5 1
164259 

4

4

22
POPULATION/SAMPLE STANDARD DEVIATION
• The standard deviation is the positive square root of
the variance:
2



Population standard deviation:
2
s

s
Sample standard deviation:
23
POPULATION/SAMPLE STANDARD DEVIATION
• Compute the sample standard deviation of
advertising data: 2.5, 1.3, 1.4, 1.0 and 2.0
s  s2 
• Compute the sample standard deviation of sales
data: 264, 116, 165, 101 and 209
s  s2 
24
POPULATION/SAMPLE CV
• The coefficient of variation is the standard deviation
divided by the means

Population coefficient of variation: V 

s
Sample coefficient of variation: v 
X
25
POPULATION/SAMPLE CV
• Compute the sample coefficient of variation of
advertising data: 2.5, 1.3, 1.4, 1.0 and 2.0
s
v 
X
• Compute the sample coefficient of variation of sales
data: 264, 116, 165, 101 and 209
s
v 
X
26
SAMPLE VARIANCE
APPROXIMATED FROM GROUPED DATA
• Sample variance from grouped data:
s
•
•
•
•
•
•
2
f


k
X  nX
2
k
2
n 1
Where s2 stands for the sample variance
X is the sample mean
n is the total number of observations   f k
X k is the midpoint of the k-th class
f k is the frequency of the k-th class
 represents a summation over all classes
27
SAMPLE VARIANCE
APPROXIMATED FROM GROUPED DATA
• Compute the sample variance of days to maturity of
40 investments from the following grouped data:
X  68.5 (see Lesson 3, Slides 6 - 7)

f k X k2 335  145  855
2
2
2
 1065  775  785  495

2
2
2
s2 
 568.5
5 1


2
4
2
Days to
Maturity
30-40
40-50
50-60
60-70
70-80
80-90
90-100
Number of
Investments
3
1
8
10
7
7
4
28
SAMPLE COEFFICIENT OF SKEWNESS
• The sample coefficient of skewness:

3 X m
SK 
s
•
•
•
•

Where SK stands for the coefficient of skewness
s is the sample standard deviation
X is the sample mean
m is the sample median
29
SAMPLE COEFFICIENT OF SKEWNESS
• Compute the sample coefficient of skewness of the
advertising data: 2.5, 1.3, 1.4, 1.0 and 2.0
Mean, X = 1.64 (see slide 20)
Sample standard deviation, s = 0.6025 (see slides 20, 24)
Median, m =


3 X  m 31.64 
SK 

s
0.6025

30
INTERQUARTILE RANGE AND BOX PLOTS
• The interquartile range represents the range of the middle
50% observations and is the difference between the third
quartile and the first.
• The interquartile range
 Q.75  Q.25
• The range and interquartile range are combined in a box
plot.
31
INTERQUARTILE RANGE AND BOX PLOTS
• A box plot is used to graphically represent the data set.
These plots involve five values:
– the minimum value, S
– the first quartile, Q.25
– the second quartile or median, Q.50
– the third quartile, Q.75
– and the maximum value, L
32
INTERQUARTILE RANGE AND BOX PLOTS
• Example: Construct a box plot with the following data which
shows the assets of the 15 largest North American banks,
rounded off to the nearest hundred million dollars: 111,
135, 217, 108, 51, 98, 65, 85, 75, 75, 93, 64, 57, 56, 98
33
INTERQUARTILE RANGE AND BOX PLOTS
• Sort the data in the ascending order (low to high): 51, 56,
57, 64, 65, 75, 75, 85, 93, 98, 98, 108, 111, 135, 217
• Find
S
Q.25 
Q.50 
Q.75 
L
34
Box Plot
0
50
100
150
200
250
Assets (in 100 million dollars)
35
INTERQUARTILE RANGE AND BOX PLOTS
• If the median is near the center of the box, the
distribution is approximately symmetric.
• If the median falls to the left of the center of the box, the
distribution is positively skewed.
• If the median falls to the right of the center of the box, the
distribution is negatively skewed.
• If the lines are about the same length, the distribution is
approximately symmetric.
• If the line segment to the right of the box is larger than
the one to the left, the distribution is positively skewed.
• If the line segment to the left of the box is larger than the
one to the right, the distribution is positively skewed. 36
SYMMETRIC BOX PLOT
0
50
100
150
200
250
300
Number of units sold
37
POSITIVELY SKEWED BOX PLOT
0
50
100
150
200
250
300
Number of units sold
38
NEGATIVELY SKEWED BOX PLOT
0
50
100
150
200
250
300
Number of units sold
39
THE PROPORTION
•
•
•
•
Population proportion is denoted by 
The parameter  is a number between 0 and 1
Sample proportion is denoted by P
P serves as an estimator of  and calculated as
follows:
Number of observatio ins in category
P
Sample size
40
READING AND EXERCISES
Lesson 4
Reading:
Section 2-3, pp. 50-61
Exercises:
2-30, 2-37, 2-41
41
Download