Document

advertisement
MEASURES OF DISPERSION
MEASURES OF DISPERSION
The measures of central tendency, such as the mean,
median and mode, do not reveal the whole picture of the
distribution of a data set.
Two data sets with the same mean may have completely
different spreads. The variation among the values of
observations for one data set may be much larger or
smaller than for the other data set.
NOTE: the words dispersion, spread and variation have the
same meaning.
MEASURES OF DISPERSION:
example
Consider the following two data sets on the ages of all workers in each of two
small companies.
Company 1:
47
Company 2:
38
35
40
36
45
70
33
18
52
27
39
The mean age of workers in both these companies is the same: 40 years. By
knowing only these means, we may deduce that the workers have a similar
age distribution in the two companies. But, the variation in the workers’ age is
very different for each of these two companies.
Company 1
36
35
27
38 40
45 47
It has a much larger variation than ages of the workers in
the first company
Company 2
18
39
33
52
70
MEASURES OF DISPERSION
The mean, median or mode is usually not by itself a sufficient measure
to reveal the shape of a distribution of a data set. We also need a
measure that can provide some information about the variation among
data set values.
The measures that help us to know about the spread of a data set are
called measures of dispersion.
The measures of central tendency and dispersion taken together give
a better picture of a data set.
We consider 3 measures of dispersion:
1. Range
2. Variance
3. Standard Deviation
RANGE
Definition
the range is the simplest measure of dispersion and it is
obtained by taking the difference between the largest and
the smallest values in a data set:
RANGE = LARGEST VALUE – SMALLEST VALUE
RANGE: example
The following data set gives the total areas in square miles of the 4
western South-Central states of the United States.
Total Area (square miles)
State
Arkansas
Louisiana
Oklahoma
Texas
53,182
49,651
69,903
267,277
RANGE = LARGEST VALUE – SMALLEST VALUE
= 267,277 – 49,651 = 217,626 square miles
Thus, the total areas of these four states are spread over a range of
217,626 square miles.
RANGE: disadvantages
• The range, like the mean has the disadvantage of being influenced
by outliers. Consequently, it is not a good measure of dispersion to
use for data set containing outliers.
• The calculation of the range is based on two values only: the largest
and the smallest. All other values in a data set are ignored.
• Thus, the range is not a very satisfactory measure of dispersion and
it is, in fact, rarely used.
VARIANCE
Definition
The variance is a measure of dispersion of values based on
their deviation from the mean. The variance is defined to
be:
2 
s2 
2
(
x


)

n
2
(
x

x
)

n
for a population
for a sample
VARIANCE
The difference between an observation and the mean,
( x   or x  x ) is called dispersion from the mean.
Consequently, the variance can also be defined as the arithmetic mean
of the squared deviations from the mean.
From the computational point of view, it is easier and more efficient to
use short-cut formulas to calculate the variance
1
1
2
2
2
  i xi -  and s  i xi2 - x 2
n
n
2
VARIANCE: example 1
Refer to the data on 2002 total payrolls of 5 Major League Baseball
(MLB) teams.
MLB Team
Anaheim Angels
Atlanta Braves
New York Yankees
St. Louis Cardinals
Tampa Bay Devil Rays
2002 Total Payroll
(millions of dollars)
62
93
126
75
34
VARIANCE: example 1
We apply the short-cut formula, hence we need to compute the
squares of observations x2.
MLB Team
Anaheim Angels
Atlanta Braves
New York Yankees
St. Louis Cardinals
Tampa Bay Devil Rays
x
x²
62
93
126
75
34
3844
8649
15,876
5625
1156
∑x = 390
∑x² = 35150
390
x
 $78 millions
5
1
1
2
2
s  i xi - x  (35150)  782  946
n
5
2
VARIANCE: example 2
The following data are the 2002 earnings (in thousands of dollars)
before taxes for all 6 employees of a small company.
48.50
38.40
65.50
22.60
x
x²
48.50
38.40
65.50
22.60
79.80
54.60
2352.25
1474.56
4290.25
510.76
6368.04
2981.16
∑x = 309.40
∑x² = 17977.02
79.80
54.60
309.40

 $51.57 thousands
6
1
1
2
2
   xi -   (17977.02)  51.57 2  336.71
n
6
2
VARIANCE: frequency distribution
The formula for variance changes slightly if observations are grouped
into a frequency table. Squared deviations are multiplied by each
frequency's value, and then the total of these results is calculated.
2 
s 
2
2
(
x


)
ni
i i
for a population
n
2
(
x

x
)
ni
i i
for a sample
n
The short-cut formulas become:
1
1
2
2
2
  i xi ni -  and s  i xi2 ni - x 2
n
n
2
VARIANCE: example 3
Vehicles Owned
(xi)
Number of
Households (ni)
0
1
2
3
4
5
Sum
xn

x
i
n
i
i
x i * ni
x i2
xi2* ni
2
18
11
4
3
2
0
18
22
12
12
10
0
1
4
9
16
25
0
18
44
36
48
50
40
74
196
74

 1.85
40
1
1
2
2
s  i xi ni - x  196  1.852  1.48
n
40
2
Variance: frequency distribution
with classes
Again, when the data set is organized in a frequency distribution with
classes, we are approximating the data set by "rounding" each
value in a given class to the class midpoint. Thus, the variance of a
frequency distribution is given by
Short-cut formulas
 
2
s2 
2
(
m


)
ni
i i
n
2
(
m

x
)
ni
i i
n
1
  i mi2 ni -  2
n
2
s2 
1
2
2
m
n
x
 i i
n i
where mi is the midpoint of each class interval.
for a population
for a sample
Variance:example 4
The following table gives the frequency distribution of the number
of orders received each day during the past 50 days at the office
of a mail-order company.
Number of Orders
10
13
16
19
–
–
–
–
12
15
18
21
Number of Days
n
m
m2
m*n
m2 *n
4
12
20
14
11
14
17
20
121
196
289
400
44
168
340
280
484
2352
5780
5600
∑m*n = 832
∑ m2 *n = 14216
n= 50
mn

x
i i
i
n

832
 16.64 orders.
50
1
1
2
2
2
s  imi ni - x  (14216)  16.64 2  7.43
n
50
STANDARD DEVIATION
Definition
The standard deviation is the positive square root of the
variance.
  2
for a population
s  s2
for a sample
STANDARD DEVIATION
The standard deviation is the most used measure of
dispersion.
The value of the standard deviation tells how closely the
values of a data set are clustered around the mean.
In general, a lower value of the standard deviation for a
data set indicates that the values of that data set are
spread over a relatively smaller range around the mean.
In contrast, a large value of the standard deviation for a
data set indicates that the values of that data set are
spread over a relatively large range around the mean.
STANDARD DEVIATION: example 1
MLB Team
2002 Total Payroll
(millions of dollars)
x
x²
62
93
126
75
34
3844
8649
15,876
5625
1156
∑x = 390
∑x² = 35150
Anaheim Angels
Atlanta Braves
New York Yankees
St. Louis Cardinals
Tampa Bay Devil Rays
1
s  i xi2 - x 2  946
n
2
x
s  946  30.76  $30.76 millions
390
 $78 millions
5
STANDARD DEVIATION: example 2
Earnings
(thousands of dollars)
x
x²
48.50
38.40
65.50
22.60
79.80
54.60
2352.25
1474.56
4290.25
510.76
6368.04
2981.16
∑x = 309.40
∑x² = 17977.02
1
   x 2 -  2  336.71
  336.71  $18.35 thousands
n
309.40

 $51.57 thousands
6
2
Variance and Standard Deviation:
observations
The values of the variance and the standard deviation are
never negative. That is, the numerator in the formula for
the variance should never produce a negative value.
Usually the values of the variance and standard deviation
are positive, but if data set has no variation, then the
variance and standard deviation are both zero.
Example: 4 persons in a group are the same age – say 35
years. If we calculate the variance and the standard
deviation, their values are zero.
CONTINGENCY TABLES
AND
ELEMENTS OF PROBABILITY
CONTINGENCY TABLES
In many applications the interest is focused on the joint
analysis of two variables (qualitative and/or quantitative)
with the aim of evaluating the relation between them.
The variables are usually presented as a contingency
table (or two-way classification table).
Whereas a frequency distribution provides the distribution
of one variable, a contingency table describes the
distribution of two or more variables simultaneously.
CONTINGENCY TABLES
All 420 employees of a company were asked if they
are smokers or nonsmokers and whether or not
they are college graduates.
Joint frequency of
College
Graduate
Not a College
Graduate
Smoker
35
80
Nonsmoker
130
175
category “Smoker”
of X and “Not a
college Graduate” of
Y
Cell
The table gives the distribution of 420 employees
based on two variables or characters:
X-smoke (yes or not) and Y-graduation (yes or not)
CONTINGENCY TABLES:
marginal distributions
Marginal distribution X
College
Graduate
Not a College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
Y
X
Marginal
distribution Y
Grand Total
The right-hand column and the bottom row are called marginal
distribution of X and marginal distribution of Y respectively.
CONTINGENCY TABLES
Marginal distribution Y
Marginal distribution X
X
Total
Y
Total
Smoker
115
College graduate
165
Nonsmoker
305
Not a College
graduate
255
420
420
n10  115
n01  165
n10 115
f10 

 0.27
n 420
p10  f10 *100  27%
n01 165
f 01 

 0.39
n 420
p01  f 01 *100  39%
CONTINGENCY TABLES:
conditional distributions
Conditional distribution of X
to the category “College
Graduate” of Y
Y
X
College
Graduate
Conditional distribution of Y
to the category “Smoker” of X
Y
X
Smoker
Smoker
35
College graduate
35
Nonsmoker
130
Not a College
graduate
80
Total
165
Total
115
n11  35
n11  35
35
f11|2 
 0.21
165
p11|2  21%
35
f11|1 
 0.30
115
p11|1  30%
NOTE
f11|1 
n11
n10
f11 
f11|2 
n11
n
n11
n01
Definition of probability
There are three different definitions of probability:
classical definition of probability, frequentist definition
of probability, subjective (Bayesian) definition of
probability.
Frequentist definition of probability:
The relative frequency associated to a category of a
variable (event) analyzed can be interpreted as an
approximation of the probability associated to that
event.
Definition of probability
Example: Ten of the 500 randomly selected cars manufactured at a certain
auto factory are found to be lemons. Assuming that the lemons are
manufactured randomly, what is the probability that the next car
manufactured
at
this
auto
factory
is
a
lemon?
Car (xi)
ni
Relative frequency
(fi)
Good
Lemon
490
10
490/500 = .98
10/500 = .02
n = 500
Sum = 1.00
ni
10
P(next car is a lemon)   f i 
 .02
n
500
NOTE: The relative frequency is an approximation of the probability!!
Relative frequencies and probabilities get closer as the number of cars
increases.
Marginal Probability
Coming back to the example of the 420 employees. Suppose that
one employee is selected at random from the 420 employees. He
may be classified on the basis of smoke alone or graduation. The
employee
can
be
“smoker”,
“nonsmoker”,
“graduate”,
“nongraduate”.
The probability of each characteristic is called marginal probability
College
Graduate
Not a College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
Marginal Probability
Marginal (Simple) Probability: is the probability
frequency) computed on the marginal distributions:
(relative
College
Graduate
Not a College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
n10 115
P(Smoker )  f10 

 0.27
n 420
n20 305
P( Nonsmoker )  f 20 

 0.73
n 420
n01 165
P(Graduate )  f 01 

 0.39
n 420
P( NonGraduat e)  f 02 
n02 255

 0.61
n 420
Joint Probability
Suppose that one employees is selected at random from these 420. What
is the probability that the employee is a smoker and a College
graduate?
College
Graduate
Not a
College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
It is written as P (Smoker  College Graduate).
The symbol  is read as “and”.
Joint Probability
Joint Probability: is the probability (relative frequency) computed on
the joint distributions
College
Graduate
Not a
College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
P(Smoker  College Graduate) 
n11 35

 0.08
n 420
Conditional Probability
Now suppose that one employees is selected at random from these
420. Assume that it is known that he is a Smoker. What is the
probability that the employee selected is Graduate?
College
Graduate
Not a College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
It is written as P (Graduate|Smoker)
It is read as “Probability that he is College Graduate given that
he is a Smoker”
Conditional Probability
Conditional
Probability: is the probability (relative frequency)
computed on the conditional distributions:
College
Graduate
Not a
College
Graduate
Total
Smoker
35
80
115
Nonsmoker
130
175
305
Total
165
255
420
n11 35
P(Graduate/S moker ) 

 0.30
n10 115
Download