Section 2.4 Numerical Measures of Central Tendency 2.4.1 Definitions

advertisement
Section 2.4
Numerical Measures of Central Tendency
2.4.1 Definitions
Mean: The Mean of a quantitative dataset is the sum of
the observations in the dataset divided by the number of
observations in the dataset.
Median: The Median (m) of a quantitative dataset is
the middle number when the observations are arranged
in ascending order.
Mode: The Mode of a datset is the observation that
occurs most frequently in the dataset.
2.4.2 How to calculate these
Mean: There are two means, the Population Mean μ
and the Sample mean x . The calculation of both is the
same except that μ is calculated for the entire
population and x is calculated for a sample taken from
that population.
We will now refer to x as in practice we never
calculate μ, after all not calculating but estimating μ
is the whole point of inferential statistics.
1
Dataset: X1 X2 X3 X4 X5 . . . . . Xn
so there are n observations in this dataset
n
Sample Mean:
x =
∑
xi
i=1
n
Median: Arrange the n observations in order from
smallest to largest, then:
if n is odd, the median (m) is the middle number,
if n is even, the median is the mean of the middle two
numbers
Given a histogram the median is the point on the X-axis
such that half the area under the histogram lies to the
left of the median and half lies to the right.
An example of finding the median from a histogram
with Class Intervals is shown in Example
Median
50%
50%
2
Mode: If given a dataset, the mode is easily chosen as
the value with the highest relative frequency. If given a
relative frequency distribution with class intervals then
the mode is chosen to be the mid point of the class
interval which has the highest relative frequency.
This class interval which has the highest relative
frequency is called the Modal Class.
The mode measures data concentration and so can be
used to locate the region in a large dataset where much
of the data is concentrated.
NOTE: unlike the mean and median the mode must be
an element of the original dataset.
2.4.3 Example Calculate the Mean Median and Mode
for the following datasets:
Example A:
Dataset: 5, 3, 8, 5, 6
5
x =
∑
i=1
5
xi
=
5+3+8+5+6
= 5.4
5
Mode = 5
Median: 3, 5, 5, 6, 8 so m = 5
Note: 5.4 is not one of the original values in the dataset
B: 11, 140, 98, 23, 45, 14, 56, 78, 93, 200, 123, 165
3
n = 12,
n
x =
∑
i=1
n
xi
= 1046/12 = 87.16666666
Median: 11, 14, 23, 45, 56, 78, 93, 98, 123, 140, 165,
200
m = (78 + 93)/2 = 85.5
C: generate a dataset containing 9 numbers using the
Day, Month and Year of your birth and that of the
people sitting to your left and right. ie: DD/MM/YY
4
*** D:
Class Interval
2 -< 4
4 -< 6
6 -< 8
8 -< 10
Frequency
3
18
9
7
Modal Class is 4 -< 6 as frequency of 18 is highest,
mode is in the middle of this so mode = 5
Mean = (3*3 + 5*18 + 7*9 + 9*7)/(3 +18 + 9 +7) =
225/37 = 6.081
Median: There are 37 observations in this datset so the
median is the 19th observation. There are 3
observations in the first Class Interval 2 -<4 and as
19 - 3 =16 we need to find the 16th observation in the
Class Interval 4 -< 6.
Assuming the observations are distributed uniformly
within each Class Interval we find that the 16th
observation in the second interval should lie 16/18 =
0.89 of the way between 4 and 6.
The distance between 4 and 6 is 2 units, 2*.89 = 1.78,
and so we find:
median (m) = 4 + 1.78 = 5.78
5
2.4.4 Mean vs Median vs Mode
- which measures the centre best?
Choosing which of these three measures to use in
practice can sometimes seem like a difficult task.
However if we understand a little about the relative
merits of each we should at least be able to make an
informed decision.
If the distribution is symmetric then
Mean = Median
If the distribution is Positively Skewed (to the right)
then
Mean > Median
If the distribution is Negatively Skewed (to the left)
then
Median > Mean
So the difference between the mean and median can be
used to measure the skewness of a dataset.
***********INSERT SLIDE
Note: The presence of outliers affects the mean but not
the median. This can be seen from the diagrams and
from the following example:
6
*** 2.4.5 Example
Ten statistics graduates who are now working as
statisticians are surveyed for their annual salary. The
survey produced the following dataset:
£60,000 £20,000 £19,000 £22,000 £21,500
£21,000 £18,000 £16,000 £17,500 £20,000
Calculate the Mode, Median and Mean:
Mode = £20,000
Median = £20,000
Mean = £23,500
Notice that the distribution is positively skewed, the
presence of the one high earner has affected the Mean
causing it to be £1,500 higher than the highest of all the
salaries excluding £60,000. For this dataset the Mean is
therefore not a good measure of the centre of the
dataset.
Notice also that the median would be unaffected if the
£60,000 was changed to a value like £23,000 which is
more in line with the rest of the data.
Because of this sensitivity of the mean to outliers and
because the median is completely insensitive to outliers
a revised version of the mean is sometimes used called
the trimmed mean.
7
2.4.6 Definition: Trimmed Mean
NOTE: This definition is NOT in the textbook
A trimmed mean is computed by first ordrering the data
values from smallest to largest, then deleting a selected
number of values from each end of the ordered list and
finally averaging the remaining undeleted values.
The trimming percentage is the percentage of values
deleted from EACH end of the ordered list.
So if a dataset contained 10 observations and we
wanted to find a 20% trimmed mean we would delete 2
observations from the top of the ordered dataset and 2
from the bottom leaving 6 remaining values. The mean
is then calculated for these 6 remaining values and this
is the 20% Trimmed Mean.
Example: Compute a 10% trimmed mean for the
dataset in Example 2.4.5, compare with previous
measures.
There are 10 observations in the dataset, 10% of 10 is 1
so we delete the largest and smallest observations
ie the values £60,000 and £16,000 are deleted.
The mean of the remaining values is then calculated:
10% Trimmed Mean =
(£17,500 + £18,000 + £19,000 + £20,000+ £20,000 +
£21,000 + £21,500 + £22,000)/8 = £19,875
This is very similar to the median and mode for this data.
8
2.4.7 Some more Examples
Sometimes we are not presented with a dataset but with
a a Histogram or a Stem and Leaf Diagram. It is still
possible to measure the centre of the dataset from these
graphs.
**********INSERT MPG Histogram and Stem&Leaf
2.4.8 Example
Measurements were taken of the pulses of a certain
number of UCD Students, the observations are listed
below. Find the median and mode of this dataset.
What is the best way to present this data which will
allow the median and mode to be calculated more
easily?
2.4.9 Examples
Would you expect the datasets described below to
possess relative Frequency distributions which are
symmetric, skewed to the right or skewed to the left.
A. The salaries of people employed by UCD
B. The grades on an easy exam
C. The grades on a diffucult exam
D. The amount of time spent by students in a difficult 3
hour exam.
E. The amount of time students in this class studied last
week.
F. The age of cars on a used car lot
9
2.4.10 Example:
The median age of the population in Ireland is now 32
years old. The median age of the Irish population in
1986 was 27. Interepret these values and explain the
trend, what implications does this data have for Irish
society. What are the consequences for the
entertainment industry in Ireland?
10
Section 2.5 Numerical Measures of Variability
When we want to describe a dataset providing a
measure of the centre of that dataset is only part of the
story. Consider the following two distributions:
A
B
Both of these distributions are symmetric and
meanA = meanB, modeA=modeB and
medianA=medianB. However these two distributions
are obviously different, the data in A is quite spread out
compared to the data in B.
This spread is technically called variability and in this
section we will examine how best to measure it.
11
2.5.1 Definitions
Range: The Range of a quantatitive dataset is equal to
the largest value minus the smallest value.
Sample Variance: The Sample Variance is equal to the
sum of the squared distances from the mean divided by
n-1.
n
∑(x − x)
2
s = i=1
2
i
n−1
An easier formula to be used when calculating the
variance is:
⎛
⎞
x
⎜
∑
i⎟
n
2 ⎝ i=1 ⎠
xi −
∑
n
2
i=1
s =
n−1
n
2
12
Sample Standard Deviation: The Sample Standard
Deviation, s, is defined as the positive square root of
the Sample Variance, s2.
2.5.2 Which is best?
The meaning of the Range is easily seen from its
definition. It is a very crude measure of the variability
contained in a dataset as it is only interested in the
largest and smallest values and does not measure the
variability of the rest of the dataset.
ExampleA: These two datasets have the same range
but do they have the same variability?
Dataset1: 1, 5, 5, 5, 9
Dataset2: 1, 2, 5, 8, 9
NO, Dataset2 is obviously more spread out than
Dataset1 which has threee values clustered at 5.
The Sample Variance is a much better measure of the
variability in the whole dataset.
This is because the term ( x i − x ) in s2 calculates the
distance of each observation in the dataset from the
centre of the dataset (as measured by the Sample
Mean).
13
As some of the xi s are smaller than x and some are
larger they tend to cancel each other out. For this
reason we square each ( x i − x ) term before adding them
together and dividing by n-1 to get an average measure
of the squared distance of each observation from the
mean.
The Sample Variance therefore will be small if all
observations are close to the Sample Mean but will be
large if the observations are far away from the mean.
This is best illustrated by comparing the calculation of
s2 for the two datasets in ExampleA above.
Dataset1: 1, 5, 5, 5, 9
........ x =5
s2 = [(1-5)2 + (5-5)2 + (5-5)2 + (5-5)2 + (9-5)2]/4
= [ (-4)2 + (0)2 + (0)2 + (0)2 + (4)2 ]/4
= [16 + 0 + 0 + 0 + 16]/4
=8
Dataset2: 1, 2, 5, 8, 9
........ x =5
s2 = [(1-5)2 + (2-5)2 + (5-5)2 + (8-5)2 + (9-5)2]/4
= [ (-4)2 + (-3)2 + (0)2 + (3)2 + (4)2 ]/4
= [16 + 9 + 0 + 9 + 16]/4
= 12.5
So the increased spread contained in Dataset2 is indeed
measured by s2.
2.5.3 Samples and Populations
14
You will have noticed that although we described s2
as an average of the squared distances from the sample
mean, in fact we divided the sum of the squares not by
n but by n-1. Now there were n observations in the
dataset so surely the correct thing would be to divide by
n and not n-1.
The reason we divided by n-1 is because we are as
always intereted in Inferential Statistics and we want to
use s2 (the Sample Variance) to estimate for the
Population Variance which we will denote by
σ2 ( sigma squared). And we will find later that s2 with
the n-1 provides a more accurate estimator of σ2.
So again we have a sample and a Population and two
Population Characteristics estimated by two Sample
Statistics.
Population Characteristic
Population
2
Variance
Sample Statistic
Sample
2
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
σ
σ
2.5.4 Example
Two samples are chosen from a population:
15
s
s
Sample1: 10, 0, 1, 9, 10, 0, 8, 1, 1, 9
Sample2: 0, 5, 10, 5, 5, 5, 6, 5, 6, 5
Answer the following questions based on these two
samples:
A. Examine both samples and identify which has the
greater variability
B. Calculate the Range for each sample, does your
result aggree with the answer in A.
C. Calculate the Standard Deviation for each sample,
does this result aggree with your answer to part A.
D. Which of the two, Range or Standard Deviation
provides the best measure of variability.
Answers:
Range1 = 10, Range2 = 10
S1=4.5814, S2 = 2.3944
2.5.5 Example
Once upon a time there were two lecturers A & B, each
delivered the same course to two different classes.
When exam time came both classes had the same
average marks of 70%. The marks for Lecturer A’s
class however had a standard deviation of 25%
whereas the Standard Deviation for Lecturer B’s class
was 5%. Who’s class would you rather be in?
16
Section 2.6 Interpreting the Standard Deviation Chebyshev’s Rule and the Empirical Rule
We have seen that the Variance and hence the Standard
Deviation of a dataset provides us with a relative
measure of the variability contained in a dataset. So
that if we are given two datasets the one with the larger
Standard Deviation will be the dataset which exhibits
the greater variability.
Is it posssible for the Standard Deviation to give more
than a relative measure of variability?
Can we actually say how spread ou the data is?
The answer is yes, we will see later how to give
detailed answers for particular distributions. In the
meantime there are two rules which will provide us
with a good deal of information about some general
datasets.
2.6.1 Chebyshev’s Rule
This rule applies to any dataset (population or sample)
regardless of the shape or frequency distribution of the
data.
For k > 1 the proportion of observations which are
within k Standard Deviations of the mean is
at least 1- 1/k2.
17
Computing this for several values of k gives:
k:
Number of Standard
Deviations
2
3
4
4.472
5
10
Proportion of the observations
within k Standard Deviations
from the Mean
At least 1-1/4 = 0.75
At least 1-1/9 = 0.89
At least 1-1/16 = 0.94
At least 1-1/20 = 0.95
At least 1-1/25 = 0.96
At least 1-1/100 = 0.99
Note: Chebyshev’s Rule provides us with an idea of the
spread of distributions. Because it is meant to work for
all distributions regardless of their shape it doesn’t give
definite specific results. Instead it tells us that “at
least” a certain proportion of observations lie in a
specified interval. The proportions in Chebyshev’s Rule
are therefore very conservtive and for certain
distributions we may find a much higher proportion of
observations within these intervals.
The Empirical rule provides us with some definite
statements about the proportion of observations in a
specified interval. It only works for Symmetric BellShaped (mound-shaped) distributions. Also this rule is
an approximation and more or less data than is
indicated by the rule may lie in each interval.
18
2.6.2 The Empirical Rule
For a Symmetric Bell-Shaped distribution;
• Approximately 68% of the observations are within 1
Standard Deviation of the Mean
• Approximately 95% of the observations are within 2
Standard Deviation of the Mean
• Approximately 99.7% of the observations are within 3
Standard Deviation of the Mean
19
2.6.3 Some Examples
ExampleA
The following is a list of the times it takes 12 UCD
students to get to college in the morning :
12, 23, 56, 14, 17, 21, 33, 42, 45, 38, 51, 29
Calculate x and s and calculate the percentage of data
between x - 2s and x + 2s and also between x - 3s and
x + 3s. Compare these results with the predictions of
Chebyshev’s Rule. Assuming that the data is
distributed in an approximate Bell shape use the
Empirical Rule to calculate the percentage of the data
within 2 standard deviations of the mean and within 3
S.Devs of the mean. Comment on your results.
=31.75
s = 14.78
2s = 2*14.78 = 29.56
3s = 44.34
x
x
x
- 2s = 31.75 - 29.56 = 2.19
+ 2s = 31.75 + 29.56 = 61.31
x
x
- 3s = 31.75 - 44.34 = -12.59 ~ 0 !!!!!!!!!!!!!!!!
+ 3s = 31.75 + 44.34 = 76.09
20
Interval
Actual
x -2s ~ x +2s 100%
2.19 ~ 61.31
x -3s ~ x +3s 100%
0 ~ 76.09
Chebyshev’s Empirical
at least 75% approx. 95%
at least 89% approx
99.75%
This table illustrates very clearly how Chebyshev’s rule
generally underestimates the amount of data in each
interval. The empirical rule provides, in this case, more
accurate results.
ExampleB:
A lecturer in UCD has assigned some problems to be
done by the 120 students in her class. When it comes
time to collect the problems 9 students inform her that
“The dog ate my homework”.
From many years of teaching classes this size she has
observed that the mean for homeworks actually eaten
by pets of all kinds is 3 homeworks and the standard
deviation is 0.8 homeworks.
Should the lecturer believe that the homeworks of all 9
students were eaten by their dogs or not.
By Chebyshev’s rule at least 1-1/k2 of the observations
should in the interval ( x - ks, x + ks).
This gives the following table:
21
k# of Standard
Deviations
2
3
4
5
6
7
8
Interval
“At least” Percentage
of observations in
interval
75%
89%
93%
96%
97%
98%
98.4%
1.4, 4.6
0.6, 5.4
0, 6.2
0, 7
0, 7.8
0, 8.6
0, 9.4
From this table we can see that there is an AT MOST
2% chance that dogs ate 9 homeworks in this class.
Remembering that Chebyshev’s rule is extremely
conservative we could conclude that the chances are
very high that some of the students just didn’t do their
homeworks.
22
Example C:
In Tombstone, Arizona Territory people used Colt .45
revolvers. However people used different ammunition.
Wyatt Earp knew that his brothers and Doc Holliday
were the only ones in the territory who used Colt .45s
with Winchester ammunition.
The Earp brothers conducted tests on many different
combinations of weapons and ammunition.
They found that dataset of observations produced by
the combination of Colt .45 with Winchester shells
showed a Mean velocity of 936 feet/second and a
Standard Deviation of 10 feet/second.
The measurements were taken at a distance of 15 feet
from the gun.
When Wyatt examined the body of a cowboy shot in
the back in cold blood he concluded that he was shot at
a distance of 15 feet and that the velocity of the bullet
at impact was 1,000 feet/second.
The dastardly Ike Clanton claimed that this cowboy
was shot by the Earp brothers or Doc Holliday. Was
Wyatt able to clear his good name using the Empirical
Rule?
23
The distribution of this bullet velocity data should be
approximately bell-shaped. This implies that the
empirical rule should give a good estimation of the
percentages of the data within each interval.
k# of
Standard
Deviations
2
3
4
5
6
7
Interval
Chebyshev’s Empirical
“At least”
approximate
Percentage
Percentage
916, 956
906, 966
896, 976
886, 986
876, 996
866, 1006
75%
89%
93%
96%
97%
98%
95%
99.7%
~100%
~100%
~100%
~100%
This table quite clearly demonstrates that since the
bullet velocity in the shooting was 1000 ft/sec and since
this lies more than 6 Standard Deviations away from
the mean the probability is extremely high that the
Earps were not responsible for this shooting. This is
especially evident from looking at the column showing
percentages from the empirical rule. Practically 100%
of bullet velocities should be between 896 and 976
ft/sec.
24
Example C2:
During “The Troubles” in Northern Ireland both
Republicans and Loyalists used 9mm handguns
however they used different brands of handgun and
ammunition.
The security forces in NI knew that the republicans
used Heckler and Koch 9mm handguns with
Winchester ammunition.
The security forces conducted tests on many different
combinations of weapons and ammunition.
They found that dataset of observations produced by
the combination of a H&K 9mm with Winchester shells
showed a Mean velocity of 936 feet/second and a
Standard Deviation of 10 feet/second.
The measurements were taken at a distance of 15 feet
from the gun.
Forensic scientists examining the body of a shooting
victim concluded that he was shot at a distance of 15
feet and that the velocity of the bullet at impact was
1,000 feet/second.
Describe the distribution of the bullet velocities.
Did they conclude that the shooter was a member of a
Republican terrorist organisation or a Loyalist
organisation?
25
The distribution of this bullet velocity data should be
approximately bell-shaped. This implies that the
empirical rule should give a good estimation of the
percentages of the data within each interval.
k# of
Standard
Deviations
2
3
4
5
6
7
Interval
Chebyshev’s Empirical
“At least”
approximate
Percentage
Percentage
916, 956
906, 966
896, 976
886, 986
876, 996
866, 1006
75%
89%
93%
96%
97%
98%
95%
99.7%
~100%
~100%
~100%
~100%
This table quite clearly demonstrates that since the
bullet velocity in the shooting was 1000 ft/sec and since
this lies more than 6 Standard Deviations away from
the mean the probability is extremely high that
Republicans were not responsible for this shooting.
This is especially evident from looking at the column
showing percentages from the empirical rule.
Practically 100% of bullet velocities should be between
896 and 976 ft/sec.
26
2.6.4 Example to illustrate the difference beween
Chebyshev’s Rule, The Empirical Rule and some actual
data.
A survey was conducted to measure the height 14 year
olds, a sample of 1052 children were measured and it
was found that :
= 62.484 inches
s = 2.390 inches
x
A bell-shaped symmetric distribution provided a good
fit to the data, applying Chebyshev’s and the Empirical
rule we get:
k:
number
of
SDevs
1
2
3
Interval: Actual
% of
( x -ks,
x +ks)
Obs. in
Interval
60.094 - 72.1%
64.874
57.704 - 96.2%
67.264
55.314 - 99.2%
69.654
Empirical Chebyshev’s
Rule: % Rule:
of Obs.
68%
>= 0%
95%
>= 75%
99.7%
>= 89%
Clearly in this instance Chebyshev’s Rule
underestimates the proportions very severely.
27
2.6.5 Estimating the Standard Deviation from the
Range
According to the Empirical rule for Bell-Shaped
distributions almost all of the data should be in the
interval ( x -3s, x +3s). So the Range should be
approximately 6s ie: x +3s - ( x -3s).
This gives us a crude but useful measure of the
Standard Deviation.
Standard Deviation ~ Range/6
28
Section 2.7
Numerical Measures of Relative Standing
While it is useful to know how to measure the centre of
a dataset and the variability of a dataset, many times we
want to be able to compare one observation with the
rest of the observations in the dataset. Is one
observation larger than many others?
For Example suppose you get 35% on the exam for this
course you will probably feel quite bad about your
performance but what if 90% of the class actually did
worse than you? Then you might feel a bit better about
your 35%.
So in some cases knowing how one observation
compares with others can be more useful than just
knowing the value of that observation.
This chapter will introduce some different ways of
measuring Relative Standing.
29
2.7.1 Definitions
Percentile: For any dataset the pth percentile is the
observation which is greater in value than P% of all the
numbers. Consequently this observation will be smaller
than (100-P)% of the data.
Z-Score: The Z-Score of an observation is the distance
between that observation and the mean expressed in
units of standard deviations. So:
Sample Z-Score for an observation x is:
x−x
Z=
s
Population Z-Score of an observation is:
Z=
x−μ
σ
The numerical value of the Z-score reflects the relative
standing of the observation.
A large positive Z-score implies that the observation is
larger than most of the other observations.
A large negative Z-score indicates that the bservation is
smaller than almost all the other observations.
A Z score of zero or close to 0 means that the
observation is located close to the mean of the dataset.
2.7.2
30
ExampleA: The 50th percentile of a dataset is the
median (The median remember is the value which is
larger than half of the data).
ExampleB: Dataset 15, 3, 1, 7, 5, 17, 19, 11, 9, 13
In this dataset the 80th percentile is the value 15 as 15
is greater than or equal to 80% of the data.
This is easily seen if we arrange the data in ascending
order: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19
Exercise 2.79 in textbook
The distribution of scores on a nationally administered
college achievement test has a median of 520 and a
mean of 540.
a. Explain how it is possible for the mean to exceed the
median for this distribution.
b. Suppose that you are told that the 90th percentile is
660, what does this mean?
c. Suppose you are told that you scored at the 94th
percentile, what does this mean?
Answers:
a. Distribution is positively skewed (to the right)
b. 90% of the test scores are below 660 and 10% are
above.
c. 94% of the test scores were below yours and only 6%
were above.
31
Example D. A sample of 120 statistics students was
chosen and their exam results summarised, the mean
and standard deviation were shown to be:
x = 53% and s = 7%
Eric and Kenny are two students in this class and Eric’s
exam result was 47% what was his Z-score? If Kenny’s
Z-Score is 2, what was his percentage on the exam?
2.7.3 Z-scores and the Empirical Rule
For a bell shaped distribution the Empirical Rule tells
us the following about Z-scores:
1. Approximately 68% of the observations have a
Z-Score between -1 and 1.
2. Approximately 95% of the observations have a
Z-Score between -2 and 2.
3. Approximately 99.7% of the observations have a
Z-Score between -3 and 3.
Example 2.14 in the textbook:
Suppose a female bank employee believes that her
salary is low as a result of sex discrimination. To
substantiate her belief, she collects information on the
salaries of her male counterparts. She finds that their
salaries have a mean of $34,000 and a standard
deviation of $2,000. Her salary is $27,000 does this
information support her claim of sex discrimination?
Answer:
32
Calculate her Z-score with respect to her male
counterparts:
x − x $27,000 − $34,000
Z=
=
= −3.5
s
$2,000
So the woman’s salary is 3.5 Standard Deviations
below the mean of the male salary distribution. If the
male salaries are distributed in a bell shape then the
empirical rule tells us that very few salaries in this
distribution should have a z-score below -3.
Therefore a Z-score of -3.5 represents either a highly
unsual observation from the male salary distribution or
is from a different distribution.
Do you think her claim of sex discrimination is
justified?
Answer: Need more data, on the collection technique
the woman used, the length of time she has been in her
job, her competence at her job etc.
If she truly chose a representative sample, if she had
been employed there as long as others and if she was
good at her job then one might conclude that she was
discriminated against.
33
Download