Descriptive Statistics

advertisement
Chapter
4
Descriptive Statistics
• Statistics are descriptive measures derived from a
sample (n
(n items).
Numerical Description
Central Tendency
Dispersion
Standardized Data
Percentiles and Quartiles
Box Plots
Grouped Data
Skewness and Kurtosis (optional)
• Parameters are descriptive measures derived from
a population (N
(N items).
Numerical Description
• Three key characteristics of numerical data:
Characteristic
Interpretation
Central Tendency
Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion
How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape
Numerical Description
Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
Numerical Description
Numerical Description
¯ Example: Vehicle Quality
• Consider the data set of vehicle defect rates from
J. D. Power and Associates.
• Defect rate = total no. defects x 100
no. inspected
• Numerical statistics can be used to summarize this
random sample of brands.
• Must allow for sampling error since the analysis is
based on sampling.
To begin, sort the
data in Excel.
• Number of defects per 100 vehicles, 1004 models.
1
Numerical Description
• Sorted data provides insight into central tendency
and dispersion.
Numerical Description
¯ Visual Displays
• The dot plot offers a visual impression of the data.
Numerical Description
¯ Visual Displays
• Histograms with 5 bins (suggested by Sturge
Sturge’’s
Rule) and 10 bins are shown below.
Descriptive
Statistics in Excel
Go to Tools | Data Analysis
and select
Descriptive Statistics
• Both are symmetric with no extreme values and
show a modal class toward the low end.
Highlight the data
range, specify a cell
for the upperupper - left
corner of the output
range, check
Summary Statistics
and click OK.
Here is the resulting analysis.
2
Here is the
resulting
MegaStat
analysis:
Descriptive Statistics in MegaStat
Central Tendency
Central Tendency
• The central tendency is the middle or typical
values of a distribution.
• Central tendency can be assessed using a dot
plot, histogram or more precisely with numerical
statistics.
¯ Six Measures of Central Tendency
Statistic
Formula
Mean
1 n
∑ xi
n i=1
Median
Mode
Midrange
Most
frequently
occurring
data value
x min + xmax
2
Excel Formula
= MODE(Data
MODE(Data))
=0.5*(MIN(Data))
=0.5*(MIN(Data
+ MAX(Data
MAX(Data))
))
Pro
Useful for
attribute
data or
discrete data
with a small
range.
Easy to
understand
and
calculate.
Con
= AVERAGE(Data
AVERAGE(Data))
Familiar and
uses all the
sample
information.
Influenced
by extreme
values.
= MEDIAN(Data
MEDIAN(Data))
Robust when
extreme data
values exist.
Ignores
extremes
and can be
affected by
gaps in data
values.
Central Tendency
¯ Six Measures of Central Tendency
Formula
Pro
Middle
value in
sorted
array
Central Tendency
Statistic
Excel Formula
¯ Six Measures of Central Tendency
Con
May not be
unique,
and is not
helpful for
continuous
data.
Influenced
by extreme
values and
ignores
most data
values.
Statistic
Geometric
mean (G
( G)
Trimmed
mean
Formula
n
x1 x2 ... x n
Same as the
mean except
omit highest
and lowest
k% of data
values (e.g.,
5%)
Excel Formula
= GEOMEAN(Data
GEOMEAN(Data))
= TRMEAN(Data
TRMEAN(Data,, %)
Pro
Useful for
growth
rates and
mitigates
high
extremes.
Con
Less
familiar
and
requires
positive
data.
Mitigates
effects of
extreme
values.
Excludes
some data
values
that could
be
relevant.
3
Central Tendency
¯ Mean
Central Tendency
¯ Mean
• A familiar measure of central tendency.
Population Formula
µ=
i =1
N
n
Sample Formula
x=
n
N
∑ xi
• For the sample of n = 37 car brands:
x=
∑ xi
∑ xi
i=1
n
=
87 + 93 + 98+ ... + 159 + 164 + 173 4639
=
= 125.38
37
37
i= 1
n
• In Excel, use function =AVERAGE(Data
=AVERAGE(Data)) where
Data is an array of data values.
Central Tendency
¯ Characteristics of the Mean
• Arithmetic mean is the most familiar average.
• Affected by every sample item.
• The balancing point or fulcrum for the data.
Central Tendency
¯ Characteristics of the Mean
• Regardless of the shape of the distribution,
absolute distances from the mean to the data
n
points always sum to zero.
∑ ( xi − x ) = 0
• Consider the following
i =1
asymmetric distribution of quiz
scores whose mean = 65.
n
∑ (x i
i =1
− x )= (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65)
= ((-23) + ((-5) + (5) + (10) + (13) = -28 + 28 = 0
Central Tendency
¯ Median
— Median is that value of the variate which divides the
ordered data into two equal halves.
— Median does not look at the extreme values.( Not
sensitive to extreme values)
— Ignores the values of the variable
— Median value is unique.
Central Tendency
¯ Median
• The median ( M) is the 50th percentile or midpoint
of the sorted sample data.
• M separates the upper and lower half of the sorted
observations.
• If n is odd, the median is the middle observation in
the data array.
• If n is even, the median is the average of the
middle two observations in the data array.
4
Central Tendency
¯ Median
Central Tendency
¯ Median
• For n = 8, the median is between the fourth and
fifth observations in the data array.
• For n = 9, the median is the fifth observation in the
data array.
Central Tendency
¯ Median
Central Tendency
¯ Median
• Consider the following n = 6 data values:
11 12 15 17 21 32
• What is the median?
xn / 2 + x( n / 2 +1)
For even n, Median =
n/2 = 6/2 = 3
and
2
n/2+1 = 6/2 + 1 = 4
• Consider the following n = 7 data values:
12 23 23 25 27 34 41
• What is the median?
For odd n, Median =
( n+1)/2 = (7+1)/2 = 8/2 = 4
M = x 4 = 25
M = (x
(x 3+x 4)/2 = (15+17)/2 = 16
11
12
15
16
17
21
32
Central Tendency
¯ Median
• Use Excel’
Excel’s function =MEDIAN(Data
=MEDIAN(Data)) where Data
is an array of data values.
• For the 37 vehicle quality ratings (odd n) the
position of the median is
( n+1)/2 = (37+1)/2 = 19.
• So, the median is x 19 = 121.
• When there are several duplicate data values, the
median does not provide a clean “ 50
50-- 50
50”” split in
the data.
x( n +1 ) / 2
12
23
23
25
27
34
41
Central Tendency
¯ Characteristics of the Median
• The median is insensitive to extreme data values.
• For example, consider the following quiz scores for
3 students:
Tom ’s scores:
Tom’
20, 40, 70, 75, 80
Jake’’s scores:
Jake
60, 65, 70, 90, 95
Mary ’s scores:
50, 65, 70, 75, 90
Mean =57,
Median = 70,
70,
Total = 285
Mean = 76, Median = 70,
70,
Total = 380
Mean = 70, Median = 70,
70,
Total = 350
• What does the median for each student tell you?
5
Central Tendency
¯ Mode
Central Tendency
¯ Mode
• The most frequently occurring data value.
• Similar to mean and median if data values occur
often near the center of sorted data.
• May have multiple modes or no mode.
• For example, consider the following quiz scores for
3 students:
Lee’s scores :
Lee’
60, 70, 70, 70,
Pat’’s scores:
Pat
scores:
45, 45, 70, 90,
Sam’’s scores :
Sam
50, 60, 70, 80,
Xiao’’s scores :
Xiao
50, 50, 70, 90,
80
Mean =70, Median = 70, Mode = 70
100 Mean = 70, Median = 70, Mode = 45
90
Mean = 70, Median = 70, Mode = none
90
Mean = 70, Median = 70, Modes = 50,90
• What does the mode for each student tell you?
Central Tendency
¯ Mode
Central Tendency
¯ Mode
• Easy to define, not easy to calculate in large
samples.
• Generally isn’
isn’t useful for continuous data since
data values rarely repeat.
• Use Excel’
Excel’s function =MODE(Array
=MODE(Array )
- will return #N/A if there is no mode.
- will return first mode found if multimodal.
• Best for attribute data or a discrete variable with a
small range (e.g., Likert scale).
• May be far from the middle of the distribution and
not at all typical.
Central Tendency
Central Tendency
¯ Example: Price/Earnings Ratios and Mode
¯ Example: Price/Earnings Ratios and Mode
• Consider the following P/E ratios for a random
sample of 68 Standard & Poor’
Poor’s 500 stocks.
7 8 8 10 10 10 10 12 13 13
14 15 15 15 15 15 16 16 16 17
13 13 13 13 13 14 14
18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22
26 26 27 29 29 30 31 34 36 37
23 23 23 24 25 26 26
40 41 45 48 55 68 91
• What is the mode?
• Excel’
Excel’s descriptive
statistics results are:
• The mode 13 occurs
7 times, but what
does the dot plot
show?
Mean
22.7206
Median
Mode
19
13
Range
84
Minimum
Maximum
7
91
Sum
Count
1545
68
6
Central Tendency
Central Tendency
¯ Example: Price/Earnings Ratios and Mode
• The dot plot shows local modes (a peak with
valleys on either side) at 10, 13, 15, 19, 23, 26, 29.
¯ Mode
• A bimodal distribution refers to the shape of the
histogram rather than the mode of the raw data.
• Occurs when dissimilar populations are combined
in one sample. For example,
• These multiple modes suggest that the mode is
not a stable measure of central tendency.
Central Tendency
Central Tendency
¯ Skewness
• Compare mean and median or look at histogram to
determine degree of skewness.
¯Mean median and mode are approximately related. In
one situation they are all equal. Otherwise,
¯Mode = Mean - 3(Mean - Median)
Central Tendency
Central Tendency
¯ Symptoms of Skewness
Distribution’ s
Distribution’
Shape
Histogram Appearance
Skewed left
(negative
skewness)
Long tail of histogram points left
(a few low values but most data on
right)
Tails of histogram are balanced
(low/high values offset)
Symmetric
Skewed right
(positive
skewness)
Long tail of histogram points right
(most data on left but a few high
values)
¯ Skewness
Statistics
• For the sample of J.D. Power quality ratings, the
mean (125.38) exceeds the median (121). What
does this suggest?
Mean < Median
Mean ≈ Median
Mean > Median
7
Central Tendency
Central Tendency
¯ Geometric Mean
¯ Growth Rates
• The geometric mean (G) is a
G=
multiplicative average.
n
• A variation on the geometric mean used to find the
average growth rate for a time series.
x1 x2 ... xn
• For the J. D. Power quality data (n=37):
G=
37
(87)(93)(98)...(164)(173) =
37
2.37667 ×10
G =n
77
= 123.38
xn
−1
x1
• For example, from
1998 to 2002, Spirit
Airlines revenues
are:
• In Excel use =GEOMEAN(Array
=GEOMEAN(Array))
• The geometric mean tends to mitigate the effects
of high outliers.
Central Tendency
¯ Growth Rates
• The average growth rate is given by taking the
geometric mean of the ratios of each year’
year’s
revenue to the preceding year.
• Due to cancellations, only the first and last years
are relevant:
Revenue (mil)
131
1999
2000
227
311
2001
354
2002
403
Central Tendency
¯
Geometric Mean
•
Suppose $100 is growing by 10% each year, then
• year
• Year
• Year
• Year
0 (current)
1
$100
$110
2
3
$121
$133.1
$146.41
• Year 4
 227   311   354   403 
403
G=5



 − 1 = 5 131 − 1
 131   227   311   354 
= 1.242−
1.242 −1 = .242 or 24.2% per year
Year
1998
•
Mean = 122.1
•
Now let us take the average growth rate:
• The
Geometric mean = 121.0
amount grew by $46.41 in 4 years
growth (using AM) = 11.6025
growth (using GM) = 10.0 (Which is correct).
• Average
• Average
•
GM is better measure of central tendency when the data is showing
showing a
proportionate change.
• In Excel use =(403/131)^(1/5)=(403/131)^(1/5)- 1
Central Tendency
¯ Midrange
• The midrange is the point halfway between the
lowest and highest values of X.
• Easy to use but sensitive to extreme data values.
x + xmax
Midrange = min
2
• For the J. D. Power quality data (n=37):
x + xmax
x +x
87 + 173
Midrange = min
= 130
= 1 37 =
2
2
2
• Here, the midrange (130) is higher than the mean
(125.38) or median (121).
Central Tendency
¯ Trimmed Mean
• To calculate the trimmed mean,
mean , first remove the
highest and lowest k percent of the observations.
• For example, for the n = 68 P/E ratios, we want a 5
percent trimmed mean (i.e., k = .05).
• To determine how many observations to trim,
multiply k x n = 0.05 x 68 = 3.4 or 3 observations.
• So, we would remove the three smallest and three
largest observations before averaging the
remaining values.
8
Central Tendency
Central Tendency
¯ Trimmed Mean
¯ Trimmed Mean
• Here is a summary of all the measures of central
tendency for the n = 68 P/E values.
Mean:
22.72
=AVERAGE(PERatio
AVERAGE(PERatio))
Median:
19.00
=MEDIAN(PERatio
MEDIAN(PERatio))
Mode:
13.00
=MODE(PERatio
MODE(PERatio))
Geometric Mean:
19.85
=GEOMEAN(PERatio
GEOMEAN(PERatio))
Midrange:
5% Trim Mean:
49.00
21.10
=(MIN(PERatio)+MAX(PERatio))/2
=TRIMMEAN(PERatio,0.1)
• The trimmed mean mitigates the effects of very
high values, but still exceeds the median.
• The Federal
Reserve uses a
16% trimmed
mean to mitigate
the effects of
extremes in its
analysis of the
Consumer Price
Index.
Dispersion
Dispersion
• Variation is the “spread
spread”” of data points about the
center of the distribution in a sample. Consider the
following measures of dispersion:
¯ Measures of Variation
Statistic
Formula
Range
xmax – xmin
Variance
(s2)
n
∑ ( xi − x )
i =1
Excel
= MAX(Data) MIN(Data))
MIN(Data
2
= VAR(Data
VAR(Data))
n −1
Pro
Con
Easy to calculate
Sensitive to
extreme data
values.
Plays a key role
in mathematical
statistics.
Non- intuitive
Nonmeaning.
¯ Measures of Variation
Statistic
Formula
Standard
deviation
( s)
∑ ( xi − x )
CoefCoefficient.. of
ficient
variation
( CV
CV))
s
100 ×
x
n
i =1
2
Formula
Excel
i =1
n
= AVEDEV(Data
AVEDEV(Data))
= STDEV(Data
STDEV(Data))
Non- intuitive
Nonmeaning.
None
Measures relative
variation in
percent so can
compare data
sets.
Requires
non-non
negative
data.
¯ Range
Pro
Con
Easy to
understand.
Lacks “ nice
nice””
theoretical
properties.
n
∑ xi − x
Con
Most common
measure. Uses
same units as the
raw data ($ , £, ¥,
etc.).
Dispersion
¯ Measures of Variation
Mean
absolute
deviation
( MAD
MAD))
Pro
n −1
Dispersion
Statistic
Excel
• The difference between the largest and smallest
observation.
Range = x max – x min
• For example, for the n = 68 P/E ratios,
Range = 91 – 7 = 84
9
Dispersion
Dispersion
¯ Variance
¯ Standard Deviation
• The population variance(
variance ( σ2) is
defined as the sum of squared
deviations around the mean µ
divided by the population size.
N
∑ ( xi −µ )
i= 1
σ2 =
• For the sample variance (s 2), we
divide by n – 1 instead of n,
otherwise s 2 would tend to
s2 =
underestimate the unknown
2
population variance σ .
2
N
n
∑ ( xi − x )
2
i =1
n −1
Dispersion
Statistic
Excel population
formula
Excel sample
formula
Variance
=VARP(Array )
=VAR(Array )
=STDEVP(Array )
=STDEV(Array )
Dispersion
¯ Calculating a Standard Deviation
• Now, calculate the sample standard deviation:
s=
∑ ( xi − x )
i =1
n −1
2
=
2380
= 595 = 24.39
5 −1
• Somewhat easier, the two
two-- sum formulacan
formula can also
be used:
2
n 
 ∑ xi 
2  i =1 
x
−
∑i
n
s 2 = i =1
=
n −1
n
(360)2
28300 −
5 =
5− 1
Population
standard
deviation
N
σ=
∑ ( xi − µ )
i =1
N
2
Sample
standard
deviation
n
s=
∑ ( xi − x )
i =1
2
n −1
¯ Calculating a Standard Deviation
• Excel
Excel’’s built in functions are
n
• Explains how individual values in a data set vary
from the mean.
• Units of measure are the same as X.
Dispersion
¯ Standard Deviation
Standard deviation
• The square root of the variance.
28300 −25920
= 595 = 24.39
5− 1
• Consider the following five quiz scores for
Stephanie.
Dispersion
¯ Calculating a Standard Deviation
• The standard deviation is nonnegative because
deviations around the mean are squared.
• When every observation is exactly equal to the
mean, the standard deviation is zero.
• Standard deviations can be large or small,
depending on the units of measure.
• Compare standard deviations only for data sets
measured in the same units and only if the means
do not differ substantially.
10
Dispersion
¯ Coefficient of Variation
Dispersion
¯ Coefficient of Variation
• Useful for comparing variables measured in
different units or with different means.
• A unitunit-free measure of dispersion
• Expressed as a percent of the mean.
s
CV = 100 ×
x
CV = 100 ×
• For example:
Defect rates
( n = 37)
ATM
deposits
( n = 100)
P/E ratios
( n = 68)
s
x
s = 22.89
x = 125.38
gives
s = 280.80
x = 233.89 gives
s = 14.28
= 22.72 gives
x
CV = 100 × (22.89)/(125.38) = 18%
CV = 100 × (280.80)/(233.89) = 120%
CV = 100 × (14.08)/(22.72) = 62%
• Only appropriate for nonnegative data. It is
undefined if the mean is zero or negative.
Dispersion
¯ Mean Absolute Deviation
• The Mean Absolute Deviation (MAD
MAD)) reveals the
average distance from an individual data point to
the mean (center of the distribution).
Dispersion
¯ Central Tendency vs. Dispersion:
Manufacturing
• Consider the histograms of hole diameters drilled in
a steel plate during manufacturing.
• Uses absolute values of the deviations around the
mean.
n
MAD =
∑ xi − x
i =1
n
Machine A
• Excel
Excel’’s function is =AVEDEV(Array
=AVEDEV(Array )
Dispersion
¯ Central Tendency vs. Dispersion:
Manufacturing
Machine B
• The desired distribution is outlined in red.
Dispersion
¯ Central Tendency vs. Dispersion:
Job Performance
• Consider student ratings of four professors on eight
teaching attributes (10(10- point scale).
Machine A
Machine B
Acceptable variation but
Desired mean (5mm)
mean is less than 5 mm.
but too much variation.
• Take frequent samples to monitor quality.
11
Dispersion
¯ Central Tendency vs. Dispersion:
Job Performance
Dispersion
¯ Central Tendency vs. Dispersion:
Job Performance
• Jones and Wu have identical means but different
standard deviations.
• Smith and Gopal have different means but identical
standard deviations.
Dispersion
Standardized Data
¯ Central Tendency vs. Dispersion:
Job Performance
• A high mean (better rating) and low standard
deviation (more consistency) is preferred. Which
professor do you think is best?
Standardized Data
¯ Chebyshev
Chebyshev’’s Theorem
• For k = 2 standard deviations,
100[1 – 1/22] = 75%
• So, at least 75.0% will lie within µ + 2σ
• For k = 3 standard deviations,
100[1 – 1/32] = 88.9%
• So, at least 88.9% will lie within µ + 3σ
• Although applicable to any data set, these limits
tend to be too wide to be useful.
¯ Chebyshev
Chebyshev’’s Theorem
• Developed by mathematicians Jules Bienaym
Bienaymé
é
(1796-- 1878) and Pafnuty Chebyshev (1821
(1796
(1821-- 1894).
• For any population with mean µ and standard
deviation σ, the percentage of observations that lie
within k standard deviations of the mean must be at
least 100[1 – 1/
1/kk 2].
Standardized Data
¯ The Empirical Rule
• The normal or Gaussian distribution was named for
Karl Gauss (1771(1771- 1855).
• The normal distribution is symmetric and is also
known as the bellbell-shaped curve.
• The Empirical Rulestates
Rule states that for data from a
normal distribution, we expect that for
k = 1 about 68.26% will lie within µ + 1σ
k = 2 about 95.44% will lie within µ + 2σ
k = 3 about 99.73% will lie within µ + 3σ
12
Standardized Data
¯ The Empirical Rule
• Distance from the mean is measured in terms of
the number of standard deviations.
Standardized Data
¯ Example: Exam Scores
• If 80 students take an exam, how many will score
within 2 standard deviations of the mean?
• Assuming exam scores follow a normal distribution,
the empirical rule states
Note: no
upper bound
is given.
Data values
outside
µ + 3σ
are rare.
about 95.44% will lie within µ + 2σ
so 95.44% x 80 ≈ 76 students will score
+ 2σ from µ.
• How many students will score more than 2
standard deviations from the mean?
Standardized Data
¯ Unusual Observations
• Unusual observations are those that lie beyond
µ + 2σ.
Standardized Data
¯ Unusual Observations
• For example, the P/E ratio data contains several
large data values. Are they unusual or outliers?
7
• Outliers are observations that lie beyond
µ + 3σ.
8 10 10
10
10 12
13 13
13 13
13
13 13 14 14
8
14
15 15
15 15
15 16
16
20
16 17 18 18
20 20 21 21
18
21
18 19
22 22
19 19
23 23
19 19
23 24
25
26 26 26 26
27
29 29
30 31
34 36
37
40 41 45 48
55
68 91
Standardized Data
¯ The Empirical Rule
• If the sample came from a normal distribution, then
the Empirical rule states
Standardized Data
¯ The Empirical Rule
• Are there any unusual values or outliers?
7 8
. . .
48 55
68 91
x ± 1s = 22.72 ± 1(14.08) = (8.9, 38.8)
Unusual
x ± 2 s = 22.72 ± 2(14.08) = (-5.4, 50.9)
Unusual
Outliers
Outliers
x ± 3s
= 22.72 ± 3(14.08) = (-19.5, 65.0)
-19.5
-5.4
8.9
22.72
38.8
50.9
65.0
13
Standardized Data
¯ Defining a Standardized Variable
• A standardized variable(
variable ( Z) redefines each
observation in terms the number of standard
deviations from the mean.
xi − µ
σ
Standardization
formula for a
population:
zi =
Standardization
formula for a
sample:
x −x
zi = i
s
Standardized Data
¯ Defining a Standardized Variable
• A negative z value means the observation is below
the mean.
Standardized Data
¯ Defining a Standardized Variable
• z i tells how far away the observation is from the
mean.
• For example, for the P/E data, the first value x 1 = 7.
The associated z value is
zi =
xi − x
s
= 7 – 22.72 = - 1.12
14.08
Standardized Data
¯ Defining a Standardized Variable
• Here are the standardized z values for the P/E
data:
• Positive z means the observation is above the
mean. For x 68 = 91,
zi =
xi − x
= 91 – 22.72 = 4.85
14.08
s
• What do you conclude for these four values?
Standardized Data
¯ Defining a Standardized Variable
• MegaStat calculates standardized values as well
as checks for outliers.
• In Excel, use =STANDARDIZE(Array
=STANDARDIZE(Array,, Mean,
STDev)) to calculate a
STDev
standardized z value.
Standardized Data
¯ Outliers
• What do we do with outliers in a data set?
• If due to erroneous data, then discard.
• An outrageous observation (one completely outside
of an expected range) is certainly invalid.
• Recognize unusual data points and outliers and
their potential impact on your study.
• Research books and articles on how to handle
outliers.
14
Standardized Data
Percentiles and Quartiles
¯ Estimating Sigma
¯ Percentiles
• For a normal distribution, the range of values is 6σ
6σ
(from µ – 3σ to µ + 3σ
3σ).
• If you know the range R (high – low), you can
estimate the standard deviation as σ = R/6.
• Useful for approximating the standard deviation
when only R is known.
• This estimate depends on the assumption of
normality.
Percentiles and Quartiles
¯ Percentiles
• Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
• Percentiles are used in employee merit evaluation
and salary benchmarking.
Percentiles and Quartiles
¯ Quartiles
• Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.
Q1
ïLower 25%ð
|
Q2
ïSecond 25%ð
|
Q3
ïThird 25%ð
|
ïUpper 25%ð
• The three values that separate the four groups are
called Q 1, Q 2, and Q 3, respectively.
Percentiles and Quartiles
¯ Quartiles
• The second quartile Q 2 is the median
median,, an important
indicator of central tendency.
tendency.
Q2
ï Lower 50% ð
|
ï Upper 50% ð
• Q 1 and Q 3 measure dispersion since the
interquartile range Q 3 – Q 1 measures the degree of
spread in the middle 50 percent of data values.
Q1
|
Percentiles and Quartiles
¯ Quartiles
• Percentiles are used to establish benchmarks for
comparison purposes (e.g., health care,
manufacturing and banking industries use 5, 25,
50, 75 and 90 percentiles).
ïLower 25%ð
25%ð
• Percentiles are data that have been divided into
100 groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the
test-- takers scored below you.
test
• Deciles are data that have been divided into
10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
Q3
ï Middle 50% ð
|
ïUpper 25%ð
25%ð
• The first quartile Q 1 is the median of the data
values below Q 2, and the third quartile Q 3 is the
median of the data values above Q 2.
Q1
ïLower 25%ð
25%ð
|
Q2
ïSecond 25%ð
25%ð
For first half of data,
50% above,
50% below Q 1.
|
Q3
ïThird 25%ð
25%ð
|
ïUpper 25%ð
25%ð
For second half of data,
50% above,
50% below Q 3.
15
Percentiles and Quartiles
¯ Quartiles
Percentiles and Quartiles
¯ Method of Medians
• Depending on n, the quartiles Q 1,Q 2, and Q 3 may
be members of the data set or may lie between
two of the sorted data values.
• For small data sets, find quartiles using method of
medians :
Step 1. Sort the observations.
Step 2. Find the median Q 2.
Step 3. Find the median of the data values that lie
below Q 2.
Step 4. Find the median of the data values that lie
above Q 2.
Percentiles and Quartiles
¯ Excel Quartiles
¯ Example: P/E Ratios and Quartiles
• Use Excel function =QUARTILE(Array
=QUARTILE(Array , k) to return
the k th quartile.
• Excel treats quartiles as a special case of
percentiles. For example, to calculate Q 3
=QUARTILE(Array
QUARTILE(Array,, 3)
=PERCENTILE(Array , 75)
• Excel calculates the quartile positions as:
Position of Q 1
0.25n
0.25
n + 0.75
Position of Q 2
Position of Q 3
0.50n + 0.50
0.50n
0.75n
0.75
n + 0.25
Percentiles and Quartiles
¯ Example: P/E Ratios and Quartiles
• Using Excel’
Excel’s method of interpolation, the quartile
positions are:
Quartile
Position
Q1
Q2
Q3
Percentiles and Quartiles
Formula
= 0.25(68) + 0.75 = 17.75
= 0.50(68) + 0.50 = 34.50
= 0.75(68) + 0.25 = 51.25
Interpolate
Between
X17 + X18
X34 + X35
X51 + X52
• Consider the following P/E ratios for 68 stocks in a
portfolio.
7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• Use quartiles to define benchmarks for stocks that
are low - priced (bottom quartile) or highhigh- priced (top
quartile).
Percentiles and Quartiles
¯ Example: P/E Ratios and Quartiles
• The quartiles are:
Quartile
First (Q
(Q 1)
Second (Q
(Q 2)
Third (Q
(Q 3)
Formula
Q 1 = X17 + 0.75 (X
(X18- X17)
= 14 + 0.75 (14(14- 14) = 14
Q 2 = X34 + 0.50 (X
(X35- X34)
= 19 + 0.50 (19(19- 19) = 19
Q 3 = X51 + 0.25 (X
(X52- X51)
= 26 + 0.25 (26(26- 26) = 26
16
Percentiles and Quartiles
¯ Example: P/E Ratios and Quartiles
Percentiles and Quartiles
¯ Tip
• So, to summarize:
Q1
ïLower 25%ð
25%ð
of P/E Ratios
Q2
ïSecond 25%ð
25%ð
of P/E Ratios
14
19
Whether you use the method of
medians or Excel, your quartiles will be
about the same. Small differences in
calculation techniques typically do not
lead to different conclusions in
business applications.
Q3
ïThird 25%ð
25%ð
of P/E Ratios
26
ïUpper 25%ð
25%ð
of P/E Ratios
• These quartiles express central tendency and
dispersion. What is the interquartile range?
• Because of clustering of identical data values,
these quartiles do not provide clean cut points
between groups of observations.
Percentiles and Quartiles
¯ Caution
Percentiles and Quartiles
¯ Dispersion Using Quartiles
• Quartiles generally resist outliers.
• However, quartiles do not provide clean cut points
in the sorted data, especially in small samples with
repeating data values.
Data set A:
1, 2, 4, 4, 8, 8, 8, 8
Q1 = 3, Q2 = 6, Q3 = 8
Data set B:
0, 3, 3, 6, 6, 6, 10, 15
Q1 = 3, Q2 = 6, Q3 = 8
• Some robust measures of central tendency and
dispersion using quartiles are:
Statistic
Midhinge
Formula Excel
Q1 + Q3
2
• Although they have identical quartiles, these two
data sets are not similar. The quartiles do not
represent either data set well.
Percentiles and Quartiles
¯ Dispersion Using Quartiles
Statistic
Midspread
Formula
Q3 – Q1
Coefficient
Q 3 − Q1
of quartile 100× Q + Q
3
1
variation
( CQV
CQV))
=0.5*(QUARTILE
(Data,1)+QUARTILE
(Data,3))
Pro
Con
Robust to
presence
of extreme
data
values.
Less
familiar
to most
people.
Percentiles and Quartiles
¯ Midhinge
Excel
Pro
Con
=QUARTILE(Data,3) QUARTILE(Data,1)
Stable
when
extreme
data values
exist.
Ignores
magnitude
of extreme
data
values.
None
Relative
variation in
percent so
we can
compare
data sets.
Less
familiar to
non-non
statisticians
• The mean of the first and third quartiles.
Midhinge =
Q1 + Q3
2
• For the 68 P/E ratios,
Midhinge =
Q1 + Q3 14 + 26
=
= 20
2
2
• A robust measure of central tendency since
quartiles ignore extreme values.
17
Percentiles and Quartiles
¯ Midspread (Interquartile Range)
• A robust measure of dispersion
Midspread = Q 3 – Q 1
• For the 68 P/E ratios,
Midspread = Q 3 – Q 1 = 26 – 14 = 12
Percentiles and Quartiles
¯ Coefficient of Quartile Variation (CQV)
• Measures relative dispersion, expresses the
midspread as a percent of the midhinge
midhinge..
Q3 − Q1
CQV = 100 ×
Q3 + Q1
• For the 68 P/E ratios,
Q − Q1
26 − 14
CQV = 100 × 3
= 100 ×
= 30.0%
Q3 + Q1
26 + 14
• Similar to the CV
CV,, CQV can be used to compare
data sets measured in different units or with
different means.
Box Plots
Box Plots
Whiskers
• A useful tool of exploratory data analysis (EDA).
Center of Box is Midhinge
• Also called a box
box-- and
and-- whisker plot.
plot.
Box
• Based on a five
five-- number summary:
summary:
Xmin, Q 1, Q 2, Q 3, Xmax
• Consider the fivefive- number summary for the
68 P/E ratios:
Xmin, Q 1, Q 2, Q 3, Xmax
7
Q1
Minimum
14 19 26 91
Maximum
Box Plots
¯ Fences and Unusual Data Values
• Use quartiles to detect unusual data points.
• These points are called fences and can be found
using the following formulas:
Lower fence
Upper fence
Right -skewed
Median (Q
(Q 2)
Box Plots
Inner fences
Q1 – 1.5 (Q
(Q3–Q1)
Q3 + 1.5 (Q
(Q3–Q1)
Q3
Outer fences:
Q1 – 3.0 (Q
(Q3–Q1)
Q3 + 3.0 (Q
(Q3–Q1)
¯ Fences and Unusual Data Values
• For example, consider the P/E ratio data:
Inner fences
Outer fences:
Lower fence:
14 – 1.5 (26–
(26–14) = −4
14 – 3.0 (26–
(26–14) = −22
Upper fence:
26 + 1.5 (26–
(26–14) = +44
26 + 3.0 (26–
(26–14) = +62
• Ignore the lower fence since it is negative and P/E
ratios are only positive.
• Values outside the inner fences are unusual while
those outside the outer fences are outliers
outliers..
18
Box Plots
Grouped Data
¯ Fences and Unusual Data Values
• Truncate the whisker at the fences and display
unusual values
Inner
Outer
and outliers
Fence
Fence
as dots.
Unusual
¯ Nature of Grouped Data
• Although some information is lost, grouped data
are easier to display than raw data.
• When bin limits are given, the mean and standard
deviation can be estimated.
Outliers
• Based on these fences, there are three unusual
P/E values and two outliers.
• Accuracy of grouped estimates depend on
- the number of bins
- distribution of data within bins
- bin frequencies
Grouped Data
¯ Mean and Standard Deviation
• Consider the frequency distribution for prices of
Lipitor®
Lipitor
® for three cities:
Grouped Data
¯ Nature of Grouped Data
• Estimate the mean and standard deviation by
k f m
3427.5
j j
x =∑
=
= 72.92552
47
j =1 n
s=
• Where
mj = class midpoint
fj = class frequency
k = number of classes n = sample size
Grouped Data
¯ Nature of Grouped Data
k
f j (mj − x)2
j= 1
n −1
∑
=
2091.48936
= 6.74293
47 − 1
• Note: don’
don’t round off too soon.
Grouped Data
¯ Accuracy Issues
• Now estimate the coefficient of variation
CV = 100 (s
(s / x ) = 100 (6.74293 / 72.92552) = 9.2%
¯ Accuracy Issues
• How accurate are grouped estimates compared to
ungrouped estimates?
• For the previous example, we can compare the
grouped data statistics to the ungrouped data
statistics.
• For this example, very little information was lost
due to grouping.
• However, accuracy could be lost due to the nature
of the grouping (i.e., if the groups were not evenly
spaced within bins).
19
Grouped Data
¯ Accuracy Issues
Grouped Data
¯ Accuracy Issues
• The dot plot shows a relatively even distribution
within the bins.
• Accuracy tends to improve as the number of bins
increases.
• If the first or last class is openopen- ended, there will be
no class midpoint (no mean can be estimated).
• Effects of uneven distributions within bins tend to
average out unless there is systematic skewness.
Skewness and Kurtosis
¯ Skewness
• Assume a lower limit of zero for the first class
when the data are nonnegative.
• You may be able to assume an upper limit for
some variables (e.g., age).
• Median and quartiles may be estimated even with
open-- ended classes.
open
Skewness and Kurtosis
¯ Skewness
• Generally, skewness may be indicated by looking
at the sample histogram or by comparing the mean
and median.
• Skewness is a unitunit-free statistic.
• The coefficient compares two samples measured
in different units or one sample with a known
reference distribution (e.g., symmetric normal
distribution).
• Calculate the sample’
sample’s skewness coefficient as:
• This visual indicator is imprecise and does not take
into consideration sample size n.
Skewness and Kurtosis
¯ Skewness
Skewness =
n  x −x 
n
∑ i 
( n −1)( n − 2) i=1  s 
3
Skewness and Kurtosis
¯ Skewness
• In Excel, go to
Tools | Data Analysis |
Descriptive Statistics or
use the function
=SKEW(array )
• Consider the following table showing the 90%
range for the sample skewness coefficient.
20
Skewness and Kurtosis
¯ Skewness
Skewness and Kurtosis
¯ Skewness
• Coefficients within the 90% range may be
attributed to random variation.
Skewness and Kurtosis
¯ Skewness
• Coefficients outside the range suggest the sample
came from a nonnormal population.
Skewness and Kurtosis
¯ Kurtosis
• As n increases, the range of chance variation
narrows.
• Kurtosis is the relative length of the tails and the
degree of concentration in the center.
• Consider three kurtosis prototype shapes.
Skewness and Kurtosis
¯ Kurtosis
Skewness and Kurtosis
¯ Kurtosis
• A histogram is an unreliable guide to kurtosis since
scale and axis proportions may differ.
• Consider the following table of expected 90%
range for sample kurtosis coefficient.
• Excel and MINITAB calculate kurtosis as:
4
Kurtosis =
n
n( n + 1)
3( n −1)2
 x −x 
−
∑ i
( n − 1)( n − 2)( n − 3) i=1  s  ( n − 2)( n − 3)
21
Skewness and Kurtosis
¯ Kurtosis
• A sample coefficient within the ranges may be
attributed to chance variation.
Skewness and Kurtosis
¯ Kurtosis
• Coefficients outside the range would suggest the
sample differs from a normal population.
Skewness and Kurtosis
¯ Kurtosis
• As sample size increases, the chance range
narrows.
• Inferences about kurtosis are risky for n < 50.
22
Download