Week 2, Lecture 2, Measures of variability

advertisement
QBM117
Business Statistics
Descriptive Statistics
Numerical Descriptive Measures
Objectives
• To introduce numerical measures for describing the
central location of data
• To introduce numerical measures for describing the
variability of data
Numerical Descriptive Methods
• We have looked at tabular and graphical methods for
presenting data.
• Although these methods help us to highlight
important features of the data, they do not tell the
whole story.
• Numerical descriptive measures allow us to be more
precise in describing the characteristics of the data.
Numerical Descriptive Methods for
Quantitative Data
• Most numerical descriptive measures are obtained
through arithmetic operations on the data.
• Arithmetic calculations can only be applied to
quantitative data.
• Consequently most of the numerical descriptive
measures we will discuss are for quantitative data.
Parameters and Statistics
• Recall the terms introduced in lecture 2 week 1:
population, sample, parameter, statistic
• Numerical measures calculated from sample data are
called sample statistics.
• Numerical measures calculated from population data
are called population parameters.
• We will look at a number of descriptive statistics and
for each we will learn how to calculate both the
population parameter and the sample statistic.
• In practice we usually collect data from a sample and
calculate sample statistics to use as estimates of
population parameters.
Notation
• Statistics are usually represented by Roman letters:
sample mean x
sample standard deviation
s
• Parameters are usually represented by Greek letters:
population mean 
population standard deviation 
Properties of numerical data
• Three major properties that describe quantitative data
are
- measures of central tendency
- measures of dispersion
- measures of shape
Measures of Central Tendency
• In most sets of data there is a tendency for the data
to group about a central point.
• This phenomenon is referred to as central tendency.
• We will look at three measures of central tendency:
mean, median and mode
The Mean
• The most popular and useful measure of central
tendency is the arithmetic mean, widely known as the
average.
• The mean is calculated by summing all the
observations and dividing by the number of
observations.
• It can easily be calculated using the statistics function
on your calculator.
• The mean of a sample of n measurements x1 , x2 ,..., xn
n
is defined as
x1  x2  ...  xn
x

n
x
i 1
i
n
• The mean of a population of N measurements
x1 , x2 ,..., xn is defined as
n
x
x1  x2  ...  xn i 1


N
N
i
• I have shown you the formulas so that you
understand how the mean is calculated.
• However it is expected that you will calculate the
mean using the statistics function on your calculator.
• If you are unsure of how to use the statistics
functions on your calculator refer to your calculator
manual.
• The population mean or the sample mean are
calculated using the same button on your calculator.
Example 1
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the mean of the data.
4  20  ...  29
x
 23.27
15
The Median
• The median is the middle value when the data are
arranged in order.
• To calculate the median
- Order the data from smallest to largest
- If the number of observations is odd, the median is
the middle value.
- If the number of observations in even, the median is
the mean of the two middle observations.
Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the median.
Order the data.
4 10 15 16 18 20 21 23 28 29 29 31 33 35 37
median
There are 15 observations and so the median will be
the middle value.
It will be the 8th value.
Stem and Leaf Display
• A useful tool for ordering data is the stem and leaf
display.
• To construct and stem and leaf display separate each
observation into
a stem, consisting of all but the last digit
and a leaf, the final digit.
• Write the stems in a vertical column (smallest at top) .
• Write each leaf in the row to the right of the stem.
• Redraw, ordering the leaves.
Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Construct and stem and leaf display and calculate the
median.
Ordered
0
4
1
6
0
5
8
2
0
8
3
9
3
1
7
3
5
0
4
1
0
5
6
8
2
0
1
3
8
3
1
3
5
7
1
9
9
9
The Mode
• The mode is the value that occurs most frequently.
• The mode doesn’t necessarily lie in the middle.
• Its claim to be a measure of central tendency is
based on the fact that it indicates the location of
greatest concentration of values.
• The mode is a measure of central tendency that can
be used for qualitative data.
Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the mode.
mode = 29
• If no data value occurs more than once then there is
no mode.
• A data set may have more than one mode.
• If there are two modes then the data are bimodal.
• If there are more than two modes the data are
multimodal.
Example 2
A survey of television-viewing habits among
university students provided the following data on
viewing time in hours per week:
14
9 12
18 15 10
4 20 26 17 15
6 16 15
8
5
Calculate the mean, median and mode.
mean = 13.125
4 5 6 8 9 10 12 14 15 15 15 16 17 18 20 26
median
median = 14.5
mode = 15
Mean, Median or Mode
• There are several factors to consider when making
our choice of measure of central tendency.
• The mean is generally our first selection.
• However, there are circumstances when the median
is better.
• The mode is seldom the best measure of central
tendency.
• The mean is a popular measure because it is simple
to calculate and interpret, and lends itself to
mathematical manipulation.
• However the mean is sensitive to skewness and
outliers.
• The mean can be thought of as the balance point of
the data.
• If there are a few data points that are far from the
bulk of the data, the mean moves towards them in
order to maintain balance.
• The mean is the preferred measure of central
tendency.
• However, if the data are skewed or contain outliers
then the median is the preferred measure of central
tendency.
• If the data are qualitative, the mode must be used.
Relationship between Mean, Median
and Mode
• If the data is unimodal and symmetric, the mean,
median and mode coincide.
• If the data are unimodal and positively skewed, the
mean is greater than the median, which is greater
than the mode.
• If the data are unimodal and negatively skewed, the
mean is less than the median, which is less than the
mode.
Measures of Dispersion
• In addition to knowing the central location of the data
values, it is important to know how the values vary
about this point.
• We are now going to look at measures of dispersion,
also referred to as
- measures of spread
- measures of variability
• We will look at three measures of dispersion:
range, standard deviation and coefficient of variation
The Range
• The range is the difference between the largest and
smallest observations in a data set.
• The range measures the total spread of the data set.
• Although the range is a simple measure of variability,
it does not take into account how the data are
distributed between the smallest and largest values.
• Hence the range is seldom used as the only
measure.
Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the range.
range = 37 – 4 = 33
Variance and Standard Deviation
• The variance and the standard deviation are the two
most widely accepted measures of dispersion.
• The variance is the square root of the standard
deviation.
• Both measures take into account how far each data
value is away from the mean.
Population Variance
• The variance of a population of N measurements
x1 , x2 ,..., xn having mean  is defined as
2
2
2
(
x


)

(
x


)

...

(
x


)
2
n
2  1
N
n

2
(
x


)
 i
i 1
N
Sample Variance
• The variance of a sample of n measurements
x1 , x2 ,..., xn having mean x is defined as
2
2
2
(
x

x
)

(
x

x
)

...

(
x

x
)
2
n
s2  1
n 1
n

2
(
x

x
)
 i
i 1
n 1
Standard Deviation
• Calculating the variance involves squaring the
original measurements and hence the unit attached
to the variance is the square of the unit attached to
the original measurements.
• Taking the square root of the variance gives as a
measure of variability that is in the same units as the
data.
• This measure is the standard deviation.
Population Standard Deviation
• The standard deviation of a population of N
measurements x1 , x2 ,..., xn having mean μ is defined
as
n

2
(
x


)
 i
i 1
N
Sample Standard Deviation
• The standard deviation of a sample of n
measurements x1 , x2 ,..., xn having mean
as
x
n
s
2
(
x

x
)
 i
i 1
n 1
x is defined
Calculating the Standard Deviation
and Variance
• As with the mean, you are expected to calculate the
standard deviation and variance using the statistics
functions on your calculator.
• You are not to use the formulae, these have been
provided to help you understand what the standard
deviation and variance are.
• Note that the population standard deviation and
sample standard deviation are calculated using
different buttons on your calculator.
Important Points about the Standard
Deviation
• The standard deviation cannot be negative.
• The standard deviation is zero if, and only if, all of the
observations have the same value.
• Like the mean, the standard deviation is not resistant.
Strong skewness or a few outliers can greatly
increase the standard deviation.
Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the standard deviation and the variance.
s  9.57 (2d.p.)
s 2  91.50 (2d.p.)
Coefficient of Variation
• In some situations we may be interested in a
measure of variability that indicates how large the
standard deviation is in relation to the mean.
• This measure is called the coefficient of variation
(CV) and is calculated by dividing the standard
deviation of a data set by the mean.
• The CV allows us to compare the variability of two
data sets having different units of measurement.
• A standard deviation of 1mm would be considered
very large for the measured thickness of CDs on a
production line.
• However a standard deviation of 1mm would be
considered small for the height of a telephone pole.
• When the means for data sets differ greatly we do not
get an accurate picture of the relative variability in the
two data sets by comparing the standard deviations.
Calculating the Coefficient of
Variation
• The sample coefficient of variation is calculated by
s
cv 
x
• The population coefficient of variation is calculated by

CV 

Example 1 revisited
The following data are the price-earnings ratios for a
set of stocks whose prices are quoted by NASDAQ
4
20 16 28 31
10 23 37 29 15
33 21 18 35 29
Calculate the coefficient of variation.
9.57
cv 
 0.41 (2d.p.)
23.27
Example 2 revisited
A survey of television-viewing habits among
university students provided the following data on
viewing time in hours per week:
14
9 12
18 15 10
4 20 26 17 15
6 16 15
8
5
Calculate the range, standard deviation, variance and
coefficient of variation.
range = 26 – 6 = 20
standard deviation: s = 5.92 (2d.p.)
variance: s2 = 35.05
coefficient of variation: cv = 0.45 (2d.p.)
Interpreting the Standard Deviation
• The standard deviation, as a measure of average
deviation around the mean, helps you understand
how the observations are distributed above and
below the mean.
• A data set with a large standard deviation has much
dispersion with values widely scattered around its
mean.
• A data set with a small standard deviation has little
dispersion with the values tightly clustered about the
mean.
Chebyshe’s Theorem
• More than a century ago, Russian mathematician
Pavroty Chebyshev, found that regardless of how a
data set is distributed, the proportion of observations
that are contained within distances of k standard
deviations of the mean is at least 1-(1/k2).
• This is known as Chebyshev’s theorem.
Regardless of the shape of the distribution,
Chebyshev’s theorem states:
• At least 75% of the observations must lie within 2
standard deviations of the mean
• At least 89% of the observations must lie within 3
standard deviations of the mean
• At least 94% of the observations must lie within 4
standard deviations of the mean
Example 3.11 from text (pg 86)
The duration (in minutes) of a sample of 30 longdistance telephone calls placed by a firm in
Melbourne in a given week are given in Table 3.2 on
page 86 of the text.
The 30 telephone-call durations have a mean of
10.26 and a standard deviation of 4.29.
Chebyshev’s theorem states that at least 75% of the
call durations lie within 2 standard deviations of the
mean.
x  2s  10.26  2  4.29
 1.68
x  2s  10.26  2  4.29
 18.84
When we look at the data we find that all but the
largest of the 30 durations fall within this interval.
That is, the interval actually contains 96.7% of the
call durations.
Empirical Rule
• A more exact rule applies if the distribution of the
data is bell-shaped.
• The empirical rule has evolved from empirical studies
that have produced samples possessing bell-shaped
distributions.
The empirical rule states that for data with a bellshaped distribution:
• About 68% of all observations lie within 1 standard
deviation of the mean
• About 95% of all observations lie within 2 standard
deviations of the mean
• Almost all 94% of the observations lie within 3
standard deviations of the mean
Example 3.12 from text (pg 87)
The data in the sample of telephone-call durations in
Table 3.2 have a mean of 10.26, a standard deviation
of 4.29, and the durations have an approximately
bell-shaped distribution (see Figure 3.5).
According to the empirical rule, approximately 68% of
the observations should lie in the interval
( x  s, x  s)  (10.26  4.29,10.26  4.29)
 (5.97,14.55)
According to the empirical rule, approximately 68% of
the observations should lie in the interval
( x  s, x  s)  (10.26  4.29,10.26  4.29)
 (5.97,14.55)
If we look at the data we see that 21 out of the 30
durations are contained in this interval, i.e. 70%.
This is very close the the empirical rule’s
approximation.
According to the empirical rule, approximately 95% of
the observations should lie in the interval
( x  2s, x  2s)  (10.26  2  4.29,10.26  2  4.29)
 (1.68,18.84)
If we look at the data we see that 29 out of the 30
durations are contained in this interval, i.e. 96.7%.
This is very close the the empirical rule’s
approximation.
Reading for next lecture
• Chapter 3 Sections 3.5 - 3.6
Exercises
•
•
•
•
3.7
3.20
3.25a
3.31
Download