Uploaded by George Toma

2.1 Descriptive statistics Note1

advertisement
Numerical Descriptive
Measures
Definitions
The central tendency
locates the central value in a data set.
The variation measures how close to the centre
or how dispersed (scattered) the observations are
from the centre.
The shape is the pattern of the distribution of
values from the lowest value to the highest value.
Describing Data Numerically
Describing Data Numerically
Central Tendency
Dispersion
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Standard Deviation
Coefficient of Variation
Measures of Central Tendency
Calculating the Mean, Median and
Mode
Measures of Central Tendency
Purpose:
To determine the
“centre” of the
data values.
The Mean
The mean is also known as the average.
Calculating the Sample Mean
from raw data
Pronounced x-bar
The ith observation
(values taken by x)
n
x
x
i 1
i
n
Sample size = number of observations
Example 1
The number of work days lost due to illness in a
business per week is given below
(for a 10 week period)
36, 28, 33, 29, 28, 32, 33, 33, 34, 32
Calculate mean number of days lost per week
during the above period.
n
Sample mean,
x
i 1
i
n
x1  x2  x3  ...  xn

n
36  28  33  ...  32

10
318

10
 31.8
Exercise 1
The following are the ages (in years) of all eight
employees of a small company
53, 32, 61, 27, 39, 44, 49, 57
Find the mean age of these employees.
45.25 years
Properties of the Sample Mean

Uniqueness ‐‐ For a given set of data there is one
and only one mean.

Affected (distorted) by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
Properties of the Sample Mean

May better be replaced by the median when
the distribution of the data is ‘skewed’).

An important property of the mean is that it
includes every value in your data set as part of
the calculation.
The Median
The median is the value of the middle observation
in a dataset.
Calculating the Median
from raw data
Step 1: First, arrange the observations in ascending
order
Step 2: Then, find the middle position, using the
following formula if n is an odd number.
n 1
Median position 
2
Step 3: The median value is in the median position
Example 1
Find the median for the following data set.
27 38 12 34 42 40 24 40 23

The ordered set becomes
Observation 12 23 24 27 34 38 40 40 42
Rank
1
2
3
4
5
6
7
8

9  1 th
The median position is
 5 rank (observation)
2

Therefore the median = 34
9
Exercise 1
Sambiri Silicon manufactures computer monitors.
The following data are numbers of computer
monitors produced at the company for a sample of
10 days. Find the median.
24
31
27
25
35
33
26
40
25
28
Properties of the Median



In an ordered array, the median is the “middle”
number (50% above, 50% below)
Uniqueness -- There is only one median for each
set of data.
Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
The Mode
The mode is the most frequently occurring value
in a dataset.
Calculating the Mode
from raw data
Step 1: First, arrange the observations in ascending
order
Step 2: The mode is the most frequently occurring
value in the dataset.
Example 1

Find the mode for the data below
7.00
19.00
23.00
34.22
11.00 14.25 15.00 15.00 15.50
19.00 19.00 19.00 21.00 22.00
24.00 25.00 27.00 27.00 28.00
43.25
The mode is 19.00 because it recurs the most
times, i.e. four (4) times
Properties of the Mode

Normally, the mode is used for categorical
data where we wish to know which is the
most common category

Not affected by extreme values

The mode is not unique
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Properties of the Mode

There can be one mode

There can be several modes

We are now stuck as to which mode best
describes the central tendency of the data.

This is particularly problematic when we have
continuous data because we are more likely not to
have any one value that is more frequent than the
other.
Properties of the Mode

For example, consider measuring 30 peoples'
weight (to the nearest 0.1 kg). How likely is it
that we will find two or more people with
exactly the same weight (e.g., 67.4 kg)? The
answer, is probably very unlikely ‐ many
people might be close, but with such a small
sample (30 people) and a large range of
possible weights, you are unlikely to find two
people with exactly the same weight; that is,
to the nearest 0.1 kg. This is why the mode is
very rarely used with continuous data.
Question
When re‐ordering, the most common hat or
jeans size is what you would like to know, not
the average hat or jeans size.
The Shape: Skewness
The shape is the pattern of the distribution of
values from the lowest value to the highest value.
Symmetric Histogram
Skewed Histogram
Skewed Histogram
Measures of skewness

Pearson’s coefficient

Bowley’s coefficient (Galton’s coefficient)
Basic Business Statistics, 11e © 2009
Prentice-Hall, Inc..
Ch
ap
331
Measures of Central Tendency:
Summary
Central Tendency
Sample Mean
Median
Mode
n
X
X
i1
n
Geometric
Mean
XG  ( X1  X2    Xn )1/ n
i
Middle value
in the ordered
array
Most
frequently
observed
value
Rate of
change of
a variable
over time
Measures of Dispersion
Measures of Dispersion
Which dataset has the larger variation?
Dataset 1
Dataset 2
Measures of Dispersion
Population 1
Population 2
Narrow range
Wide range
Smaller
variation
Larger
variation
Smaller
deviation
Larger
deviation
Observations
clustered
Observations
spread out
Population 1
Population 2
Same centre,
different variation
Measures of Dispersion
The measures of central tendency, the mean, median
and mode, do not reveal the whole picture of the
distribution of the dataset.
Two datasets with the same mean may have
completely different spreads.
The amount or degree of spread is known as variation.
Measures of Dispersion
Variation
Range
Variance
Standard
Deviation
Coefficient
of Variation
Measures of variation give
information on the spread or
variability or dispersion of
the data values.
Same centre,
different variation
Measures of Dispersion:
The Range
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
Range = 13 – 1 = 12
13 14
Measures of Dispersion:
Why The Range Can Be Misleading
Range
12 - 7
5
Range
12 - 7
5
Measures of Dispersion:
Why The Range Can Be Misleading

Ignores the way in which data are distributed
7
8
9
Range
10
12 - 7
11
5
12
7
8
9
Range
10
11
12 - 7
12
5
Measures of Dispersion:
Why The Range Can Be Misleading
Range
Range
5-1
120 - 1
4
119
Measures of Dispersion:
Why The Range Can Be Misleading

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range
5-1
4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range
120 - 1
119
The Sample Variance
Variance is used to measure the dispersion of
values relative to the mean.
n
s 
2
Where
 (x
i1
n
 x)
2
i
n1

 xi
2
 nx
i 1
n1
X = arithmetic mean
n = sample size
Xi = ith observation of the
variable X
2
The Sample Standard Deviation
Most commonly used measure of variation
 Tells us how much observations in our sample
differ from the mean value within our sample.
 Has the same units as the original data making
it easier to interpret.

s
s
2
Example
For this sample data Xi:
2, 3, 5, 1, 4, 3, 2, 4 find.
Sample variance
2. Sample standard deviation
1.
The variation or dispersion in a set of values refers to
how spread out the values are from each other.
• The variation is small when the values are close together.
• There is no variation if the values are the same.
Smaller variation
Larger variation
The Coefficient of Variation
The variance and the standard deviation are useful
as measures of variation of the values of a single
variable for a single population (or sample).
If we want to compare the variation of two
variables we cannot use the variance or the
standard deviation because:
1. The variables might have different means.
2. The variables might have different units.
The Coefficient of Variation

Measures relative variation to the mean

Expressed as a percentage (%)
 s 
CV =   ×100%
x 
The Coefficient of Variation
The coefficient of variation compares the
variability of two different datasets even if they
have different units of measurement.
Example 1
Spot, the dog, weighs 65 pounds. Spot’s weight
fluctuates 5 pounds depending on Spot’s
exercise level.
Sea Biscuit, the horse, weighs 1200 pounds.
Sea Biscuit’s weight fluctuates 125 pounds
depending on the number of rides Sea
Biscuit goes on.
Basic Business Statistics, 11e © 2009
Prentice-Hall, Inc..
Ch
ap
352
Coefficient of Variation
Some financial investors use the
coefficient of variation as a measure of
risk.
What does the Coefficient of
Variation tell us about the risk of a
stock that the standard deviation
does not?
Relative to the amount invested in a
stock, the coefficient of variation reveals
the risk of a stock in terms of the size of
the standard deviation relative to the
size of the mean (in percentage).
Example 2
Relative to the amount of money invested in the
stock, which stock, A or B, is riskier?
Stock A
Stock B
Average
price
$50
$100
Standard
deviation
$5
$5
Comparing Coefficients of Variation
 s
5
CVA    100%  100%  10%
50
x 
 s
5
CVB    100% 
100%  5%
100
x 
Comparing the C.V. it is clear that variation is much
higher stock A than in stock B.
Example 3
The yearly salaries of all employees who work
for a company have a mean of $62,350 and a
standard deviation of $6820.
The years of experience for the same
employees have a mean of 15 years and a
standard deviation of 2 years.
Is the relative variation in the salaries larger or
smaller than that in the years of experience for
these employees?
Interpretation

A low (%) value shows low variability
implying tight clustering of observations
about the mean.

A middle to high (%) value shows high
variability implying that observations are
widely spread.
Measures of Position for
ungrouped data
(Quartile Measures)
Quartile Measures

Quartiles split the ranked data into 4 equal
segments.
25%
25%
Q1



25%
Q2
25%
Q3
The first quartile(lower quartile), Q1, below the first
are 25% of the observations.
Q2 is the same as the median (middle quartile)and
hence below the second quartile are 50% of the
observations.
The third quartile(upper quartile), Q3, below the
third quartile are 75% of the observations.
Quartile Measures

Q1 = 25th percentile = P25

Q2 = 50th percentile = P50

Q3 = 75th percentile = P75
Locating Quartiles Positions
Step 1: First, arrange the observations in
ascending order
Step 2: Find the quartile positions using the
following formulas.
Q1 position  0.25 n  1
Q 2 position  0.5  n  1
Q3 position  0.75 n  1
Step 3: Determine the quartile values.
The Interquartile Range (IQR)
Remember that the range can be distorted by
outliers.
The IQR excludes these outliers and focuses on the
spread of the middle 50% of the data values.
The IQR is also called the 50% mid‐spread range.
IQR  Q3  Q1
The Interquartile Range (IQR)
Weakness
The IQR, like the range, also provides no
information on the clustering of observations
within the dataset as it uses only two
observations in its computation.
Example 1
Given Sample Data in Ordered Array:
11 12 13 16 16 17 18 21 22
Find
1. Q1 and Q3
2. IQR
Locating First quartile, Q1
11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the 0.25(9+1)=2.5 th position of the ranked
data
so use the value half way between the 2nd and 3rd values
12  13
 13  12 
Q1 
 12.5 or Q 1  12  
  12.5
2
 2 
Locating Third Quartile, Q3
11 12 13 16 16 17 18 21 22
(n = 9)
Q3 is in the 0.75(9+1)=7.5 th position of the ranked
data
so use the value half way between the 7th and 8th values.
18  21
 21  18 
Q3 
 19.5 or Q 3  18  
  19.5
2
 2 
The Interquartile Range (IQR)
IQR  Q3  Q1
 19.5  12.5
 7.0
Example 2
Given Sample Data in Ordered Array:
7 8 9 10 11 12 13 13 14 17 17 45
Find
1. Q1 and Q3
2. IQR
Locating First quartile, Q1
7 8 9 10 11 12 13 13 14 17 17 45
(n  12) Q1 is in the 0.2512  1  3.25 pos of the ranked data.
So find the value half way between the 3rd and 4th values,
9  10
which is
 9.5
2
9  9.5
 10  9 
Q1 
 9.25 or Q 1  9  
  9.25
2
 4 
Locating Third Quartile, Q3
7 8 9 10 11 12 13 13 14 17 17 45
(n  12) Q3 is in the 0.7512  1  9.75 pos of the ranked data.
So find the value half way between the 9th and 10th values,
14  17
which is
 15.5
2
15.5  17
 17  14 
Q3 
 16.25 or Q 3  17  
  16.25
2
 4 
The Interquartile Range (IQR)
IQR  Q3  Q1
 16.25  9.25
 7.0
End of Chapter
Grouped data

Mean

Variance

CV
Basic Business Statistics, 11e © 2009
Prentice-Hall, Inc..
Ch
ap
375
Download