Uploaded by sefapi8883

Measures for Describing Data

advertisement
Unit 3
Measures for Describing Data
3.0 Introduction
Statistics is an area of Science concerned with the extraction of information from numerical
data. Individual values, when taken together in their entirety, form a distribution or a
population. Summary statistics are ways of characterising that distribution: saying whether
the values are very similar; whether there are some exceptionally large or small values; what
a typical value is like, and so on. In this unit, we are going to discuss various statistics that are
used to describe the distributions from which data are obtained.
3.1 Objectives
By the end of this unit, you should be able to:
calculate the mean, median and mode for discrete and continuous data
discuss the advantages and disadvantages of each of the measures of central tendency
estimate quartiles and percentiles for given data sets
use a box plot to summarise a given data set
calculate the range, inter-quartile range, variance and standard deviation for given
samples of discrete and continuous data
calculate coefficient of variation for given data sets
3.2 Measures of Central Tendency
Measures of central tendency are also called measures of central location, averages or
averages of the first order. In this module, we shall use the terms measure of central tendency
or average interchangeably to mean the representative value around which all the values of
the variable cluster or concentrate. The three averages are: the arithmetic mean - commonly
known as the mean, the median and the mode.
3.2.1 The arithmetic mean
The layman calls it "the average" as if it were the only average. He calls it so, probably, since
it is the most commonly used average.
The mean of a small set of discrete data
The mean of a set of n measurements x1 , x2 , x3, ..., xn is equal to the sum of the measurements
divided by n. Denoting the mean by x we have:
n
xi
x
where the
i 1
[3.1]
n
is the summation sign.
34
Example 3.1
The following data set is a record of the amount of money (in $) spent by a sample of 8
customers on groceries in a shop on a particular Saturday. The figures were rounded-off to
whole numbers.
13, 8, 21, 4, 23, 16, 11, 15
Calculate the mean amount spent by customers on groceries.
Solution 3.1
n
xi
i 1
x
n
111
8
= 13.875
Based on the sample, a customer spent on average about $14 on groceries in the shop on that
particular Saturday.
The mean of discrete frequency data
When a data set is large with some observations appearing several times each, the mean is
found by multiplying each observed value by the corresponding frequency, adding up and
then dividing the sum by the number of observations.
The mean for discrete frequency data is obtained using formula 3.2.
k
f i xi
x
k
where n
i 1
[3.2]
n
f i . Note that n is the total frequency, k is the number of categories of
i 1
observations and f i is the frequency of category i .
Example 3.2
In Example 3.1, suppose there are: 5 customers who spent $13 each; 3 customers who spent
$8 each; 1 customer who spent $21; 6 customers who spent $4 each; 2 customers who spent
$23 each; 3 customers who spent $16 each; 4 customers who spent $11 each ;and 7 customers
who spent $15 each. Find the mean amount spent in the shop.
Solution 3.2
This is an example of frequency discrete data. We usually record such data in form of a
frequency table as shown below.
x
f
13
5
8
3
21
1
4
6
23
2
16
3
11
4
15
7
There are a total of 31 individual observations. We multiplying each observed value by its
corresponding frequency, add up and then divide by 31.
35
k
f i xi
x
i 1
n
377
31
= 12.1613
Thus on average, the customers spent $12.16 each.
Activity 3.1
1. A sample of ten university students had the following weekly expenditure, in dollars.
23, 15, 18, 35, 24, 45, 35, 28, 40, 32
Calculate the mean weekly expenditure for a student.
2. There were 5 categories of cash prices in a road-show promotion of a product. The
following frequency distribution table shows the number of people who won the various
categories of prices.
Prize ($)
25
40
60
75
120
No. of winners
20
12
5
3
1
Calculate the mean cash price won at the road-show.
The mean of grouped continuous data
Continuous variables such as mass, height and distance take values which are not clear-cut.
Large volumes of data of such measures are usually presented in form of grouped frequency
tables. Once data is presented in this form, some information is lost as it is no longer possible
to retrieve the original raw data. The mean is estimated on the basis of this limitation.
Suppose that data is grouped into k classes/categories with frequencies f1 , f 2 ,..., f k . Let the
classes have the midpoints x1 , x2 ,...xk respectively. Then the mean is estimated by:
k
x
i 1
k
f i xi
[3.3]
fi
i 1
where xi is the class midpoint or class mark.
The class midpoint is a representative mark of all the marks falling in the particular class. It is
obtained by adding the lower and upper class boundaries and dividing the result by 2.
Example 3.3
The following data show monthly salaries (in dollars) of 50 employees of a nongovernmental organisation.
Salary($00)
No. of employees
0 to less than 10
10
10 to less than 20
23
20 to less than 30
12
30 to less than 40
3
40 to less than 50
2
Calculate the mean monthly salary of the employees.
36
Solution 3.3
Salary($00) Number of employees, f i
0 - 10
10
10 - 20
23
20 - 30
12
30 – 40
3
40 – 50
2
f i =50
Mean, x
Midpoint, xi
5
15
25
35
45
f i xi
50
345
300
105
90
f i xi 890
f i xi
fi
890
50
= 17.8
The mean monthly salary is $1 780.00.
Example 3.4
An organisation recorded monthly medical expenses incurred by families of 30 randomly
selected employees.
Amount($00) Number of employees
1 – 10
3
11 – 20
7
21 – 30
11
31 – 40
5
41 – 50
4
Calculate the mean monthly expenditure per family.
Solution 3.4
Note that there are some gaps in between the classes. Amounts spent may assume values
between say 10 and 11, so there is need to do some continuity correction to the class
boundaries to obtain the real limits. The gaps are one unit each, so we obtain the limits by
adding 0.5 to upper class boundaries, and by subtracting 0.5 from the lower class boundaries.
Amount($00) Frequency, f i
0.5 – 10.5
3
10.5 – 20.5
7
20.5 – 30.5
11
30.5 – 40.5
5
40.5 – 50.5
4
f i =30
Mean, x
Midpoint, xi
5.5
15.5
25.5
35.5
45.5
f i xi
fi
765
30
= 25.5
The mean monthly medical expense per family was $2 550.00
37
f i xi
16.5
108.5
280.5
177.5
182
f i xi 765
Activity 3.2
1. The data shows mass, in kg, a sample of people who had applied to train as horse-riders.
Mass (kg)
15 - 20
21- 25
26 - 30 31 - 35
36 - 40
Number of applicants
9
5
3
4
2
Calculate the sample mean.
2. The heights of the applicants were recorded as shown in the table below.
Height (cm)
130 - 135
136- 140
141 - 145 146 - 160
Number of applicants
7
9
4
1
Calculate the mean height of the applicants.
3.2.2 The median
The median is a positional average. It is the value such that half the observations in the data
set are larger than it and half are smaller than it. The median is the central value after the
observations are ranked according to size.
Median for small set of discrete data
Let the ordered values of a data set be y1 , y 2 , y 3
The median is given by
y n 1 if n is odd
, yn where n is the number of observations.
[3.4]
2
1
(yn
2 2
y n 2 ) if n is even.
[3.5]
2
Example 3.5
The data set shows scores for university students who wrote a management course.
36, 67, 41, 52, 73, 61, 58, 76, 33, 48, 68
Find the median score.
Solution 3.5
Rearranging in order of size
y: 33 36 41 48 52 58 61 67
Since n = 11 is odd, the median corresponds to
y n 1 y11 1 y 12 y 6 58
2
2
68
73 76
2
Example 3.6
Suppose in Example 3.5, the student who obtained a score of 68 was disqualified for some
reason. Find the median of the remaining scores.
Solution 3.6
Since n = 10 is even, the median is given by
1
( yn yn 2 )
2 2
2
1
( y5 y 6 )
2
1
= (52 58)
2
= 55
=
38
Activity 3.3
1. The PXP bank issued loans to farmers toward the rain season. The loans, in thousand
dollars, are listed below.
7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2
Find the median of the data set.
2. An Omnibus operator expects the bus crew to cash in $100 dollars every day. Due to
various reasons, the crew may sometimes fail to meet the target. On 11 randomly selected
days, the following amounts were cashed in:
84, 93, 75, 88, 55, 69, 96, 100, 74, 80, 58
Find the median for the cash remittances.
Median for discrete frequency data
We shall proceed by giving an example to demonstrate how to locate the median of discrete
frequency data. This approach is used in order to avoid a cumbersome task of having to list a
large number of values in the data set.
Example 3.7
In a survey to assess attitude to a new product, a random sample of 41 potential customers
was obtained. A 60-point rating scale was used to measure the potential to become loyal to
the product. The frequency table shows the distribution of scores that were obtained in the
survey.
x
Frequency, f
18
4
24
7
27
5
35
12
42
7
53
6
Find the median score.
Solution 3.7
n 1 41 1
=
= 21
2
2
The median has a rank of 21 i.e. the 21st in the set of ascending values.
We then construct a cumulative frequency table to help us identify the median value.
Rank for median =
x
Cumulative
frequency
18
4
24
11
27
16
35
28
42
35
53
42
We note from the table that there are 16 values that are below or equal to 27, that is, the 16th
value is 27. Similarly, the 28th value is 35. A list of values from the 16th to the 28th will
include the 21st. The list will be a 17 followed by a chain of 35s up to the 28th value. Thus the
21st value is 35, the median.
Activity 3.4
During the first quarter of the year EST Department stores conducted a promotions for
various products were prices worth various amounts of dollars were won. The frequency table
shows the distribution of prizes that were won.
Prize worth
20
35
60
75
120
No. of prizes
83
26
48
52
15
Find the median value of the prices.
39
Median for grouped continuous data
To find the median you start by identifying the median class. The median class is the class
that contains the nth 2 observation, where n is the total frequency. The class containing the
nth 2 observation can easily be identified using the less than cumulative frequencies of the
data.
The median is given by
Median = Lm
C m ( n 2 Fm 1 )
fm
[3.6]
where - Lm = lower class boundary of median class, f m = frequency of the median class, Fm 1
= cumulative frequency up to(but excluding) the median class, C m = width of the median
class, n = total frequency and m = subscript used to denote median class.
Example 3.8
Calculate the median of data in Example 3.4.
Solution 3.8
Class interval Class boundaries
1 – 10
11 – 20
21 – 30
31 – 40
41 - 50
0.5 - 10.5
10.5 – 20.5
20.5 – 30.5
30.5 – 40.5
40.5 – 50.5
Frequency, f i
3
7
11
5
4
Cumulative frequency, F
3
10
21
26
30
The median class contains the 30th 2 observation, that is, the 15th observation. This is
contained in the class 20.5 – 30.5. Therefore, Lm 20.5 , Cm 10 , f m 11and Fm 1 10 .
You now substitute these values into the formula.
Cm (n 2 Fm 1 )
Median = Lm
fm
10(15 10)
20.5
11
50
= 20.5
11
= 25.04545455
$2504.54
Example 3.9
The table below shows the distribution of the sales ($) made by vendors at Mbare Musika one
particular morning.
Sales, x
40 - 60 61 - 80 81 - 100 101 - 150 151 - 200 201- 250
Frequency, f
8
5
15
9
13
6
Estimate the median of the sales.
40
Solution 3.9
Note that there are some gaps in between the classes. For instance, the lowest class ends at 60
and the next class starts from 61. Sales may assume values between 60 and 61, so there is
need to do some continuity correction to the class boundaries to obtain the real limits.
Sales ($)
Frequency, f Cumulative frequency, F
39.5 – 60.5
8
8
60.5 – 80.5
5
13
80.5 – 100.5
15
28
100.5 – 150.5
9
37
150.5 – 200.5
13
50
200.5 – 250.5
6
56
The median has a rank of 28, that is, it is the 28th value of the ordered data set. This is in the
class 80.5 – 100.5, which is therefore the median class. The median is then interpolated using
the formula:
Cm (n 2 Fm 1 )
Median = Lm
fm
20( 28 13)
80.5
15
300
= 80.5
15
= 100.5
The median is $100.50
Activity 3.5
1. The frequency table shows the distribution of bank deposits, in thousand dollars, made by
companies over the month of February.
Deposits
60 - 80
80 - 100
100 - 200
200 - 250
250 - 400
No. of banks
5
12
5
15
8
Estimate the median for the company deposits.
2. The distribution of investors by value of shares (thousand dollars) in Earthly limited
company is shown in the frequency table.
Value
2.5 - 4.5
5.0 - 7.5
8.0 - 12.5
13.0 - 18.5
19.0 - 25.5
Investors
12
9
24
8
5
Estimate the median share value.
3.2.3 The mode
The mode of a data set is the observation that appears most. The mode represents fashion and,
often, it is used in business.
Example 3.10
A cross boarder trader is deciding to order shoes for resale. She will be guided by shoe sales
recorded by a colleague in the same business in order to determine what proportion to order
of each size. The sales (shoe sizes) recorded by the colleague on her last visit were:
4, 7, 8, 8, 9, 3, 8, 8, 7, 9, 5, 6, 7, 5, 8
Determine the modal shoe size.
41
Solution 3.10
The shoe size which appears most is 8. This is the shoe size with highest demand, and the
cross boarder trader should order more size 8 shoes.
Mode for discrete continuous data
Example 3.11
Suppose in Example 3.10 the sales for the last 30 visits were as follows:
Size
3
4
5
6
7
8
9
Frequency 11
7
16
10
23
47
7
10
1
What was the modal shoe size?
Solution 3.11
Size 8 has the highest frequency of 47, hence it is the mode. This shoe size must constitute
the biggest proportion of the new order.
Activity 3.6
Mukwe Lodge recorded the following number of bookings per week for accommodation in
the first quarter of the year.
15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21
Find the modal number of bookings.
The mode for grouped continuous data
The mode for discrete data could be found easily by inspection. However, when raw data is
put into classes, it is difficult to tell exactly how many times each value occurs, but you can
tell the number of times each class occurs. The class that occurs the greatest number of times
than any other class is the modal class. The actual mode lies in the modal class and can be
estimated by calculation or graphically using a histogram.
The mode is calculated using the formula:
Cm ( f m f m 1 )
Mode Lm
2 fm fm 1 fm 1
[3.7]
where Lm - lower class boundary of the modal class, Cm -class width of modal class, f m frequency of the modal class, f m 1 - frequency of the class one step below the modal class,
f m 1 - frequency of the class one step above the modal class and m = subscript used to denote
modal class.
Example 3.12
Calculate the mode of the data in Example 3.4.
Solution 3.12
The modal class is 20.5 – 30.5
Cm ( f m f m 1 )
Mode Lm
2 fm fm 1 fm 1
10(11 7)
= 20.5
2(11) 7 5
42
= 20.5
40
10
= 24.5
The modal expense was $2 450.00
Example 3.13
The frequency table shows the distribution of loans (in thousand dollars) that were issued to
small businesses by PXP bank.
Amount
2-8
9 - 15
16 - 20
21 - 35
36 - 40
Frequency
15
8
23
14
5
Find the mode.
Solution 3.13
We need to present the table using real limits as shown below. It is important to note that
successive classes have small gaps of 1 unit between them. To close these gaps we subtract
half the distance (0.5) from lower limits, and add the same to the upper class limits.
Amount
Frequency
1.5 - 8.5
15
8.5 - 15.5
8
15.5 - 20.5
23
20.5 - 35.5
14
35.5 - 40.5
5
The modal class is 15.5 - 20.5 since it has the highest frequency.
Cm ( f m f m 1 )
Mode Lm
2 fm fm 1 fm 1
5(23 8)
= 15.5
2(23) 8 14
75
= 15.5
24
= 18.625
The mode is $18 625.00
Activity 3.7
The table shows monthly tobacco sales, in tons, made over the last 32 months at BW Tobacco
Auction Floors.
Sales in tons
No. of months
5.5 - 9.5
6
10 - 15.5
3
16 - 20.5
12
21 - 24.5
8
25 - 28.5
3
Estimate the mode for the monthly sales of tobacco by the company.
3.2.4 Estimating the mode using a histogram
The use of a histogram to estimate the mode requires that the bars be of uniform width. The
method is illustrated in Figure 3.1 using the data of Example 3.14.
Example 3.14
The monthly salaries earned by a sample of 20 salespersons employed in the motor insurance
industry are:
43
Salary/($)
Number of employees
200 to less than 300
2
300 to less than400
4
400 to less than 500
8
500 to less than 600
5
600 to less than 700
1
Estimate the mode using a histogram.
Solution 3.14
You start by identifying the modal class. The modal class is the one with the tallest bar. You
then estimate the position of the mode within the modal class by drawing diagonals as shown
in Figure 3.1. Where the diagonals intersect you now draw a straight vertical line downwards
to meet the horizontal axis.
Number of employees
8
6
4
2
0
200
300
400
500
600
700
Salary ($)
Figure 3.1 Estimation of Mode
The arrow indicates the position of the mode which can be read off from the horizontal scale.
3.2.5 Choosing the appropriate average
The suitability of the mode, median or mean as an average for a given situation largely
depends on the advantages and disadvantages of the particular measure.
Advantages of the median
Using the median in describing a distribution has advantages in that it:
is ease to calculate;
eliminates the effect of extreme values;
is capable of further algebraic use in analysing other measures;
can be estimated graphically using an ogive.
Disadvantages of the median
Using the median has disadvantages in that it:
may not be representative of all the items as it ignores the extreme values;
cannot be determined precisely when it falls between two middle values;
has no use when items are weighted according to size;
requires ranking of items which may be involving.
44
Advantages of the arithmetic mean
The mean has the following advantages:
it is ease to calculate;
it is based on all the observations;
it has further algebraic use in calculating other measures;
it is easily understood.
Disadvantages of the arithmetic mean
The mean has the following disadvantages:
it is affected by extreme values (outliers), if any, in a data set;
it does not give information on composition of the data;
it does not depict the entire picture of the data;
it does not always represent the characteristics of individual items;
it is usually not one of the observed values.
Advantages of the mode
The advantages of using the mode are that:
it is easy to find;
it is easy to understand;
it is usually one of the observed values of the data set.
Disadvantages of the mode
The disadvantages of mode are that:
the mode may not exist;
it may not be unique;
its use in further statistical analysis is limited;
it does not take into account all other values except the most frequent.
Activity 3.8
The planning department of a Building Society would like to estimate the average household
size of workers at a particular company for which they are to develop a housing project. The
Society gathered the following data pertaining to household sizes from a random sample of
20 workers at the company.
2, 1, 3, 6, 2, 5, 3, 5, 1, 7, 1, 2, 5, 3, 3, 4, 5, 5, 7, 15
(a) Find the mean, median and mode of the data.
(b) Which average is most suitable to estimate household size? Justify your answer by
saying why the other two are not suitable.
3.3 Measures of Position
These measures provide the position of a value in an ordered set of data. The median, for
instance, is a measure of position which divides the distribution into halves. Other commonly
used measures of position are the lower quartile (Q1), the upper quartile (Q3), and the
percentiles. The quartiles (Q1, Q2 and Q3) are positions which divide the entire distribution
into four portions of equal frequency. The lower quartile (Q1) is the value below which lies
25% of the distribution. The median (Q2) has 50% of the distribution lying below it while the
upper quartile (Q3) is the value below which lies 75% of the distribution.
45
Rank for the quartiles
In finding or estimating the quartiles, the data is first arranged in ascending order. We then
need to know the rank for the quartiles, since these will be used in making estimation. Table
3.1 summarises the rank for the quartiles for discrete and continuous data in which the
number of observations is n .
Table 3.1 Rank for the Quartiles
Quartile
Rank in discrete data
Q1
n 1
4
Q2
n 1
2
Q3
3( n 1)
4
Rank in continuous data
n
4
n
2
3n
4
3.3.1 Quartiles for discrete data
We demonstrate how the quartiles are estimated for given sets of discrete data, using the
ranks.
Example 3.15
A phone- shop operator recorded the daily revenue she received, in dollars, over 14 days as
shown below.
12, 18, 23, 27, 14, 17, 25, 43, 16, 37, 22, 28, 10, 36
(a) Estimate Q1 and Q3 from the data.
(b) Based on the calculated values, what is the probability that on a given day her revenue
exceed Q3?
Solution 3.15
(a) There are 14 observations, hence n 14 . We first arrange these values in ascending
order to obtain the ordered data set below.
10, 12, 14, 16, 17, 18, 22, 23, 25, 27, 28, 36, 37, 43
14 1
The rank for Q1 is
= 3.75
4
Thus the rank is 3.75, that is, 3 + 0.75. We therefore consider Q1 to be the third value plus
0.75 of the distance between this and the fourth value. Put mathematically, this is
Q1 14 0.75(16 14)
= 15.5
3(14 1)
The rank for Q3 is
= 11.25
4
The upper quartile is, therefore, the 11th value plus 0.25 of the difference between this and the
12th value.
Q3 28 0.25(36 28)
= 30
The upper quartile, Q3, has 0.25 of the distribution lying above it. The probability that her
revenue exceeds $30 is 0.25.
46
Activity 3.9
Find the lower and upper quartiles for the following data sets
(a) 15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21
(b) 7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2
Percentiles are found in a very similar way to quartiles. The 25th percentile and the 75th
percentile are in fact Q1 and Q3 respectively.
3.3.2 Quartiles for continuous data
Quartiles can be obtained from grouped data in a similar way as was used for the median.
You begin by identifying the appropriate quartile class. The lower quartile class is the class
that contains the nth 4 observation while the upper quartile class contains the 3nth 4
observation. The following computational formulae are then made use of to estimate the
quartiles:
Cq ( n 4 Fq 1 )
Lower quartile, Q1 = Lq
fq
[3.8]
Cq (3n 4 Fq 1 )
Upper quartile, Q3 = Lq
fq
[3.9]
where Lq = lower limit of the quartile class, C q = class width of the quartile class, f q =
frequency of the quartile class and Fq 1 = cumulative frequency of the class one step below
the quartile class.
Example 3.16
Calculate the lower quartile, Q1 and upper quartile, Q3 for the data of Example 3.4.
Solution 3.16
The lower quartile class contains the 30th 4 observation, that is, the 7.5th observation. The
lower quartile class is therefore 10.5 - 20.5.
Cq ( n 4 Fq 1 )
Lower quartile, Q1 = Lq
fq
10(7.5 3)
10.5
7
10.5 6.428571429
= 16.9285
The upper quartile class contains the 3nth 4 observation, that is, the 22.5th observation. This
class is 30.5 - 40.5.
Cq (3n 4 Fq 1 )
Upper quartile, Q3 = Lq
fq
10( 22.5 21)
30.5
5
= 30.5 + 3
= 33.5
47
Activity 3.10
Using the data of Example 3.13, calculate the lower and upper quartiles of monthly
salaries of the employees.
3.4 Measures of Dispersion
Measures of dispersion give an indication of how widely scattered the observations are
around their mean. When values in a sample or population are close to the mean, they exhibit
less dispersion. The measures of dispersion we are going to look at are the range, interquartile range, semi inter-quartile range, variance and standard deviation.
3.4.1 Range
The range gives a simple indicator of the variability of a set of observations. The range of a
set of observations is the difference between the largest observation and the smallest
observation.
Range = highest observed value – lowest observed value
[3.10]
Example 3.17
Find the range of the following data
13 12 16 19 26 20 14 21 15 18 22 36
Solution 3.17
Range = highest observed value – lowest observed value
Range = 36 - 12 = 24
The range for grouped data is found by subtracting the real lower limit of the lowest class
interval from the real upper limit of the highest class interval. Although it is very easy to use
and understand, the range is not a reliable way of measuring the spread of data because it is
only based on only two observations which are the highest and lowest values. If one of these
two values is an outlier, then the spread of data is rather exaggerated. Moreover, it is not
applicable where class intervals are open-ended.
Inter-quartile range
The inter-quartile range (IQR) is the range between quartiles. More specifically, it is the
difference between the upper quartile and the lower quartile, that is:
IQR = Q3 - Q1
[3.11]
In turn, the semi inter-quartile range (SIQR) is half the inter-quartile range and is obtained
from the formula:
SIQR = Q3 Q1
[3.12]
2
The SIQR is limited in that, just like the range, it is based on selected observations in a
distribution so it cannot always detect dispersion in data. However, it is more resistant to
extreme observations compared to the range.
Activity 3.11
Find the range, inter-quartile range and the semi inter-quartile range for the following data
12 19 19 26 20 14 21 15 17 22 36 12 18 33 15 21 18 19 11
48
3.4.2 Variance and standard deviation of ungrouped data
The variance and standard deviation allow us to avoid the shortcomings of the range and
inter-quartile range as measures of dispersion because they take into account all the
observations in the data set as opposed to just selecting a few.
The variance of a set of data is the average squared deviation of the data points from their
mean. Computationally, the variance of a sample of n observations x1 , x2 ,..., xn is obtained by
the formula:
s
n
1
2
x
n 1
i 1
1
N
N
i 1
2
x
i 1
is given by:
2
N
1
N
2
i
[3.13]
xi
The formula for population variance
2
2
n
1
n
2
i
xi
i 1
[3.14]
The standard deviation of a set of observations is the positive square root of the variance of
the set. The variance is a squared quantity and its units which are (units)2 often have no
practical meaning. For example, the variance of sales data in dollars is (dollars)2 which is
practically meaningless. By taking the square root of the variance, we ‘unsquare’ the units
and get the standard deviation which has the same units as those of the quantity being
measured and thus easier to interpret compared to the variance. When calculating variance or
standard deviation, you should verify whether the data relate to a population or a sample.
Example 3.18
The numbers of vehicles stopping to refuel at a service station on 20 randomly selected days
are:
32 37 29 40 35 26 45 37 34 29 30 34 56 74 40 48 45 43 32 35
Find the variance and standard deviation of the data.
Solution 3.18
s2
1
n 1
n
xi2
i 1
1
n
2
n
xi
i 1
1
(781) 2
32821
19
20
= 122.2605263
The standard deviation is then obtained by finding the positive square root of the variance.
s
122.2605263
= 11.0571
The variance is 122.2605 and the standard deviation is 11.0571.
Activity 3.12
The commissions (in dollars) earned by a sample of 15 ice cream vendors in one month were:
78 50 65 79 97 80 102 45 54 75 98 86 92 69 72 75 80
Find the variance and standard deviation of the data.
49
3.4.3 Variance and standard deviation of grouped data
Suppose that data were put into k classes. Let x1 , x2 ,..., xk be the midpoints of the class
intervals and f1 , f 2 ,..., f k be the respective class frequencies, then the population variance is
given by:
( f i xi ) 2
1
2
( f i xi2
)
N
N
[3.15]
k
where N
f i is the population size.
i 1
The sample variance is given by:
( f i xi ) 2
1
s2
( f i xi2
)
n 1
n
k
where n
[3.16]
f i is the sample size.
i 1
The standard deviation is found by taking the square root of the variance.
Example 3.19
Calculate the variance and standard deviation of the following data.
Class interval Frequency
2 - 9
2
10 - 17
6
18 - 25
12
26 - 33
5
34 - 41
3
42 - 49
2
Solution 3.19
Class boundaries
Frequency, f i
2
6
12
5
3
2
f i 30
1.5 - 9.5
9.5 - 17.5
17.5 - 25.5
25.5 – 33.5
33.5 – 41.5
41.5 – 49.5
Variance, s 2
1
(
f i xi2
(
Class midpoint, xi
5.5
13.5
21.5
29.5
37.5
45.5
f i xi ) 2
)
n 1
n
1
(701) 2
(19411.5
)
=
29
30
1
(19411.5 16380.03333)
29
= 104.5333334
104.5333
50
f i xi
11
81
258
147.5
112.5
91
f i xi 701
f i xi2
60.5
1 093.5
5 547
4 351.25
4 218.75
4 140.5
f i xi2 19411.5
Standard deviation, s
104.5333334
= 10.22415441
10.2242
Example 3.20
Calculate the variance and standard deviation of the data in Example 3.4.
Solution 3.20
Amount($00) Frequency, f i
0.5 – 10.5
3
10.5 – 20.5
7
20.5 – 30.5
11
30.5 – 40.5
5
40.5 – 50.5
4
f i =30
2
Variance, s
1
(
f i xi2
Midpoint, x i
5.5
15.5
25.5
35.5
45.5
f i xi ) 2
(
n 1
n
1
(765) 2
(23507 .5
)
29
30
f i xi
16.5
108.5
280.5
177.5
182
f i xi 765
f i x i2
90.75
1681.75
7152.75
6301.25
8281
2
f i xi 23507.5
)
1
( 23507 .5 19507 .5)
29
1
(4000)
29
= 137.9310345
Standard deviation = 137.9310345
= 11.74440439
The standard deviation was $1174.44
Activity 3.13
The annual profits made by a random sample of 40 companies in the textiles industry are
shown in the table below.
Profit ($00)
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
60 but less than 100
Calculate the:
Number of companies
3
7
12
10
5
3
51
i.
ii.
iii.
iv.
v.
vi.
Mean
Median
Mode
Semi inter-quartile range
Variance
Standard deviation
3.4.4 Coefficient of variation
The coefficient of variation is the standard deviation given as a percentage of the mean. It is
calculated using the following formula:
s
Coefficient of variation (CV)
[3.17]
100
The coefficient of variation is a relative measure and it is used to compare variability of two
or more distributions especially where the units of measurement differ.
Example 3.21
A German based firm would like to purchase stock in one of two companies (A and B) listed
on the Zimbabwe Stock Exchange. The firm considered the monthly returns of the two
companies over the last 10 months.
A: 34 42 36 38 45 40 32 34 39 41
B: 21 24 32 64 50 35 28 30 42 55
Compare the variability in returns between the two companies. In which company should the
firm invest?
Solution 3.21
A: mean = 38.1
standard deviation = 16.7667
16
.
7667
CV=
100
38.1
= 44.01 %
B: mean = 38.1
standard deviation = 14.2162
14
.
2162
CV =
100
38.1
= 37.31 %
The returns of company A are more variable and therefore, risky compared to company B.
The German based firm should invest in company B.
Activity 3.14
Sekai and Sam stay in the same suburb and are employed by the same company in town.
Sekai travels to work by bus and Sam cycles. The times (in minutes) taken by each to get to
work on a sample of 10 days were:
Sekai: 35 26 41 38 36 48 37 30 35 24
Sam: 24 28 24 21 27 26 24 28 22 23
Calculate the coefficient of variation for each set of times. Whose travel time is more
consistent? Justify your answer.
52
3.5 Coefficient of Skewness
Pearson’s coefficient of skewness, denoted Skp, is a measure of the degree of departure from
symmetry which is based on the difference between the mean and the median. It is calculated
using the formula
3(mean median)
Skp
[3.18]
s tan dard deviation
A symmetrical distribution has a coefficient of skewness which is equal to zero. A coefficient
of skewness which is close to zero indicates moderate skewness. A positive coefficient of
skewness shows that data are positively skewed whilst a negative coefficient means data are
negatively skewed.
Example 3.22
Calculate the coefficient of skewness for the data in Example 3.18
Solution 3.22
You are now capable of finding the mean, median and standard deviation of ungrouped data.
Show that mean = 39.05, median = 36 and standard deviation = 11.0571
3(39.05 36)
Coefficient of skewness =
11.0571
= 0.8275
Since the coefficient of skewness is positive, the data is positively skewed.
3.6 The Box-and-Whisker Plot
A box- and- whisker plot is useful in comparing distributions. It highlights five summary
measures of a distribution which are: the median, lower quartile, upper quartile, the smallest
observation and the largest observation.
The middle half of the values in a distribution is represented by a box which has the lower
quartile at one end and the upper quartile at the other. The median is shown by a line inside
the box. Observations in the top and bottom quarters are represented by straight lines called
whiskers which extend from each end of the box, one from the lower quartile to the smallest
observation and the other from the upper quartile to the largest observation. Because of these
features, a box plot makes it easier to determine skewness, spread, central tendency and
possible outliers of a distribution.
Example 3.23
Draw a box-and-whisker plot of the following sales data
10 5 14 11 16 24 21 12 16 20 22 15 24 18 10 14 19 8 12 20
Solution 3.23
The smallest observation is 5 while the largest observation is 24. By now you should be able
to show that the lower quartile is 11.25; the median is 15.5 and the upper quartile is 20.
53
Sales
20
15
10
5
Figure 3.1 Box- and- Whisker Plot of Sales Data
Note: The length of the box (showing inter-quartile range) and that of the whiskers (showing
the range) give an indication of the spread of the data.
3.7 Summary
We looked at three broad categories of measures of describing data namely measures of
central tendency, measures of location and measures of dispersion. The measures of central
tendency locate the centre of data; these are the mean, mode and median. Quartiles and
percentiles which are classified as measures of position provide the position of a value in an
ordered set of data. Measures of dispersion give an indication of how widely scattered the
observations are around their mean. When values in a sample or population are close to the
mean, they exhibit less dispersion. The measures of dispersion we looked at are the range,
inter-quartile range, semi inter-quartile range, variance and standard deviation. We also
considered the advantages and disadvantages of these measures of describing data.
54
Further Reading
Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.
55
Download