Week 2, Lecture 3, Descriptive measures for grouped data

advertisement
QBM117
Business Statistics
Descriptive Statistics
Descriptive Measures for Grouped Data
Percentiles and Box Plots
Objectives
• To learn how to calculate the approximate mean and
standard deviation for grouped data.
• To introduce percentiles as another descriptive
measure.
• To introduce the box plot as another graphical
technique.
Descriptive Measures for Grouped
Data
• In most cases, measures if locations and variability
are computed by using the individual data values.
• Sometimes we only have data that have been
grouped into a frequency distribution, and we do not
have access to the raw data.
• It is therefore useful to be able to calculate
approximate descriptive measures directly from a
frequency distribution.
Approximate Mean and Standard
Deviation for Grouped Data
• The mean and the standard deviation are the most
widely used descriptive measures.
• And so we will look at how to calculate the
approximate mean and standard deviation for
grouped data.
• Keep in mind that by grouping the data, we have lost
information, and the descriptive measures obtained
from the grouped data will only approximate those of
the ungrouped data.
Calculating the Approximate Mean and
Standard Deviation for Grouped Data
• You can calculate the approximate mean and
standard deviation for grouped data using the
statistics mode on your calculator.
• We start by calculating the midpoint for each of the
classes of the the frequency distribution.
• We then assume that each observation in a class is
assumed to be equal to the midpoint of that class.
• We then need to enter the data into your calculator
and obtain the mean and standard deviation.
• This will be demonstrated by example.
Example 1
Revisit example 5 from week 1 lecture 3 (Exercise
2.41 from text).
The number of items returned to a leading Brisbane
retailer by its customers were recorded for 25 days.
The frequency distribution for the data is given below:
Number of items
Frequency
>5 up to and including 10
5
>10 up to and including 15
3
>15 up to and including 20
9
>20 up to and including 25
>25 up to and including 30
7
1
We now need to calculate the midpoint of each class.
Number of items
Midpoint
>5 up to and including 10
7.5
>10 up to and including 15
12.5
>15 up to and including 20
>20 up to and including 25
>25 up to and including 30
17.5
22.5
27.5
Frequency
5
3
9
7
1
We now need to enter the data into the calculator.
Enter the value 7.5 into stats mode on your calculator
5 times.
New Casios: 7.5 SHIFT ; 5 M+
Older Casios: 7.5 X 5 M+
New Sharps: 7.5 2nf F , 5 M+
And then enter the value 12.5 in 3 times.
And then enter the value 17.5 in 9 times.
And then enter the value 22.5 in 7 times.
And then enter the value 27.5 in once.
Once you have entered the data in, check to see that
you have 25 data values.
Then obtain the mean and standard deviation of
these values.
mean = 16.7
standard deviation = 5.89 (2d.p.)
Hence the approximate mean and standard deviation
for the grouped data are 16.7 and 5.89 respectively.
Note that the true mean and standard deviation for
the ungrouped data are 17 and 6.20 (2d.p.)
respectively.
Measures of Relative Standing
• Measures of central tendency and dispersion are
important.
• However they are not the only numerical measures
that can be used to describe a data set.
• Measures of relative standing, or order statistics, give
information about the position of an observation in
the sample.
Median
• We have already looked at a measure of relative
standing, the median, which is also a measure of
central tendency.
• Recall that the median is the middle value when the
data are arranged in order.
• Hence the median divides the data set into halves
Percentiles
• It is useful in some situations to know what data
value has a certain percentage of the observations
above or below it.
• This measure is know as the percentile of the data.
• The pth percentile is the value that has at most p% of
the observations less than that value, and at most
(100-p)% of the observations greater than that value.
Quartiles
• We have special names for the 25th, 50th and 75th
percentiles.
• These three measures divide the data into quartiles
and hence are called quartiles.
• The 25th percentile is known as the lower quartile, Q1.
• The 50th percentile is known as the middle quartile,
Q2 but more commonly called the median, M.
• The 75th percentile is known as the upper quartile, Q3.
Calculating Percentiles
• Arrange the data in ascending order
• We find the position of the pth percentile by
calculating i = (p/100) x n .
• If i is not an integer, round up. The next integer
greater than i denotes the position of the pth
percentile.
• If i is an integer, the pth percentile is the average of
the data values in positions i and i+1.
Example 3.14 from text
Calculate the quartiles for the set of measurements
7 18 12 17 29 18 4 27 30 2 4 10 21 5 8
First we need to order the data
2 4 4 5 7 8 10 12 17 18 18 21 27 29 30
The lower quartile is the 25th percentile.
p = 25
n = 15
i = (p/100) x n
= (25/100) x 15
= 3.75
i = 3.75 is not an integer and so we round up to 4.
The lower quartile is the 4th value.
2 4 4 5 7 8 10 12 17 18 18 21 27 29 30
Hence the lower quartile is 5.
The median (middle quartile) is the 50th percentile.
p = 50
n = 15
i = (p/100) x n
= (50/100) x 15
= 7.5
i = 7.5 is not an integer and so we round up to 8.
The median is the 8th value.
2 4 4 5 7 8 10 12 17 18 18 21 27 29 30
Hence the median is 12.
The upper quartile is the 75th percentile.
p = 75
n = 15
i = (p/100) x n
= (75/100) x 15
= 11.25
i = 11.25 is not an integer and so we round up to 12.
The upper quartile is the 12th value.
2 4 4 5 7 8 10 12 17 18 18 21 27 29 30
Hence the upper quartile is 21.
Calculating Percentiles in Excel
• To calculate percentiles in Excel go to
Tools
Data Analysis
Descriptive Statistics
• To produce the median select Summary Statistics.
• To produce the lower quartile select Kth Smallest and
enter in the position of the lower quartile.
• To produce the upper quartile select Kth Largest and
enter in the position of the upper quartile from the
largest value.
Five-Number Summary
• In a five-number summary, the following five numbers
are used to summarise the data:
- Smallest data value
- Lower quartile
- Median
- Upper Quartile
- Largest data value
Example 3.14 revisited
The five-number summary for the set of
measurements in Example 3.14 is
Min = 2
Q1 = 5
M = 12
Q3 = 21
Max = 30
Interquartile Range (IQR)
• The interquartile range is the difference between the
upper and lower quartiles.
IQR = Q3 - Q1
• The interquartile range is the range of the middle
50% of the data.
• It is a measure of dispersion that is not sensitive to
outliers.
Example 3.14 revisited
Calculate the inter quartile range for the set of
measurements in Example 3.14.
Q1 = 5
Q3 = 21
IQR = Q3 - Q1
= 21 – 5
= 16
Box Plots
• Now that we have introduced quartiles, we can
present one more graphical technique for quantitative
data.
• A box plot is a graphical display of the five-number
summary.
• It can be used to identify the central location, spread
and shape of the data and identifies any possible
outliers.
Constructing a Box Plot
• Order that data. The most efficient way to do this is to
construct a stem and leaf display.
• Calculate the five-number summary.
• Draw a box with the ends of the box located at the
lower and upper quartiles.
• Draw a vertical line I the box at the location of the
median.
• Identify any outliers. An outlier is any value located at
a distance of more than 1.5 x IQR from the box.
• Draw lines extending from the box to the smallest
and largest values within 1.5 x IQR , i.e. the most
extreme value that is not an outlier. These lines are
called whiskers.
• Plot any outliers individually.
Example 3.14 revisited
Construct a box plot for the set of measurements
7 18 12 17 29 18 4 27 30 2 4 10 21 5 8
The five-number summary is
Min = 2 Q1 = 5 M = 12 Q3 = 21
The inter quartile range is IQR = 16
Max = 30
1.5 x IQR = 1.5 X 16
= 24
Q1 – 1.5 x IQR = 5 – 24
= -19
Q3 + 1.5 x IQR = 21 + 24
= 45
There are no data values less than -19 or greater
than 45.
Therefore there are no outliers.
Boxplot for Data from Example 3.14
0
5
10
15
20
25
30
35
Constructing Box Plots in Excel
• There are instructions for constructing a box plot In
Excel on page 96 of the text (pg 94 abridged).
• You will need to use Data Analysis Plus – the macros
that come on the disk that accompanies the text.
Example 3.14 revisited
Construct a box plot in Excel for the set of
measurements in Example 3.14 .
BoxPlot
2
7
12
17
22
27
32
Using the Box Plot to Identify
Skewness
If the data set is perfectly symmetric then the box plot
will be symmetric.
• The length of the left whisker will equal the length
of the right whisker.
• The median will divide the box in half.
Boxplot showing data which are symmetric
0
10
20
30
40
50
60
If the data is positively skewed,
• the length of the right whisker will be greater than
the length of the left whisker,
and/or
• the portion of the box to the right of the median will
be greater than the portion of the box to the left of
the median.
Boxplot showing data which are positively skewed
0
10
20
30
40
If the data is negatively skewed,
• the length of the left whisker will be greater than
the length of the right whisker,
and/or
• the portion of the box to the left of the median will
be greater than the portion of the box to the right
of the median.
Boxplot showing data which are negatively skewed
0
20
40
60
80
Outliers
• As well as providing a graphical summary of a data
set, a box plot is useful for identifying outliers.
• When presenting and analysing data it is important to
identify and review outliers.
• An outlier may be an observation that has been
incorrectly recorded. If so, it needs to be corrected
before further analysis.
• An outlier may also be an observation that was
incorrectly included in the data set. If so, it can be
removed.
• An outlier may just be an unusual observation that
has been recorded correctly and does belong to the
data set. In such cases the observation should
remain.
Using Box Plots to Compare Data Sets
• We can use box plots to compare several data sets
by constructing a box plot for each data set and
displaying the box plots on the same scale.
• We can then compare the centre, spread and shape
of the distributions of the different data sets.
• If the box plots are not on the same scale, more care
needs to be taken when comparing the distributions.
.
Example
In automobile mileage and gasoline-consumption testing,
13 automobiles were road tested for 300 miles in both
city and country driving conditions. The following data
were recorded for miles-per-gallon performance.
City 16.2 16.7 15.9 14.4 13.2 15.3 16.8 16.0 16.1 15.3 15.2 15.3 16.2
Country 19.4 20.6 18.3 18.6 19.2 17.4 17.2 18.6 19 21.1 19.4 18.5 18.7
Construct box plots for both data sets and compare the
performance for city and country driving.
BoxPlot of City Data
13.2
14.2
15.2
16.2
17.2
BoxPlot for Country Data
17.2
18.2
19.2
20.2
21.2
22.2
Box Plot for City Data
12
14
16
18
20
22
20
22
Boxplot for Country Data
12
14
16
18
Reading for next lecture
• Chapter 4 Sections 4.1 – 4.3
Exercises
•
•
•
•
•
3.47
3.54
3.57
3.59
3.61
Download