QBM117 Business Statistics Descriptive Statistics Descriptive Measures for Grouped Data Percentiles and Box Plots Objectives • To learn how to calculate the approximate mean and standard deviation for grouped data. • To introduce percentiles as another descriptive measure. • To introduce the box plot as another graphical technique. Descriptive Measures for Grouped Data • In most cases, measures if locations and variability are computed by using the individual data values. • Sometimes we only have data that have been grouped into a frequency distribution, and we do not have access to the raw data. • It is therefore useful to be able to calculate approximate descriptive measures directly from a frequency distribution. Approximate Mean and Standard Deviation for Grouped Data • The mean and the standard deviation are the most widely used descriptive measures. • And so we will look at how to calculate the approximate mean and standard deviation for grouped data. • Keep in mind that by grouping the data, we have lost information, and the descriptive measures obtained from the grouped data will only approximate those of the ungrouped data. Calculating the Approximate Mean and Standard Deviation for Grouped Data • You can calculate the approximate mean and standard deviation for grouped data using the statistics mode on your calculator. • We start by calculating the midpoint for each of the classes of the the frequency distribution. • We then assume that each observation in a class is assumed to be equal to the midpoint of that class. • We then need to enter the data into your calculator and obtain the mean and standard deviation. • This will be demonstrated by example. Example 1 Revisit example 5 from week 1 lecture 3 (Exercise 2.41 from text). The number of items returned to a leading Brisbane retailer by its customers were recorded for 25 days. The frequency distribution for the data is given below: Number of items Frequency >5 up to and including 10 5 >10 up to and including 15 3 >15 up to and including 20 9 >20 up to and including 25 >25 up to and including 30 7 1 We now need to calculate the midpoint of each class. Number of items Midpoint >5 up to and including 10 7.5 >10 up to and including 15 12.5 >15 up to and including 20 >20 up to and including 25 >25 up to and including 30 17.5 22.5 27.5 Frequency 5 3 9 7 1 We now need to enter the data into the calculator. Enter the value 7.5 into stats mode on your calculator 5 times. New Casios: 7.5 SHIFT ; 5 M+ Older Casios: 7.5 X 5 M+ New Sharps: 7.5 2nf F , 5 M+ And then enter the value 12.5 in 3 times. And then enter the value 17.5 in 9 times. And then enter the value 22.5 in 7 times. And then enter the value 27.5 in once. Once you have entered the data in, check to see that you have 25 data values. Then obtain the mean and standard deviation of these values. mean = 16.7 standard deviation = 5.89 (2d.p.) Hence the approximate mean and standard deviation for the grouped data are 16.7 and 5.89 respectively. Note that the true mean and standard deviation for the ungrouped data are 17 and 6.20 (2d.p.) respectively. Measures of Relative Standing • Measures of central tendency and dispersion are important. • However they are not the only numerical measures that can be used to describe a data set. • Measures of relative standing, or order statistics, give information about the position of an observation in the sample. Median • We have already looked at a measure of relative standing, the median, which is also a measure of central tendency. • Recall that the median is the middle value when the data are arranged in order. • Hence the median divides the data set into halves Percentiles • It is useful in some situations to know what data value has a certain percentage of the observations above or below it. • This measure is know as the percentile of the data. • The pth percentile is the value that has at most p% of the observations less than that value, and at most (100-p)% of the observations greater than that value. Quartiles • We have special names for the 25th, 50th and 75th percentiles. • These three measures divide the data into quartiles and hence are called quartiles. • The 25th percentile is known as the lower quartile, Q1. • The 50th percentile is known as the middle quartile, Q2 but more commonly called the median, M. • The 75th percentile is known as the upper quartile, Q3. Calculating Percentiles • Arrange the data in ascending order • We find the position of the pth percentile by calculating i = (p/100) x n . • If i is not an integer, round up. The next integer greater than i denotes the position of the pth percentile. • If i is an integer, the pth percentile is the average of the data values in positions i and i+1. Example 3.14 from text Calculate the quartiles for the set of measurements 7 18 12 17 29 18 4 27 30 2 4 10 21 5 8 First we need to order the data 2 4 4 5 7 8 10 12 17 18 18 21 27 29 30 The lower quartile is the 25th percentile. p = 25 n = 15 i = (p/100) x n = (25/100) x 15 = 3.75 i = 3.75 is not an integer and so we round up to 4. The lower quartile is the 4th value. 2 4 4 5 7 8 10 12 17 18 18 21 27 29 30 Hence the lower quartile is 5. The median (middle quartile) is the 50th percentile. p = 50 n = 15 i = (p/100) x n = (50/100) x 15 = 7.5 i = 7.5 is not an integer and so we round up to 8. The median is the 8th value. 2 4 4 5 7 8 10 12 17 18 18 21 27 29 30 Hence the median is 12. The upper quartile is the 75th percentile. p = 75 n = 15 i = (p/100) x n = (75/100) x 15 = 11.25 i = 11.25 is not an integer and so we round up to 12. The upper quartile is the 12th value. 2 4 4 5 7 8 10 12 17 18 18 21 27 29 30 Hence the upper quartile is 21. Calculating Percentiles in Excel • To calculate percentiles in Excel go to Tools Data Analysis Descriptive Statistics • To produce the median select Summary Statistics. • To produce the lower quartile select Kth Smallest and enter in the position of the lower quartile. • To produce the upper quartile select Kth Largest and enter in the position of the upper quartile from the largest value. Five-Number Summary • In a five-number summary, the following five numbers are used to summarise the data: - Smallest data value - Lower quartile - Median - Upper Quartile - Largest data value Example 3.14 revisited The five-number summary for the set of measurements in Example 3.14 is Min = 2 Q1 = 5 M = 12 Q3 = 21 Max = 30 Interquartile Range (IQR) • The interquartile range is the difference between the upper and lower quartiles. IQR = Q3 - Q1 • The interquartile range is the range of the middle 50% of the data. • It is a measure of dispersion that is not sensitive to outliers. Example 3.14 revisited Calculate the inter quartile range for the set of measurements in Example 3.14. Q1 = 5 Q3 = 21 IQR = Q3 - Q1 = 21 – 5 = 16 Box Plots • Now that we have introduced quartiles, we can present one more graphical technique for quantitative data. • A box plot is a graphical display of the five-number summary. • It can be used to identify the central location, spread and shape of the data and identifies any possible outliers. Constructing a Box Plot • Order that data. The most efficient way to do this is to construct a stem and leaf display. • Calculate the five-number summary. • Draw a box with the ends of the box located at the lower and upper quartiles. • Draw a vertical line I the box at the location of the median. • Identify any outliers. An outlier is any value located at a distance of more than 1.5 x IQR from the box. • Draw lines extending from the box to the smallest and largest values within 1.5 x IQR , i.e. the most extreme value that is not an outlier. These lines are called whiskers. • Plot any outliers individually. Example 3.14 revisited Construct a box plot for the set of measurements 7 18 12 17 29 18 4 27 30 2 4 10 21 5 8 The five-number summary is Min = 2 Q1 = 5 M = 12 Q3 = 21 The inter quartile range is IQR = 16 Max = 30 1.5 x IQR = 1.5 X 16 = 24 Q1 – 1.5 x IQR = 5 – 24 = -19 Q3 + 1.5 x IQR = 21 + 24 = 45 There are no data values less than -19 or greater than 45. Therefore there are no outliers. Boxplot for Data from Example 3.14 0 5 10 15 20 25 30 35 Constructing Box Plots in Excel • There are instructions for constructing a box plot In Excel on page 96 of the text (pg 94 abridged). • You will need to use Data Analysis Plus – the macros that come on the disk that accompanies the text. Example 3.14 revisited Construct a box plot in Excel for the set of measurements in Example 3.14 . BoxPlot 2 7 12 17 22 27 32 Using the Box Plot to Identify Skewness If the data set is perfectly symmetric then the box plot will be symmetric. • The length of the left whisker will equal the length of the right whisker. • The median will divide the box in half. Boxplot showing data which are symmetric 0 10 20 30 40 50 60 If the data is positively skewed, • the length of the right whisker will be greater than the length of the left whisker, and/or • the portion of the box to the right of the median will be greater than the portion of the box to the left of the median. Boxplot showing data which are positively skewed 0 10 20 30 40 If the data is negatively skewed, • the length of the left whisker will be greater than the length of the right whisker, and/or • the portion of the box to the left of the median will be greater than the portion of the box to the right of the median. Boxplot showing data which are negatively skewed 0 20 40 60 80 Outliers • As well as providing a graphical summary of a data set, a box plot is useful for identifying outliers. • When presenting and analysing data it is important to identify and review outliers. • An outlier may be an observation that has been incorrectly recorded. If so, it needs to be corrected before further analysis. • An outlier may also be an observation that was incorrectly included in the data set. If so, it can be removed. • An outlier may just be an unusual observation that has been recorded correctly and does belong to the data set. In such cases the observation should remain. Using Box Plots to Compare Data Sets • We can use box plots to compare several data sets by constructing a box plot for each data set and displaying the box plots on the same scale. • We can then compare the centre, spread and shape of the distributions of the different data sets. • If the box plots are not on the same scale, more care needs to be taken when comparing the distributions. . Example In automobile mileage and gasoline-consumption testing, 13 automobiles were road tested for 300 miles in both city and country driving conditions. The following data were recorded for miles-per-gallon performance. City 16.2 16.7 15.9 14.4 13.2 15.3 16.8 16.0 16.1 15.3 15.2 15.3 16.2 Country 19.4 20.6 18.3 18.6 19.2 17.4 17.2 18.6 19 21.1 19.4 18.5 18.7 Construct box plots for both data sets and compare the performance for city and country driving. BoxPlot of City Data 13.2 14.2 15.2 16.2 17.2 BoxPlot for Country Data 17.2 18.2 19.2 20.2 21.2 22.2 Box Plot for City Data 12 14 16 18 20 22 20 22 Boxplot for Country Data 12 14 16 18 Reading for next lecture • Chapter 4 Sections 4.1 – 4.3 Exercises • • • • • 3.47 3.54 3.57 3.59 3.61