DESCRIPTIVE STATISTICS CHAPTER TWO Content 2.1 Data organization and Frequency Distribution 2.2 Types of Graph 2.3 Summary Statistics (Data Description) • • • Measures of Central Tendency Measures of Variation Measures of Position Objectives At the end of this chapter, you should be able to • Organize data using frequency distributions. • Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives. • Represent data using Pareto charts, time series graphs, and pie graphs. • Draw and interpret a stem and leaf plot. • Summarize data using measures of central tendency, such as the mean, median, mode, and midrange. • Describe data using measures of variation, such as the range, variance, and standard deviation. • Identify the position of a data value in a data set, using various measures of position, such as percentiles, deciles, and quartiles. 2.1 Data Organization & Frequency Distribution A. The raw data – A fresh data have been collected from any resource Example of the Raw Data The Slimline Beverage Company makes and sells a line of dietetic soft drink products. These products are sold in bottles and cans. In additions, soft drink syrups are sold to restaurants, theaters, and other outlets that mix small amounts of the syrup with carbonated water and sell the result in cup. The sales manager wants to see how new Fizzy Cola syrup is selling so the raw sales data on gallons of syrup sold were gathered as shown on below table. Raw data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month Employee Galoon sold Employee Galoon sold Employee Galoon sold Employee Galoon sold PP 95 RN 95 GH 135.5 IT 135.5 SM 100.75 SG 100.75 RI 115.25 NI 115.25 PT 126 AD 126 OS 128.75 GC 128.75 PU 114 RO 114 US 113.25 AS 113.25 MS 134 EY 134 PO 132 NC 132 FK 116.75 YO 116.75 OR 105 YA 105 LZ 97.5 OU 97.5 FT 118.25 TN 118.25 FE 102.25 US 102.25 WO 121.75 HB 121.75 AN 110 LT 110 OF 109.25 IE 109.25 RJ 125 EA 125 RT 136 NF 136 OO 144 AT 144 KH 124 GU 124 UY 112 RI 112 EI 91 XN 91 TT 82.5 NS 82.5 B. The data array An arrangement of data items in either as ascending (lowesthighest) or descending (highest-lowest) order. Example of Data Array Array data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month The lowest data: The highest data: 82.5 105 116.75 128.75 82.5 109.25 116.75 128.75 91 109.25 118.25 132 91 110 118.25 132 95 110 121.75 134 95 112 121.75 134 97.5 112 124 135.5 97.5 113.25 124 135.5 100.75 113.25 125 136 100.75 114 125 136 102.25 114 126 144 102.25 115.25 126 144 105 115.25 Range: Advantages - We can see the range of the data - We can determine the data distribution - An array can show the presence of large concentrations of items at particular values (outliers – data that are different than the rest of data, much larger or smaller ) Disadvantages - The array is still a rather awkward data organization tool, especially when the number of data items is large. - There’s often a need to arrange the data into a more compact form for analysis and communication purposes. C. Frequency distribution (frequency table) Group’s data items into classes and then records the number of items that appear in each class. The purpose To organize the data items into a compact form without obscuring essential facts How to do (general)? 1. Determine the number of classes that will be used to group the data. a. Minimum – 5, maximum – 20 b. The actual number depends on such factor i. The number of observations being group ii. The purpose of the distribution iii. The arbitrary preferences of the analyst c. Use classes that can give you a good view of the data pattern and enable you to gain insights into the information that is there d. All data items from the smallest to the largest must be included e. Each items must be assign to one and only one class 2. Determine the width (class interval) of these classes a. The width should be equal b. Width = range / number of classes c. Whenever possible an open-ended class interval (one with an unspecified upper or lower class limit) should be avoided 3. Determine the number of observations / frequency in each class Types of Frequency Distribution • Categorical Frequency Distribution – Used for data that can be placed in specific categories such as nominal or ordinal level data • Ungrouped Frequency Distribution – Used for numerical data – The range of data is small Grouped Frequency Distribution – Used for numerical data too – The range of the data is large Example : Categorical Frequency Distribution Twenty-five army inductees were given a blood test to determine their blood type. The data set is A O B A AB B O B O A B B O O O AB AB A O B O B O AB A Construct a frequency distribution for the data. Constructing an ungrouped & Grouped Frequency Distribution STEP 1 Determine the classes. - Find the highest and lowest value. - Find the range. - Select the number of classes desired. - Find the width by dividing the range by the number of classes and rounding up. - Select a starting point (usually the lowest value or any convenient number less than the lowest value); add the width to get the lower limits. - Find the upper class limits. - Find the boundaries. STEP 2 Tally the data. STEP 3 Find the numerical frequencies from the tallies. STEP 4 Find the cumulative frequencies. • The lower class limit represents the smallest data value that can be included in the class. • The upper class limit represents the largest value that can be included in the class. • The class boundaries are used to separate the classes so that there are no gaps in the frequency distribution. • Rule of Thumb: Class limits should have the same decimal place value as the data, but the class boundaries have one additional place value and end in a 5. • The class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the next class. • The class midpoint is found by adding the lower and upper boundaries (or limits) and dividing by 2. Class Rules • • • • • There should be between 5 and 20 classes. The classes must be mutually exclusive. The classes must be continuous. The classes must be exhaustive. The classes must be equal width. Example : Ungrouped Frequency Distribution The data shown here represent the number of miles per gallon that 30 selected four-wheel-drive sports utility vehicles obtained in city driving. Construct a frequency distribution. 12 16 15 12 19 17 18 16 14 13 12 12 12 15 16 14 16 15 12 18 16 17 16 15 16 18 15 16 15 14 Example : Grouped Frequency Distribution These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes. 112 110 107 116 120 100 118 112 108 113 127 117 114 110 120 120 116 115 121 117 134 118 118 113 105 118 122 117 120 110 105 114 118 119 118 110 114 122 111 112 109 105 106 104 114 112 109 110 111 114 Why Construct Frequency Distributions? To organize the data in a meaningful, intelligible way. To enable the reader to make comparisons among different data sets. To facilitate computational procedures for measures of average and spread. To enable the reader to determine the nature or shape of the distribution. To enable the researcher to draw charts and graphs for the presentation of data. 2.2 Types of Graph The purpose of graphs in statistics is to convey the data to the viewer in pictorial form. Graphs are useful in getting the audience’s attention in a publication or a presentation. A. Histogram, Frequency Polygon, Ogive • Histogram – A graph that displays the data by using vertical bars of various heights to represent the frequencies • Frequency Polygon – A graph that displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. The frequencies represent the heights of the midpoints. Ogive (Cumulative Frequency Graph) – A graph that represents the cumulative frequencies for the classes in a frequency distribution Procedure to construct Histogram, Frequency Polygon & Ogive • STEP 1 Draw and label the x and y axes. • STEP 2 Choose a suitable scale for the frequencies or cumulative frequencies, and label it on the y axis. • STEP 3 Represent the class boundaries for the histogram or ogive, or the midpoint for the frequency polygon, on the x axis. • STEP 4 Plot the points and then draw the bars or lines. Example These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes. Then, construct a histogram, frequency polygon and ogive for these data. 112 110 107 116 120 100 118 112 108 113 127 117 114 110 120 120 116 115 121 117 134 118 118 113 105 118 122 117 120 110 105 114 118 119 118 110 114 122 111 112 109 105 106 104 114 112 109 110 111 114 Distribution Shapes B. Pareto Chart Used to represent a frequency distribution for a categorical variable and the frequency are displayed by the heights of vertical bars. Example Twenty-five army inductees were given a blood test to determine their blood type. The data set is A O B A AB B O B O A B B O O O AB AB A O B O B O AB A Construct a pareto chart for the data. C. Time Series Graph • • • • • • Represents data that occur over a specified period of time STEP 1 Draw and label the x and y axes. STEP 2 Label the x axis for years and the y axis for the number of theaters. STEP 3 Plot each point according to the table. STEP 4 Draw line segments connecting adjacent points. Do not try to fit a smooth curve through the data points. We look for a trend or pattern that occurs over the time period (ascending, descending) & the slope or steepness of the line (increase, decrease) Two time series graph for comparisons (compound time series graph) Example In 1958, there were more than 4000 outdoor drive-in theaters. The number of these theaters has changed over the years. Draw a time series graph for the data and summarize the findings. Year 1988 1990 1992 1994 1996 1998 2000 Number 1497 910 870 859 826 750 637 D. Pie Chart A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the sizes of the sectors. Percentages or proportions can be used. The variable is nominal or categorical. Example Twenty-five army inductees were given a blood test to determine their blood type. The data set is A O B A AB B O B O A B B O O O AB AB A O B Construct a pie chart for the data. O B O AB A Stem-and-Leaf Plots • A stem-and-leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. • It has the advantage over grouped frequency distribution of retaining the actual data while showing them in graphic form. Stem leaf Example An insurance company researcher conducted a survey on the number of car thefts in a large city for a period of 30 days last summer. The raw data are shown below. Construct a stem and leaf plot. 52 58 75 79 57 65 62 77 56 59 51 53 51 66 55 68 63 78 50 53 67 65 69 66 69 57 73 72 75 55 Conclusions (2.1 & 2.2) • Data can be organized in some meaningful way using frequency distributions. Once the frequency distribution is constructed, the representation of the data by graphs is a simple task. 2.3 Summary Statistics (Data Description) • Statistical methods can be used to summarize data. • Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange. • Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation. • • Measures of position tell where a specific data value falls within the data set or its relative position in comparison with other data values. The most common measures of position are percentiles, deciles, and quartiles. • The measures of central tendency, variation, and position are part of what is called traditional statistics. This type of data is typically used to confirm conjectures about the data Measures of Central Tendency Mean the sum of the values divided by the total number of values. Population Mean Sample Mean N xi i 1 N n , N population size x x i 1 n i , n sample size Arithmetic Mean – Individual Data Example 1 • Calculate the arithmetic mean for the following: 3, 5, 8, 12, 15 35 The Arithmetic Mean – Ungrouped Frequency Distribution Example 2 • Number of defects in a sample of 50 products No of defects No of products 0 5 1 7 2 15 3 13 4 6 5 4 36 The Arithmetic Mean – Grouped Frequency Distribution Example 3 • A radar speed recorder was setup on a stretch of road to which a legal speed limit was applied. The result are summarized in the table below: Speed (mph) No of cars observed 15 – 20 5 20 – 25 39 25 – 30 112 30 – 35 295 35 – 40 242 40 – 45 89 45 – 50 8 37 Mean • One computes the mean by using all the values of the data. • The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples. • The mean is used in computing other statistics, such as variance. • The mean for the data set is unique, and not necessarily one of the data values. • The mean cannot be computed for an open-ended frequency distribution. • The mean is affected by extremely high or low values and may not be the appropriate average to use in these situations Measures of Central Tendency Median the middle number of n ordered data (smallest to largest) If n is odd Median xn 1 2 If n is even xn xn Median 2 2 2 1 Median • The median is used when one must find the center or middle value of a data set. • The median is used when one must determine whether the data values fall into the upper half or lower half of the distribution. • The median is used to find the average of an open-ended distribution. • The median is affected less than the mean by extremely high or extremely low values. The Median – Individual Data Example 4 • The following data relates to the marks obtained in a course of 15 students • Progress test 1: marks obtained 30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72 • In the case of even number of observations, there is, no definite middle item • The median is then taken to be the average of two middle items 41 The Median – Locating the Median Graphically • Example 5 • Given below is the frequency distribution of marks obtained by 50 students in a certain college Marks No. of Students 10 – 20 3 20 – 30 7 30 – 40 10 40 – 50 20 50 – 60 7 60 – 70 3 42 The Median – Ungrouped Frequency Distribution • Example 6 • Tests for defects are carried out in a textile factory on a lot comprising 400 pieces of cloth. The results of the tests are tabulated below No of faults per pieces No pieces 0 92 1 142 2 96 3 46 4 18 5 6 6 0 43 Measures of Central Tendency Mod the most commonly occurring value in a data series • The mode is used when the most typical case is desired. • The mode is the easiest average to compute. • The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation. • The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set. The Mode – Individual Data • Example 7 • Determine the mode from the following data: • Marks obtained by 10 students 10, 27, 24, 12, 27, 27, 20, 18, 15, 20 45 The Mode – Grouped Frequency Distribution • Example 8 • A client company of your firm is a horticultural shop selling a wide variety of product to its customers. The analysis of weekly sales of plants throughout the year is summarized in the following frequency distribution Weekly sales of plants ($) No. of weeks 1255 – 1280 9 1280 – 1305 19 1305 – 1330 10 1330 – 1355 8 1355 – 1380 6 46 Measures of Central Tendency Midrange is a rough estimate of the middle & also a very rough estimate of the average and can be affected by one extremely high or low value. lowest value highest value MR 2 Types of Distribution Symmetric Positively skewed or right-skewed Negatively skewed or left-skewed Measures of Variation / Dispersion • Used when the central of tendency doesn't mean anything or not needed (eg: mean are same for two types of data) • One that gauges the variability that exists in a data set • To form a judgment about how well the average value illustrate/ depict the data • To learn the extent of the scatter so that steps may be taken to control the existing variation Measures of Variation / Dispersion Range is the different between the highest value and the lowest value in a data set. The symbol R is used for the range. R = highest value - lowest value Measures of Variation / Dispersion Variance is the average of the squares of the distance each value is from the mean. Population Variance N 2 x i 1 i n 2 s2 , N N xi i 1 N Sample Variance x x i 1 n s , , n sample size n 1 N population size 2 2 i x x i 1 2 i n 1 , n sample size N population size Population standard deviation , Sample standard deviation, s Standard Deviation is the square root of the variance Variance • The variance is the average of the squared deviations from the arithmetic mean • Calculate of Variance • The following data relates to the marks obtained by 15 students in an Accounting examination • 50, 60, 60, 65, 70, 50, 40, 45, 40, 50, 70, 80, 80, 70, 70 52 Standard Deviation • Calculation of Standard Deviation – grouped frequency distribution • The following data relates to the sales of electronic calculators in the South of England Sales per week (thousand) No. of weeks 4–6 2 6–8 5 8 – 10 12 10 – 12 9 12 - 14 3 53 Variance & Standard deviation • Variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. The information is useful in comparing two or more data sets to determine which is more variable. • The measures of variance and standard deviation are used to determine the consistency of a variable. • The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution. • The variance and standard deviation are used quite often in inferential statistics. • Measures of Position Describing the position of the data value Quartile in F for i = 1, 2, 3 Qi L C 4 f where; L Lower limit of the interval containing Qi C Width of the interval containing Qi F Cumulative frequency before class Qi f Frequency class Qi Quartile Deviation - Individual Data • The following is the marks of 9 students in a certain examination. Student No Marks 1 20 2 28 3 40 4 12 5 30 6 15 7 50 8 45 9 60 56 Quartile Deviation Example – Group Frequency Distribution • The following group frequency table describes the weight of 95 packages selected for a QC test. Weight (grams) No. of Packages 450 – 452 11 452 – 454 26 454 – 456 34 456 - 458 24 57 The measures of central tendency, variation, and position for Grouped data measures of central tendency Mean Class x fx Median class i i N where; f i frequency xi midpoint Mode class n F Median Class L C 2 f where; L C F f Lower limit of the interval containing median Width of the interval containing median Cumulative frequency before class median Frequency for class median Mode class = L C 1 1 2 where; L Lower limit of the interval containing mod C Width of the interval containing mod 1 Frequency class mode - frequency before class mod 2 Frequency class mode - frequency after class mod measures of Variation Population variance N 2 f x i 1 i i N 2 where; fi frequency fx 2 i i f x Sample variance 2 n i i N N s 2 f x x i 1 i i n 1 where; fi frequency xi midpoint xi midpoint N population size mean class n sample size x mean class 2 fx 2 i i f x i i n 1 n 2 Conclusions • By combining all of these techniques discussed in this chapter together, the student is now able to collect, organize, summarize and present data.