M227 Chapter 2 Frequency Distributions and Graphs Sections 1,2 OBJECTIVES Organize data using frequency distributions. Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives. Represent data using Pareto charts, time series graphs, and pie graphs. Draw and interpret a stem and leaf plot. Draw and Interpret a scatter plot for a set of paired data. This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form. ORGANIZING DATA When data are collected in original form, they are called raw data. A frequency distribution is the organization of raw data in table form, using classes and frequencies. The most common distributions are o categorical frequency distributions o ungrouped frequency distributions o grouped distributions PRESENTING DATA Organized data is presented in pictorial form, called Graphs and Charts, to facilitate its analysis and interpretation. The most common graphs and charts are: o Histograms o Frequency Polygons o Cumulative Frequencies or Ogives o Pareto Charts o Time Series and Compound Time Series Graphs o Pie Graphs o Stem and Leaf Plots o Scatter Plots Sections 1,2 1 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Sections 1,2 A. Categorical Frequency Distributions Categorical frequency distributions are used when data can be placed in specific categories, such as nominal or ordinal level data: for example political affiliations, major field of study, or color will use categorical frequency distributions. Example: The data below (raw data) represent the blood types of 25 people: A B B AB O O O B AB B B B O A O A O O O AB AB A O B A The four categories are: A, B, O, and AB. We count how many concurrencies each category has, and we form the following frequency table: Class A B O AB Tally Frequency Cumulative Frequency Percent Cumulative Percent 5 7 9 4 25 5 12 21 25 20 28 36 16 100 20 48 84 100 ///// ///// // ///// //// //// Table 2.1 B. Ungrouped Frequency Distributions Ungrouped Frequency Distributions are used when we have few distinct numerical data to organize. Example: The data below (raw form) represents the number of incoming calls per day over the first 20 days of a month: 4 4 1 2 5 6 4 6 6 3 3 2 4 6 4 7 We just count how many times each distinct value has occurred: Class limits 1 2 3 4 5 6 7 Tally 3 1 1 1 Frequency //// // /// ///// / //// / 4 2 3 5 1 4 x 1x 20 Table 2.2 Sections 1,2 2 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Sections 1,2 C. Grouped Frequency Distributions Grouped Frequency Distributions are used when we need to organize large amounts of numerical data. Example: The data below (raw form) represents the miles that the 50 employees of a company travel to work every day. 1 17 5 8 2 4 2 8 6 17 9 4 7 1 11 14 12 14 16 7 13 5 1 3 2 4 9 2 6 16 2 6 9 4 10 5 5 11 18 8 4 7 6 10 3 5 9 15 18 18 We organize these data by grouping them into subintervals, called classes: 1-3 miles, 4-6 miles, 7-9 miles, 10-12 miles, 13-15 miles, 16-18 miles, where 1 is the lowest data value and 18 is the highest. Then, we count how many values fall in each class; this number is called the frequency of the class: Class limits 1-3 4-6 7-9 10-12 13-15 16-18 Tally Frequency ///// ///// ///// ///// //// ///// ///// ///// //// ///// // 10 14 10 5 4 7 ___ 50 Table 2.3 Attributes of Frequency Distributions 1. Classes: Grouping of data into subdivisions of fixed width 2. Lower Class Limit: the smallest number of each class: numbers 1, 4, 7, 10, 13, 16 of the classes in Table 1. 3. Upper Class Limit: the largest number of each class: number 3, 6, 9, 12, 15, 18 of the classes in Table 1. 4. Class Width for a class is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the next class. 5. Class Boundaries: close the gaps between classes. For example classes 1-3 and 4-6 have a gap between 3 and 4. In this case we define the boundary of each class to be .5 less than the lower limit and .5 more than the upper limit: .5 - 3.5, 3.5 - 6.5, … See Table 1A below. It means all values greater or equal to .5, 3.5, … and less than 3.5, 6.5, … In general, the class boundaries have an additional place value than the class limits and they end with a 5. Class limit: 3 – 6, Boundary: 2.5 – 6.5 Class limit: 3.2 – 6.2 Boundary: 3.15 – 6.25 Class limit: 3.46 – 6.46, Boundary: 3.455 - 6.465 Sections 1,2 3 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Sections 1,2 lower class lim it + upper class lim it 2 7. Cumulative Frequency: the sum of the frequencies accumulated up to the upper boundary of a class. 6. Class Mark: The midpoint of each class: X m = 8. Relative Frequency: the frequency of each class divided by the sum of all the f frequencies: Relative frequency = . ∑f 9. Percent Frequency: the Relative Frequency expressed as a percent: f ∑f i100 Relative Class Limit Class Class Mark Boundaries (Midpoint) Frequency Cumulative Frequency Percent Frequency Cumulative Frequency Frequency Cumulative Frequency 1-3 0.5-3.5 2 10 10 0.20 0.20 20% 20 4-6 3.5-6.5 5 14 24 0.28 0.48 28% 48 7-9 6.5-9.5 8 10 34 0.20 0.68 20% 68 10 - 12 9.5-2.5 11 5 39 0.10 0.78 10% 78 13 - 15 12.5-15.5 14 4 43 0.08 0.86 8% 86 16 - 18 15.5-18.5 17 7 50 0.14 1.00 14% 100 50 1.00 100% Table 2.4 Rules to Construct Frequency Distributions 1. There should be between 5 and 20 classes 2. The class width should be an odd number. This insures that the midpoint has the same H −L place value as the data. Ex: Width = ; Round-up result. number of classes 3. The classes must be mutually exclusive – nonoverlapping class limits 4. The classes must be continuous. 5. The classes must be exhaustive – enough classes to accommodate all the data. 6. The classes must be equal in width (exception: “open-ended distributions”, no specific beginning value or no specific ending value.) Sections 1,2 4 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Sections 1,2 Procedure for Constructing a Quantitative Frequency Distribution Step 1: Determine the Classes o Find the lowest and highest value o Find the range o Select the number of classes desired (often given) o Find the width: Divide the range by the number of classes and round UP. o Select starting point (normally the lowest data value) o Set lowest class limit equal to the starting point. o Calculate lower limits for all classes. o Calculate upper limits for all classes. o Find the boundaries. o Calculate the midpoints . Step 1: Tally the data Step 3: Find the numerical frequencies. Step 4: Find cumulative frequencies. Step 5: Find relative (and / or percent) frequencies. Step 6: Find cumulative relative and/or percent frequencies. Reasons for constructing frequency distributions To organize the data in a meaningful, intelligible way. To enable the reader to make comparisons among different data sets. To facilitate computational procedures for measures of average and spread. To enable the reader to determine the nature or shape of the distribution To enable the researcher to draw charts and graphs for the presentation of data. Sections 1,2 5 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Section 2-3 Histograms, Frequency Polygons, and Ogives The purpose of graphs in statistics is to convey the data to the viewer in pictorial form. Graphs are useful in getting the audience’s attention in a publication or a presentation. The three most common graphs: 1. Histogram: displays the data by using vertical bars of various heights to represent the frequencies. The class boundaries are represented on the x-axis and the frequencies on the y-axis. Histogram Histogram 16 Frequencies 12 10 10 Frequencies 14 14 10 7 8 5 6 4 4 16 14 12 10 8 6 4 2 0 14 10 7 5 2 0 0.5 3.5 6.5 9.5 12.5 15.5 5 0. 18.5 10 . -3 5 5 3. . -6 5 5 6. Boundaries 0.20 0.28 0.20 0.20 0.14 0.15 0.10 0.10 0.08 0.05 0.00 5 3. 50. 5 6. 53. 5 9. 5 2. -1 .5 .5 15 18 55. . 12 15 Histogram - Percent Frequencies Percent Frequencies Relative Frequencies 0.25 5 Boundaries Histogram - Relative Frequencies 0.30 . -9 4 5 5 5 .5 9. 8. 5. 12 -1 -1 55.5 .5 6. 5 2 9. 1 1 28% 30% 20% 14% 10% 10% 8% 0% 5 0. Boundaries 20% 20% . -3 5 5 3. . -6 5 5 6. . -9 5 5 5 5 5. 8. 2. -1 -1 -1 5 5 5 . . 9. 12 15 Boundaries Relative frequencies and percent frequencies histograms are constructed the same way as the frequency histograms, except that instead of frequencies for the data values, we use the relative or percent frequencies. Section 2-3 6 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Section 2-3 2. Frequency polygon: displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. Frequencies are represented on the y-axis. Frequency Polygon Relative Frequency Polygon Relative Frequrncies Frequrncies 16 14 12 10 8 6 4 2 0 0.30 0.25 0.20 0.15 0.10 0.05 0.00 -1 2 5 8 11 14 17 20 -1 2 5 8 Midpoints 11 14 17 20 Midpoints Relative frequency and percent frequency polygons are constructed the same way as frequency polygons, except that we use the relative or percent frequencies in place of the frequencies. 3. Cumulative frequency or Ogive: represents the cumulative frequencies for the classes in a frequency distribution. Class boundaries are located on the x-axis and frequencies on the y-axis. Realtive Ogive 60 1.2 Relative Cum Freq Cumulative Frequrncies Ogive 50 40 30 20 10 0 1 0.8 0.6 0.4 0.2 0 0.5 3.5 6.5 9.5 12.5 15.5 18.5 0.5 Boundaries 3.5 6.5 9.5 12.5 15.5 18.5 Boundaries Realtive Ogive Relative Cum Freq 120% 100% 80% 60% 40% 20% 0% 0.5 3.5 6.5 9.5 12.5 15.5 18.5 Boundaries Relative Ogives can be constructed by replacing the frequency values with the relative (or percent) values. Ogive charts are useful in answering questions such as how many or what percent of values are below a certain upper class boundary. Section 2-3 7 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Section 2-4 Other Types of Graphs • • • • Pareto Time Series Pie Stem and Leaf Plots 1. Pareto chart: is used to represent a frequency distribution for categorical variable; the frequencies are displayed by the heights of vertical bars, which are arranged in order from highest to lowest. (Graph below from data in table 2-1 above) Pareto Chart of Blood Types Frequencies 10 9 8 7 6 5 4 4 2 0 O B A AB Blood Types 2. Pie graph: is a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution Each wedge has an angle that is proportional to the ratio of its frequency to the sum of all the frequencies of the distribution: Pie AB 16% - 57.60 O 36% - 129.60 A 20% - 720 B 28% - 100.80 Section 2-4 8 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Degree = Section 2-4 f ⋅ 360 , where f is the frequency and n = sum of all the frequencies. n Percentages are calculated as before: % = f ⋅ 100% n Class Frequency Percent Angle O B A AB 9 7 5 4 25 36 28 20 16 100 129.60 100.80 72.00 57.60 3600 3. A stem-and-leaf plot is a data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes. It has the advantage over grouped frequency distribution of retaining the actual data while showing them in graphic form Example: Consider the following data from example 2-12: 25 31 32 44 20 32 32 52 13 44 14 51 43 45 02 57 23 36 32 33 A. Arrange the data in ascending order: 02,13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33, 36, 43, 44, 44, 45, 51, 52, 57 B. Separate the data according to the first digit: 02 13,14 20,23,25 31,32,32,32,32,33,36 43,44,44,45 51,52,57 C. Use the first digit as the stem and the trailing digits as the leaf: 0 1 2 3 4 5 2 3 0 1 3 1 4 3 2 4 2 5 2 2 2 3 6 4 5 7 3. Time series graph. It is a line graph that represents data occurring over a specific period of time For example, the data below represents the number (in millions) of vehicles sold in California during the last 7 years: 2000 2001 2002 2003 2004 2005 2006 Section 2-4 1.05 0.95 1.45 1.62 1.63 1.96 2.22 9 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Compound Time Series 2.5 2.5 2 2 New Vehicles New Vehicles Time Series Section 2-4 1.5 1 0.5 1.5 1 0.5 0 0 2000 2001 2002 2003 2004 2005 2006 Year 2000 2001 2002 2003 2004 2005 2006 Year Time series graphs are very useful when two or more sets of values are plotted for the same period of time. For example, assume that we have another set of values for the years above, representing the number of vehicles “retired” for each year. 2000 2001 2002 2003 2004 2005 2006 1.05 0.95 1.45 1.62 1.63 1.96 2.22 0.20 0.15 0.21 0.41 0.51 0.68 0.75 We graph both data values with the same x-axis, as shown above. A trend can be discerned from this graph that there is an increase over the years of the number of “active” vehicles. Section 2-4 10 2/6/2007 M227 Chapter 2 Frequency Distributions and Graphs Section 2-5 4. Scatter Plots Many times researchers are interested in determining if a relationship between two variables exist. To do this, the researcher collects data consisting of two measures that are paired with each other. The variable first mentioned is called the independent variable; the second variable is the dependent variable. A scatter plot is a graph of ordered pairs of data values that is used to determine if a relationship exists between the two variables. Typically, the independent variable is plotted on the x-axis and the dependent variable is plotted on the y-axis. Example: 2-15 No. of Accidents, x No. of fatalities, y 376 5 650 20 884 20 1162 28 1513 26 165 34 2236 35 3002 56 4028 68 Scatter Plot 80 70 # Fatalities 60 50 40 30 20 10 0 0 1000 2000 3000 4000 5000 # Accidents Analyzing a Scatter Plot A positive linear relationship exists when the points fall approximately in an ascending straight line and both the x and y values increase at the same time. A negative linear relationship exists when the points fall approximately in a straight line descending from left to right. A nonlinear relationship exists when the points fall along a curve. No relationship exists when there is no discernable pattern of the points. Section 2-5 11 2/6/2007 4010 55 M227 Chapter 2 Frequency Distributions and Graphs Section 2-5 SUMMARY Histograms, frequency polygons, and ogives are used when the data are contained in a grouped frequency distribution. Pareto charts are used to show frequencies for nominal variables Time series graphs are used to show a pattern or trend that occurs over time. Pie graphs are used to show the relationship between the parts and the whole. When data are collected in pairs, the relationship, if one exists, can be determined by looking at a scatter plot Section 2-5 12 2/6/2007