Chapter 2: Frequency Distributions and Graphs Learning Objectives Upon completion of Chapter 2, you will be able to: – – Organize the data into a table or chart (called a frequency distribution) Construct a graph from the chart I. Basic Vocabulary • Raw data is data in its original form. • A frequency distribution is the organization of raw data into a table using categories for the data in one column and the frequencies for each category in the second column. • Frequency (f) is the tally or count of the number of data values in each class. • Relative frequency (f/n) is the tally or count of the number of data values in each class divided by the total number of data values. • Cumulative Frequency is the tally or count of the number of data values in a class plus the frequencies for all lower classes. • Cumulative relative fequency is the cumlative frequency divided by the totally number of data values. II. Frequency Distributions A. Types of Frequency Distributions I. Qualitative Data: • Categorical frequency distribution is a two column chart with a list of all possible attributes or categories for the data in the first column and the count of the amount of data in each category in the second column. II. Quantitative Data: a) Grouped frequency distribution (for data with a small range) is a chart of each possible individual value of the data in the first column and the count of the amount of data with that value in the second column. b) Ungrouped frequency distribution (for data with a small range) is a chart of each possible individual value of data in the first column and the count of the amount of data with that value in the second column. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 1 B. Examples of Frequency Distributions I. Qualitative or Categorical Frequency Distributions • Create a table with gender (Male/Female) in the first column and the count of the number of men and women in the class in the second column. • Create a table with level of Employment (none, part time, full time) in the first column and the count of the number of students in the class in each category in the second column. II. Ungrouped Quantitative Frequency Distributions • In the first column, list the numbers 0, 1, 2, 3, 4… representing the number of keys a student is carrying. In the second column, list the count of the number of students with that many keys. • In the first column, list the numbers 0, 1, 2, 3, 4… representing the number of cars in your family. In the second column, list the count of the number of students with that many cars in their family. C. Why Construct Frequency Distributions? • • • • • To organize the data for interpretation To compare different data sets To simplify the computation for measures of average and speed To determine the shape distribution To draw charts and graphs for data III. Grouped Frequency Distributions For data with a “large” range, place the data in groups or classes that are several units in width. A. Terms for Grouped Frequency Distributions • • • • The lower class limit represents the smallest data value that can be included in the class. The upper class limit represents the largest value that can be included in the class. Range (R): largest data value minus the smallest data value. Class boundaries are the numbers used to separate classes but without the gaps created by class limits. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 2 B. Characteristics of Classes or Groups • • • • • • • There should be between 5 and 20 classes. The class width should be an odd number. (Suggested by Bluman) The classes must be mutually exclusive. The classes must be continuous. The classes must be able to include all data. The classes must be equal width. The classes must include all data or be exhaustive. C. Finding the Class Midpoint It is the average of either: a) The 2 class boundaries for each individual class, OR b) The 2 class limits for each individual class D. Finding the Class Width There are several ways to find the class width (all with the same answer). The class width is either: a) The difference in 2 sequential lower class limits (2 different classes), b) The different between 2 sequential upper class limits (2 different classes), OR c) The different between the lower and upper boundaries for the same class Note: the class width is constant throughout the frequency distribution E. Procedure to find the Class Limits from Data The process to find class limits from data is as follows: 1. Find the range. Range= maximum-minimum values 2. To find the class width, divide the range by the number of classes and round up to the next whole odd number. The width has the same number of decimal places as the data. 3. Select the lowest data value as the starting point or lowest class limit. 4. Add the width to find the next lower class limit. 5. Upper limits are 1 unit less than the next class’s lower limit. 6. Continue this process until an upper class limit is less than the highest data value. Note: The last class should not have no members or should not have a frequency equal to zero. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 3 F. Suggested Number of Classes Based on Sample Size Sample Size (n) Suggested Number of Classes Less than 16 Not Enough Information 16 - 31 5 32 - 63 6 64 - 127 7 128 - 255 8 256 – 511 9 512 - 1023 10 G. Finding Class Boundaries from Class Limits Use class limits to find class boundaries: I. Find the class limits (same number of decimal places as the data). II. Find upper class boundaries by adding ½ unit to the upper class limit of each class. III. Find the lower class boundaries by subtracting ½ unit from the lower class limit of each class. H. Decimal Place Rule Class limits have the same number of decimal places as the data, but class boundaries have one additional place value than the data and end in a “5”. IV.Graphs A. The Role of Graphs • • Presents the data in pictorial form. Attracts attention in a publication or a presentation. B. Types of Graphs • • • • Bar graph – graph of the frequency distribution for qualitative or categorical data. Histograph – graph of the frequency distribution for quantitative data. Ogive – graph of the cumulative frequency for quantitative data. Frequency polygon – graph of the frequency for quantitative data. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 4 C. Histogram Scale: class boundaries or class midpoints Vertical (or horizontal) bars are proportional to the frequencies for each class. Class Boundaries 0.5 – 20.5 20.5 – 40.5 40.5 – 60.5 60.5 – 80.5 80.5 – 100.5 Frequency 4 9 20 40 24 Note: The scale on the non-frequency axis is either the class boundaries or class midpoints. • Class midpoints are located in the middle of the bars and class boundaries are located at the ends of the bars. D. Frequency Polygon Scale: class midpoints • • • • • Plot the frequency of each class at its midpoint, i.e., (class midpoint, class frequency.) The scale is sequential midpoints. Extend the midpoint scale once below the first class midpoint and once above the last class midpoint. Label the extensions. Plot a point at each extension with a frequency of zero (extension, 0). Connect all of the points with line segments forming a polygon. Note: Remember a polygon is a many sided closed figure. The extension points and the axis make the figure closed. Things to Remember About Frequency Polygons • • • The scale is the difference between two sequential class midpoints. Extend the scale and graph once above the largest class midpoint and once below the smallest class midpoint. Use a frequency of zero with both extensions. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 5 Class 0.5 – 20.5 20.5 – 40.5 40.5 – 60.5 60.5 – 80.5 80.5 – 100.5 Midpoint 10.5 30.5 50.5 70.5 90.5 Frequency 4 9 20 40 24 Note: Cumulative frequency for each upper boundary is the sum of the frequency in that class plus all lower class frequencies. E. Ogive or Cumulative Frequency Graph Scale: class boundaries Start with the lowest class boundary (lowest lower boundary, 0) and a frequency of zero, then plot the cumulative frequency at the class boundary of each class. End with the highest upper boundary (highest upper boundary, n) Class 0.5 – 20.5 20.5 – 40.5 40.5 – 60.5 60.5 – 80.5 80.5 – 100.5 Dr. Janet Winter, jmw11@psu.edu Frequency 4 9 20 40 24 Stat 200 Cumulative Frequency 4 13 33 73 97 Page 6 Number of Students 100 90 80 70 60 50 40 30 20 10 0 0.5 20.5 40.5 60.5 80.5 100.5 Scores for final exam Note: The line segments connect at (.5, 0), (20.5, 4), (40.5, 13), (60.5, 33), (80.5, 73), (100.5, 97) which are the (lowest lower boundary, 0), (first upper boundary, frequency for first class), (second upper boundary, frequency for second class),…(last upper boundary, total frequency). F. Relative Frequency Graphs A relative frequency graph uses the frequencies divided by the total of all frequencies instead of frequencies. Use it with any graph when proportions are more meaningful than the actual count or frequency. G. Other Graphs of Interest I. Dot plot is a graph with a point r dot for each data value above a scaled horizontal line. II. A Pareto chart is a bar graph (for the categorical data) with the categories arranged from the highest to the lowest frequency. Frequency How People Get to Work 30 20 10 0 Auto Bus Dr. Janet Winter, jmw11@psu.edu Trolley Train Stat 200 Walk Page 7 III. A time series graph is used for data that occur over a specific period of time; it is a graph of time on the x-axis and frequency on the y-axis ( time, quantity) connected with line segments: Temperature Temperature Over a 5-hour Period 55 50 45 40 35 12 1 2 3 4 5 Time IV. A pie graph is a circle divided into sections proportional to the percentage in each category. Favorite American Snacks Snack Nuts 8% Popcorn 13% Pretzels 14% Potato Chips 38% Tortilla Chips 27% Note: The degree for a segment is the relative frequency for the segment times 360°. V. A stem-and-leaf plot • Use for quantitative data • Vertically ordered list of the left part of the data digits (or stem) • The right most digit of the data digits (called the leaf) listed horizontally and sequentially to the right • Retains actual data while showing it in graphic form. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 8 a) Process: 1. Split the digits in the number into right most digit called the leaf and any remaining digits to the left called the stems 2. List all possible stem values once in increasing order 3. Draw a vertical line to the right of the stems 4. List the leaves sequentially and horizontally to the right of the vertical line with their respective stems as often as occurs Note: A stem value is listed once while leaves are listed as often as they occur in a data value b) Example: Data: 123 125 131 113 101 102 104 111 114 111 132 133 141 142 143 132 Stem Plot: 10 1 2 11 1 1 12 3 5 13 1 2 14 1 2 4 3 4 2 3 3 c) Other types of stem plots: Split stem-and-leaf • Each stem value is recorded twice • The first line is for trailing digits 1-4 • The second line is for trailing digits 5-9 Back to back stem-and-leaf • Separate the data into two categories by listing the leaf’s for one category to the left of its stem and the leaf’s for the other category to the right of its stem Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 9 V. Chapter Review Questions 1. One of the early steps a researcher must do when conducting a statistical study is to a) gather and collect data. b) use a computer or a calculator to analyze the data. c) draw conclusions from the data. 2. A statistics professor gives a very easy 100 point test, with the highest score being 98 and the lowest score being 71. We want to divide this data into categories. Then, a reasonable width of categories could be a) 1 b) 5 c) 10 3. The manager of a computer store wishes to track how many computer monitors of different screen sizes are sold during the week. He tallies the sales by the following categories: less than 15”, 15-15.9”, 16-16.9”, 17-17.9” 18-18.9”, 19-19.9”, and 20” and above. The best way to represent the data is using a a) b) c) d) Histogram. Frequency polygon. Ogive. All of the above. 4. What presents more information, a frequency polygon or an ogive? a) The frequency polygon presents more information. b) The ogive presents more information. c) They have equal amounts of information. 5. If we would like to display all the areas of the states in the Unites States and we only care about the states with the largest areas, then an appropriate graph would be a a) Pareto chart. b) Time series graph. c) Pie graph. 6. The dean of engineering at a school wishes to track the number of students with engineering majors over the past 10 years. An appropriate graph would be a a) Pareto chart. b) Time series graph. c) Pie graph. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 10 VI.Summary • Histograms, frequency polygons and ogives are used for quantitative data in a grouped frequency distribution. • Pareto charts and bar graphs are frequency graphs for qualitative variables. • Time series graphs are used to show a pattern or trend that occurs over time. • Pie graphs are used to show the relationship between the parts and the whole for qualitative or categorical data. • Data can be organized in meaningful ways using frequency distributions and graphs. VII. ANSWERS: Chapter Review Questions 1. One of the early steps a researcher must do when conducting a statistical study is to a) gather and collect data. 2. A statistics professor gives a very easy 100 point test, with the highest score being a 98 and the lowest score being 71. We want to divide this data into categories. Then, a reasonable width of categories could be b) 5 3. The manager of a computer store wishes to track how many computer monitors of different screen sizes are sold during the week. He tallies the sales by the following categories: less than 15”, 15-15.9”, 16-16.9”, 17-17.9” 18-18.9”, 19-19.9”, and 20” and above. The best way to represent the data is using a d) All of the above. 4. Which presents more information, a frequency polygon or an ogive? c) They have equal amounts of information. 5. If we would like to display the areas of the states in the United States and we only care about the states with the largest areas, then an appropriate graph would be a a) Pareto chart. 6. The dean of engineering at a school wishes to track the number of students with engineering majors over the past 10 years. An appropriate graph would be a b) Time series graph. Dr. Janet Winter, jmw11@psu.edu Stat 200 Page 11