Econ 5 Introduction to Statistics Lectures on Chapter 2 Asatar Bair, Ph.D. Department of Economics City College of San Francisco abair@ccsf.edu Frequency Distribution ! A frequency distribution is a tabular summary of a set of data showing the frequency (or number) of items in each of several non-overlapping classes. ! The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data. DESCRIPTIVE STATISTICS: Summarizing Qualitative Data ! Frequency Distribution ! Relative Frequency ! Percent Frequency Distribution ! Bar Graph ! Pie Chart Example: Marada Inn Guests staying at Marada Inn were asked to rate the quality of their accommodations. The ratings provided by a sample of 20 quests are shown below. Below Average Above Average Above Average Average Above Average Average Above Average Average Above Average Below Average Poor Excellent Above Average Average Above Average Above Average Below Average Poor Above Average Average Example: Marada Inn Frequency Distribution Quality rating Frequency Poor 2 Below average 3 Average 5 Above average 9 Excellent 1 total 20 Relative Frequency and Percent Frequency Distributions Relative Frequency and Percent Frequency Distributions ! The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class. ! A relative frequency distribution is a tabular summary of a set of data showing the relative frequency for each class. Example: Marada Inn Frequency Distribution ! The percent frequency of a class is the relative frequency multiplied by 100. ! A percent frequency distribution is a tabular summary of a set of data showing the percent frequency for each class. Quality rating Relative Frequency Percent Frequency Poor Below average Average Above average Excellent total 0.10 0.15 0.25 0.45 0.05 1.00 10 15 25 45 5 100 Bar Graph Bar Graph A bar graph is a graphical device for depicting qualitative data that have been summarized in a frequency, relative frequency, or percent frequency distribution. On the horizontal axis we specify the labels that are used for each of the classes. A frequency, relative frequency, or percent frequency scale can be used for the vertical axis. Example: Marada Inn Using a bar of fixed width drawn above each class label, we extend the height appropriately. The bars are separated to emphasize the fact that each class is a separate category. Tall thin bar graphs emphasize difference 9.0 Frequency 7.2 5.4 7.2 5.4 3.6 3.6 Poor Below average Average Above average Excellent Excellent 0 0 Average 1.8 1.8 Poor Frequency 9.0 Pie Chart Wide short bar graphs emphasize similarity The pie chart is a commonly used graphical device for presenting relative or percentage frequency distributions for qualitative data. Frequency 9.0 7.2 5.4 3.6 1.8 0 Poor Below average Average Above average Excellent First draw a circle; then use the relative or percentage frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class. Since there are 360 degrees in a circle, a class with a relative frequency of 0.25 would consume 0.25(360) = 90 degrees of the circle. Example: Marada Inn Use of color in presenting pie charts Above average 45% Above average 45% Excellent 5% Excellent 5% Poor 10% Poor 10% Average 25% Below average 15% Average 25% Below average 15% To highlight the positive features of this data Use of color in presenting pie charts Use of flashy 3D pie charts Below average Poor 10% Poor Average Below average 15% Above average 45% Above average Excellent Average 25% Excellent 5% To highlight the negative features of this data To highlight a certain slice of the pie Exploded wedges also draw attention Summarizing Quantitative Data Above average Excellent • Frequency Distribution • Relative Frequency and Percent Frequency Distributions • Dot Plot • Histogram • Cumulative Distribution Average Below average Poor Example: Hudson Auto Repair The manager of Hudson would like to get a better picture of the distribution of costs for engine tuneup parts. A sample of 50 customer invoices has been taken and the costs of parts, rounded to the nearest dollar, are listed below. Frequency Distribution Guidelines for Selecting Number of Classes Use bet ween 5 and 20 classes. Data sets with a larger number of elements usually require a larger number of classes. Smaller data sets usually require fewer classes. Frequency Distribution Use classes of equal width. Class Width = Frequency Distribution Cost ($) Frequency 50-59 2 60-69 13 70-79 16 80-89 7 90-99 7 100-109 5 Total 50 Relative and Percent Frequency Distribution Cost ($) Relative Frequency Percent Frequency 50-59 0.04 4 60-69 0.26 26 70-79 0.32 32 80-89 0.14 14 90-99 0.14 14 100-109 0.10 10 Total 1.00 100 Dot Plot • One of the simplest graphical summaries of data is a dot plot. • A horizontal axis shows the range of data values. • Then each data value is represented by a dot placed above the axis. Histogram Dot Plot • Another common graphical presentation of quantitative data is a histogram. 44 55 66 77 88 99 110 • The variable of interest is placed on the horizontal axis and the frequency, relative frequency, or percent frequency is placed on the vertical axis. • A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency. Most of the data is in this range. • Unlike a bar graph, a histogram has no natural separation bet ween rectangles of adjacent classes. Histogram Relative Frequency Histogram 0.40 relative frequency frequency 20 15 10 5 0 0.32 60-69 70-79 80-89 90-99 100-109 0.26 0.24 0.16 0.14 0.14 0.10 0.08 0 50-59 0.32 0.04 50-59 60-69 70-79 80-89 90-99 Cost ($) Cost ($) Cumulative Distribution Cumulative Frequency Cost ($) • The cumulative frequency distribution shows the number of items with values less than or equal to the upper limit of each class. • The cumulative relative frequency distribution shows the proportion of items with values less than or equal to the upper limit of each class. • The cumulative percent frequency distribution shows the percentage of items with values less than or equal to the upper limit of each class. 100-109 Cumulative Cumulative Frequency Relative Frequency !59 2 0.04 !69 15 0.30 !79 31 0.62 !89 38 0.76 !99 45 0.90 !109 50 1.00 Ogive Ogive • An ogive is a graph of a cumulative distribution. • The data values are shown on the horizontal axis. • The vertical axis can be cumulative frequencies, cumulative relative frequency, or cumulative percent frequency. cumulative frequency 50 40 30 20 10 0 49 59 69 79 89 99 109 cost ($) Exploratory Data Analysis Exploratory Data Analysis: techniques to quickly summarize data Crosstabulations Scatter Diagrams Stem-and-Leaf Display This display shows both the rank order and shape of the distribution of the data. It’s similar to a histogram, but it has the advantage of showing the actual data values. The first digit(s) of each data item are arranged to the left of a vertical line. Hudson Auto Repair Crosstabulations and Scatter Diagrams Stem and Leaf Display for Cost of Parts 5 2 7 6 2 2 2 2 5 6 7 8 8 8 9 9 9 7 1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9 8 0 0 2 3 5 8 9 Thus far we have focused on methods that are used to summarize the data for one variable at a time. Next we explore methods of understanding the relationship bet ween t wo variables. 9 1 3 7 7 7 8 9 10 1 4 5 5 9 Crosstabulation: The number of Finger Lakes homes sold for each style and price for the past two years is shown below. Home Style Price Problem with crosstabulation Crosstabulation data are often combined to form an aggregate crosstabulation; Colonial Ranch Split A-Frame Total less than $100,000 18 6 19 12 55 $100,000+ 12 14 16 3 45 relationships that appear in the aggregate may be contradicted by the unaggregated data; Total 30 20 35 15 100 this is called Simpson’s Paradox. this presents a possible danger; Crosstabulation: Simpson’s Paradox Crosstabulation: Simpson’s Paradox Judge Luckett Verdict Municipal Court Upheld 29 (91%) 100 (85%) 129 Reversed 3 (9%) 18 (15%) 21 Total 32 118 150 Judge Verdict Upheld Reversed Total Total Luckett Kendall 129 (86%) 110 (88%) 21 (14%) 150 15 (12%) 125 Total Common Pleas 239 Judge Kendall Verdict 36 275 Total Common Pleas Municipal Court Upheld 90 (90%) 20 (80%) 110 Reversed 10 (10%) 5 (20%) 15 Total 100 25 125 It looks like Kendall’s doing a better job But Luckett actually has a better record in both courts. Example: Panthers Football Team Scatter Diagram The Panthers football team is interested in investigating the relationship, if any, between interceptions made and points scored. Interceptions Points scored 1 14 3 24 2 18 1 17 3 27 Points scored 30 Panthers Football Team Scatter Diagram 25 20 15 10 5 0 1 2 Interceptions 3 Scatter diagram of weight and speed of NFL players 6 6 5 5 Time in the 40 yard dash (sec) Time in the 40 yard dash (sec) Scatter diagram of weight and speed of NFL players 4 3 2 1 From the Excel data, Chapter 2, “NFL”. 0 4 3 2 1 0 0 50 100 150 200 250 300 350 400 Weight (lb) Data Tabular Methods Graphical Methods Frequency Distribution(s) Bar graph Crosstabulation Pie chart 50 100 150 200 250 300 350 Weight (lb) Tabular and Graphical Procedures (p. 56) Qualitative Data 0 Microsoft Excel Quantitative Data Tabular Methods Graphical Methods Frequency Distribution(s) Dot plot Cumulative Frequency Distribution(s) Stem-and-Leaf Display Crosstabulation Histogram Ogive Scatter diagram MS Excel (and other statistical / spreadsheet programs like it) makes many tasks in statistics much, much easier; Appendix 2.2 describes how to perform some operations in Excel 400 Histogram Histograms highlight all your data You need the Analysis ToolPak; (more on this in a minute) Go to “Tools”, then “Add-ins”, then select “Analysis ToolPak” and hit “OK”; Then go to “Tools” and hit “Data Analysis”; highlight where you want the frequency distribution to go A list of options will come up; select “Histogram”. Bin range what you do here is to define the class widths you want to use; Excel can do this automatically, but it has very bad judgement and the results will be worthless; look at the data and decide what the upper bounds for each class should be 40-49 50-59 60-69 70-79 80-89 90-99 100-109 110-119 Bin range For this example, I’m using the “Norris” data on the CD; if you want your classes to look like this, you enter just the upper boundary in each cell: 49, 59, 69, 79, 89, 99, 109, 119 then go to the bin range field and highlight these cells Frequency distribution Histogram Now go to “Insert” and select “Chart” or hit the button hit “OK”, and Excel gives you this Select the “Clustered column” I like to rename the “Bin” fields “40-49”, “50-59”, etc. I also rename the “More” field “Total” and add up the column above so the whole thing looks like this enter the title for the x and y axes, and for the whole chart, then hit OK; to get rid of the gap bet ween the bars, double-click on one of the bars on the finished chart, then go to “Options” and enter zero under “Gap width”. Frequency Histogram: Norris Electronics 70 Frequency 60 50 40 30 20 10 0 11 0 to 9 6 11 10 99 89 79 69 59 49 to to to to to to to 0 10 90 80 70 60 50 40 Hours until Burnout be sure to label the axes and give it a title; another masterpiece of statistics!