Chapter 2 Summarizing and Graphing Data Recall: The 2 Types of data variables: 2.1 Graphs for qualitative variables Bar graphs (frequency and relative frequency) Pie charts Pareto Graphs for qualitative variables The values of a qualitative or categorical variable are labels. The distribution of a categorical variable lists the count or percentage of individuals in each category. Wireless surfers by Age Bar Chart 60% 40% 53% Pie chart 55> 5% 42% 20% 5% 0% 18-34 Counts: 212 35-54 168 55> 20 A sample of 400 wireless internet users. 35-54 42% 18-34 53% Wireless internet users Male 288 (72%) Female 112 (28%) Total 400 (100%) Wireless surfers by gender Bar chart 100% 72% 28% 50% 0% Male Female Frequency Distribution (or Frequency Table) lists each category of data and the number of occurrences for each category of data. Frequency Distribution Ages of Best Actresses Original Data Frequency Distribution Lower Class Limits are the smallest numbers that can actually belong to different classes Lower Class Limits Upper Class Limits are the largest numbers that can actually belong to different classes Upper Class Limits Class Midpoints can be found by adding the lower class limit to the upper class limit and dividing the sum by two Class Midpoints 25.5 35.5 45.5 55.5 65.5 75.5 Class Width is the difference between two consecutive lower class limits or two consecutive lower class boundaries Editor: Substitute Table 2-2 Class Width 10 10 10 10 10 10 EXAMPLE Organizing Qualitative Data into a Frequency Distribution The data on the next slide represent the color of M&Ms in a bag of plain M&Ms. Construct a frequency distribution of the color of plain M&Ms. Frequency table The relative frequency is the proportion (or percent) of observations within a category and is found using the formula: frequency relative frequency sum of all frequencies A relative frequency distribution lists the relative frequency of each category of data. 2-14 EXAMPLE Organizing Qualitative Data into a Relative Frequency Distribution Use the frequency distribution obtained in the prior example to construct a relative frequency distribution of the color of plain M&Ms. Relative Frequency 12 0.2667 45 0.2222 0.2 0.1333 0.0667 0.1111 2-16 Bar Graphs A bar graph is constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis. EXAMPLE Constructing a Frequency and Relative Frequency Bar Graph Use the M&M data to construct (a) a frequency bar graph and (b) a relative frequency bar graph. 2-18 2-19 Actresses example 28/76 = 37% 30/76 = 39% etc. Total Frequency = 76 Frequency bar graph The horizontal scale represents the classes of data values the vertical scale represents the frequencies 20 30 40 50 60 70 80 Relative Frequency Graph Has the same shape and horizontal scale as the bar graph, but the vertical scale is marked with relative frequencies instead of actual frequencies Interpreting Frequency Distributions In later chapters, there will be frequent reference to data with a normal distribution. One key characteristic of a normal distribution is that it has a “bell” shape. The frequencies start low, then increase to some maximum frequency, then decrease to a low frequency. The distribution should be approximately symmetric. Example: “bell” shape EXAMPLE Comparing Two Data Sets The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in 1990 and 2006. Draw a side-by-side relative frequency bar graph of the data. Marital Status 1990 2006 Never married 40.4 55.3 Married 112.6 127.7 Widowed 13.8 13.9 Divorced 15.1 22.8 Marital Status in 1990 vs. 2006 0.7 Relative Frequency 0.6 0.5 1990 0.4 2006 0.3 0.2 0.1 0 Never married Married Marital Status Widowed Divorced Another Example: On the morning of April 10, 1912 the Titanic sailed from the port of Southampton (UK) directed to NY. Altogether there were 2,201 passengers and crew members on board. This is the table of the survivors of the famous tragic accident. Survived Dead Male Female Male Female First class 62 141 118 4 Second class 25 93 154 13 Third class 88 90 422 106 Crew members 192 20 670 3 Define the categorical variables Bar chart representing the data in the table above (in percentages) 0.7 0.6 0.5 First Class 0.4 Second class 0.3 Third class 0.2 Crew class 0.1 0 Male Female Male Female Survived Survived Dead Dead A Pareto chart is a bar graph where the bars are drawn in decreasing order of frequency or relative frequency. 2-30 Pareto Chart 2-31 Pie Chart A pie chart is a circle divided into sectors. Each sector represents a category of data. The area of each sector is proportional to the frequency of the category. Slide 32 EXAMPLE Constructing a Pie Chart The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in 2006. Draw a pie chart of the data. Marital Status Frequency Never married 55.3 Married 127.7 Widowed 13.9 Divorced 22.8 Slide 33 Other example: A graph depicting qualitative data as slices of a pie Slide 34 2.2 Graphs for quantitative variables: Histograms (discrete data and continuous data) Stem-and-leaf plots Time series Dot plots Distributions Histogram: Example: CEO salaries Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than $350 million. Firms were ranked by fiveyear average return on investment. The data extracted are the age and annual salary of the chief executive officer for the first 60 ranked firms. (Data at http://lib.stat.cmu.edu/DASL/DataArchive.html ) Salary of chief executive officer (including bonuses), in $thousands 145 621 262 208 362 424 339 736 291 58 498 643 390 332 750 368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204 206 250 21 298 350 800 726 370 536 291 808 543 149 350 242 198 213 296 317 482 155 802 200 282 573 388 250 396 572 Drawing a histogram 1. 2. 3. Construct a distribution table: i. Define class intervals or bins (Choose intervals of equal width!) ii. Count the percentage of observations in each interval iii. End-point convention: left endpoint of the interval is included, and the right endpoint is excluded, i.e. [a,b[ Draw the horizontal axis. Construct the blocks: Height of block = percentages! The total area under an histogram must be 100% Class intervals Frequency Percentage= (frequency/total)x Use left 100 end-point Class interv als Frequency Use left end-point Percentage= (frequency/total)x100 0-100 2 2/59x100=3.39 600700 3 5.08 100-200 4 4/59x100=6.78 700800 3 5.08 200-300 18 30.50 800900 4 6.78 300-400 14 23.73 9001000 0 0 400-500 4 6.78 10001100 1 1.70 500-600 6 10.18 Total 59 100% 30.50% 23.73% 3.39% 1.70% The area of each block represents the percentages of cases in the corresponding class interval (or bin). Remarks • A histogram represents percent by area. The area of each block represents the percentages of cases in the corresponding class interval. • The total area under a histogram is 100% • There is no fixed choice for the number of classes in a histogram: If class intervals are too small, the histogram will have spikes; If class intervals are too large, some information will be missed. Use your judgment! • Typically statistical software will choose the class intervals for you, but you can modify them. • Let's try various binning levels. Example: Smoking In a Public Health Service study, a histogram was plotted showing the number of cigarettes smoked per day by each subject (male current smokers), as shown below. The density is marked in parentheses. The class intervals include the left endpoint, but not the right. 1. 2. 3. 4. The percentage who smoked less than two packs a day but at least a pack, is around (note: there are 20 cigarettes in a pack.) 1.5% 15% 30% 50% The percent who smoked at least a pack a day is around 1.5% 15% 30% 50% The percent who smoked at least 3 packs a day is around 0.25 of 1% 0.5 of 1% 10% The percent who smoked 20 cigarettes a day is around 0.35 of 1% 0.5 of 1% 1.5% 3.5% 10% Answers: 1. The percentage who smoked less than two packs a day but at least a pack, is given by (note: there are 20 cigarettes in a pack.) the area of the third block: 1.5x(40-20)=1.5x20=30% 2. The percent who smoked at least a pack a day is given by the area of the third and fourth blocks: 30+0.5x40=50% 3. The percent who smoked at least 3 packs a day is the area of the block for number of cigarettes greater or equal to 60. This is half of the fourth block: 10% 4. The percent who smoked 20 cigarettes a day: use the left endpoint convention, so 20 belongs to the third block. The answer is 1.5%. Using histograms for comparisons Fuel economy for model year 2001 compact and twoseater cars (Table 1.8 pg 38) City Consumption Highway consumption Stemplot (or Stem-and-Leaf Plot) Represents data by separating each value into two parts: the stem (leftmost digits) and the leaf (the last rightmost digit) Example: a data value of 147 would have 14 as the stem and 7 as the leaf. To make a Stemplot: Example: Advantage of Stem-and-Leaf Diagrams over Histograms Once a frequency distribution or histogram of continuous data is created, the raw data is lost (unless reported with the frequency distribution), however, the raw data can be retrieved from the stem-and-leaf plot. Dot plots A dot plot is drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed. 2-50 EXAMPLE Drawing a Dot Plot The following data represent the number of available cars in a household based on a random sample of 50 households. Draw a dot plot of the data. 3 4 1 3 2 0 2 1 3 3 1 2 3 2 2 2 2 2 1 1 1 1 4 2 2 1 2 1 2 2 1 2 2 0 1 2 0 1 3 1 Data based on results reported by the United States Bureau of the Census. 0 2 2 2 3 2 4 2 2 5 2-52 Examining distributions Purpose of graph: to understand data better Histograms and Stemplots display the main features of a distribution similarly. Features to be observed: Modes (how many?) Symmetry vs skewness Outliers 2-54 EXAMPLE Identifying the Shape of the Distribution Identify the shape of the following histogram which represents the time between eruptions at Old Faithful. Time-Series Graphs Data that have been collected at different points in time Time-Series Graphs Data that have been collected at different points in time Example: Time series graph: Time series graph with seasonal variation: Other types of graphs: Frequency Polygon Ogive (cumulative frequencies) Scatter Plot (to relate two variables) Frequency polygons The class midpoint is found by adding consecutive lower class limits and dividing the result by 2. A frequency polygon is drawn by plotting a point above each class midpoint on a horizontal axis at a height equal to the frequency of the class. After the points for each class are plotted, draw straight lines between consecutive points. 2-64 Time between Eruptions (seconds) Class Midpoint Frequency Relative Frequency 670 – 679 675 2 0.0444 680 – 689 685 0 0 690 – 699 695 7 0.1556 700 – 709 705 9 0.2 710 – 719 715 9 0.2 720 – 729 725 11 0.2444 730 – 739 735 7 0.1556 2-65 Frequency Polygon Time between Eruptions 12 10 Frequency 8 6 4 2 0 665 675 685 695 705 715 725 735 Time (seconds) 2-66 Practice CO2 emission levels in the world: Burning fuel in power plants or motor vehicles emits carbon dioxide (CO2) which contributes to global warming. The table in the next slide displays CO2 emissions per person from countries with populations at least 20 millions. Questions: (a) Why do you think we choose to measure emissions per person rather than total CO2 emissions for each country? (b) Display the data of the table in a graph. Describe the shape, center, and spread of the distribution. Which countries are outliers? 1. Make a Stemplot, then 2. A Histogram. Answer: (a) Totals emissions would almost certainly be higher for very large countries; for example, we would expect that even with great attempts to control emissions, China (with over 1 billion people) would have higher total emissions than the smallest countries in the data set. Answer: (stemplot) (b) Graph representation of the data: 1) Stemplot: 0 001122223357899 1 02478 2 3558 3 67899 4 68 5 1 6 18 7 36 8 018 9 017 10 0 2 11 12 13 14 15 16 17 18 19 9 Answer: (histogram) (b)-continued: Graph representation of the data: 2) Histogram: (For example, using Excel – Note: in Excel, the convention is ‘right point belongs in bin, left point out’): (Demo in class) Summary of steps: - Find min and max of data - Choose binning - From Menus: Tools, Data Analysis, Histograms - Define: Input range, Bin range, Output range - Check Chart output. - Click OK. - Adjust width between bars (right-click on bars, format data series, options, set gap width to zero). Answer: (histogram) (b)-continued: Histogram: min 0 max 19.9 Bin Histogram 20 Frequency 18 0 2 2 18 4 9 6 3 8 5 10 6 12 2 4 14 0 2 16 1 0 18 1 20 1 22 0 16 Frequency 14 12 10 8 6 Bin Interpretation of graphs: The graph is not symmetric. There is a strong right skew with a high peak at low metric tons per person, The three highest countries (the U.S., Canada, and Australia) appear to be outliers; apart from those countries, the distribution is spread from 0 to 11 metric tons per person (see table).