Chapter 1 Picturing Distributions with Graphs Adapted from the DS1000B lectures by Dr. Yifan Li Revisions made by Ashley McAlpine, FSA, FCIA © 2021 W. H. Freeman and Company Chapter 1 Overview Individuals and variables Categorical variables: pie charts and bar graphs Quantitative variables: histograms Interpreting histograms Quantitative variables: stemplots Time plots 2 Data Science Data science is a multidisciplinary field (Statistics, Computer Science, Mathematics) with the goals of extracting insight/information from data and making data-driven decisions. Data Science encompasses: data collection, storage, preprocessing, analysis and visualisation. 3 Statistics Statistics is the science of data. The first step in dealing with data is to organize your thinking about the data. Individual: an object described by a set of data Variable: a characteristic of the individual Individuals Variables 4 When planning a study … … or simply exploring data from someone else’s work, ask yourself these questions: Who? What individual does the data describe? What? How many and what are the exact definitions of the variables in the data? In what unit of measurement is each variable recorded? Where? The context of the data collection is always important. When? (see previous point.) Why? Were the data collected to describe just those individuals or to represent a larger group? 5 Types of Variables A categorical variable places individuals into one of several groups or categories. A quantitative variable takes numerical values for which arithmetic operations make sense (usually recorded in a unit of measurement). Most data tables follow this format—the data here appear in a spreadsheet program. 6 Exploratory Data Analysis An exploratory data analysis is the process of using statistical tools and ideas to examine data in order to describe their main features. EXPLORING DATA Begin by examining each variable by itself. Then move on to studying the relationships among the variables. Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data. 7 Exercise Suppose we have a dataset that describes the fuel economy in miles per gallon (mpg) of model year 2019 motor vehicles. What is the individual in this dataset? A. mile per gallon (mpg) B. a year C. the driver D. a vehicle 8 Categorical variables: Quantitative variables: 9 Distribution of a Variable To examine a single variable, we usually want to display its distribution. DISTRIBUTION OF A VARIABLE The distribution of a variable tells us what values it takes and how often it takes these values. 10 Displaying Categorical Data The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall into each category. Pie charts show the distribution of a categorical variable as a “pie” where the sizes of the slices reflect the counts or percents for the categories. Bar graphs represent each category as a bar whose height shows the category count or percent. 11 Pie Chart or Bar Graph EXAMPLE: What do the 1.5 million full-time first-year students plan to study? Here are data on the percents of post-secondary first-year students who plan to major in several discipline areas. Field of Study Percent of Student Biological sciences 15.5 Business 13.8 Health professions 11.7 Engineering 11.5 Social sciences 11 Arts and humanities 8.8 Math and computer science 6.2 Education 4.4 Physical sciences 2.7 Other majors and undeclared 13.1 Total 98.7 PERCENT OF STUDENTS Social sciences Arts and humanities Physical sciences Biological sciences Other majors and undeclared Math and computer science rounding error Business Health professions Education Engineering 12 Pie Chart or Bar Graph Field of Study Percent of Student Biological sciences 15.5 Business 13.8 Health professions 11.7 Engineering 11.5 Social sciences 11 Arts and humanities 8.8 Math and computer science 6.2 Education 4.4 Physical sciences 2.7 Other majors and undeclared 13.1 Percent of students who plan to major EXAMPLE (cont’d): The same example now shown as a bar graph 18 16 14 12 10 8 6 4 2 0 Field of Study 13 Bar Graphs Only Brand Percent of 12-34s Who Have Listened to Each Brand Pandora 36 Spotify 46 iHeartRadio 14 Apple Music 20 Amazon Music 10 SoundCloud 23 Google Play All Access 8 Percent of 12-34 YO’s Who Have Listened to Each Brand EXAMPLE: What sources do Americans aged 12–34 years use to keep up to date and50 learn about music? 45 40 35 30 25 20 15 10 5 0 Spotify Pandora SoundCloud Apple Music iHeartRadio Amazon Music Google Play All Access Audio Brand Note: For bar graphs, percents don’t necessarily add to 100. 14 Bar Graphs vs Pie Charts They both describe categorical variables, but we need to choose them wisely in different cases: Pie charts must include all categories that make up a whole. Use a pie chart only when you emphasize each category’s relationship to the whole. Bar graphs are more flexible than pie charts (e.g. the categories do not need to make up a whole; when there are many categories, pie charts can be hard to see but bar graphs still work.) 15 Can we use a pie chart? Here are top brands chosen by Americans aged 12-34 who are using social media. A. Yes B. No 16 Can we use a pie chart? Here are the average numbers of babies born on each day of the week in 2017 A. Yes B. No 17 Quantitative Data The distribution of a quantitative variable tells us what values the variable takes on and how often it takes on those values. Histograms show the distribution of a quantitative variable by using bars where the height of each bar represents the number of individuals who take on a value within a particular class/interval. Stemplots separate each observation into a stem and a leaf that are then plotted to display the distribution, while maintaining the original values of the variable. 18 Histograms Are appropriate for quantitative variables that take on many values and/or for large datasets. Divide the possible values into classes/intervals (equal widths). Count how many observations fall into each interval (may change to percents). Draw a picture representing the distribution ― bar heights are equivalent to the number (percent) of observations in each interval. 19 Histograms (Example) Freshman Graduation Rate, or FGR, Data for 2016– 2017 20 Histograms vs Bar Graphs Histogram Bar graphs Type of variable Quantitative Categorical Space between bars No space Equal space The height of bars (both mean count/frequency/ percent) Changes with the setup of intervals or break points Once the unit of measurement is fixed, it does not change 1. Why is there usually no space between the bars of a histogram? 2. Sometimes there is a gap between bars in a histogram. What does this mean? 21 Interpreting Histograms EXAMINING A HISTOGRAM In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern by its shape, center, and variability. You will sometimes see variability referred to as spread. An important kind of deviation is an outlier, an individual that falls outside the overall pattern. 22 Overall Shape of Distributions SYMMETRIC AND SKEWED DISTRIBUTIONS A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other. A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side. A distribution is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side. Symmetric Right-skewed Left-skewed 23 Direction to the long tail (not the direction where most points are clustered) 1000 0 500 Frequency 1500 2000 Symmetric 6 8 10 12 14 Right-skewed 100 Frequency 50 150 100 0 50 40 50 60 70 80 90 100 0 Frequency 200 150 250 200 Left-skewed 0 5 10 15 20 24 What is the shape of this distribution? A. Right-skewed B. Left-skewed C. Symmetric 25 What is the shape of this distribution? A. Right-skewed B. Left-skewed C. Symmetric 26 Stemplots (Stem-and-Leaf Plots) To make a stemplot (not suitable for a large dataset): 1. Separate each observation into a stem (consisting of all but the final, or rightmost, digit) and a leaf (the final digit). Stems may have as many digits as needed, but each leaf contains only a single digit. 2. Write the stems in a vertical column with the smallest at the top and draw a vertical line at the right of this column. Be sure to include all the stems needed to span the data, even when some stems have no leaves. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 27 Example of Stemplot 5.3, 5.6, 5.3, 5.0, 5.5, 5.9, 5.9, 3.3, 5.0, 3.4, 5.0, 3.4, 4.6, 6.9, 6.0 • First, order the data from smallest to largest. • For this dataset, the stems are the non-decimals and leaves are the decimals (recall: leaves can only be one digit) 3.3, 3.4, 3.4, 4.6, 5.0, 5.0, 5.0, 5.3, 5.3, 5.5, 5.6, 5.9, 5.9, 6.0, 6.9 Stemplot: 3 | 344 4|6 5 | 000335699 6 | 09 28 If there are very few stems (that is, when the data covers only a very small range of values), then we may want to create more stems by splitting the original stems. EXAMPLE: 3.3, 3.4, 3.4, 4.6, 5.0, 5.0, 5.0, 5.3, 5.3, 5.5, 5.6, 5.9, 5.9, 6.0, 6.9 3 | 344 decimals 0 to 4 3| decimals 5 to 9 4| 4|6 5 | 00033 5 | 5699 6|0 6|9 29 Exercise The shape of the distribution below is: a. Skewed to the left. b. Skewed upward. c. Skewed to the right. 30 Stemplot of Median Income for Families of Four Median incomes range from $46,596 (New Mexico) to $82,879 (Maryland). Stemplot of Median Incomes: 4|66789 5|11344 Stems: 4 to 8, reusing two times 5|56666688899999 6|011112334 (0-5 then 5-9) with leaves truncated to $1,000s. 6|556666789 Note leaves have been ordered. 7|01223 7| Example: 8|0022 $46,596 would be truncated Example: 4|6 = $46,xxx to 46,000 and shown as 4|6 Source: Federal Registry, April 15, 2003 31 Obtaining Info from the Stemplot Determine shape, identify outliers, locate center. Pulse Rates: 5|4 5|789 6|023344 6|55567789 7|00124 7|58 Exam Scores 3|2 4| 5|5 6|024418 7|5659839 8|5430820 9|532087 Bell-shape Centered mid 60’s no outliers Outlier of 32. Apart from 55, rest uniform from the 60’s to 90’s. Median Incomes: 4|66789 5|11344 5|56666688899999 6|011112334 6|556666789 7|01223 7| 8|0022 Wide range with 4 unusually high values. Rest bell-shape around high $50,000s. 32 Time Plots A time plot shows behavior over time. Time is always on the horizontal axis, and the variable being measured is on the vertical axis. Look for an overall pattern (trend) and for deviations from this trend. Connecting the data points by lines may emphasize this trend. Look for patterns that repeat at known regular intervals (seasonal variations or cycles). 33 Time Plots (Example 1) Time Plot of Average Gauge Height, Everglades National Park Monitoring Station 34 Time Plots (Example 2) Time Plot of the Mean Atmospheric CO2 Concentration (ppm) 35 Practice Textbook Problems: Odd problems from 1.13 to 1.45 (except 1.29) 36 The Basic Practice of Statistics Ninth Edition David S. Moore William I. Notz Chapter 1 Picturing Distributions with Graphs Clicker Questions © 2021 W. H. Freeman and Company Question 1 In a study of commuting patterns of people in a large metropolitan area, respondents were asked to report the time they took to travel to their workplace on a specific day of the week. What is the individual? a) b) c) d) travel time a person the day of the week the city in which they lived Question 2 In a study of commuting patterns of people in a large metropolitan area, respondents were asked to report the time they took to travel to their workplace on a specific day of the week. What is the variable of interest? a) b) c) d) travel time a person the day of the week the city in which the person lives Question 3 If we asked people to report their “weight,” would that variable be considered a categorical variable or a quantitative variable? a) b) categorical quantitative Question 4 We ask people whether their weight is underweight, normal, overweight, or obese. Is this variable categorical or quantitative? a) b) categorical quantitative Question 5 A survey involving 35 questions was given to a sample of 200 students attending a university with an enrollment of 10,000. How many individuals do the data describe? a) b) c) d) 35 200 10,000 It is impossible to say. Question 6 A survey involving 35 questions was given to a sample of 200 students attending a university with an enrollment of 10,000. How many variables do the data contain? a) b) c) d) 35 200 35 × 200 35 × 10,000 Question 7 What type of data is produced by the answer choices for the following question? How many times have you accessed the Internet this week? 1) 2) 3) 4) a) b) never once or twice three or four times more than four times categorical quantitative Question 8 Which statement is correct regarding the distribution of a variable? a) b) c) d) The distribution of a quantitative variable can be displayed using a graphical display like a histogram. The distribution of a categorical variable can be displayed using a table of counts or percents. The distribution of a variable tells us what values the variable takes and how often it takes these values. All of these choices are correct. Question 9 The purpose of pie charts and bar charts is to _______________. a) b) c) d) hide the data from the audience impress the audience with interesting colors and shapes help the audience grasp the distribution of the data quickly display the distribution of quantitative variables Question 10 Married respondents, in a survey, were asked how they met their spouses. According to the bar chart below, approximately how many respondents met their spouses through a date (dating) service? a) b) c) d) 2 20 30 70 Question 11 Bar graphs are more flexible than pie charts. Both types of graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities measured in the same units. a) b) true false Question 12 Look at the following histogram of salaries of baseball players. What shape would you say the data take? a) b) c) d) e) bi-modal left-skewed right-skewed symmetric uniform Question 13 Look at the following histogram of percent urban population in 50 U.S. states. What shape would you say the data take? a) b) c) d) e) bi-modal left-skewed right-skewed symmetric uniform Question 14 Wine bottles at a small winery are sampled and tested for quality. One measurement is volume: the bottles should be filled to 750 ml, with some variation expected. Based on this histogram, does 750 ml seem to be a reasonable center for the data? a) b) c) Yes, the center is approximately 750 ml. No, the center is too variable. No, the data range from 744 to 755 ml. Question 15 Considering this histogram, about how many bottles of wine have more than 752 ml? a) b) c) d) 17 44 6 5 Question 16 In the dataset represented by the following stemplot, in which the leaf unit = 1.0, how many times does the number “28” occur? 0 9 1 246999 2 111134567888999 3 000112222345666699 a) b) c) d) 0 1 3 4 4 001445 5 0014 6 7 7 3 Question 17 Which of these plots is a time plot? a) b) c) d) Plot A Plot B Both of these answers are correct. None of these choices is correct. Question 18 What type of trend does the following time plot show? a) b) c) downward upward no trend Question 19 What would be the correct interpretation of the following graph? a) b) c) There is an upward trend in the data. There is a downward trend in the data. The data show seasonal variation. Question 20 An individual value that falls outside the overall pattern is called ____________. a) b) c) d) the center a shape an outlier a spread