Welcome to MATH171! Overview of Syllabus Technology Overview Basic Skills Quiz Start Chapter 1! Displaying data with graphs BPS chapter 1 © 2006 W. H. Freeman and Company With Modifications by Dr. M. Leigh Lunsford What is Statistics? Statistics is the Science of Learning from Data The Collection and Analysis of Data Sampling and Experimental Design Chapters 8 & 9 Probability & Sampling Distributions Chapters 10 & 11 Descriptive Statistics (Data Exploration) Chapters 1 - 5 Inferential Statistics Chapters 14 - 21 Objectives for Chapter 1 Picturing Distributions with Graphs Individuals and variables Two types of data: categorical and quantitative Ways to chart categorical data: bar graphs and pie charts Ways to chart quantitative data: histograms and stemplots Interpreting histograms Time plots Individuals and variables (page 3) Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things. Example: Freshmen, 6-week-old babies, golden retrievers, fields of corn, cells A variable is any characteristic of an individual. A variable can take different values for different individuals. Example: Age, height, blood pressure, ethnicity, leaf length, first language Two types of variables (page 4) A variable can be either quantitative Something that can be counted or measured for each individual and then added, subtracted, averaged, etc., across individuals in the population. Example: How tall you are, your age, your blood cholesterol level, the number of credit cards you own. OR categorical Something that falls into one of several categories. What can be counted is the count or proportion of individuals in each category. Example: Your blood type (A, B, AB, O), your hair color, your ethnicity, whether you paid income tax last tax year or not. Example 1.1 (page 4-5) How do you determine if a variable is categorical or quantitative? Identify individuals, variables and types of variables. Ways to graph categorical data Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.). Bar graphs Each category is represented by a bar. Variable Values Variable Pie charts Use when you want to emphasize each category’s relation to the whole. Example: Top 10 causes of death in the United States, 2001 Rank Causes of death Counts Percent of top 10s Percent of total deaths 1 Heart disease 700,142 37% 29% 2 Cancer 553,768 29% 23% 3 Cerebrovascular 163,538 9% 7% 4 Chronic respiratory 123,013 6% 5% 5 Accidents 101,537 5% 4% 6 Diabetes mellitus 71,372 4% 3% 7 Flu and pneumonia 62,034 3% 3% 8 Alzheimer’s disease 53,852 3% 2% 9 Kidney disorders 39,480 2% 2% 32,238 2% 1% 10 Septicemia All other causes 629,967 How did they get these numbers? 26% For each individual who died in the United States in 2001, we record what was the cause of death. The table above is a summary of that information. Bar graphs Each “value” of the categorical variable is represented by one bar. The bar’s height shows the count (or sometimes the percentage) for that particular category. Top 10 causes of death in the U.S., 2001 The number of individuals who died of an accident in 2001 is approximately 100,000. di se as es C an C er ce eb rs r ov C as hr on cu ic la r re sp ira to ry Ac cid D ia en be ts te s m Fl el u l it & us p Al ne zh um ei on m er ia 's di Ki se dn as ey e di so rd er Se s pt ic em ia H ea rt Counts (x1000) 800 700 600 500 400 300 200 100 0 ov ce rs on ic as cu la r re sp ira D to ia be ry te s m Fl el u lit us & pn eu m on H ea ia rt di se Ki as dn es ey di so rd er s Se pt ic em ia C hr C an s se nt di se a Ac ci de er 's eb r ei m C er Al zh Counts (x1000) eb r is e ov ce rs es as C an rt d as cu on la ic r re sp ira to ry Ac ci D de ia nt be s te s m Fl el u lit & us pn Al eu zh m ei on m ia er 's di Ki se dn as ey e di so rd er s Se pt ic em ia C hr C er H ea Counts (x1000) 800 700 600 500 400 300 200 100 0 Top 10 causes of death in the U.S., 2001 Bar graph sorted by rank Easy to analyze 800 700 600 500 400 300 200 100 0 Sorted alphabetically Much less useful Pie charts Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Percent of people dying from top 10 causes of death in the U.S., 2000 Make sure your labels match the data! Make sure all percents add up to 100!! Percent of deaths from top 10 causes Percent of deaths from all causes Apply Your Knowledge Problem 1.4 Let’s work Problem 1.4 (page 10) together! Bar graph (in count & percent) Pie chart? Births in 2004 by Day of Week Sun 10% Sat 11% Mon 15% Fri 16% Tues 16% Thurs 16% Wed 16% Day of Week Births Sun 7563 Mon 11733 Tues 13001 Wed 12598 Thurs 12514 Fri 12396 Sat 8605 Number of Babies Born on Each Day of the Week in 2003 Ways to chart quantitative data Histograms and stemplots These are summary graphs for a single variable. They are very useful to understand the pattern of variability in the data. Line graphs: time plots Use when there is a meaningful sequence, like time. The line connecting the points helps emphasize any change over time. Other graphs to reflect numerical summaries (see Chapter 2) An Example Suppose we want to determine the following: What percent of all fifth grade students in our district have an IQ score of at least 120? What is the average IQ score of all fifth grade students in our district? It is too expensive to give an IQ test to all fifth grade students in our district. Below are the IQ test scores from 60 randomly chosen fifth graders in our district. (Individuals (subjects)?, Variable(s)?) Previews of Coming Attractions! We are interested in questions about a population (all fifth grade students in our district). We want to know the percent (or proportion) of the population in a particular category (IQ score of at least 120) and the average value of a variable for the population (average IQ score). We have taken a random sample from the population. Eventually we will use the data from the sample to infer about the population. (Inferential Statistics) For now we will describe the data in the sample. (Descriptive Statistics) We will graphically represent the IQ scores for our sample (histogram & stem and leaf) We will find the percent of students in our sample with an average IQ score of at least 120 and understand how that percent relates to the graph. Later (Chapter 2) we will also be able to describe the data with numerical summaries and other types of plots (boxplots) Stemplots (page 19) How to make a stemplot: STEM 1) Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. 2) Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column. 3) Write each leaf in the row to the right of its stem, in increasing order out from the stem. Let’s try it with this data: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70 LEAVES Now Let’s Make a Stemplot for Our IQ Data Stem & Leaf Plot for IQ Data IQ Test Scores for 60 Randomly Chosen 5th Grade Students Stem and Leaf plot for IQ Scores stem unit = 10 leaf unit = 1 Frequency Stem 3 8 129 4 9 0467 14 10 01112223568999 17 11 00022334445677788 11 12 22344456778 9 13 013446799 2 14 25 60 Leaf Now Let’s Make a Histogram (pages 10-12) Use the Same IQ Data We will start by hand….using class (bin) widths of 10 starting at 80… Make a Frequency Table for the data: Variable: X = IQ score Frequency Table: Bins Frequency 80X<90 3 90X<100 4 6.7% 100X<110 14 110X<120 17 120X<130 11 130X<140 9 140X<150 2 totals: 60 Percent 5.0% 23.3% 28.3% 18.3% 15.0% 3.3% 99.9% Now Let’s Make a Histogram (pages 10-12) IQ Scores of Randomly Chosen Fifth Grade Students 30 23.3 25 18.3 20 15.0 15 10 5.0 6.7 3.3 5 IQ Score 15 0 14 0 13 0 11 0 10 0 0 12 0 Percent of What? What is the meaning of this bar? 28.3 90 80 Use the Same IQ Data We will start by hand….using class (bin) widths of 10 starting at 80… Make a Frequency Table for the data: Percent Back to Our Question: What percent of the 60 randomly chosen fifth grade students have an IQ score of at least 120? Numerically? 18.3%+15%+3.3%=36.6% (11+9+2)/60=.367 or 36.7% How to Represent Graphically? Grey Shaded Region corresponds to the 36.6% of students Another Histogram of the IQ Data! What is Different From the Histogram we Generated In Class? How to create a histogram It is an iterative process—try and try again. What bin (class) size should you use? Not too many bins with either 0 or 1 counts Not overly summarized that you lose all the information Not so detailed that it is no longer summary Rule of thumb: Start with 5 to10 bins. Look at the distribution and refine your bins. (There isn’t a unique or “perfect” solution.) Same data set Not summarized enough GOAL: Capture Overall Pattern Too summarized Apply Your Knowledge Let’s try problem 1.7 (page 14) What is the difference between a histogram and a bar chart? See pages 12-13 Interpreting histograms When describing a quantitative variable, we look for the overall pattern and for striking deviations from that pattern. We can describe the overall pattern of a histogram by its shape, center, and spread. Histogram with a line connecting each column too detailed Histogram with a smoothed curve highlighting the overall pattern of the distribution Most common distribution shapes Symmetric distribution A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram Skewed distribution extends much farther out than the right side. Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations. Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetric except for two states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. Alaska Florida IMPORTANT NOTE: Your data are the way they are. Do not try to force them into a particular shape. It is a common misconception that if you have a large enough data set, the data will eventually turn out nice and symmetrical. Line graphs: time plots Time always goes on the horizontal, or x, axis. The variable of interest— here “retail price of fresh oranges”—goes on the vertical, or y, axis. This time plot shows a regular pattern of yearly variations. These are seasonal variations in fresh orange pricing most likely due to similar seasonal variations in the production of fresh oranges. There is also an overall upward trend in pricing over time. It could simply be reflecting inflation trends or a more fundamental change in this industry. Let’s Start Problem 1.41 on Page 35…. Scales matter Death rates from cancer (US, 1945-95) Death rates from cancer (US, 1945-95) Death rate (per thousand) 250 200 150 100 250 Death rate (per thousand) How you stretch the axes and choose your scales can give a different impression. 200 150 100 50 50 0 1940 1950 1960 1970 1980 1990 0 1940 2000 1960 1980 2000 Years Years Death rates from cancer (US, 1945-95) 250 Death rates from cancer (US, 1945-95) 220 Death rate (per thousand) Death rate (per thousand) 200 150 100 50 0 1940 1960 Years 1980 2000 A picture is worth a thousand words, 200 BUT 180 160 there is nothing like hard numbers. Look at the scales. 140 120 1940 1960 1980 Years 2000