STAT 2331 Chapter 1: Looking at Data – Distributions (Exploring and Summarizing Data) A variable is any characteristic we are studying – height, GPA, religious affiliation, income, major, and number of pets. For each of these characteristics, we expect to see a certain amount of variability – some people are tall, others are short; some people have many pets, some people have no pets, etc. Statistical methods provide ways to measure and understand variability. 1.1 Different Types of Data (textbook page 2-8) In general, there are two types of variables: categorical and quantitative. A variable is called categorical if each observation belongs to one of a set of categories. A variable is called quantitative if observations are numerical values. Some examples are given in the table below. Variable Type Examples Categorical Gender (male, female), type of residence (house, apartment, dorm, other), political affiliation (Republican, Democrat, other) Quantitative Age, daily high temperature, SAT score, calories consumed per day For quantitative variables, we distinguish between discrete and continuous. Discrete variables have a countable number of values. Continuous variables have uncountable number of values. Continuous variables are usually variables that can take on all values on an interval. If you’re not sure if a variable is categorical or quantitative, think about finding the average. If it does make sense to find the average, then the variable is quantitative. If computing the average does not make sense, then the variable is categorical. Example 1: Variable types Identify the variable and then determine if it is a categorical or quantitative variable. 1. The proportion of customers currently using Verizon is 43%. 2. The average high temperature in March (spring break month) in Athens is 66 F°. Using Frequency Tables Frequency tables are an easy way to summarize data (usually categorical). A frequency table is a listing of possible values for a variable together with the number of observations for each value. The term frequency means count. Example 2: Favorite cookie Thirty college students were randomly selected and asked which of the following are their favorite cookie: oreo, sugar, peanut butter, chocolate chip, brownie, or oatmeal raisin. The responses are shown below. Organize the data into a frequency table. Oreo Sugar Peanut Butter Chocolate Chip Chocolate Chip Chocolate Chip Oreo Oreo Oatmeal Raisin Oreo Oatmeal Raisin Chocolate Chip Chocolate Chip Sugar Brownie Sugar Oatmeal Raisin Oreo Brownie Peanut Butter Peanut Butter Oatmeal Raisin Chocolate Chip Oreo Oatmeal Raisin Brownie Peanut Butter Brownie Chocolate Chip Oreo Favorite Cookie Oreo Chocolate Chip Oatmeal Raisin Sugar Peanut Butter Brownie 1.2 Frequency Graphical Summaries of Categorical Data (textbook page 8-27) The two primary graphical displays for a categorical variable are the bar graph and the pie chart. In the bar graphs shown, notice that (1) the categories are on the horizontal axis, (2) frequencies (or proportions) are on the vertical axis, and (3) the heights of the rectangles for each category are equal to the category’s frequency or proportion. Proportion A pie chart is a circle divided into sectors. Each sector represents a category of data. NOW ON TO QUANTITATIVE VARIABLES 1.3 Graphs for Quantitative Variables (textbook page 8-27) Recall Two Different Types of Data – Categorical and Quantitative. quantitative variables Now we’ll focus on A dot plot is as it sounds – a plot with dots. This is best understood through an example. 3, 7, 6, 8, 5, 5, 7, 5, 9 A dot plot for this data set looks like this: Do you know what a histogram is? A histogram is a graph that uses bars to portray the frequencies of the possible outcomes of a quantitative variable. The horizontal, x, axis represents the values the variable can take on. The vertical, y, axis tells how many of each value falls within a certain range of values. Example 4: SMU on Offense The table below shows the number of points that the SMU football team will score in the 20182019 season. Consider a histogram with bar widths by tens: 10-19, 20-29, 30-39, 40-49 and 5059. Opponent Points Opponent Points North Texas 31 TCU 52 Michigan 13 Navy 17 HBU 45 UCF 31 Tulane 41 Cincinnati 24 Houston 27 Connecticut 51 Memphis 18 Tulsa 30 UCF 48 Florida St. 26 You can use the point totals to create a frequency table, then use the table to create the histogram. Usually, the bars in a histogram are touching (though that’s not always the case in the images provided). Frequency 3 20-29 2 30-39 3 40-49 2 50-59 2 SMU Offense 4 Frequency Group 10-19 3 2 1 0 10-19 20-29 30-39 40-49 50-59 Points Scored Example 5: Sodium The sodium values of 20 cereals are given in the table below. Cereal Frosted Mini Wheats Raisin Bran All Bran Apple Jacks Cap’n Crunch Cheerios Cinnamon Toast Crunch Crackling Oat Bran Fiber One Frosted Flakes Sodium 0 340 70 140 200 180 210 150 100 130 Using intervals of 50, connect information in the frequency table and histogram. Interval Frequency 0-49 1 50-99 2 100-149 4 150-199 6 200-249 5 250-299 1 300-349 1 Cereal Fruit Loops Honey Bunches of Oats Honey Nut Cheerios Life Rice Krispies Honey Smacks Special K Wheaties Corn Flakes Honeycomb Sodium 140 180 190 160 290 50 220 180 200 210 Example 5: IQ Scores The histogram shows for 7th grade IQ scores. How many students were sampled? a. b. c. d. 205 98 149 324 Using the same histogram, which class has the most observations? a. b. c. d. e. f. g. h. i. 60-69 70-79 80-89 90-99 100-109 110-119 120-129 130-139 140-149 Using the same histogram, how many observations are in the bin with the most? a. 98 b. 100 c. 109 d. 119 The Shape of a Distribution Looking at the shape of a histogram (or dot plot) allows us to describe the distribution of the data set. Some common distributions are presented in the table below. Shape Description How it looks 30 Symmetric/Unimodal One side is a mirror image of the other. The histogram looks symmetric (SAT scores, height of male SMU students). 25 20 15 10 5 0 30 25 Skewed left Left tail is stretched out longer than the right tail (life span of humans). 20 15 10 5 0 30 25 Skewed right Right tail is stretched out longer than the left tail (income, number of pets). 20 15 10 5 0 Bimodal Two distinct humps (height of all SMU students – why?). We’ll deal mostly with the first three types of graphs shown above, all of which are unimodal distributions. In picturing features such as skew and symmetry, it’s common to use smooth curves to summarize the shape of a histogram. Example 6: Reading a Histogram The figure below shows a histogram for 7th grade IQ scores. Answer the questions that follow. 7th Grade IQ Scores 100 90 80 Frequency 70 60 50 98 40 30 20 10 0 38 32 14 2 60-69 12 4 70-79 80-89 90-99 4 1 100-109 110-119 120-129 130-139 140-149 IQ Score 1. How many students were sampled? 2 + 4 + β― + 4 + 1 = 205 2. Which class has the highest frequency? How many observations fell within that range? 3. Which class has the fewest observations? How many observations fell within that range? 4. What proportion of students have an IQ between 120 and 129? 5. Describe the shape of the distribution. Example 6: More . . . Reading Histograms Match the variable description to the possible histogram: 1. 2. 3. 4. Scores on a fairly easy exam SAT scores of a group of college students Heights of college students Number of medals won by countries in the 2018 Winter Olympics A. B. C. D. Hits per Game Example 7: Hitting streak Frequency 4 The following histogram shows how many hits a baseball player had in the first eight games of the season. Convert the histogram to a data set. 3 2 3 1 2 1 1 1 2 1 0 3 Hits 4 5 Time plots Time plot is a plot values according to the time when it was obtained. Time goes to the horizontal direction. 1.4 Measuring the Center of Quantitative Data (textbook page 27-51) It’s always a good idea to look at the data with a graph first to get a feel for the distribution – is it symmetric, is it skewed, are there outliers? (We’ll talk more about outliers later.) After graphing the data, you can summarize data with the center and the spread of the data set. The most frequently used measures of center are the mean and the median. The mean (average) is the sum of the observations divided by the number of observations. When observations are ordered from smallest to largest, the median is the point that splits the data in two – half of the observations are below it, half of the observations are above it. π₯Μ = The formula for the mean is: ∑π₯ π Remember, π₯Μ is a statistic representing the sample average, where π is a parameter representing the population average. We will never calculate π. Example 8: Calculating mean and median Student Test 1 Test 2 Test 3 Test 4 Test 5 20 20 80 100 100 Mean Median Mean Median Why do we need different measure of center? Test 1 Test 2 Test 3 Test 4 Test 5 Student A 50 50 50 50 50 Student B 30 40 55 55 70 1.5 Measuring the Variability of Quantitative Data (textbook 27-51) We’ve discussed a few ways to describe the center of quantitative data (mean, median), and now we’ll talk about how to describe the variability. Why do we need to measure the variability of data? Test 1 Test 2 Test 3 Test 4 Test 5 Student A 50 50 50 50 50 Student B 30 40 50 60 70 Example 9: Variability Which set of numbers (A, B, or C) do you think has the most variability? A: 13, 17, 1, 18, 4, 22, 11, 8, 6 B: 122, 125, 127, 126, 128, 121, 128 C: 51, 78, 53, 42, 47, 33, 82, 75, 91 The three measures of spread (range, variance, and standard deviation) Range: Variance: Standard deviation: Mean Median Example 10: Calculating standard deviation A data set contains the observations: 6, 1, 4, 3, 1 a. οx = b. οx2 = c. ∑(π₯ − π₯Μ ) = Observation (π₯ − π₯Μ ) 6 1 4 3 1 Sums ο(π₯ − π₯Μ ) ) = π =√ ∑(π₯ − π₯Μ )2 1 =√ =√ π−1 5−1 Example 11: Exam standard deviation For an exam given to a class, the students’ scores ranged from 35 to 98 with a mean of 74. Which of the following is the most realistic value for the standard deviation: -10, 0, 3, 12, 63? What is unrealistic about the other values? A distribution with a large standard deviation will be wider than a distribution with a smaller standard deviation. In the graphs below, the second distribution is wider and has a larger standard deviation. ο = You won’t EVER have to calculate standard deviation by hand. Any statistical software package, including StatCrunch, and advanced scientific calculators (TI-83 and TI-84) will do this for you. Related to the standard deviation is the variance. Variance is standard deviation squared. The symbol for sample variance is π 2 . If you are given the standard deviation and asked to find the variance, all you have to do is square π . The formula is: π 2 = ∑(π₯−π₯Μ )2 π−1 Because the mean is used in calculating both π and π 2 , these statistics will be influenced by extremely large or extremely small observations. A Quick Note on Symbols Remember, a statistic is a numerical summary of a sample. A parameter is a numerical summary of a population. In practice, parameters are generally unknown. I have introduced symbols for the sample standard deviation and the sample variance, but not the population standard deviation or the population variance. A summary of the symbols you should be familiar with is in the table below. Statistic/Parameter Sample Population Mean π₯Μ π Proportion πΜ π Standard deviation π π Variance π 2 π2 1.6 Using Measures of Position to Describe Variability (textbook 27-51) The mean and median describe the center of the data. The range and standard deviation describe the variability. We can use some measures of position to further describe the data. The πππ percentile is a value such that π percent of the observations fall below or at that value. Suppose your SAT score falls at the 90th percentile. This means 90% of the people who took the SAT scored below you (and so 10% of them scored above you). Three useful percentiles are the quartiles. Each set of data has three quartiles. The 25th percentile is referred to as the first quartile (Q1). The 50th percentile is the second quartile (Q2), but we just call this the median. The 75th percentile is referred to as the third quartile (Q3). The quartiles split the distribution into four parts, each containing 25% of the observations: Example 12: Manual Dexterity A research study of manual dexterity involved determining the time required to complete a task. The time required for each of 40 individuals is: 7.1 7.2 7.2 7.6 7.6 7.9 8.1 8.1 8.1 8.3 8.3 8.4 8.4 8.9 9.0 9.0 9.1 9.1 9.1 9.1 9.4 9.6 9.9 10.1 10.1 10.1 10.2 10.3 10.5 10.7 11.0 11.1 11.2 11.2 11.2 12.0 13.6 14.7 14.9 15.5 Compute Median, Q1, and Q3 Median = ______ Q1 = __________ Q3 = __________ The Interquartile Range (IQR) The middle 50% of observations fall between the first quartile and the third quartile – 25% from Q1 to the median and 25% from the median to Q3. The distance from Q1 to Q3 is called the interquartile range, denoted IQR. πΌππ = π3 − π1 Example 13: Manual Dexterity IQR Find the IQR for the manual dexterity data Q1 = 8.3 Q3 = 10.85 IQR = Detecting Potential Outliers An outlier is an unusually small or unusually large observation. Outliers can occur due to an error in data entry, but this isn’t always the case. Consider the number of home runs Brady Anderson hit per season from 1992 to 2001: 21, 13, 12, 16, 50, 18, 18, 24, 19, 8. Fifty is an unusually large observation and would likely be classified as an outlier. One of the ways to flag an observation as potentially being an outlier is the 1.5 x IQR Criterion. This criterion states that if an observation is less than π1 − 1.5(πΌππ ) or greater than π3 + 1.5(πΌππ ), it is considered an outlier. Example 14: Manual Dexterity outliers? Are any of the observations in the manual dexterity data outliers? Recall Q1 = 8.3 Med= 9.25 Q3 = 10.85 Min = 7.1 Max = 15.5 IQR = 2.55 The Five-Number Summary and Box Plots The five-number summary of a data set is the minimum value, the first quartile, the median, the third quartile, and the maximum value. The five-number summary is the basis of a graphical display called the box plot. In other textbooks, the box plot is occasionally called a box-andwhisker plot because of its appearance. Example 15: Manual Dexterity Boxplot Using the manual dexterity data from the previous example, construct a box plot. Recall Q1 = 8.3 Med= 9.25 Q3 = 10.85 Min = 7.1 Max = 15.5 IQR = 2.55. How to Measure and Report Center and Spread The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with extreme outliers. Use the mean and the standard deviation only for reasonably symmetric distributions that are free of outliers. Which measures of center and spread should be used for a distribution? • • • • • • If the shape is skewed, the median and IQR should be reported. If the shape is unimodal and symmetric, the mean and standard deviation and possibly the median and IQR should be reported. If there are multiple modes, try to determine if the data can be split into separate groups. If there are unusual observations point them out and report the mean and standard deviation with and without the values. Always pair the median with the IQR and the mean with the standard deviation. Remember -- The median and IQR are resistant to skewness and outliers, but the mean and standard deviation are not. Side-by-Side Boxplots Help to Compare Groups Boxplots are particularly useful to compare two or more groups on a quantitative variable. Some engineers in Germany were investigating ways to improve traffic flow by enabling traffic lights to communicate information about traffic flow with nearby lights. The graph below displays the results of one experiment that simulated buses moving along a street and recorded the delay time (in seconds) for both a fixed time and a flexible system of traffic lights. Compare the two groups. Using the Box Plot to Determine the Shape of the Distribution Like a histogram, a box plot can tell us if the distribution of the data is symmetric or if skew is present. • • • If the box plot looks approximately symmetric, then the distribution is approximately symmetric. If the median is closer to Q1 and/or the right whisker is much longer than the left whisker, then the distribution is skewed right. If the median is closer to Q3 and/or the left whisker is much longer than the right whisker, then the distribution is skewed left. 1.7 Normal Distribution (textbook 27-51) Density Curves A density curve is a curve that - Always positive horizontal values Area under the curve is 1 Comparing the Mean and Median The mean is sensitive to extreme values in the data set. This means an extremely large value will pull the mean to the right and an extremely small value will pull the mean to the left. This is because the calculation of the mean uses all values in the data set. The median, which is determined by only the values in the middle of the data set, is generally resistant to extreme values. Normal Distributions All normal distributions have symmetric, unimodal, and bell-shaped distribution curves. The mean μ decides the location and the standard deviation σ decides a shape. The Empirical Rule (aka 68-95-99.7 Rule) In the normal distribution, then the value of π has a more precise interpretation. The Empirical Rule says that if a distribution is bell shaped, then • • • 68% of the observations fall within 1 standard deviation of the mean, denoted π ± π 95% of the observations fall within 2 standard deviations of the mean, π ± 2π 99.7% all observations fall within 3 standard deviations of the mean, π ± 3π This graph illustrates this rule. Example 12: White Walkers White walkers have taken over the Westeros. Once the white walkers invade, the mean time until every dead body in one city is revived is 150 minutes with a standard deviation of 25 minutes. 1. What percentage of a city is entirely infected in somewhere between 125 and 150 minutes? 2. 95% of dead bodies in the city will be reanimated within how much time? 3. What percentage of dead bodies is entirely revived in less than 125 minutes? Standardizing observations: z-score The z-score of an observation tells us how many standard deviations the observation falls from the mean. A positive z-score indicates the observation is above the mean. A negative z-score indicates the observation is below the mean. π§= πππ πππ£ππ‘πππ − ππππ πππ πππ£ππ‘πππ − π = π π‘ππππππ πππ£πππ‘πππ π Example 16: Height and z-scores The height of 20-29 year old man follows normal distribution with average 70 inches and a standard deviation of 2.8 inches. The height of female also follows normal with average 64.6 inches and a standard deviation of 2.6 inches. 1. Find the z-score for a 75-inch tall man. 2. Find the z-score for a 70 inch tall woman. 3. Who is relatively taller, a 75 inch man or a 70 inch woman? 4. What is the z-score for a man whose height is 2.3 standard deviations below the average height? 5. David’s height is 1.1 standard deviations above the mean. How tall is David? Using z-scores to check for Outliers When the data are approximately normal, a data value is regarded as an outlier if it falls more than three standard deviations from the mean. In other words, if an observation has a z-score less than -3 or greater than +3, then it is an outlier. (Think about this in connection with the Empirical Rule which says that all of the data should fall within three standard deviations of the mean. Any observation beyond this three standard deviation ban is an outlier.) Example 17: Heights of Men Assume that male heights are normally distributed with a mean of 70 inches and a standard deviation of 2.8 inches. 1. Would a male with a height of 78 inches be considered an outlier? 2. What height would cutoff outliers for short men?