MAT 1000 Mathematics in Today's World Last Time 1. Collecting data requires making measurements. 2. Measurements should be valid. 3. We want to minimize bias and variability, as much as possible. Today 1. Three keys for summarizing a collection of data 2. The distribution of a data set 3. Two ways to visualize a distribution Summarizing data The best summary of a large collection of data tells us about three things • Shape • Center • Spread Today we focus on the “shape” of a collection of data Visualization A graph is a visual presentation of a collection of data. Graphing is an excellent way to reveal the shape of a collection of data. Visualization There are many different types of graph, each with advantages and disadvantages. We will look at two types of graph • Histograms • Stemplots Organizing data Before we can visualize the data, it may be necessary to organize it. One way is to count how often particular values occur in our data set. For example: how many students in this class are psychology majors? Organizing data The number of times a value occurs is called the value’s frequency. Number of psychology majors = frequency of psychology majors. The proportion of times a value occurs is called the relative frequency of that value. Percent of psychology majors = relative frequency of psychology majors. Organizing data The variable “a student’s major” is not numeric. For non-numeric variables we can always find frequencies or relative frequencies. What about numeric variables? Organizing data We can find the frequency or relative frequency for numeric variables, but often there’s a better option: Organize by grouped frequencies. This means we put the data into classes, lumping together numbers which are close. Organizing data However we choose to organize the data—by count, proportion, or in classes—we produce a list of different values and how often they occur. Distribution: a list of different data values and how often each value occurs. A distribution shows the “shape” of the data. This shape is best presented visually. Example Consider the set: 3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 38, 45, 49 (the ages of a population consisting of 16 people) Example (continued) Knowing the frequency (how many 1s, how many 2s, how many 3s, etc.) would be useless—no number occurs more than once. Instead, let’s look at grouped frequencies. Data Range Frequency 0-9 1 10-19 3 20-29 6 30-39 4 40-49 2 Example (continued) 3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 38, 45, 49 Example (continued) Now we can make a chart of the frequency distribution of the data The following is called a frequency histogram: Histograms Bars for each class. Height of the bar is the number of data in the class. Note that the bars touch each other. Only leave a blank space for empty classes. The shape of a distribution Important features to identify: Number of peaks Symmetric or asymmetric Asymmetric: skewed to the left, the right, or neither Outliers: values that stand out from the overall shape. Clusters • • • • • Symmetric Distributions Bell-Shaped Symmetric Distributions Mound-Shaped Symmetric Distributions Uniform Asymmetric Distributions Skewed to the Left Asymmetric Distributions Skewed to the Right The shape of a distribution Earlier example Symmetric with one peak and no outliers or clusters The shape of a distribution 8 7 6 Frequency 5 4 3 2 1 0 0-9 10-19 20-29 30-39 40-49 Home Runs 50-59 60-69 70-79 Asymmetric with one peak, skewed to the left, no clusters, one outlier in the 70-79 class. The shape of a distribution Asymmetric with one peak, skewed to the right, and no outliers or clusters The shape of a distribution Asymmetric with multiple peaks, not skewed, no outliers, two clusters The disadvantage of histograms In a histogram the original data points are lost. 8 7 Frequency 6 5 4 3 2 1 0 0-9 10-19 20-29 30-39 40-49 Home Runs 50-59 60-69 70-79 We can see that there is one data value in the 70-79 range, but there is no way to determine the value. Stemplots Here is a sample of a stemplot The numbers on the left are the “stems.” The other numbers are the “leaves.” Stemplots The leaf is the rightmost digit of the data value. The stem is the rest of the data value. For example, the 0 in the last row means that the number 60 is in this data set. Notice there are no leaves on the 1 stem, but we still include it in the stemplot. How to make a stemplot 1. Each observation gets separated into a stem (all but the rightmost digit) and a leaf (the final digit). 2. The stems get put in a vertical column with the smallest at the top. A vertical line is then drawn. 3. Each leaf is then written in the row to the right of its stem, in increasing order out of the stem. 4. Make sure to line up the leaves in columns. Example The following data is a list of the annual home run totals of the baseball player Barry Bonds over his entire 22 year career, sorted from smallest to largest. 5 16 19 24 25 25 26 28 33 33 34 34 37 37 40 42 45 45 46 46 49 73 0 5 1 6 9 2 4 5 5 6 8 3 3 3 4 4 7 7 4 0 2 5 5 6 6 9 5 6 7 3 Example The following data is a list of the annual home run totals of the baseball player Barry Bonds over his entire 22 year career, sorted from smallest to largest. 5 16 19 24 25 25 26 28 33 33 34 34 37 37 40 42 45 45 46 46 49 73 Comparing histograms and stemplots Let’s compare our stemplot to a histogram of the same data. 0 5 8 6 9 2 4 5 5 6 8 3 3 3 4 4 7 7 4 0 2 5 5 6 6 9 7 6 Frequency 1 5 4 3 2 5 1 0 6 7 0-9 3 10-19 20-29 30-39 40-49 Home Runs 50-59 60-69 70-79 Comparing histograms and stemplots Stemplots are like histograms that are “tipped over.” Stemplots gives all of the same information about the shape of the distribution. In addition, stemplots show all of the data values, which histograms do not. But, we can’t use stemplots for large data sets. How to make a stemplot Sometimes you may need to round the data to improve a stemplot. Example 8.623 8.735 9.529 9.873 10.023 After rounding to the nearest tenth, these are 8.6 8.7 9.5 9.9 10.0