CHAPTER 1, SECTIONS 1.1 AND 1.2 Revised January 24, 2012 Section 1.1 Looking at Data Individuals (units) are objects described by a set of data. A unit can be a person, a place, or a thing (ie. a student, the city of West Lafayette, Purdue University). A person is often called a subject. A variable is any characteristic of an individual. It could be your birth date, gender, marital status, in state or out of state status, major, semester in school, gpa, etc……. There are two types of variables: Quantitative variable, takes numerical values for which arithmetic operations such as adding and averaging make sense. Examples: your age, GPA, exam scores in this course. Categorical variable , places an individual into one of several groups or categories, and uses the count or percent of the individuals for each category. Examples: your major, gender , academic year (freshman, sophomore, junior, senior), in-state vs out-of –state status, etc. The distribution of a variable describes what values that variable takes and how often it takes on that value. If you have more than one variable in a problem, look at each variable by itself first, then look for any relationships between the variables. Example: Lecture 2, Section 1.1 & 1.2 Page 1 Identify the unit in the following questions and whether the answer would give you categorical or quantitative data. If it is categorical, state the possible answers. a) What letter grade did you get in your Calculus class last semester? b) What was your score on the last exam? c) What is your GPA? d) How many red M&Ms are in this bag? e) What colors are in the bag? f) What color was the most common? g) Which type of M&Ms has more red ones, peanut or plain? We will look at describing data on a single variable by: 1. Graphing it, always a good place to start. 2. Finding numerical summaries. Graphing Methods: The best way to examine data is to make a graph of it. Graphing techniques depend on the type of variable. For Categorical variables the most commonly used types are 1. Bar graph 2. Pie graph With bar graphs, the y-axis represents either the 1. frequency (number of observations) in each of the categories which are displayed on the x axis, OR 2. relative frequency (number of observations / total number of observations. Lecture 2, Section 1.1 & 1.2 Page 2 Example: The distribution of the brand of golf drivers used in the 2008 Mercedes Championship: Driver Brand Taylor Made Titleist Calloway Ping Nike All others Total No of Users 15 10 6 3 2 4 40 Percent 37.5 25 15 7.5 5 10 100 For Quantitative Variables: 1. Stem-and-leaf-plots 2. Histograms 3. Box Plots Stem-and-leaf-plots: Displays the actual values of all observations. Good for small amounts of data. Similar to a histogram. Steps: 1. Arrange the data in ascending order. Lecture 2, Section 1.1 & 1.2 Page 3 2. Disect each observation into a stem and leaf . The leaf is the right-most digit, (always a single digit from 0 to 9), and the stem is all digits in front of the leaf. 3. Write stems in a vertical column with the smallest at the top, and draw a vertical line to separate the stems from the leaves. 4. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 5. As an option we might want to “split” the stems, ie, list each stem twice, in order to spread out the leaves. Leaves 0 through 4 go on the first stem and leaves 5 through 9 go on the second stem. 6. As another option we might want to reduce the number of stems by rounding each observation to a value ending in 0 and dropping the 0 so that the leaf is the last digit before the 0. Example: Bob’s last 20 golf scores, beginning with his last score. 69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84 stems 6 7 8 leaves 9 3776578875967 012304 stems 6 7 8 leaves 9 3556677777889 001234 split stems leaves 6 6 7 7 8 8 9 3 556677777889 001234 Example: Breaking strength of wood. 23422, 25389, 28128, 22673, 29452, 28138, 24487, 26841, 27793 Rounded to 100's stems leaves 22 7 23 4 24 5 25 4 26 8 27 8 28 11 29 5 Lecture 2, Section 1.1 & 1.2 Page 4 Rounded to 1000's stems leaves 2 334 2 578889 Histograms: Large data sets, classified into class intervals, with height of bar displaying the count or the percent of the observations for each class interval. Steps: 1. Divide the range of data into class intervals of equal width. 2. Count the number of individuals in each class interval or develop a percent of the total in each class interval. The counts in each class interval are called frequencies. The table of frequencies for all class intervals is called a frequency table. 3. Draw the Histogram. Example 1: USGA handicap ranges and percentage within each range. HANDICAP PERCENT 0 -<5 5 -<10 10 - <15 15 - <20 20 - <25 25 - <30 30 - <35 35 - < 40 All 4.8 15.6 26.4 24.8 15.6 7.9 3.2 1.7 100 Lecture 2, Section 1.1 & 1.2 Page 5 If the overall pattern of a large number of observations is quite regular, we chose to describe it by a smoth curve called a density curve. A density curve is an idealized model for a distribution of data. There are many types of density curves. Some are sketched below. Unimodal Bimodal Multimodal What to look for in a histogram or density curve: Look at the overall pattern of the data and any deviations from that pattern (outliers). The pattern is described by shape, center, and spread. 1. SHAPE: Unimodal Left Skewed Unimodal Symmetric Unimodal Right Skewed 2. The center is defined by either 1) median, 2) mean, or 3) mode The median, M, is the middle value in an ascending list of values. The mean is the arithmetic average of all the values. Either Sample mean, X , or the Population mean , µ The mode is the value which occurs most often. Lecture 2, Section 1.1 & 1.2 Page 6 If the pattern is symmetric, the mean = median. If the pattern is left skewed, the mean < median. If the pattern is right skewed, the mean >median. 3. The spread defines the width of a distribution. Sometimes the spread is a From ___ To____ expression. Sometimes it just the total range. Sometimes it is the center + a value, and center – a value. Section 1.2 Describing Distributions with Numerical Summaries: A numerical summary must describe two important features: 1. The value that represents the center of the distribution, and 2. A measure of the spread (variability) going away from the center. Measuring Center: 1. Mode: The measurement that occurs most often. Not often used. 2. Median: The middle value (countwise) when the measurements are arranged from the lowest to the highest. Symbol is M Steps to finding the median: a. Arrange observations from smallest to largest.Count the observations.Calculate n 1 to find the position of the center of 2 the data set b. If n is odd, M is the data point at the center of the data set. Lecture 2, Section 1.1 & 1.2 Page 7 c. If n is even, n 1 ends with .5, and falls between 2 data points, 2 called the middle pair, M = the average of the middle pair. Example: Bob’s last 20 golf scores, 69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84 Put the data in ascending order: 69 73 75 75 76 76 77 77 77 77 77 78 78 79 80 80 81 82 83 84 N=20, Position of the Median = (20+1)/2 = 10.5 The Median is half way between the 10th and 11th values and is therefore 77. 3. Mean or average (arithmetic mean): The sum of the measurements divided by the total number of measurements. On Bob’s last 20 golf scores: x sum = 1554 ( x x ...... xn ) 1 xi 1 2 n n Lecture 2, Section 1.1 & 1.2 Page 8 = 1554 / 20 = 77.7 RESISTANT MEASURES: A measure that can resist the influence of extreme observations is called a resistant measure. Characteristics of each measure: Mode Median Mean - there can be more - only one median. - only one mean. than one mode - resistant measure - non-resistant for a data set. - quantitative data measure. - Resistant only. - quantitative data measure. - used with skewed only. - For grouped data data usually. - Used with value can change - Affected less symmetric data depending on than the mean. usually categories. - Affected more - Can be found for than the median. both categorical and quantitative data. Measuring Spread: (variability) There are three measures of spread that we will look at: 1. Range, Maximum - Minimum 2. Five number summary, which is Minimum, First Quartile, Median, Third Quartile and Maximum 3. Standard Deviation, or Variance which is standard deviation squared. 1. Range The difference between the largest and the smallest measurement of a data set. Range =Maximum - Minimum Example, the range on Bob’s golf scores is: 84- 69 = 15 2. The Five Number Summary Lecture 2, Section 1.1 & 1.2 Page 9 The five number summary involves determination of the Quartiles and the Median. We showed how the Median is determined above. Now we show how the Quartiles are determined. The quartiles Q1 and Q3 are calculated as follows a. Arrange the observations in ascending order and determine the median, M, as shown previously. b. Now repeat the procedure to find the median of the lower half, ie, the 1st quartile, Q1, is the median of the observations which are below the the overall median, M. c. Now repeat the procedure to find the median of the upper half, ie, the third quartile, Q3, is the median of the observations which are above the overall median, M Example: Bob’s last 20 golf scores, beginning with his last score. 69 73 77 77 80 76 75 77 78 78 77 81 82 75 79 76 83 77 80 84 Put the data in ascending order: 69 73 75 75 76 76 77 77 77 77 77 78 78 79 80 80 81 82 83 84 Lower Half: Observations 1 through 10, n = 10 in lower half Upper Half: Observations 11 through 20, n = 10 in upper half NOTE THAT THE MEDIAN DOES NOT BELONG TO EITHER HALF. THE MEDIAN SEPARATES THE TWO HALVES. The First Quartile is at position 5.5 in the lower half and has a value of 76 The Third Quartile is at position 5.5 in the upper half and has a value of 80. Five Number Summary: Minimum, Q1, MEDIAN, Q3, Maximum Example with Bob’s golf scores: 69, 76, 77, 80, 84 Lecture 2, Section 1.1 & 1.2 Page 10 The Interquartile Range, IQR The IQR is the distance between the first and the third quartiles. IQR = Q3 – Q1. In the example above, it would be 80-76 = 4 Outliers: (1.5 X IQR criteria) Call an observation a suspected outlier if it falls more than 1.5(IQR) above the third quartile or more than 1.5(IQR) below the first quartile. Example with Bob’s golf scores: IQR = 80-76 = 4 1.5 (IQR) = 6 Upper cutoff = 80 + 6 = 86 Lower cutoff = 76 – 6 = 70 (69 is a suspected outlier) Boxplot: A boxplot is a graph of the five number summary. Lines extend from the box out to the smallest and largest observations. - A central box spans the quartiles, Q1 and Q3. - A line inside the box marks the median, M. - Lines extend from the box out to the smallest and largest observations. - In a modified boxplot, lines extend from the box out to the smallest and largest observations which are NOT outliers by the 1.5(IQR) rule, and an asterix marks any outliers. Example with Bob’s golf scores: Lecture 2, Section 1.1 & 1.2 Page 11 3. Standard Deviation The variance, s 2 , of a set of observations is: ( x1 x)2 ( x2 x)2 ....... ( xn x) 2 1 ( xi x)2 = 230.2 / 19 n 1 n 1 = 12.11579 S2 The standard deviation, s , is the square root of the variance, s 2 . Example, the standard deviation of Bob’s last 20 golf scores is: S = Square Root (Variance) = square root (12.11579) = 3.48077 Characteristics of measures of spread: Range: - non-resistant. - Simple. Lecture 2, Section 1.1 & 1.2 Page 12 IQR - Resistant. - used with the median. - IQR=0 does not mean there is no spread. - Used with nonsymmetric data usually. Standard Deviation - Non-resistant. - Approx=range/4 - Used with the mean. - Good for symmetric distributions with no outliers. - S=0 means there is no spread Best method for describing center and spread: The 5-Number Summary is better for describing skewed distributions or distributions with outliers. The Mean and Standard Deviation are preferred for describing reasonably symmetric distributions free of outliers. Again, always start with a graph of the data to evaluate skewness. Linear Transformation of data values: Example: The temperature is 86 degrees F (Fahrenheit), what is the temperature in Celsius? C = 5/9 ( F -32) C = 5/9 ( 86-32) = 30 A linear transformation of the form xnew a bx multiplies the measure of center and the measure of spread by b, and shifts the center by a. Mean and Standardization For Grouped Data. See my webpage for an Excel File which shows how to calculate the mean and standard deviation from grouped data. Lecture 2, Section 1.1 & 1.2 Page 13