PROBABILITY & STATISTICS FOR P-8 TEACHERS Chapter 3 Data Description WHAT IS NEXT? Now that we know how to organize the data and create nice graphs to present the results, we need to focus on describing patterns in the data. Summarizing data sets numerically Are there certain values that seem more typical for the data? How typical are they? A number that helps describe a set of data is an AVERAGE! Sometimes called a MEASURE OF CENTRAL TENDANCY NUMERICAL MEASURES OF DATA Central Tendency is the value or values around which the data tend to cluster Variability shows how strongly the data cluster around that value FINDING THE CENTER All of these are Measures of Central Tendency o o o o MEAN MEDIAN MODE MIDRANGE The question “What’s my average?” has many meanings What we should say is “What’s my mean?” WHAT DO THEY ALL MEAN? MEAN Arithmetic Mean (Mean) the measure of center obtained by adding the values and dividing the total by the number of values What most people call an average. NOTATION denotes the sum of a set of values. x is the variable usually used to represent the individual data values. n represents the number of data values in a sample. N represents the number of data values in a population. MEAN The sample mean is computed using sample data. o Denoted by x The sample mean is a statistic. If x1, x2, …, xn are the n observations of a variable from a sample, then the sample mean, x , is x1 x2 xn x n MEAN The population mean is computed using all data points in a population. o Denoted by µ The population mean is a parameter. If x1, x2, …, xn are the N observations of a variable from a population, then the population mean, µ , is x1 x2 xN N MEAN x is pronounced ‘x-bar’ and denotes the mean of a set of sample values x = x n µ is pronounced ‘mu’ and denotes the mean of all values in a population µ = x N COMPUTING SAMPLE MEAN The following data represent the travel times (in minutes) to work for a sample of seven employees of an insurance company. 23, 36, 23, 18, 5, 26, 43 Compute the sample mean. COMPUTING SAMPLE MEAN x = x n 23 36 23 18 5 26 43 7 174 7 24.9 minutes MEAN Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance: MEDIAN The median represents the middle value when the original data values are arranged in increasing or decreasing order The median will be one of the data values if there is an odd number of values. The median will be the average of two data values if there is an even number of values. MEDIAN The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data. COMPUTING THE MEDIAN The following data represent the travel times (in minutes) to work for a sample of seven employees of an insurance company. 23, 36, 23, 18, 5, 26, 43 Determine the median of this data. COMPUTING THE MEDIAN 23, 36, 23, 18, 5, 26, 43 Step 1: Order the data: 5, 18, 23, 23, 26, 36, 43 Step 2: Locate the middle data point Median = 23 COMPUTING THE MEDIAN Suppose the insurance company hires a new employee. The travel time of the new employee is 70 minutes. Determine the median of the “new” data set. 23, 36, 23, 18, 5, 26, 43, 70 COMPUTING THE MEDIAN 23, 36, 23, 18, 5, 26, 43, 70 Step 1: Order the data: 5, 18, 23, 23, 26, 36, 43, 70 Step 2: Locate the middle data point Step 3: Find the mean of the two middle data points Median = (23 + 26) / 2 = 24.5 DESCRIBE THE DISTRIBUTION The following data represent the asking price of homes for sale in Lincoln, NE. 79,995 99,899 105,200 128,950 130,950 131,800 149,900 151,350 154,900 189,900 203,950 217,500 111,000 120,000 121,700 132,300 134,950 135,500 159,900 163,300 165,000 260,000 284,900 299,900 125,950 126,900 138,500 147,500 174,850 180,000 309,900 349,900 Source: http://www.homeseekers.com DESCRIBE THE DISTRIBUTION Find the mean and median. Use the mean and median to identify the shape of the distribution. Verify your result by drawing a histogram of the data. The mean asking price is $168,320 The median asking price is $148,700 Therefore, we would conjecture that the distribution is skewed right. Asking Price of Homes in Lincoln, NE 12 10 Frequency 8 6 4 2 0 100000 150000 200000 250000 Asking Price 300000 350000 MODE The mode is the value that occurs most often in a data set. There may be no mode, one mode (unimodal), two modes (bimodal), or many modes (multimodal). MODE NFL Signing Bonuses: Find the mode of the signing bonuses of eight NFL players for a specific year. The bonuses in millions of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10 You may find it easier to sort first. 10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5 Select the value that occurs the most. The mode is 10 million dollars. MODE Coal Employees in Pennsylvania Find the mode for the number of coal employees per county for 10 selected counties in southwestern Pennsylvania. 110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752 No value occurs more than once. There is no mode. MODE Licensed Nuclear Reactors The data show the number of licensed nuclear reactors in the United States for a recent 15year period. Find the mode. 104 104 104 104 104 104 104 104 104 104 107 107 109 109 109 109 109 109 110 110 109 112 111 111109 109 109 111 111 112 104 and 109 both occur the most. The data set is said to be bimodal. The modes are 104 and 109. MODAL CLASS Miles Run per Week Find the modal class for the frequency distribution of miles that 20 runners ran in one week. Class Frequency 5.5 – 10.5 1 10.5 – 15.5 2 15.5 – 20.5 3 20.5 – 25.5 5 25.5 – 30.5 4 30.5 – 35.5 3 35.5 – 40.5 2 The modal class is 20.5 – 25.5. The mode, the midpoint of the modal class, is 23 miles per week. MIDRANGE The midrange is the average of the lowest and highest values in a data set. Lowest Highest MR 2 MIDRANGE Water-Line Breaks In the last two winter seasons, the city of Brownsville, Minnesota, reported these numbers of water-line breaks per month. Find the midrange. 2, 3, 6, 8, 4, 1 1 8 9 MR 4.5 2 2 The midrange is 4.5. PROPERTIES OF THE MEAN Uses all data values. Varies less than the median or mode Used in computing other statistics, such as the variance Unique, usually not one of the data values Cannot be used with open-ended classes Affected by extremely high or low values, called outliers Central Tendency PROPERTIES OF THE MEDIAN Gives the midpoint Used when it is necessary to find out whether the data values fall into the upper half or lower half of the distribution. Can be used for an open-ended distribution. Affected less than the mean by extremely high or extremely low values. PROPERTIES OF THE MODE Used when the most typical case is desired Easiest average to compute Can be used with nominal data Not always unique or may not exist PROPERTIES OF THE MIDRANGE Easy to compute. Gives the midpoint. Affected by extremely high or low values in a data set DISTRIBUTIONS MEASURE OF DISPERSION The mean, median and mode give us an idea of the central tendency, or where the “middle” of the data is located Variability gives us an idea of how spread out the data are around that middle The combination of central tendency and dispersion provide a more complete picture of the data MEASURE OF DISPERSION Without knowing something about how data is dispersed, measures of central tendency may be misleading. For Example: A residential street with 20 homes on it having a mean value of $200,000 where all the homes are in a similar price range would be very different from a street with the same mean value but with 3 homes having a value of $1 million and the other 17 clustered around $60,000. MEASURES OF VARIATION How Can We Measure Variability? Range Variance Standard Deviation Coefficient of Variation Chebyshev’s Theorem Empirical Rule (Normal) RANGE The range is the difference between the highest and lowest values in a data set. R Highest Lowest Find the range in the following test scores. 100, 68, 74, 56, 57, 68 Range = High - Low = 100 - 56 = 44 RANGE IN A HISTOGRAM RANGE Disadvantages: Easy to compute, but not very informative Considers only two observations (the smallest and largest) VARIANCE & STANDARD DEVIATION The variance is the average of the squares of the distance each value is from the mean. The standard deviation is the square root of the variance. The standard deviation is a measure of how spread out your data are. VARIANCE & STANDARD DEVIATION The population variance is The 2 X 2 N population standard deviation is X N 2 VARIANCE & STANDARD DEVIATION Find the variance and standard deviation for the data set for how long paint lasts before it fades 2 Months, X µ 10 60 50 30 40 20 35 35 35 35 35 35 X - µ (X -25 25 15 -5 5 -15 µ)2 625 625 225 25 25 225 1750 2 X n 1750 6 291.7 1750 6 17.1 VARIANCE & STANDARD DEVIATION The sample variance is The sample standard deviation is COMPUTATIONAL FORMULA The sample variance is The sample standard deviation is WHY N - 1? s is an estimate of the population standard deviation () . In order to calculate an unbiased estimate of the population standard deviation, subtract one from the denominator. Sample standard deviation tends to be an underestimation of the population standard deviation. EUROPEAN AUTO SALES Find the variance and standard deviation for the amount of European auto sales for a sample of 6 years. The data are in millions of dollars. X X2 11.2 11.9 12.0 12.8 13.4 14.3 75.6 125.44 141.61 144.00 163.84 179.56 204.49 958.94 s2 = 958.94 – (75.6)2 / 6 6-1 s2 = 1.28 s = 1.13 COMPARING STANDARD DEVIATIONS Data A Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B 11 12 13 14 15 Data C 11 12 13 14 15 16 17 18 19 20 21 Least Variable Mean = 15.5 s = 0.926 Mean = 15.5 s = 4.570 16 17 18 19 20 21 Most Variable COEFFICIENT OF VARIATION The measures discussed so far are primarily useful when comparing members from the same population, or comparing similar populations. When looking at two or more dissimilar populations, it doesn’t make any more sense to compare standard deviations than it does to compare means. COEFFICIENT OF VARIATION The coefficient of variation is the standard deviation divided by the mean, expressed as a percentage. s CVAR 100% X Use CVAR to compare standard deviations when the units are different. SALES OF AUTOMOBILES The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two. 5 CVar 100% 5.7% 87 Sales 773 CVar 100% 14.8% 5225 Commissions Commissions are more variable than sales. RANGE RULE OF THUMB The Range Rule of Thumb approximates the standard deviation as Range s 4 when the distribution is unimodal and approximately symmetric. RANGE RULE OF THUMB The shortest home-run hit by Mark McGwire was 340 ft and the longest was 550 ft. Use the range rule of thumb to estimate the standard deviation. Range = 550 – 340 = 210 ft Standard Deviation approximation s = range / 4 = 210 / 4 = 52.5 ft CHEBYSHEV’S THEOREM The proportion of values from any data set that fall within k standard deviations of the mean will be at least 1-1/k2, where k is a number greater than 1 (k is not necessarily an integer). # of Minimum Proportion standard within k standard deviations, k deviations 2 3 4 1-1/4=3/4 1-1/9=8/9 1-1/16=15/16 Minimum Percentage within k standard deviations 75% 88.89% 93.75% MEASURES OF VARIATION: CHEBYSHEV’S THEOREM PRICES OF HOMES The mean price of houses in a certain neighborhood is $50,000, and the standard deviation is $10,000. Find the price range for which at least 75% of the houses will sell. Chebyshev’s Theorem states that at least 75% of a data set will fall within 2 standard deviations of the mean. 50,000 – 2(10,000) = 30,000 50,000 + 2(10,000) = 70,000 At least 75% of all homes sold in the area will have a price range from $30,000 and $75,000. EMPIRICAL RULE (NORMAL DISTRIBUTION) The percentage of values from a data set that fall within k standard deviations of the mean in a normal (bell-shaped) distribution is listed below. # of standard Proportion within k standard deviations, k deviations 1 68% 2 95% 3 99.7% EMPIRICAL RULE (NORMAL) MEASURES OF POSITION z-score Percentile Quartile Outlier MEASURES OF POSITION: Z-SCORE A z-score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. x – x z= s A x – µ z= z-score represents the number of standard deviations a value is above or below the mean. TEST SCORES A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the two tests. x–x z = s Calculus Test History Test 65 – 50 z = 10 = 1.5 30 – 25 z= 5 = 1.0 She has a higher relative position in the Calculus class. MEASURES OF POSITION: PERCENTILES Percentiles separate the data set into 100 equal groups. A percentile rank for data represents the percentage of data values below the datum. # of values below X 0.5 Percentile 100% total # of values n p c 100 PERCENTILES Measures of location There are 99 percentiles denoted P1, P2, . . . P99 which divide a set of data into 100 groups with about 1% of the values in each group Use cumulative data to keep track of relative positions MEASURES OF POSITION: EXAMPLE OF A PERCENTILE GRAPH PERCENTILES FOR TEST SCORES A teacher gives a 20-point test to 10 students. Find the percentile rank of a score of 12. 18, 15, 12, 6, 8, 2, 3, 5, 20, 10 Sort in ascending order. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 6 values # of values below X 0.5 Percentile 100% total # of values 6 0.5 A student whose score 100% was 12 did better than 10 65% of the class. 65% PERCENTILES FOR TEST SCORES A teacher gives a 20-point test to 10 students. Find the value corresponding to the 25th percentile. 18, 15, 12, 6, 8, 2, 3, 5, 20, 10 Sort in ascending order. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 n p 10 25 c 2.5 3 100 100 The value 5 corresponds to the 25th percentile. QUARTILES Quartiles separate the data set into 4 equal groups Q1 = P25, Q2 = P50 (median) Q3 = P75 We can easily find the quartiles by separating the sorted data into two halves Q2 = median of all data points Q1 = median of lower half Q3 = median of upper half QUARTILES For quartiles, we want to divide our data into 4 equal pieces. Consider the following data set (already in order) 1 1 2 2 2 3 4 5 6 6 7 8 Q1 Q2 Q3 The quartiles will divide the data into 4 groups, each with three elements. BOX PLOTS The Five-Number Summary is composed of the following numbers: Minimum Value Q1 Median Q3 Maximum Value The Five-Number Summary can be graphically represented using a Boxplot. BOX PLOT The box plot is sometimes called a box and whisker plot. 5 – Number Summary PROCEDURE TABLE Constructing Boxplots 1. Find the five-number summary. 2. Draw a horizontal axis with a scale that includes the maximum and minimum data values. 3. Draw a box with vertical sides through Q1 and Q3, and draw a vertical line though the median. 4. Draw a line from the minimum data value to the left side of the box and a line from the maximum data value to the right side of the box. METEORITES (BOX PLOT) The number of meteorites found in 10 U.S. states is shown. Construct a boxplot for the data. 89, 47, 164, 296, 30, 215, 138, 78, 48, 39 30, 39, 47, 48, 78, 89, 138, 164, 215, 296 Min Q1 47 30 Median 83.5 Q3 Max 5-Number Summary: Min = 30 Q1 = 47 Q2 = 83.5 Q3 = 164 Max = 296 164 296 3-77 OUTLIERS An outlier is an extremely high or low data value when compared with the rest of the data values. The Interquartile Range, IQR = Q3 – Q1. Range of middle 50% of data Lower bound = Q1 – 1.5(IQR) Upper bound = Q3 + 1.5(IQR) An outlier is any value less than the lower bound or more than the upper bound. USING IQR TO FIND OUTLIERS The red lines are 1.5 times the IQR. Starting from Q1 going left, and starting from Q3 going right 1.5(IQR) we establish limits. All numbers smaller on the left, and larger on the right are outliers. OUTLIERS Check the meteorite example for outliers 30, 39, 47, 48, 78, 89, 138, 164, 215, 296 Step 1: The first and third quartiles are Q1 = 47 and Q3 = 164 Step 2: The interquartile range is 164 – 47 = 117 Step 3: The boundaries are Lower Bound = Q1 – 1.5(IQR) Upper Bound = Q3 + 1.5(IQR) = 47 – 1.5 (117) = 164 + 1.5 (117) = -128.5 = 292.5 Step 4: The value 296 is greater than 292.5. Therefore, 296 is an outlier. OUTLIERS o An outlier can have a dramatic effect on the mean. o An outlier can have a dramatic effect on the standard deviation. o An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured. SUMMARY o Measures of Central Tendency o Mean, Median, Mode o Measures of Dispersion o Range, Variance, Standard Deviation o Measures of Position o Percentiles, Quartiles o 5-Number Summary, Box Plots, Outliers