Chapter 3 Numerical Descriptive Measures Last (Family) Name: __________________________________. First (Given) Name: __________________________________ INTRODUCTION FROM PREVIOUS CHAPTERS: A parameter is a numerical measure that describes a characteristic of a population. A statistic is a numerical measure that describes a characteristic of a population sample. What are the following? According to a Pew survey, twenty one percent of Americans believe that President Obama is a Muslim. The average age of children in our family is 17. First time unemployment claims were down by 5% in the last quarter. WHY NUMERICAL DESCRIPTIVE MEASURES? Charts, frequency distributions, and cross-tabs reduce the detail in the data in ways that make it easier to see key patterns in the data. Numerical descriptive measures reduce detail even further; they say something important about a variable with a single number. “The average age of our employees is 34.” “Nearly everyone has more than the average number of legs.” “More than 80% of our customers live within three miles of our store.” Page 1 TYPES OF DESCRIPTIVE MEASURES: Central tendency describes the extent to which all of the data values group around a typical center value. Variation describes the amount of dispersion, or scattering, of values away from a central value. Shape is the pattern of the distribution of values from the lowest value to the highest value. Page 2 3.1 MEASURES OF CENTRAL TENDENCY THE MEDIAN The middle value in ordered data. Where is the “center” of the distribution? Ranked values (odd number) Is the measure of central tendency a meaningful “typical” value? Data: 12, 18, 27, 42, 150. Median = 27 THE MEAN Ranked values (even number) The mean is what most people call “the average” Data: 4, 8, 15, 16, 23, 42 Tie for middle value; take the average of 15 and 16. Mean = X = sum of the values / number of values Values: 1, 3, 5, 7, 9 Sum: 25 Number: 5 Median = 15.5 Works with numerical data Not sensitive to outliers. Also works with ordered categorical data. X = 25/5 = 5 Rates self as above average: 42 Only works with numerical data. Rates self as average: A few outliers can throw off the mean Values: 1, 3, 5, 7, 30 Sum: 46 Number: 5 19 Rates self as below average: 6 Median is “Rates self as above average.” THE MODE X = 46/5 = 9.2; yet no value is even close to this. The category with the largest frequency: Data do not have to be ordered: California 35 Values: 1, 9, 5, 7, 3 New York 9 Sum: 25 Nebraska 72 Number: 5 The mode is Nebraska X = 25/5 = 5 Works with numerical and categorical data Page 3 3.2: VARIATION AND SHAPE THE RANGE The range is the largest value minus the smallest value. THE VARIANCE AND THE STANDARD DEVIATION Measure the “average” scatter around the mean. You do not have to know how to compute them. However, you do need to know: That the standard deviation (S) and the variance (V) measure the same thing. That the variance is the square of the standard deviation. THE COEFFICIENT OF VARIATION A relative measure of scatter compared to the size of the mean. CV = (S/X) x 100% Good for comparing things that have different units of measurement, like the weight and volume differences among packing crates. Z SCORES The Z score is a measure of how far an outlier is away from the mean. Z = (X-X)/S Page 4 3.3 NUMERICAL DESRIPTIVE MEASURES OF A POPULATION Means, standard deviations, and variances for the entire population, not just a sample Sample Population Mean X µ (mu) Standard Deviation S δ (lower-case sigma) Variance S2 δ2 Note: There are slight differences in how sample and population standard deviations and variances are measured, but you can ignore them. THE EMPIRICAL RULE AND THE CHEBYSHEV RULE Percentage of Values Within the Interval Normal Bell-Shaped Distribution (Empirical Rule) Any Distribution (Chebyshev Rule) Mean +/- one δ Approximately 68% NA Mean +/- two δs Approximately 95% At least 75% Mean +/- three δs Approximately 99.7% At least 88.89% Typically, a value more than two standard deviations from the mean is viewed as an outlier. Page 5 COMPUTING KEY STATISTICS IN EXCEL Page 6 3.4 QUARTILES AND THE BOX PLOT QUARTILES A crude but common way of expressing ranges Process Order the data 2, 3, 4, 5, 6, 7, 8, 12 Divide it into four parts 2, 3, 4, 5, 6, 7, 8, 12 The lowest 25% of the data values form the first quartile—Q1 (2,3) The next 25% of the data values from the second quartile—Q2 (4,5) The next 25% of the data values from the third quartile—Q3 (6,7) The highest 25% of the data values form the fourth quartile—Q4 (8,12) EXAMPLES In General Q1 = (N+1)/4 ranked value Q2 = 2(N+1)/4 ranked value Q3 = 3(N+1)/4 ranked value Example Times in Minutes: 29, 31, 35, 39, 39, 40, 43 44, 46, 52 10 Values Q1 = (10+1)/4 = 2.75th value This is 35 (the third value) Q2 = 2*(10+1)/4 = 5.5th value The 5th value is 39 The 6th value is 40 Choose 39.5 Page 7 Q3 = 3*(10+1)/4 = 8.25th value This is 44 (the 8th value) THE INTERQUARTILE RANGE The difference between the third and first quartile. The middle 50 percent of the data. Times in Minutes: 29, 31, 35, 39, 39, 40, 43 44, 44, 52 10 Values Q1 = (10+1)/4 = 2.25th value This is 35 (the third value) Q3 = 3*(10+1)/4 = 8.25th value This is 44 (the 8th value) Interquartile range = 44-33 = 9 minutes THE FIVE-NUMBER SUMMARY THE BOXPLOT Page 8 3.5 CORRELATION RELATIONSHIPS Relationships describe how two variables behave relative to one another. POSITIVE RELATIONSHIP As one variable increases, the other variable also increases or As one variable decreases, the other variable also decreases. NEGATIVE RELATIONSHIP As one variable increases, the other variable decreases or As one variable decreases, the other variable increases. NO RELATIONSHIP As one variable changes, this has no impact on the other variable. Page 9 THE COVARIANCE THE COEFFICIENT OF CORRELATION CORRELATION COEFFICIENT The correlation coefficient r (Pearson’s r) Is a number that describes the strength of the relationship between two variables. The Excel function is CORREL(data series 1, data series 2) Interpreting the correlation coefficient 1.0 is perfect correlation (rare) 0 is no correlation -1.0 is perfect negative correlation (rare) Page 10 3.6 PRESENTING DESCRIPTIVE STATISTICS: PITFALLS AND ETHICAL ISSUES You need to summarize data to understand it. Single numerical measures can be very powerful. However, they may summarize too much and lose important specifics. Often, when you read a report, you are only given one or two measures. This may leave you unable to interpret the results meaningfully. Example: cannot tell if the mean and median are different, which would indicate a skewed distribution. Example: Cannot tell if the distribution is bimodal, so that the mean and median are values for which there are no items in the distribution. You must personally not be deceptive in what your report. Giving a single number that is not characteristic of the distribution. Giving single numbers that rely on characteristics your distribution does not have. You must report the results fairly, not withholding information that detracts from the point you are trying to make. Page 11 HOMEWORK 1. CREATE A SCATTER PLOT FOR ADVERTISING PER CAPITAL AND SALES PER CAPITA USING THE FOLLOWING DATA. To do this, select the two data series and not the header row. Then go to the Insert tab and select Scatter Chart. Make Sales Per Capital the Vertical Title Make Advertising Per Capita the Horizontal Title Region Honolulu Orlando Omaha Ventura Dallas Chicago Nashville Boston San Francisco Boise Advertising pc $10 $6 $3 $2 $4 $1 $7 $9 Sales pc $153 $135 $116 $115 $126 $114 $140 $153 $5 $8 $127 $143 Paste your chart onto the following line: * Compute Pearson’s r The value you computed: ______________________ Page 12 2. CREATE A SCATTER CHART FOR PRICE AND UNITS SOLD USING THE FOLLOWING DATA. Use directions from Problem 1 but with these column names. The table is for a sample. Region San Francisco Dallas Omaha Orlando Chicago Ventura Boston Boise Honolulu Nashville Price 5 4 3 6 1 2 9 8 10 7 Units Sold 21.7319557 32.1765348 35.6313442 21.6527418 29.7849633 36.4823921 25.1103128 25.6841931 12.0050118 30.7900904 Chart: * Correlation Coefficient Answer: Page 13 3. CREATE A SCATTER CHART FOR SALES PER CAPITA AND INCOME PER CAPITA USING THE FOLLOWING DATA. Use directions from Problem 1 but with these column names. The table is for a population. Region Honolulu Orlando Omaha Ventura Dallas Chicago Nashville Boston San Francisco Boise Sales pc $56 $48 $3 $54 $34 $34 $12 $74 Income pc $46,079 $26,355 $83,108 $23,164 $74,177 $34,859 $16,414 $92,937 $44 $100 $31,164 $89,303 Chart: * Correlation Coefficient Answer: Page 14 4. FOR SALES PER CAPITA, COMPUTE THE FOLLOWING STATISTICS. The table is for a population. City Sales per Capita Chicago Ventura Omaha Dallas San Francisco Orlando Nashville Boise Boston Honolulu Cheyenne Raleigh Detroit Denver Seattle Bismark New York Trenton Topeka New Orleans $145.66 $127.27 $145.41 $120.69 $151.42 $120.92 $149.06 $156.05 $125.91 $121.10 $165.61 $138.08 $112.99 $140.55 $157.04 $140.40 $161.95 $109.60 $139.17 $163.87 Mean: ______ Median: _____ Standard Deviation: _______ Variance: __________ Z value for Trenton; ________ Z value for Bismark: _________ Page 15