Week 2 September 8-12 Five Mini-Lectures QMM 510 Fall 2014 ML 2.1 Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data So many topics, so little time … 4.8 Skewness and Kurtosis 4-2 Chapter 4 Describing Data Numerically Three key characteristics of numerical data: 4-3 Chapter 4 Center, Variability, Shape 4-4 Chapter 4 Visual Description Mean • A familiar measure of center Population Mean • 4-5 Sample Mean Excel function =AVERAGE(Data) where Data is an array of data values. Chapter 4 Measures of Center Median • • • • 4-6 The median (M) is the 50th percentile or midpoint of the sorted sample data. M separates the upper and lower halves of the sorted observations. If n is odd, the median is the middle observation in the data array. If n is even, the median is the average of the middle two observations in the data array. Chapter 4 Measures of Center Mode • The most frequently occurring data value. • Familiar and easy to understand. • But - data may have multiple modes or no mode. • Most useful for discrete or categorical data with only a few values.Rarely useful for continuous data or data with a wide range. Example: Revenue growth in 32 bio-tech companies last year. 0.57 4.00 8.29 11.00 1.57 4.01 8.43 11.57 1.71 5.28 9.14 11.57 1.71 5.29 9.29 11.86 1.86 6.14 10.00 12.43 2.14 6.43 10.29 13.43 2.43 6.71 10.43 13.57 2.86 6.86 10.43 14.14 Caution: In decimal data, some data values may occur more than once, but this is likely due to chance (not central tendency). Excel’s =MODE(Data) returns only the first mode (1.71 in this example). 4-7 Chapter 4 Measures of Center Chapter 4 Measures of Center • Compare mean and median or look at the histogram to determine degree of skewness. • Figure 4.10 shows prototype population shapes showing varying degrees of skewness. 4-8 Geometric Mean • The geometric mean (G) is a multiplicative average. Growth Rates In Excel =GEOMEAN(Data) or =(2*3*7*9*10*12)^(1/6) A variation on the geometric mean used to find the average growth rate for a time series. 4-9 Chapter 4 Measures of Center Chapter 4 Measures of Center Growth Rates • For example, from 2006 to 2010, JetBlue Airlines revenues are: The average growth rate: or 12.5 % per year. 4-10 Year Revenue (mil) 2006 2,361 2007 2,843 2008 3,392 2009 3,292 2010 3,779 Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. • For the J.D. Power quality data: • Here, the midrange (126.5) is higher than the mean (114.70) or median (113). 4-11 Chapter 4 Measures of Center Trimmed Mean Chapter 4 Measures of Center • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations. • So, we would remove the two smallest and two largest observations before averaging the remaining values. 4-12 Trimmed Mean • Here is a summary of all the measures of central tendency for the J.D. Power data, along with Excel functions. Mean: 114.70 =AVERAGE(Data) Median: 113 =MEDIAN(Data) Mode: 111 =MODE.SNGL(Data) Geometric Mean: 113.35 =GEOMEAN(Data) Midrange: 126.5 (MIN(Data)+MAX(Data))/2 5% Trim Mean: 113.94 =TRIMMEAN(Data, 0.1) • 4-13 The trimmed mean mitigates the effects of very high values. Chapter 4 Measures of Center Chapter 4 Measures of Variability Variability is the “spread” of data points about the center of the distribution in a sample. Measures of Variability Statistic Range Sample Variance (s2) 4-14 Formula Excel xmax – xmin =MAX(Data) MIN(Data) Pro Con Sensitive to Easy to calculate extreme data values. Plays a key role =VAR.S(Data) in mathematical statistics. Nonintuitive meaning. Population standard deviation Population variance Statistic Formula Excel Pro Chapter 4 Measures of Variability Con Sample standard deviation (s) Most common measure. Uses Nonintuitive =STDEV.S(Data) same units as the meaning. raw data ($ , £, ¥, grams etc.). Sample coefficient. of variation (CV) Measures =100*STDEV.S relative variation (Data)/AVERAGE in percent so (Data) can compare data sets. 4-15 Requires nonnegative data. Chapter 4 Measures of Variability Statistic Mean absolute deviation (MAD) 4-16 Formula Excel Pro n xi x i 1 n =AVEDEV(Data) Easy to understand. Con Lacks “nice” theoretical properties. Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unit-free measure of dispersion. • Expressed as a percent of the mean. • Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. 4-17 Chapter 4 Measures of Variability Example: Class scores on 16-point quiz on first day of class and after students had an opportunity to review the material. Caution: Only appropriate for nonnegative data. CV is undefined if the mean is zero or negative (this could happen, for example, if stocks in a portfolio had negative rates of return). 4-18 Chapter 4 Measures of Variability Chapter 4 Standardized Data ML 2.2 Topics • sorting, standardizing, z-scores • normal distribution as a benchmark • Empirical Rule (MegaStat) • outliers and unusual observations • Excel functions (Appendix J) • examples: birth weight, voting • using MegaStat and Minitab 4-19 • The normal distribution is symmetric and is also known as the bell-shaped curve. • The Empirical Rule states that for data from a normal distribution, we expect the interval ± k to contain a known percentage of observed data: k = 1 68.26% will lie within + 1 k = 2 95.44% will lie within + 2 k = 3 99.73% will lie within + 3 4-20 Chapter 4 The Empirical Rule The Empirical Rule Note: No upper bound is given. Data values outside + 3 are rare. 4-21 Chapter 4 Standardized Data • A standardized variable (Z) redefines each observation in terms of the number of standard deviations from the mean. Standardization formula for a population: Standardization formula for a sample (for n > 30): 4-22 Chapter 4 Standardized Data A negative z value means the observation is to the left of the mean. Positive z means the observation is to the right of the mean. 4-23 Chapter 4 Standardized Data Chapter 4 Standardized Data Example: Birth Weights (n = 1429) Resembles a normal except for the low tail (a few extremely tiny babies). Source Birth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill. • • • 5 pound baby’s z-score: z = (80-116.14)/21.96 = -1.65 8 pound baby’s z-score: z = (144-116.14)/21.96 = 1.27 11 pound baby’s z-score: z = (176-116.14)/21.96 = 2.73 4-24 Example: Voting in 2004 Presidential Election) State Hawaii California Texas Nevada Georgia … … Oregon North Dakota Maine Wisconsin Minnesota Voting% 46.2 49.1 50.3 51.3 52.6 z-Score -2.35 -1.89 -1.71 -1.55 -1.35 Mean St Dev n 61.29 6.43 50 Use Excel’s function =STANDARDIZE(x, μ, σ) … 70.6 70.8 72.0 73.0 76.7 1.45 1.48 1.67 1.82 2.40 Only two states stand out as unusual Note: Sorting the data values allows you to see the extremes. Values within μ ±1σ are not less interesting. 4-25 Chapter 4 Standardized Data Chapter 4 Excel Voting% Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 4-26 61.286 0.909788089 61.5 59.7 6.433173274 41.38571837 0.014949556 0.00241464 30.5 46.2 76.7 3064.3 50 Voting percent in 50 states Note: In Excel’s Descriptive Statistics, you can’t choose the statistics displayed. Chapter 4 MegaStat Note: You can choose the statistics displayed (e.g.,Empirical Rule). Statistic count mean sample variance sample standard deviation minimum maximum range 1st quartile median 3rd quartile interquartile range mode 4-27 Voting% 50 61.286 41.386 6.433 46.2 76.7 30.5 57.450 61.500 64.950 7.500 59.700 empirical rule mean - 1s mean + 1s percent in interval (68.26%) mean - 2s mean + 2s percent in interval (95.44%) mean - 3s mean + 3s percent in interval (99.73%) low outliers high outliers high extremes 54.853 67.719 68.00% 48.420 74.152 96.00% 41.986 80.586 100.00% 0 1 0 Voting percent in 50 states 4-28 Chapter 4 Appendix J: Excel Functions 4-29 Chapter 4 Appendix J: Excel Functions Chapter 4 Quantiles ML 2.3 Topics • percentiles, quartiles, boxplots • fences, another view of outliers • examples: birth weight. City MPG 4-30 Percentiles • Percentiles are data that have been divided into 100 groups. For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. • Deciles are data that have been divided into10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups. 4-31 Chapter 4 Percentiles, Quartiles, and Box-Plots Percentiles • Percentiles may be used to establish benchmarks for comparison purposes (e.g. health care, manufacturing, and banking industries use 5th, 25th, 50th, 75th and 90th percentiles). • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles can be used in employee merit evaluation and salary benchmarking. 4-32 Chapter 4 Percentiles, Quartiles, and Box-Plots Chapter 4 Percentiles, Quartiles, and Box-Plots Quartiles • Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1 Lower 25% | Q2 Second 25% | Q3 Third 25% | Upper 25% The three values that separate the four groups are called Q1, Q2, and Q3. 4-33 Quartiles • The second quartile Q2 is the median, a measure of central tendency. Q2 Lower 50% 4-34 | Upper 50% Chapter 4 Percentiles, Quartiles, and Box-Plots Method of Medians • For small data sets, find quartiles using method of medians: Step 1: Sort the observations. Step 2: Find the median Q2. Step 3: Find the median of the data values that lie below Q2. Step 4: Find the median of the data values that lie above Q2. 4-35 Chapter 4 Percentiles, Quartiles, and Box-Plots Chapter 4 Percentiles, Quartiles, and Box-Plots Quartiles – The method of medians • • The first quartile Q1 is the median of the data values below Q2 The third quartile Q3 is the median of the data values above Q2. Q1 Lower 25% | Q2 Second 25% For first half of data, 50% above, 50% below Q1. 4-36 | Q3 Third 25% | Upper 25% For second half of data, 50% above, 50% below Q3. Method of Medians Example: 4-37 Chapter 4 Percentiles, Quartiles, and Box-Plots Box Plots • A useful tool of exploratory data analysis (EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • For the previous P/E ratios example: Xmin, Q1, Q2, Q3, Xmax 7 27 35.5 40.5 49 4-38 Chapter 4 Percentiles, Quartiles, and Box-Plots Box Plots • The box plot is displayed visually, like this. 4-39 Chapter 4 Percentiles, Quartiles, and Box-Plots Box Plots 4-40 Chapter 4 Percentiles, Quartiles, and Box-Plots Box Plots: Midhinge • The average of the first and third quartiles. The name midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”. 4-41 Chapter 4 Percentiles, Quartiles, and Box-Plots Box Plots: Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Inner fences Outer fences: Lower fence Q1 – 1.5 (Q3 – Q1) Q1 – 3.0 (Q3 – Q1) Upper fence Q3 + 1.5 (Q3 – Q1) Q3 + 3.0 (Q3 – Q1) • 4-42 Values outside the inner fences are unusual while those outside the outer fences are outliers. Chapter 4 Percentiles, Quartiles, and Box-Plots Example: Birth Weights (n = 1429) Source Birth records from the North Carolina State Center for Health and Environmental Statistics and the Institute for Research in Social Science at University of North Carolina at Chapel Hill. Note: The middle 50% of birth weights lie within a small range (105 to 130, or about 6.56 lb to 8.13 lbs). But there are extremes on the low end. 4-43 Chapter 4 Box-Plots with Fences Fences Visualized: Fences Example: Interpretation: There are three outliers (beyond the inner upper fence). One is on the border of the upper outer fence, so is almost an extreme outlier. Lower fences are not displayed since they are irrelevant for this sample. 4-44 Chapter 4 Box-Plots with Fences Example: Fences and Unusual Data Values Outlier Interpretation: Based on the fences, there is only one outlier and no extreme outliers. Lower fences are not displayed since they are not needed for this sample. 4-45 Chapter 4 Box-Plots with Fences ML 2.4 Topics • scatter plots • correlation coefficient • covariance – population, sample • mean from grouped mean • skewness, kurtosis (Excel) 4-46 Chapter 4 Correlation, Grouped Data, Shape Chapter 4 Correlation and Covariance Correlation Coefficient The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Note: -1 ≤ r ≤ +1 Perfect negative correlation 4-47 Perfect positive correlation Illustration of Correlation Coefficients 4-48 Chapter 4 Correlation and Covariance Correlation Coefficient: Examples Note: -1 ≤ r ≤ +1 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y. X = car weight (lbs), Y = city MPG 4-49 X = gestation (months), Y = birth weight (oz) Chapter 4 Correlation and Covariance Correlation Coefficient: Example Note: -1 ≤ r ≤ +1 The sample correlation coefficient describes the degree of linearity between paired observations on two quantitative variables X and Y. 4-50 Chapter 4 Correlation and Covariance Covariance The covariance of two random variables X and Y (denoted σXY ) measures the degree to which the values of X and Y change together. Caution: The covariance is not easy to interpret because its units depend on Y (e.g., dollars). That’s why we usually refer to the correlation coefficient (it is unit free). 4-51 Chapter 4 Correlation and Covariance Weighted Mean Group Mean 4-52 Chapter 4 Grouped Data Group Mean Note: You will rarely need this. If you are given only grouped data. you will have to make your own tables in Excel (like this). 4-53 Chapter 4 Grouped Data To interpret Excel’s skewness coefficient, you need a table showing critical values for various sample sizes. Skewness Note: You can assess skewness from the histogram or boxplot (usually revealed by outliers or a long tail). It’s usually not worth it to bother with the table. 4-54 Chapter 4 Skewness To interpret Excel’s kurtosis coefficient, you need a table showing critical values for various sample sizes. 4-55 Caution: You cannot reliably assess kurtosis from the histogram, because its x-axis scale affects its appearance. Maybe best to let statisticians worry about this topic. Chapter 4 Kurtosis Assignments ML 2.5 • Connect C-2 (covers chapter 4) • • • • You get three attempts Feedback is given if requested Printable if you wish Deadline is midnight each Monday • Project P-1 (data, tasks, questions) • • • • 0-56 Review instructions Look at the data Your task is to write a nice, readable report (not a spreadsheet) Length is up to you Projects: General Instructions General Instructions For each team project, submit a short (5-10 page) report (using Microsoft Word or equivalent) that answers the questions posed. Strive for effective writing (see textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless spelling and grammar. Paste graphs and computer tables or output into your written report (it may be easier to format tables in Excel and then use Paste Special > Picture to avoid weird formatting and permit sizing within Word). Allocate tasks among team members as you see fit, but all should review and proofread the report (submit only one report). 0-57 Project P-1 Random teams are assigned on Moodle (submit only one report). Data: Download Big Dataset 02 - Crime in Major Cities from Moodle. Your team is assigned one crime category (but you can change it if you wish). Copy the city names and the chosen crime data column to a new spreadsheet. Delete lines (if any) with missing data. Analysis: (a) Sort the observations (with city names). (b) List the top 10 and bottom 10 data values (with city names). (c) For the entire data set, calculate the mean and median. What do they tell you about center? Would the mode be helpful for this type of data? Explain. (d) Calculate the standard deviation. (e) Calculate the standardized z-value for each observation. (f) Are there outliers or unusual data values (see p. 137)? Discuss. (g) Use MegaStat (or Minitab or Excel) to make a histogram. Describe its shape. (h) Calculate the quartiles. Make a boxplot and describe it. (i) Make a scatter plot of your kind of crime versus a different type of crime. What does it show? (j) Ambitious students: Sort the database in random order (see bottom of page 36) using Excel’s function =RAND(). Copy and paste the first few sorted lines into your report to illustrate your sorting method. Comment on anything unusual (or interesting things that you might find on the web). Watch the video walkthrough using Voting, North Carolina Births, and CEO compensation as examples (posted on Moodle) 0-58 Project P-1 your 2010 data will look like this (2005 and 2000 are also available) Crime Rates in U.S. Metropolitan Areas, 2010 (n = 365) Violent Crimes Per 100,000 Metropolitan Statistical Area All Violent Murder Rape Robbery Abilene, TX M.S.A. 423.0 3.1 48.9 72.7 Akron, OH M.S.A. 304.7 3.7 40.9 105.1 Albany, GA M.S.A. 566.0 8.7 24.9 150.4 Albany-Schenectady-Troy, NY M.S.A. 310.4 1.5 21.0 98.5 Albuquerque, NM M.S.A. 670.4 5.8 44.8 124.3 Alexandria, LA M.S.A. 638.0 5.8 23.1 132.3 Allentown-Bethlehem-Easton, PA-NJ M.S.A. 228.2 3.5 20.3 93.6 Altoona, PA M.S.A. 243.6 0.8 38.0 49.8 Amarillo, TX M.S.A. 513.1 5.7 40.8 98.9 Ames, IA M.S.A. 299.5 1.1 41.7 12.4 Anchorage, AK M.S.A. 812.9 4.2 85.9 148.5 Anderson, IN M.S.A. 205.8 2.3 33.4 70.6 Anderson, SC M.S.A. 586.0 5.3 36.4 75.9 Ann Arbor, MI M.S.A. 338.5 1.4 43.2 69.8 Appleton, WI M.S.A. 155.8 0.0 21.4 13.8 Asheville, NC M.S.A. 229.7 1.9 21.8 59.9 Athens-Clarke County, GA M.S.A. 374.9 4.2 19.6 70.5 Atlanta-Sandy Springs-Marietta, GA M.S.A. 413.8 6.1 20.9 149.7 Atlantic City-Hammonton, NJ M.S.A. 529.8 8.0 18.9 245.5 Augusta-Richmond County, GA-SC M.S.A. 412.9 10.2 37.4 156.6 Austin-Round Rock-San Marcos, TX M.S.A. 327.9 3.4 24.7 84.0 Bakersfield-Delano, CA M.S.A. 593.0 9.0 19.9 148.4 Baltimore-Towson, MD M.S.A. 685.3 10.3 23.6 214.4 Bangor, ME M.S.A. 68.4 2.0 12.6 27.2 Barnstable Town, MA M.S.A. 434.6 0.5 36.1 57.6 Battle Creek, MI M.S.A. 697.6 4.5 75.3 109.6 Bay City, MI M.S.A. 335.2 0.9 78.1 50.8 Beaumont-Port Arthur, TX M.S.A. 498.3 5.6 37.7 157.9 Bellingham, WA M.S.A. 267.0 2.5 44.7 50.6 Bend, OR M.S.A.2 304.9 4.3 29.0 30.9 0-59 Assault 298.3 155.0 382.1 189.4 495.6 476.7 110.9 155.0 367.8 244.4 574.4 99.5 468.4 224.0 120.5 146.1 280.5 237.1 257.5 208.7 215.8 415.7 437.0 26.6 340.3 508.3 205.2 297.0 169.1 240.7 Property Crimes Per 100,000 All Property Burglary Larceny Car Theft 3617.3 1009.0 2459.8 148.5 3185.6 947.7 2074.5 163.3 4512.6 1417.8 2803.4 291.4 2693.6 512.1 2076.2 105.4 3896.1 920.6 2586.2 389.4 4592.9 1203.3 3176.3 213.3 2298.0 432.2 1758.1 107.7 1811.7 425.4 1318.2 68.0 4812.7 1137.2 3390.5 285.0 2528.1 478.6 1966.1 83.3 3506.3 416.1 2813.4 276.8 3353.8 848.1 2294.6 211.1 4707.8 1297.6 3041.7 368.4 2713.7 659.7 1879.5 174.4 2136.7 378.5 1708.2 50.0 2454.9 749.6 1534.9 170.3 3843.7 1018.0 2588.1 237.5 3462.6 957.0 2135.7 370.0 3550.3 741.5 2685.7 123.1 4815.3 1355.1 3037.7 422.5 3792.0 754.3 2866.9 170.8 3713.1 1148.0 1931.6 633.6 3090.7 649.5 2135.5 305.7 3098.2 573.3 2429.3 95.7 2972.8 1116.6 1764.7 91.5 3703.5 1145.6 2411.1 146.8 2472.4 610.1 1776.6 85.7 3865.3 1156.9 2488.4 220.1 3197.8 694.2 2372.7 130.8 2973.7 497.5 2360.2 116.0 Definitions Violent crime Murder and nonnegligent manslaughter Forcible rape Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Example: CEO Compensation sorting is a good first step 0-60 Example: CEO Compensation Highlight all data (including the headings) and use Custom Sort 0-61 Example: CEO Compensation now you can clearly see the high and low data values (and comment on any weird data values) 0-62 Example: CEO Compensation use MegaStat’s Descriptive Statistics to get your basic stats along with a nice boxplot 0-63 Example: CEO Compensation severely skewed use MegaStat’s Frequency Distributions to get a frequency table, histogram, etc annotated by user normal if logs used? 0-64 Example: CEO Compensation standardize the sorted list by subtracting the mean from each x value and then dividing by the standard deviation (or use =STANDARDIZE function) 0-65 Example: CEO Compensation after standardizing the sorted list, unusual z values can be seen 0-66 Example: CEO Compensation to randomize the list, paste values of =RAND() beside data and custom sort on =RAND() 0-67