E370 Statistical Analysis for Bus & Econ Chapter 3: Summary Statistics Objectives: Be able to import data into Excel. Be able to generate useful summary statistics from the data using Excel. (Data Crunch) Be able to interpret summary statistics and use them to describe a dataset. (Value Added) Why do we care? These days the problem is not lack of data, but an overwhelming amount of data. We need to extract information out of the data sets we have. To do that we condense data into tables and graphs summarize data into descriptive statistics Overview: Summary of a dataset (3 dimensions) Center Where are the data values concentrated? What seem to be typical or middle data values? Dispersion The scattering or spread of data around its center. How much variation is there in the data? Are the data values distributed symmetrically? Skewed? Shape Measures of Center: Statistic Formula Mean Median Mode Excel Command Pros and Cons n =AVERAGE(Array) Pros: use all the information Cons: sensitive to extreme value Middle value =MEDIAN(Array) Pros: robust to in sorted extreme value array Cons: not use all the information 1 xi å n i=1 Most frequently occurring data value =MODE.MULT (Array) Pros: the only measure of center for nominal data Cons: unreliable Measures of Dispersion: Statistic Formula Excel Command Range X max X min Variance Population: =MAX(Array)MIN(Array) =VAR.P(Array) Sample: Standard Deviation s 2 s2 (x - m ) å = 2 i N (x - x ) å = 2 i n -1 Population: s = s2 s = s2 Sample: Coefficient Population: (s m ) *100 of Variance(CV) Sample: s X *100 ( VAR.S(Array) ) =STDEV.P(Array) =STDEV.S(Array) =STDEV.P(Array)/AVE RAGE(Array)*100 =STDEV.S(Array)/AVE RAGE(Array)*100 Measures of Dispersion: 1. Range: the distance between the largest value and the smallest value in the dataset. 2. Variance: the average squared distances of observations from their mean. “Squared” units difficult to interpret. 3. Standard Deviation: a type of average distance of observations from their mean. (Calculated by taking square root of the variance.) 4. Coefficient of Variance(CV): a measure of “relative” dispersion (unit-free). It is useful for comparing dispersion of variables measured in different units or with different means. Shape of Distribution: By comparing the three measures of center: a. Mean > Median (>Mode): positively or right-skewed b. Mean = Median (=Mode): symmetric c. Mean < Median (<Mode): negatively or left-skewed The tail points to the direction of skewness. Shape of Distribution(cont’d): By using Pearson’s Second Skewness Coefficient: a. Pearson’s Second Skewness Coefficient > 0: positively or right-skewed b. Pearson’s Second Skewness Coefficient = 0: symmetric c. Pearson’s Second Skewness Coefficient < 0: negatively or left-skewed Pearson’s Second Skewness Coefficient=3*(meanmedian)/standard deviation Summary Statistics Calories Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 146.1111 4.012704 145 140 29.48723 869.4969 -0.68913 -0.1484 110 90 200 7890 54 𝑆𝑢𝑚 • 𝑚𝑒𝑎𝑛 = 𝐶𝑜𝑢𝑛𝑡 • The Standard Deviation and Sample Variance are sample statistics • 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑁−1 • 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 × 𝑁 • Skewness is NOT the Pearson’s skewness coefficient Excel Commands: Excel Command Output AVERAGE(Array) Mean of the data MEDIAN(Array) Approximate median of the data MODE.MULT(Array) Mode(s) of the data VAR.P(Array) Population variance STDEV.P(Array) Population standard deviation VAR.S(Array) Sample variance STDEV.S(Array) Sample standard deviation MAX(Array) Largest number in the data MIN(Array) Smallest number in the data MAX(Array)-MIN(Array) Range of the data Data/Data Analysis/Descriptive Statistics Table of selected descriptive statistics