Introduction to Descriptive Statistics 17.871 Spring 2006 First, Some Words about Graphical Presentation • Aspects of graphical integrity (following Edward Tufte, Visual Display of Quantitative Information) – Represent number in direct proportion to numerical quantities presented – Write clear labels on the graph – Show data variation, not design variation – Deflate and standardize money in time series Population vs. Sample Notation Population Greeks , , Vs Sample Romans s, b Types of Variables Nominal (Qualitative) U&H: “categorical” ~Nominal (Quantitative) Ordinal Interval or ratio Describing data Moment Center Spread Mean Variance (standard deviation) Skew Skewness Non-mean based measure Mode, median Range, Interquartile range -- Peaked Kurtosis -- Mean n x i i 1 n X Variance, Standard Deviation ( xi ) 2 , n i 1 2 n ( xi ) n i 1 n 2 Variance, S.D. of a Sample ( xi ) 2 s , n 1 i 1 2 n Degrees of freedom ( xi ) s n 1 i 1 n 2 The z-score or the “standardized score” z x x x Skewness Symmetrical distribution • IQ • SAT Frequency • “No skew” • “Zero skew” • Symmetrical Value Skewness Asymmetrical distribution • GPA of MIT students Frequency • “Negative skew” • “Left skew” Value Skewness (Asymmetrical distribution) • Income • Contribution to candidates • Populations of countries • “Residual vote” rates Frequency Value • “Positive skew” • “Right skew” Skewness n x x 3 i i 1 n (n 1) n2 3/ 2 x x i 1 3/ 2 i (mean mod e) / s 3 (mean median) / s Skewness Frequency Value Kurtosis k>3 leptokurtic Frequency k=3 k<3 Value Beware the “coefficient of excess” mesokurtic platykurtic A few words about the normal curve • Skewness = 0 • Kurtosis = 3 1 ( x ) / 2 2 f ( x) e 2 More words about the normal curve 34% 34% 47% 49% 47% 49% “Empirical rule” Range s 6 SEG example The instructor and/or section leader: Mean s.d. Skew Kurt Gives well-prepared, relevant presentations 6.0 0.69 -1.7 8.5 Explains clearly and answers questions well 5.9 0.68 -1.0 4.8 Uses visual aids well 5.6 0.85 -1.8 8.9 Uses information technology effectively 5.5 0.91 -1.1 5.0 Speaks well 6.1 0.69 -1.5 6.8 Encourages questions & class participation 6.1 0.66 -0.88 3.7 Stimulates interest in the subject 5.9 0.76 -1.1 4.7 Is available outside of class for questions 5.9 0.68 -1.3 6.3 Overall rating of teaching 5.9 0.67 -1.2 5.5 Graph Graph some SEG variables The instructor and/or section leader: s.d. Skew Kurt 5.6 0.85 -1.8 8.9 Graph .6 Fraction Uses visual aids well Mean 0 1 6.1 0.66 -0.88 3.7 .6 Fraction Encourages questions & class participation 7 (mean) q3 0 1 7 (mean) q6 Binary data X prob( X ) 1 proportion of time x 1 s x (1 x ) s x x (1 x ) 2 x Commands in STATA for getting univariate statistics • summarize varname • summarize varname, detail • histogram varname, bin() start() width() density/fraction/frequency normal • graph box varnames • tabulate [NB: compare to table] Example of Sophomore Test Scores • High School and Beyond, 1980: A Longitudinal Survey of Students in the United States (ICPSR Study 7896) • totalscore = % of questions answered correctly on a battery of questions • recodedtype = (1=public school, 2=religious private private, 3 = non-sectarian private) Explore totalscore some more . table recodedtype,c(mean totalscore) -------------------------recodedty | pe | mean(totals~e) ----------+--------------1 | .3729735 2 | .4475548 3 | .589883 -------------------------- Graph totalscore 1 .5 0 Density 1.5 2 . hist totalscore -.5 0 .5 totalscore 1 Divide into “bins” so that each bar represents 1% correct 1 .5 0 Density 1.5 2 • hist totalscore,width(.01) • (bin=124, start=-.24209334, width=.01) -.5 0 .5 totalscore 1 Add ticks at each 10% mark 1 .5 0 Density 1.5 2 histogram totalscore, width(.01) xlabel(-.2 (.1) 1) (bin=124, start=-.24209334, width=.01) -.2 -.1 0 .1 .2 .3 .4 .5 totalscore .6 .7 .8 .9 1 Superimpose the normal curve (with the same mean and s.d. as the empirical distribution) 1 .5 0 Density 1.5 2 . histogram totalscore, width(.01) xlabel(-.2 (.1) 1) normal (bin=124, start=-.24209334, width=.01) -.2 -.1 0 .1 .2 .3 .4 .5 totalscore .6 .7 .8 .9 1 Do the previous graph by school types .histogram totalscore, width(.01) xlabel(-.2 (.1)1) by(recodedtype) (bin=124, start=-.24209334, width=.01) 2 Density 0 1 2 3 1 -.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 1 2 3 3 -.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 totalscore Graphs by recodedtype Main issues with histograms • Proper level of aggregation • Non-regular data categories (see next) A note about histograms with unnatural categories (start here) From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? -9 -3 -2 -1 1 2 3 4 5 6 No Response Refused Don't know Not in universe Less than 1 month 1-6 months 7-11 months 1-2 years 3-4 years 5 years or longer Simple graph Fraction .557134 0 1 6 PES8 Solution, Step 1 Map artificial category onto “natural” midpoint -9 -3 -2 -1 1 2 3 4 5 6 No Response missing Refused missing Don't know missing Not in universe missing Less than 1 month 1/24 = 0.042 1-6 months 3.5/12 = 0.29 7-11 months 9/12 = 0.75 1-2 years 1.5 3-4 years 3.5 5 years or longer 10 (arbitrary) Graph of recoded data Fraction .557134 0 0 1 2 3 4 5 longevity 6 7 8 9 10 Density plot of data Total area of last bar = .557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051 0 0 1 2 3 4 5 6 longevity 7 8 9 10 15 Density plot template Category F X-min X-max X-length Height (density) < 1 mo. .0156 0 1/12 .082 .19* 1-6 mo. .0909 1/12 ½ .417 .22 7-11 mo. .0430 ½ 1 .500 .09 1-2 yr. .1529 1 2 1 .15 3-4 yr. .1404 2 4 2 .07 5+ yr. .5571 4 15 11 .05 * = .0156/.082 Draw the previous graph with a box plot 1 . graph box totalscore -.5 0 .5 Upper quartile Median Lower quartile } } 1.5 x IQR Inter-quartile range Draw the box plots for the different types of schools . graph box totalscore,by(recodedtype) -.5 0 .5 1 1 -.5 0 .5 1 3 Graphs by recodedtype 2 Draw the box plots for the different types of schools using “over” option -.5 0 .5 1 graph box totalscore,over(recodedtype) 1 2 3 Issue with box plots • Sometimes overly highly stylized Three words about pie charts: don’t use them So, what’s wrong with them • For non-time series data, hard to get a comparison among groups; the eye is very bad in judging relative size of circle slices • For time series, data, hard to grasp crosstime comparisons Time series example An exception to the no pie chart rule The worst graph ever published