IIntroduction t d ti tto Descriptive Statistics 17.871 Spring 2012 1 Key measures Describing data Center Spread Skew Peaked Moment Non-mean based measure Mean Mode, median Variance Range, (standard deviation) Interquartile range Skewness -­ Kurtosis -­ 2 Key distinction Population vs. Sample Notation Population vs. Sample vs Greeks Romans μ, σ, β s, b 3 Mean n xi i1 n X 4 Variance, Standard Deviation of of a Population (xi ) 2 , n i1 2 n ( xi ) n i1 2 n 5 V i Variance, S S.D. D off a Samplle (xi ) 2 s , n 1 i 1 2 n Degrees of freedom ((xxi ) s n 1 i1 2 n 6 Bi Binary data d t X prob( X ) 1 proportion of time x 1 s x(1 x) s x x (1 x ) 2 x 7 Example of this, using today’s NBC News/M /Marist i Poll ll iin Mi Michigan hi Candidate Pct. Santorum 35 Romneyy 37 Paul 13 Gingrich 8 [Unaccounted [U t d for] [7] 8 gen santorum = 1 if candidate==“Santorum” replace santorum = 0 if candidate~=“Santorum” the command th d summ santorum produces Mean = .35 35 Var = .35(1-.35)=.2275 S.d. = . 4769696 Normall di distributi t ib tion example l # PEOPLE Frequency 600 400 200 46 52 58 64 70 76 82 88 94 HEIGHT (inches) Image by MIT OpenCourseWare. 1 (( x ) / 2 2 f (x) e 2 9 IQ SAT Height “No skew” “N k ” “Zero skew” Symmetrical Mean = median = mode Skewness Asymmetrical distribution Frequency Value 10 Income Contribution to candidates did t Populations of countries “Residual vote” rates “Positive skew” “Right Right skew skew” Density D 1 1.5 Distribution of the average $$ of dividends/tax return (in K’s) .5 Hyde County County, SD 5 10 15 .02 .04 Density .06 6 dividends_pc Fuel economy of cars for sale in the US 11 0 0 .08 0 .1 Mitsubishi i-MiEV (which is supposed to be all electric) 0 50 100 var1 150 Skewness Asymmetrical distribution Frequency GPA of MIT students “Negative skew” “Left skew” Value 12 0 ..02 Densitty .04 .06 .08 Placement of Republican Party Party on 100-point scale 0 20 40 60 80 place on ideological scale - republican party 13 100 Skewness Sk Frequency Value 14 0 .0 02 .04 Dens sity .06 .08 .1 Placement of Republican Party Party on 100-point scale 0 20 40 60 80 place on ideological scale - democratic party Mean = 26.8; median = 25; mode = 25 15 100 K t i Kurtosis k>3 Frequency leptokurtic Image by MIT OpenCourseWare. k=3 k<3 Value 16 mesokurtic platykurtic Image by MIT OpenCourseWare. .15 .1 Density .05 Mean s.d. Skew. Kurt. Selfplacement 55.1 26.4 -0.14 2.21 Rep. pty. 26.8 21.2 0.87 3.59 Dem. pty 74.7 21.8 -1.18 4.29 0 20 40 60 place on ideological scale - yourself 80 100 0 .02 .04 Density .06 .08 .1 1 0 20 40 60 80 place on ideological scale - democratic party 100 0 20 40 60 80 place on ideological scale - republican party 100 0 .02 .04 De ensity .06 .08 .1 0 Source: Cooperative Congressional Election Study, 2008 17 N Normal l di distribution t ib ti # PEOPLE 600 Frequency 400 200 46 52 58 64 70 76 82 88 94 HEIGHT (inches) Image by MIT OpenCourseWare. 1 (( x ) / 2 2 f ( x) e 2 18 Skewness = 0 Kurtosis = 3 More word ds ab boutt tth he normall curve 0.4 0.3 34.1% 34.1% 0.2 0.1 0.1% 2.1% 2.1% 13.6% 13.6% 0.1% 0.0 -3σ -2σ -1σ µ 1σ 2σ 2σ Image by MIT OpenCourseWare. 19 The z-score or the “standardized standardized score” score x x z x 20 Commands in STATA for univariate statistics summarize varname summarize varname, varname detail histogram varname, bin() start() width() density/fraction/frequency normal graph box varnames tabulate 21 E Example l off Florida Fl id voters t Question: does the age of voters vary by race? Combine Florida voter extract files, 2008 gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b) gen age= 2010-birth 2010 bi h_year 22 0 .005 .0 01 Density .015 .02 .025 L k att di Look distribution t ib ti off bi birth th year 1850 1900 1950 birth_year 23 2000 E l Explore age b by votiting mode d . table race if birth_year>1900,c(mean age) ---------------------race | mean(age) ----------+----------1 | 45.61229 2 | 42 89916 42.89916 3 | 42.6952 4 | 45.09718 52.08628 5 | 6 | 44.77392 9 | 40.86704 ---------------------- 3 = Black 4 = Hispanic 5 = Whit White 24 0 .01 Density .02 .03 G h birth Graph bi th year 0 20 40 60 age . hist age if birth_year>1900 (bin=71, start=9, width=1.3802817) 25 80 100 0 .005 Density .01 .015 .02 Divide into “bins” bins so that each bar bar represents 1 year 0 20 40 60 80 age . hist hi t age if bi birth_year>1900,width(1) th 1900 idth(1) 26 100 Add tticks i k att 10-year 10 i t intervals l 0 .005 Density .01 ..015 .02 histogram totalscore, width(1) xlabel(-.2 (.1) 1) 20 30 40 50 60 age 27 70 80 90 100 S perimpose the normal ccurve Superimpose r e (with the same mean and s.d. as the empirical distribution) 0 .005 Density .01 .015 5 .02 hist i age if i birth_year>1900,wid(1) i i xlabel(20 l l (10) 100) normal 20 30 40 50 28 60 age 70 80 90 100 . summ age if birth_year>1900,det age ------------------------------------------------------------Percentiles Smallest 1% 18 9 5% 21 16 16 10% 24 Obs 12612114 16 25% 34 Sum of Wgt. 12612114 50% 48 75% 90% 95% 99% 63 77 83 91 Largest 107 107 107 107 29 Mean Std. Dev. 49.47549 19.01049 Variance Skewness Kurtosis 361.3986 .2629496 2.222442 Histograms by race hist age if birth birth_year year>1900&race>=3&race<=5,wid(1) 1900&race 3&race 5,wid(1) xlabel(20 (10) 100) normal by(race) 4 .03 3 0 20 30 40 50 60 70 80 90 100 .01 ..02 .03 5 0 Density .01 .02 3 = Black 4 = Hispanic 5 = White 20 30 40 50 60 70 80 90 100 age D Density it normal age Graphs by race 30 M i issues with Main ith hi histograms t Proper level of aggregation Non Non-regular regular data categories 31 Draw the previous graph with a box box plot graph box age if birth_year>1900 100 0 } 0 Upper quartile Median Lower q quartile age 50 1.5 x IQR 32 } Inter-quartile range Draw the box plots for the different different races graph box age if birth_year>1900&race>=3&race<=5,by(race) 4 100 3 age 0 50 3 = Black 4 = Hispanic 5 = White 0 50 100 5 Graphs by race 33 Draw the box plots for the different races using g “over” option age 50 0 3 = Black 4 = Hispanic 5 = White 100 graph box age if birth_year>1900&race>=3&race<=5,over(race) 3 4 34 5 A note about histograms with unnatural categories From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? -9 -3 -2 -1 1 2 3 4 5 6 No Response Refused Don't know Not in universe Less than 1 month 1-6 months 7-11 7 11 months 1-2 years 3-4 years 5 years or longer 35 Solution,,Ste p1 Map artificial category onto natural midpoint “natural” -9 -3 -2 -1 1 2 3 4 5 6 No Response missing g Refused missing Don't know missing Not in universe missing Less than 1 month 1/24 = 0.042 1 6 months 3.5/12 1-6 3 5/12 = 0 0.29 29 7-11 months 9/12 = 0.75 1-2 years 1.5 3-4 years 3.5 5 years or longer 10 (arbitrary) recode live_length live length (min/-1 (min/ 1 = =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10) )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10) 36 Graph h off recod ded d data d t histogram longevity, fraction Fraction .557134 0 0 1 2 3 4 5 longevity 37 6 7 8 9 10 D Density it pllott off d data t Total area of last bar = .557 557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051 0 0 1 2 3 4 5 6 longevity 7 38 8 9 10 15 D Density it pllott ttemplate l t Category Fraction X-min X-max X-length Height (density) < 1 mo. mo .0156 0156 0 1/12 .082 082 .19 19* 1-6 mo. .0909 1/12 ½ .417 .22 7 11 mo 7-11 mo. .0430 0430 ½ 1 .500 500 .09 09 1-2 yr. .1529 1 2 1 .15 3 4 yr. 3-4 .1404 1404 2 4 2 .07 07 5+ yr. .5571 4 15 11 .05 * = .0156/.082 39 Three words about pie charts: charts: don’t use them 40 So, wh hat’ t’s wrong wiith th tth hem For non-time series data, hard to get a parison among gg groups; p ; the ey ye is very y comp bad in judging relative size of circle slices For time series series, data, data hard to grasp cross cross­ time comparisons 41 Some words about graphical presentation Aspects of graphical integrity (following p y of Edward Tufte,, Visual Display Quantitative Information) Main point should be readily apparent Show as much data as possible Write clear labels on the graph Show data variation, not design variation 42 in There is a difference between graphs in research and publication Publish OK 4 0 20 30 40 50 60 70 80 90 100 Black Hispanic White Total 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 .01 .01 .02 2 .02 .03 .03 5 0 0 .01 Graphs by race .03 Densit De nsity normal age .02 age 0 20 30 40 50 60 70 80 90 100 Density Density .01 .02 .03 3 Do not publish Age Graphs by race 43 MIT OpenCourseWare http://ocw.mit.edu 17.871 Political Science Laboratory Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.