Introduction to Descriptive Statistics

Introduction to Descriptive Statistics 17.871 Spring 2006 First, Some Words about Graphical Presentation • Aspects of graphical integrity (following Edward Tufte, Visual Display of Quantitative Information) – Represent number in direct proportion to numerical quantities presented – Write clear labels on the graph – Show data variation, not design variation – Deflate and standardize money in time series Population vs. Sample Notation Population Greeks , ,  Vs Sample Romans s, b Types of Variables Nominal (Qualitative) U&H: “categorical” ~Nominal (Quantitative) Ordinal Interval or ratio Describing data Moment Center Spread Mean Variance (standard deviation) Skew Skewness Non-mean based measure Mode, median Range, Interquartile range -- Peaked Kurtosis -- Mean n x  i i 1 n X Variance, Standard Deviation ( xi   ) 2  ,  n i 1 2 n ( xi   )   n i 1 n 2 Variance, S.D. of a Sample ( xi   ) 2 s ,  n 1 i 1 2 n Degrees of freedom ( xi   ) s  n 1 i 1 n 2 The z-score or the “standardized score” z x x x Skewness Symmetrical distribution • IQ • SAT Frequency • “No skew” • “Zero skew” • Symmetrical Value Skewness Asymmetrical distribution • GPA of MIT students Frequency • “Negative skew” • “Left skew” Value Skewness (Asymmetrical distribution) • Income • Contribution to candidates • Populations of countries • “Residual vote” rates Frequency Value • “Positive skew” • “Right skew” Skewness n  x  x  3 i i 1 n (n  1)  n2 3/ 2  x  x  i 1 3/ 2  i (mean  mod e) / s  3  (mean  median) / s Skewness Frequency Value Kurtosis k>3 leptokurtic Frequency k=3 k<3 Value Beware the “coefficient of excess” mesokurtic platykurtic A few words about the normal curve • Skewness = 0 • Kurtosis = 3 1  ( x   ) / 2 2 f ( x)  e  2 More words about the normal curve 34% 34% 47% 49% 47% 49% “Empirical rule” Range s 6 SEG example The instructor and/or section leader: Mean s.d. Skew Kurt Gives well-prepared, relevant presentations 6.0 0.69 -1.7 8.5 Explains clearly and answers questions well 5.9 0.68 -1.0 4.8 Uses visual aids well 5.6 0.85 -1.8 8.9 Uses information technology effectively 5.5 0.91 -1.1 5.0 Speaks well 6.1 0.69 -1.5 6.8 Encourages questions & class participation 6.1 0.66 -0.88 3.7 Stimulates interest in the subject 5.9 0.76 -1.1 4.7 Is available outside of class for questions 5.9 0.68 -1.3 6.3 Overall rating of teaching 5.9 0.67 -1.2 5.5 Graph Graph some SEG variables The instructor and/or section leader: s.d. Skew Kurt 5.6 0.85 -1.8 8.9 Graph .6 Fraction Uses visual aids well Mean 0 1 6.1 0.66 -0.88 3.7 .6 Fraction Encourages questions & class participation 7 (mean) q3 0 1 7 (mean) q6 Binary data X  prob( X )  1  proportion of time x  1 s  x (1  x )  s x  x (1  x ) 2 x Commands in STATA for getting univariate statistics • summarize varname • summarize varname, detail • histogram varname, bin() start() width() density/fraction/frequency normal • graph box varnames • tabulate [NB: compare to table] Example of Sophomore Test Scores • High School and Beyond, 1980: A Longitudinal Survey of Students in the United States (ICPSR Study 7896) • totalscore = % of questions answered correctly on a battery of questions • recodedtype = (1=public school, 2=religious private private, 3 = non-sectarian private) Explore totalscore some more . table recodedtype,c(mean totalscore) -------------------------recodedty | pe | mean(totals~e) ----------+--------------1 | .3729735 2 | .4475548 3 | .589883 -------------------------- Graph totalscore 1 .5 0 Density 1.5 2 . hist totalscore -.5 0 .5 totalscore 1 Divide into “bins” so that each bar represents 1% correct 1 .5 0 Density 1.5 2 • hist totalscore,width(.01) • (bin=124, start=-.24209334, width=.01) -.5 0 .5 totalscore 1 Add ticks at each 10% mark 1 .5 0 Density 1.5 2 histogram totalscore, width(.01) xlabel(-.2 (.1) 1) (bin=124, start=-.24209334, width=.01) -.2 -.1 0 .1 .2 .3 .4 .5 totalscore .6 .7 .8 .9 1 Superimpose the normal curve (with the same mean and s.d. as the empirical distribution) 1 .5 0 Density 1.5 2 . histogram totalscore, width(.01) xlabel(-.2 (.1) 1) normal (bin=124, start=-.24209334, width=.01) -.2 -.1 0 .1 .2 .3 .4 .5 totalscore .6 .7 .8 .9 1 Do the previous graph by school types .histogram totalscore, width(.01) xlabel(-.2 (.1)1) by(recodedtype) (bin=124, start=-.24209334, width=.01) 2 Density 0 1 2 3 1 -.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 0 1 2 3 3 -.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 totalscore Graphs by recodedtype Main issues with histograms • Proper level of aggregation • Non-regular data categories (see next) A note about histograms with unnatural categories (start here) From the Current Population Survey (2000), Voter and Registration Survey How long (have you/has name) lived at this address? -9 -3 -2 -1 1 2 3 4 5 6 No Response Refused Don't know Not in universe Less than 1 month 1-6 months 7-11 months 1-2 years 3-4 years 5 years or longer Simple graph Fraction .557134 0 1 6 PES8 Solution, Step 1 Map artificial category onto “natural” midpoint -9 -3 -2 -1 1 2 3 4 5 6 No Response  missing Refused  missing Don't know  missing Not in universe  missing Less than 1 month  1/24 = 0.042 1-6 months  3.5/12 = 0.29 7-11 months  9/12 = 0.75 1-2 years  1.5 3-4 years  3.5 5 years or longer  10 (arbitrary) Graph of recoded data Fraction .557134 0 0 1 2 3 4 5 longevity 6 7 8 9 10 Density plot of data Total area of last bar = .557 Width of bar = 11 (arbitrary) Solve for: a = w h (or) .557 = 11h => h = .051 0 0 1 2 3 4 5 6 longevity 7 8 9 10 15 Density plot template Category F X-min X-max X-length Height (density) < 1 mo. .0156 0 1/12 .082 .19* 1-6 mo. .0909 1/12 ½ .417 .22 7-11 mo. .0430 ½ 1 .500 .09 1-2 yr. .1529 1 2 1 .15 3-4 yr. .1404 2 4 2 .07 5+ yr. .5571 4 15 11 .05 * = .0156/.082 Draw the previous graph with a box plot 1 . graph box totalscore -.5 0 .5 Upper quartile Median Lower quartile } } 1.5 x IQR Inter-quartile range Draw the box plots for the different types of schools . graph box totalscore,by(recodedtype) -.5 0 .5 1 1 -.5 0 .5 1 3 Graphs by recodedtype 2 Draw the box plots for the different types of schools using “over” option -.5 0 .5 1 graph box totalscore,over(recodedtype) 1 2 3 Issue with box plots • Sometimes overly highly stylized Three words about pie charts: don’t use them So, what’s wrong with them • For non-time series data, hard to get a comparison among groups; the eye is very bad in judging relative size of circle slices • For time series, data, hard to grasp crosstime comparisons Time series example An exception to the no pie chart rule The worst graph ever published

Introduction to Descriptive Statistics

Related documents

Products

Support

Introduction to Descriptive Statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib