Stat 11 January 24, 2008 Descriptive Statistics (single variable) Notation n x1 , x2 , number of items in sample , xn xi x1 ', x2 ', values in the sample (or use y, z, etc., all lower case) i-th value (where i = 1, …, n) (Greek “alpha”) – some number between 0 and 1, usually near 0, representing, for example, a fraction of observations , xn ' “order statistics” – same values as x1 , x2 , , xn , but sorted in increasing order. For example, xn ' is the largest of the numbers x1 , x2 , , xn . Always: x1 ' x2 ' xn '. Graphical representations Dot plot Histogram Bars should normally have equal width. If not, then be sure that equal areas represent equal quantities (“equal area principle”). This means that the vertical axis has units of “number of observations per x-axis unit.” Box plot ——————————> or, per Tufte—> 1 Measures of Location Mean = Average = Arithmetic mean (AM) x 1 n xi n i 1 x1 xn n Other kinds of means: Geometric mean (GM) = n x1 x2 (interesting only if all xi’s are positive) xn Harmonic mean (HM) = inverse of average of inverses = n 1 1 x1 x2 Root-mean-square (RMS) = 1 xn x12 x2 2 n xn 2 Grammatically challenging. We’ll sometimes call it the RMS mean. middle value, if n is odd Median: x = average of two middle values, if n is e ven Midmean: average of the middle half (exclude largest 25% and smallest 25%; report the average of the rest) Trimmed mean: (really, “-trimmed mean”) x = average of remaining values, after n lowest values and n highest values are removed (sensible only if α < 1/2.) Weighted averages: weighted average = w1 x1 w2 x2 where each wi 0 and wn xn n w i 1 i 1. 2 Percentiles Percentiles ( = quantiles = fractiles ): x = value such that fraction α of observations are ≤ x and fraction 1 – α of observations are x (not well defined for certain values of α) Quartiles: Q1 = lower quartile, same as x0.25 Q3 = upper quartile, same as x0.75 Extremes: xmin = minimum value among x1 , x2 , , xn xmax = maximum value among x1 , x2 , , xn Five-number summary: minimum value, Q1, median, Q3, maximum value. (Not quite standard. Some would use some extreme percentiles in place of the minimum and maximum, especially in large data sets prone to outliers) 3 Measures of Dispersion Variance (or “population variance,” VARP( ) in Excel): 2 1 n ( xi x ) 2 = “mean square deviation” n i 1 Standard deviation (or “population standard deviation,” STDEVP( ) in Excel): 2 = “root-mean-square deviation” (“standard” often means root-mean-square in statistics, so “standard deviation” is a formula as well as a name) Sample variance (VAR( ) in Excel): s2 1 n ( xi x ) 2 n 1 i 1 Sample standard deviation (STDEV( ) in Excel): s s2 xmax xmin Range: (largest value minus smallest) (not really standard usage; some use the word “range” to refer to the pair xmin, xmax.) Interquartile range (IQR): Q3 – Q1 = x0.75 x0.25 (not well defined if n is multiple of 4) Mean absolute deviation: mean absolute deviation = 1 n xi x n i 1 4