Stat 61 October 29, 2007 Descriptive Statistics (single variable) Notation n x1 , x2 , number of items in sample , xn xi x1 ', x2 ', values in the sample (or use y, z, etc., all lower case) i-th value (where i = 1, …, n) (Greek “alpha”) – some number between 0 and 1, usually near 0, representing, for example, a fraction of observations , xn ' “order statistics” – same values as x1 , x2 , , xn , but sorted in increasing order. For example, xn ' is the largest of the numbers x1 , x2 , , xn . Always: x1 ' x2 ' xn '. Graphical representations Dot plot Histogram Bars should normally have equal width. If not, then be sure that equal areas represent equal quantities (“equal area principle”). This means that the vertical axis has units of “number of observations per x-axis unit.” Stem and leaf (obsolete) Box plot ——————————> or, per Tufte—> 1 Empirical cdf (= empirical cumulative distribution function) F(x) = fraction of observations that are ≤ x 1 In general, the “empirical distribution” corresponding to x1 , x2 , , xn is a discrete distribution with p(x) = (number of xi’s equal to x) / n. It is the distribution corresponding to picking one of the xi’s at random. Empirical survival function S(x) = 1 – F(x) = fraction of observations that are > x Measures of Location Mean = Average = Arithmetic mean (AM) x 1 n xi n i 1 x1 xn n Other kinds of means: Geometric mean = n x1 x2 xn (interesting only if all xi’s are positive) Harmonic mean = inverse of average of inverses = n 1 1 x1 x2 Root-mean-square = x12 x2 2 n 1 xn xn 2 Grammatically challenging. We’ll sometimes call it the RMS mean. Transformed means, in general: (1) transform the individual xi’s by applying some function f; (2) calculate the arithmetic mean of the transformed values; (3) apply the reverse transformation, f -1. For the geometric mean, use f(x) = log x For the harmonic mean, use f(x) = 1 / x 2 For the root-mean-square, use f(x) = x2. middle value, if n is odd Median: x = average of two middle values, if n is e ven Trimmed mean: (really, “-trimmed mean”) x = average of remaining values, after n lowest values and n highest values are removed (sensible only if α < 1/2.) Weighted averages: weighted average = w1 x1 w2 x2 where each wi 0 and wn xn n w i 1 i 1. A weighted average is also called a convex combination of the xi’s. (Without the restrictions on the weights the same expression is called a linear combination of the xi’s, but it would not then be called a weighted average.) Percentiles (quantiles, fractiles): x = value such that fraction α of observations are ≤ x and fraction 1 – α of observations are x (not well defined for certain values of α) Quartiles: Q1 = lower quartile, same as x0.25 Q3 = upper quartile, same as x0.75 Extremes: xmin = minimum value among x1 , x2 , , xn xmax = maximum value among x1 , x2 , , xn Five-number summary: minimum value, Q1, median, Q3, maximum value. (Not quite standard. Some would use some extreme percentiles in place of the minimum and maximum, especially in large data sets prone to outliers) 3 Measures of Dispersion Variance (or “population variance,” VARP( ) in Excel): 2 1 n ( xi x ) 2 = “mean square deviation” n i 1 Standard deviation (or “population standard deviation,” STDEVP( ) in Excel): 2 = “root-mean-square deviation” (“standard” often means root-mean-square in statistics, so “standard deviation” is a formula as well as a name) Sample variance (VAR( ) in Excel): s2 1 n ( xi x )2 n 1 i 1 Sample standard deviation (STDEV( ) in Excel): s s2 xmax xmin Range: (largest value minus smallest) (not really standard usage; some use the word “range” to refer to the pair xmin, xmax.) Interquartile range: Q3 – Q1 = x0.75 x0.25 (not well defined if n is multiple of 4) Mean absolute deviation: 1 n mean absolute deviation = xi x n i 1 4 Moments The “k-th moment of the xi’s about zero” is the average value of xik: 1 n M k xi k for k = 1, 2, … ( 0-th moment would always be 1) n i 1 The “k-th moment about the mean” is the average value of mk So: 1 n k xi x n i 1 xi x k : for k = 2, 3, … Mean = 1st moment about 0; Variance = 2nd moment about mean. Skewness: m3 Kurtosis: m4 3 4 5