Jan. 20-23 Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale: Y = c times X Recenter: Y = X plus a adding variables to each other other transformations Shape of a distribution… Outliers Unimodal --- Bimodal --- Multimodal Symmetrical Skew - right or left? Colleges – Datadesk histogram More 1.5 1.3 1.1 0.9 0.7 0.5 0.3 0.1 -0.1 -0.3 -0.5 -0.7 GE daily changes ($/share) 10 9 8 7 6 5 4 3 2 1 0 NH polls, 1/26/04 - errors Errors from 1/26 NH polls 12 10 8 6 4 2 0.1 0.08 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 0 Population vs. Sample A statistic anything that can be computed from data. is STATISTICS of a single quantitative variable MEAN MEDIAN QUARTILES ( Q1, Q3 ) Five-number summary Boxplots Interquartile range PERCENTILES / QUANTILES / FRACTILES (“quantiles” and “fractiles” are synonyms for “percentiles” for people who don’t like the implied multiplication by 100) STANDARD DEVIATION VARIANCE Statistics of one variable… MEAN — Sum of values, divided by n MEDIAN — Middle value (when values are ranked, smallest to largest) (or, average of two middle values) Number of Colleges (ranked) 1 1 2 6 8 12 1 1 4 6 8 12 1 1 5 6 8 12 1 1 5 6 8 13 1 1 5 6 8 14 1 1 5 7 8 14 1 1 5 7 9 14 1 1 5 7 9 1 1 5 7 10 1 1 5 7 10 1 1 5 7 10 1 1 6 7 10 1 2 6 8 12 Colleges – Datadesk histogram median — 5 mean — 5.36 Salaries 20000 50000 80000 30000 50000 80000 30000 50000 80000 30000 50000 80000 30000 50000 85000 30000 50000 90000 30000 60000 100000 30000 60000 100000 35000 60000 100000 35000 60000 120000 40000 60000 125000 40000 60000 150000 40000 65000 150000 40000 70000 150000 40000 70000 200000 45000 70000 250000 45000 72500 400000 50000 75000 500000 50000 75000 600000 50000 75000 1000000 salaries median — mean — 60,000 106,875 So, which measure of “center” is best? All the measures agree (roughly) when the distribution is symmetrical Mean has attractive mathematical properties Also, the mean is related to the total, if that’s what you care about Median may be more “typical” when the distribution is nonsymmetrical A measure is “robust” if it works reasonably well under a wide variety of circumstances Medians are robust Jan. 23 RMS, Geometric mean Percentiles, Quartiles (Q1, Q3), BOX PLOTS Measures of spread: IQR (range containing middle half) Standard deviation ( , s ) Variance Transforming data… Rescale: Y = c times X Recenter: Y = X plus a adding variables to each other other transformations “STANDARDIZING” a variable NORMAL DISTRIBUTIONS Computing percentiles To calculate 20-th percentile: Rank the values from smallest to largest Compute 20% of n… 20% of 72 = 14.4 Count off that many values (from lowest)… The value at which you stop is the 20-th percentile. What if you stop between values ? Number of Colleges 1 1 2 6 8 12 1 1 4 6 8 12 1 1 5 6 8 12 1 1 5 6 8 13 1 1 5 6 8 14 1 1 5 7 8 14 1 1 5 7 9 14 1 1 5 7 9 1 1 5 7 10 1 1 5 7 10 1 1 5 7 10 1 1 6 7 10 1 2 6 8 12 QUARTILES Lower quartile (Q1) = 25-th percentile Upper quartile (Q3) = 75-th percentile ( What’s Q2 ? ) INTERQUARTILE RANGE ( IQR ) = Q3 minus Q1 Five-number summary — maximum — Q3 — — median Q1 — minimum VARIANCE and STANDARD DEVIATION VARIANCE (s2): n s2 (x x ) i 1 2 i n 1 STANDARD DEVIATION (s): n s (x x ) i 1 i n 1 2 Linear Transformations If you MULTIPLY or DIVIDE a variable by a constant… Y = c times X Y=X/c then… measures of center are multiplied or divided by c measures of spread are multiplied or divided by |c| If you ADD or SUBTRACT a constant from a variable… Y=X+a Y=X–a then… measures of center are increased (decreased) by a measures of spread are UNCHANGED. More transformations ADDING VARIABLES: W = X + Y Mean (W) = Mean (X) + Mean (Y) Standard Deviation of (W) — anything can happen OTHER TRANSFORMATIONS: Y = X squared ? Y = log (X) ? …NO RELIABLE RULES for mean or std. dev. Standardized Variables Write x and S for mean, standard deviation of X Then form transformed variable: Z = (X - x ) / S Then… mean (Z) = 0 std dev (Z) = 1 Z answers the question: How many standard deviations is this value above (or below) the mean? Jan. 25 More on transforming and standardizing variables More on normal distributions Jan. 27++ Relations among variables --scatterplots “independent” variables correlations linear regressions (best fit lines) Normal Density Function X ~ (,) = mean, = std. dev. (Why Greek? Why not x-bar, s?) Trying the integral Standard normal: mean = 0, std. dev. = 1 1 0 Density curve: 1 x2 1 f ( x) ( )e 2 2 …so the area between a and b is: 1 b x2 1 ( ) e 2 dx 2 a The core computation If X ~ N(,), what fraction of values are between a and b ? a Rule of 68 – 95 – 99.7 Standardizing Tables and computers Reversing the calculation b Standardizing Same Question: Is X between a and b ? Is (X-)/ between (b-)/ and (b-)/ ? But Z = (X-)/ is a variable with a standard normal distribution (mean 0, standard deviation 1). So, if we can answer this question for standard normals, we can answer it for all normals.