Standardization of variables Maarten Buis 5-12-2005 1 Recap • Central tendency • Dispersion • SPSS 2 Standardization • Is used to improve interpretability of variables. • Some variables have a natural interpretable metric: e.g. income, age, gender, country. • Others, primarily ordinal variables, do not: e.g. education, attitude items, intelligence. • Standardizing these variables makes them more interpretable. 3 Standardization • Transforming the variable to a comparable metric – – – – known unit known mean known standard deviation known range • Three ways of standardizing: – P-standardization (percentile scores) – Z-standardization (z-scores) – D-standardization (dichotomize a variable) 4 When you should always standardize • When averaging multiple variables, e.g. when creating a socioeconomic status variable out of income and education. • When comparing the effects of variables with unequal units, e.g. does age or education have a larger effect on income? 5 P-Standardization • Every observation is assigned a number between 0 and 100, indicating the percentage of observation beneath it. • Can be read from the cumulative distribution • In case of knots: assign midpoints • The median, quartiles, quintiles, and deciles are special cases of P-scores. 6 rent room room room room room room room room room room room room room room room room room room room 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 175 180 185 190 200 210 210 210 230 240 240 250 250 280 300 300 310 325 620 cum % percentile 5,3% 5,3% 10,5% 10,5% 15,8% 15,8% 21,1% 21,1% 26,3% 26,3% 31,6% 36,8% 36,8% 36,8% 42,1% 36,8% 47,4% 47,4% 52,6% 55,3% 57,9% 55,3% 63,2% 65,8% 68,4% 65,8% 73,7% 73,7% 78,9% 81,6% 84,2% 81,6% 89,5% 89,5% 94,7% 94,7% 100,0% 100,0% 7 P-standardization • Turns the variable into a ranking, i.e. it turns the variable into a ordinal variable. • It is a non-linear transformation: relative distances change • Results in a fixed mean, range, and standard deviation; M=50, SD=28.6, This can change slightly due to knots • A histogram of a P-standardized variable approximates a uniform distribution 8 Linear transformation • Say you want income in thousands of guilders instead of guilders. • You divide INCMID by f1000,M SD ƒ2543,- ƒ1481,- Incmid/1000 kƒ2,543 kƒ1,481 Incmid 9 Linear transformation • Say you want to know the deviation from the mean • Subtract the mean (f2543,-) from INCMID M SD Incmid ƒ2543,- ƒ1481,- Incmid-M ƒ0,- ƒ1481,10 Recap: multiplication and addition and the number line 11 Linear transformation • Adding a constant (X’ = X+c) – M(X’) = M(X)+c – SD(X’) = SD(X) • Multiply with a constant (X’ = X*c) – M(X’) = M(X)*c – SD(X’) = SD(X) * |c| 12 Z-standardization • Z = (X-M)/SD • two steps: – center the variable (mean becomes zero) – divide by the standard deviation (the unit becomes standard deviation) • Results in fixed mean and standard deviation: M=0, SD=1 • Not in a fixed range! • Z-standardization is a linear transformation: relative distances remain intact. 13 Z-standardization • • • • Step 1: subtract the mean c = -M(X) M(X’) = M(X)+c M(X’) = M(X)-M(X)=0 • SD(X’)=SD(X) 14 Z-standardization • • • • Step 2: divide by the standard deviation c is 1/SD(X) M(Z) = M(X’) * c M(Z) = 0 * 1/SD(X) = 0 • SD(Z) = SD(X’) * c • SD(Z) = SD(X) * 1/SD(X) = 1 15 Normal distribution • Normal distribution = Gauss curve = Bell curve • Formula (McCall p. 120) – Note the (x-m)2 part – apart from that all you have to remember is that the formula is complicated • Normal distribution occurs when a large number of small random events cause the outcome: e.g. measurement error 16 Normal distribution • Other examples the height of individuals, intelligence, attitude • But: the variables Education, Income and age in Eenzaam98 are not normally distributed 17 Z-scores and the normal distribution • Z-standardization will not result in a normally distributed variable • Standardization in NOT the same as normalization • We will not discuss normalization (but it does exist) • But: If the original distribution is normally distributed, than the z-standardized variable will have a standard normal distribution. 18 Standard normal distribution • Normal distribution with M=0 and SD=1. • Table A in Appendix 2 of McCall • Important numbers (to be remembered): – – – – 68% of the observations lie between ± 1 SD 90% of the observations lie between ± 1.64 SD 95% of the observations lie between ± 1.96 SD 99% of the observations lie between ± 2.58 SD 19 Why bother? • If you know: – That a variable is normally distributed – the mean and standard deviation • Than you know the percentage of observations above or below and observation • These numbers are a good approximation, even if the variable is not exactly normally distributed 20 P & Z standardization • Both give a distribution with fixed mean, standard deviation, and unit • P-standardization also gives a fixed range • Both are relative to the sample: if you take observations out, than you have to recompute the standardized variables 21 P & Z-standardization • When interpreting Z-standardized variables one uses percentiles • With P-standardization one decreases the scale of measurement to ordinal, BUT this improves interpretability. 22 Student recap 23 Do before Wednesday • Read McCall chapter 5 • Understand Appendix 2, table A • make exercises 5.7-5.28 24