X. Standard Scores and the Normal Distribution X. STANDARD SCORES AND THE NORMAL DISTRIBUTION Subtopics SPSS Tools Introduction Standard (z) scores The Normal Distribution Key Concepts Exercises For Further Study New with this topic o Histogram Review o Getting Started o Compute o Descriptives o Frequencies o Explore o Boxplot Introduction There is an old saying that “you can’t compare apples and oranges.” Fortunately, this is not always the case. Any interval or ratio variable can be converted to a standard unit of measurement called a standard score. This is especially convenient whenever variables are normally distributed. Standard (z) scores Examine the variable descriptions in the codebook for the countries data. Notice that different variables are measured in radically different units of measurement including, among other things, dollars, people, and percentages. Any of these can easily be transformed into a standard (z) score which, by definition, will have a mean of 0 and a standard deviation of 1 regardless of the original unit of measurement. The formula for converting raw scores on any variable X to z scores is: , where: zi is the standard score for case i, Xi is the raw score for case i, μ is the mean of the variable, and σ is the standard deviation of the variable. 93 X. Standard Scores and the Normal Distribution You can use the compute tool to convert any interval variable to z-scores once you know the mean and the standard deviation. An easier way is to let the descriptives procedure do the work for you − just check the box to "Save standardized values as variables." The Normal Distribution Many variables are normally distributed. A typical curve of the normal distribution is shown in the figure below. (A normal curve is sometimes called a “bell-shaped” curve.) Normal curves have certain defining characteristics. The most frequent values are found in the middle of the distribution, and taper off the further away one goes from the middle. The distribution is symmetric, meaning that the upper half of the distribution is a mirror image of the lower half. Taken together, the result is that the mean, median, and mode are all the same. While many variables are normally distributed, many are not. An easy way to tell if a variable is at least more or less normally distributed is to construct a histogram of the variable, and compare the result to a normal curve. The next figure describes an index of the self-identified ideology (liberal, moderate, conservative) of residents of different states derived from analysis of CBS/New York Times polls by Gerald C. Wright et al.[1] The index ranges from -100 (all residents of a state identifying as conservative) to 100 (all residents identifying as liberal). The distribution is approximately bell shaped. In other words, there are a few very conservative and a few very liberal ones, but more are toward the middle of the road. 94 X. Standard Scores and the Normal Distribution Consider, on the other hand, the distribution of voting records in the U. S. Senate. The following histogram uses DW-NOMINATE scores produced for members of the U.S. Senate by Jeff Lewis and Keith Poole. This index ranges from about -1 (most liberal) to about 1 (most conservative).[2] The distribution is not even close to forming a normal curve. While the two measures are not directly comparable, senators would seem to be far more polarized than their constituents. 95 X. Standard Scores and the Normal Distribution There are other ways to examine a variable in order to determine whether it is normally distributed. A boxplot provides another tool. If a variable is at least approximately normally distributed, the median (the 50th percentile) will be midway in the inter-quartile range (the range between the 25th and 75th percentiles), the length of the top and bottom “whiskers” above and below the box will be about the same, and there will be few if any outliers or extreme values beyond the whiskers. There are also a couple of descriptive statistics that help measure departures from the normal distribution. Skewness measures departures from normality due to the impact of very high or very low values. In a perfectly normal distribution, it will have the value 0. If the mean is higher than the median (because the mean is inflated by some very high values), a distribution will have a positive skew. If the reverse is the case (due to some extremely low values), the skew will be negative. Kurtosis measures “peakedness,” the tendency of values to cluster near the middle of the distribution. In a perfectly normal distribution, it will have the value 0. A positive kurtosis indicates that values are more closely clustered toward the middle than would be the case in a normal distribution, while a negative kurtosis indicates that values are more spread out. 96 X. Standard Scores and the Normal Distribution Many statistical techniques that require at least interval level measurement also require that variables be normally distributed. It is a good idea, therefore, to begin data analysis with some exploratory research into the distribution of the variables in the dataset. Considerable caution should be exercised in analyzing variables with markedly non-normal distributions. If variables are normally distributed, standard scores become extremely useful. It turns out that, in a normal distribution, 68 percent of cases will be within one standard deviation of the mean (that is, will have a z score within the range of ±1), 95 percent will be within two standard deviations of the mean, and 99.7 percent will be within 3 standard deviations of the mean. In fact, if a variable is normally distributed, you can, by converting raw scores to z scores: convert a score to a percentile (a score in, for example, the 90th percentile would be one for which 90 percent of cases had that score or lower). determine the probability that a case will be above or below a certain number, or between two numbers. Most statistics texts include a “table of the normal distribution” for these purposes. There are also “applets” (little applications) on the Internet that do the same thing. Key Concepts histogram normal distribution kurtosis skewness standard (z) score Exercises Note: In SPSS, histograms can be produced using either the frequencies or the explore procedure. There is also a separate procedure specifically designed to produce histograms. Except for explore, these procedures include the option of superimposing a normal curve on the histogram. Skewness and kurtosis can be produced with frequencies, descriptives, or explore. Z-scores can be produced with descriptives or compute. 1. Start SPSS, and open countries.sav. Look at the countries codebook. Calculate the means and standard deviations for any two interval or ratio variables. Now compute two new variables by converting each of the original variables to z scores. For the new variables, calculate means and standard deviations. Use descriptives to accomplish the same purpose. 2. Pick several variables in the Countries file, and obtain histograms, comparing the results to a normal distribution. Are the variables at least roughly normally distributed? Why or why not? 97 X. Standard Scores and the Normal Distribution 3. Open senate.sav. Look at the senate codebook. The file includes several measures of the voting behavior of senate members: acu. These are ratings of senators’ voting records compiled by the American Conservative Union (ACU). The ACU, like many interest groups, selected what it regarded as key votes on which to score elected representatives. A score of 100 would indicate a perfect conservative record, while a score of 0 would indicate a perfect liberal record. ada. Compiled by the liberal Americans for Democratic Action, these scores are roughly a mirror image of the ACU scores. A score of 100 would indicate a perfect liberal record, while a score of 0 would indicate a perfect conservative record. dwnom. This is a measure designed by political scientists Keith Poole and Howard Rosenthal to provide a more comprehensive index of roll call behavior. Scores range from about -1 (most liberal) to about 1 (most conservative). unity. Congressional Quarterly's Party Unity Score measures the percent of the time each senator voted in agreement with his or her party when majorities of the parties were on opposite sides of a roll call. Vermont's Bernie Sanders and Connecticut's Joe Lieberman (who were elected as independents, but who caucus with the Democratic Party for purposes of organizing the chamber and, in return, receiving their committee assignments) are treated as Democrats for purposes of this variable. Using explore, examine the distributions of each of these measures. (Ask for histograms rather than stem and leaf plots.) Are these variables normally distributed? Repeat the analysis, this time using party as a “factor.” What do these distributions look like? Use boxplots to compare the distributions of acu, ada, and dwnom scores, broken down by party. (Note: Joseph Lieberman of Connecticut and Bernie Sanders of Vermont are coded as independents for the party variable. To treat party as a dummy variable either, 1) recode to treat these two senators as Democrats (since they caucus with the Democratic Party), 2) go to SPSS Variable View and make “3” a missing value for this variable, or 3) use select cases to exclude these senators from your analysis.) Does the distribution for dwnom look anything like those for acu and ada? (Why not?) Now convert all three variables to standard scores and run the boxplots again. You should notice a dramatic difference. Repeat, but using house.sav. Look at the house codebook. 5. Open states.sav. Look at the states codebook. Obtain the means and standard deviations for several interval or ratio level variables. In “Data View,” find the scores on these variables for your state. Convert these scores to z scores. On which variables is your state least typical? 6. Open house.sav. Look at the house codebook. Repeat exercise 5 for your congressional district. (If you are not sure which district you are in, go to http://www.vote-smart.org/.) 7. Go to http://psych.colorado.edu/%7Emcclella/java/normal/normz.html or to http://faculty.vassar.edu/lowry/tabs.html#z and, using the applet found there, answer the 98 X. Standard Scores and the Normal Distribution following questions about a normally distributed variable with a mean of 50 and a standard deviation of 10: a. What is the z score for a raw score of 72? b. What percent of cases will have scores over 72? c. What percent of cases will have scores between 28 and 72? For Further Study Brown, James Dean, “Skewness and Kurtosis,” The JALT Testing & Evaluation SIG Newsletter. April 1997. http://www.jalt.org/test/bro_1.htm. Accessed November 23, 2003. Lane, David M., “What is a Normal Distribution?” Hyperstat. http://davidmlane.com/hyperstat/normal_distribution.html. [1] http://php.indiana.edu/~wright1/. Accessed July 3, 2007. [2]. Royce Carroll, et al., “DW-NOMINATE Scores with Bootstrapped Standard Errors,” VoteView. http://voteview.com/. Accessed February 14, 2010. 99