normal_distributions..

advertisement
X. Standard Scores and the Normal Distribution
X. STANDARD SCORES AND THE NORMAL DISTRIBUTION
Subtopics






SPSS Tools
Introduction
Standard (z)
scores
The Normal
Distribution
Key Concepts
Exercises
For Further
Study


New with this topic
o Histogram
Review
o Getting Started
o Compute
o Descriptives
o Frequencies
o Explore
o Boxplot
Introduction
There is an old saying that “you can’t compare apples and oranges.” Fortunately, this is not
always the case. Any interval or ratio variable can be converted to a standard unit of
measurement called a standard score. This is especially convenient whenever variables are
normally distributed.
Standard (z) scores
Examine the variable descriptions in the codebook for the countries data. Notice that different
variables are measured in radically different units of measurement including, among other things,
dollars, people, and percentages. Any of these can easily be transformed into a standard (z)
score which, by definition, will have a mean of 0 and a standard deviation of 1 regardless of the
original unit of measurement. The formula for converting raw scores on any variable X to z
scores is:
, where:
zi is the standard score for case i,
Xi is the raw score for case i,
μ is the mean of the variable, and
σ is the standard deviation of the variable.
93
X. Standard Scores and the Normal Distribution
You can use the compute tool to convert any interval variable to z-scores once you know the
mean and the standard deviation. An easier way is to let the descriptives procedure do the work
for you − just check the box to "Save standardized values as variables."
The Normal Distribution
Many variables are normally distributed. A typical curve of the normal distribution is shown in
the figure below. (A normal curve is sometimes called a “bell-shaped” curve.) Normal curves
have certain defining characteristics. The most frequent values are found in the middle of the
distribution, and taper off the further away one goes from the middle. The distribution is
symmetric, meaning that the upper half of the distribution is a mirror image of the lower half.
Taken together, the result is that the mean, median, and mode are all the same.
While many variables are normally distributed, many are not. An easy way to tell if a variable is
at least more or less normally distributed is to construct a histogram of the variable, and compare
the result to a normal curve. The next figure describes an index of the self-identified ideology
(liberal, moderate, conservative) of residents of different states derived from analysis of
CBS/New York Times polls by Gerald C. Wright et al.[1] The index ranges from -100 (all
residents of a state identifying as conservative) to 100 (all residents identifying as liberal). The
distribution is approximately bell shaped. In other words, there are a few very conservative and
a few very liberal ones, but more are toward the middle of the road.
94
X. Standard Scores and the Normal Distribution
Consider, on the other hand, the distribution of voting records in the U. S. Senate. The following
histogram uses DW-NOMINATE scores produced for members of the U.S. Senate by Jeff Lewis
and Keith Poole. This index ranges from about -1 (most liberal) to about 1 (most
conservative).[2] The distribution is not even close to forming a normal curve. While the two
measures are not directly comparable, senators would seem to be far more polarized than their
constituents.
95
X. Standard Scores and the Normal Distribution
There are other ways to examine a variable in order to determine whether it is normally
distributed. A boxplot provides another tool. If a variable is at least approximately normally
distributed, the median (the 50th percentile) will be midway in the inter-quartile range (the range
between the 25th and 75th percentiles), the length of the top and bottom “whiskers” above and
below the box will be about the same, and there will be few if any outliers or extreme values
beyond the whiskers. There are also a couple of descriptive statistics that help measure
departures from the normal distribution. Skewness measures departures from normality due to
the impact of very high or very low values. In a perfectly normal distribution, it will have the
value 0. If the mean is higher than the median (because the mean is inflated by some very high
values), a distribution will have a positive skew. If the reverse is the case (due to some
extremely low values), the skew will be negative. Kurtosis measures “peakedness,” the tendency
of values to cluster near the middle of the distribution. In a perfectly normal distribution, it will
have the value 0. A positive kurtosis indicates that values are more closely clustered toward the
middle than would be the case in a normal distribution, while a negative kurtosis indicates that
values are more spread out.
96
X. Standard Scores and the Normal Distribution
Many statistical techniques that require at least interval level measurement also require that
variables be normally distributed. It is a good idea, therefore, to begin data analysis with some
exploratory research into the distribution of the variables in the dataset. Considerable caution
should be exercised in analyzing variables with markedly non-normal distributions.
If variables are normally distributed, standard scores become extremely useful. It turns out that,
in a normal distribution, 68 percent of cases will be within one standard deviation of the mean
(that is, will have a z score within the range of ±1), 95 percent will be within two standard
deviations of the mean, and 99.7 percent will be within 3 standard deviations of the mean. In
fact, if a variable is normally distributed, you can, by converting raw scores to z scores:


convert a score to a percentile (a score in, for example, the 90th percentile would be one
for which 90 percent of cases had that score or lower).
determine the probability that a case will be above or below a certain number, or between
two numbers.
Most statistics texts include a “table of the normal distribution” for these purposes. There are
also “applets” (little applications) on the Internet that do the same thing.
Key Concepts
histogram
normal distribution
kurtosis
skewness
standard (z) score
Exercises
Note: In SPSS, histograms can be produced using either the frequencies or the explore
procedure. There is also a separate procedure specifically designed to produce histograms.
Except for explore, these procedures include the option of superimposing a normal curve on the
histogram. Skewness and kurtosis can be produced with frequencies, descriptives, or explore.
Z-scores can be produced with descriptives or compute.
1. Start SPSS, and open countries.sav. Look at the countries codebook. Calculate the means
and standard deviations for any two interval or ratio variables. Now compute two new variables
by converting each of the original variables to z scores. For the new variables, calculate means
and standard deviations. Use descriptives to accomplish the same purpose.
2. Pick several variables in the Countries file, and obtain histograms, comparing the results to a
normal distribution. Are the variables at least roughly normally distributed? Why or why not?
97
X. Standard Scores and the Normal Distribution
3. Open senate.sav. Look at the senate codebook. The file includes several measures of the
voting behavior of senate members:




acu. These are ratings of senators’ voting records compiled by the American
Conservative Union (ACU). The ACU, like many interest groups, selected what it
regarded as key votes on which to score elected representatives. A score of 100 would
indicate a perfect conservative record, while a score of 0 would indicate a perfect liberal
record.
ada. Compiled by the liberal Americans for Democratic Action, these scores are roughly
a mirror image of the ACU scores. A score of 100 would indicate a perfect liberal record,
while a score of 0 would indicate a perfect conservative record.
dwnom. This is a measure designed by political scientists Keith Poole and Howard
Rosenthal to provide a more comprehensive index of roll call behavior. Scores range
from about -1 (most liberal) to about 1 (most conservative).
unity. Congressional Quarterly's Party Unity Score measures the percent of the time
each senator voted in agreement with his or her party when majorities of the parties were
on opposite sides of a roll call. Vermont's Bernie Sanders and Connecticut's Joe
Lieberman (who were elected as independents, but who caucus with the Democratic
Party for purposes of organizing the chamber and, in return, receiving their committee
assignments) are treated as Democrats for purposes of this variable.
Using explore, examine the distributions of each of these measures. (Ask for histograms rather
than stem and leaf plots.) Are these variables normally distributed? Repeat the analysis, this
time using party as a “factor.” What do these distributions look like?
Use boxplots to compare the distributions of acu, ada, and dwnom scores, broken down by party.
(Note: Joseph Lieberman of Connecticut and Bernie Sanders of Vermont are coded as
independents for the party variable. To treat party as a dummy variable either, 1) recode to treat
these two senators as Democrats (since they caucus with the Democratic Party), 2) go to SPSS
Variable View and make “3” a missing value for this variable, or 3) use select cases to exclude
these senators from your analysis.) Does the distribution for dwnom look anything like those for
acu and ada? (Why not?) Now convert all three variables to standard scores and run the boxplots
again. You should notice a dramatic difference.
Repeat, but using house.sav. Look at the house codebook.
5. Open states.sav. Look at the states codebook. Obtain the means and standard deviations for
several interval or ratio level variables. In “Data View,” find the scores on these variables for
your state. Convert these scores to z scores. On which variables is your state least typical?
6. Open house.sav. Look at the house codebook. Repeat exercise 5 for your congressional
district. (If you are not sure which district you are in, go to http://www.vote-smart.org/.)
7. Go to http://psych.colorado.edu/%7Emcclella/java/normal/normz.html or to
http://faculty.vassar.edu/lowry/tabs.html#z and, using the applet found there, answer the
98
X. Standard Scores and the Normal Distribution
following questions about a normally distributed variable with a mean of 50 and a standard
deviation of 10:
a. What is the z score for a raw score of 72?
b. What percent of cases will have scores over 72?
c. What percent of cases will have scores between 28 and 72?
For Further Study
Brown, James Dean, “Skewness and Kurtosis,” The JALT Testing & Evaluation SIG Newsletter.
April 1997. http://www.jalt.org/test/bro_1.htm. Accessed November 23, 2003.
Lane, David M., “What is a Normal Distribution?” Hyperstat.
http://davidmlane.com/hyperstat/normal_distribution.html.
[1] http://php.indiana.edu/~wright1/. Accessed July 3, 2007.
[2]. Royce Carroll, et al., “DW-NOMINATE Scores with Bootstrapped Standard Errors,”
VoteView. http://voteview.com/. Accessed February 14, 2010.
99
Download