Univariate Statistics

advertisement
Univariate Statistics
Slide 1
This lecture addresses what most people refer to as descriptive statistical analysis.
Slide 2 (No Audio
Slide 3
Sadly, many people are either afraid of statistics and statistical analyses or think about statistics
in the wrong way. Here are three examples of thinking about statistics in the wrong way.
Walkup’s first three laws are:
Law #1: Everything correlates with everything, especially when the same individual
defines the variables to be correlated.
Law #2: It won’t help very much to find a good correlation in between the variable you
are interested in and some other variable that you don’t understand any better.
Law #3: Unless you can think of a logical reason why two variables should be connected
as cause and effect, it doesn’t help much to find a correlation in between them. In
Columbus, OH the mean monthly rainfall correlates very nicely with the number of letters
in the names of the months.
Clearly, Walkup is addressing misconceptions about the use of statistics. Law #1—that
everything correlates with everything else, especially if the same person defines the variables—
reflects the existence of researcher artifacts. Law #2 indicates that it’s nonsensical to search for
correlations between variables to help explain one another when you don’t understand the
relationships between the variables. Law #3 suggests that there’s always data to show the
relatedness between variables, but without a good theoretical basis for believing that the
relationship is true and stabile, that supposed relationship is meaningless.
Slide 4
This slide makes a key point about providing evidence for drawing a conclusion. Mrs. Fergusen
believes her son is a good boy—that’s her null hypothesis—but the police officer gives Mrs.
Fergusen repeated reasons and evidence for believing she’s mistaken in that null hypothesis.
By the time the officer gets to Thursday’s evidence, Mrs. Fergusen has decided that she should
reject her null hypothesis. That’s what’s meant by statistical evidence; that the statistical
analysis provides evidence that helps to test null hypotheses.
Slide 5
Here’s what is meant by descriptive analysis: the transformation of raw data into a form that
make them easy to understand and interpret. It entails rearranging, ordering, and manipulating
data to generate descriptive information. In essence, descriptive analyses summarize data in a
more meaningful and reduced form.
Slide 6 (No Audio)
Page | 1
Slide 7
One of the most basic ways to describe data is to summarize it through tabulation, which is the
orderly arrangement of data in a table or other summary format. As subsequent examples will
show, running tabulations in a statistical package like SPSS creates frequency tables that
include both counts and percentages.
Slide 8
More formally, a frequency table contains numerical data arranged in a row and column format
that shows the count and percentages of responses or observations for each category assigned
to a variable. Those categories are preselected by the researcher. For example, if we assigned
respondents to one of four income categories, those categories would be pre-determined. All a
statistical program like SPSS does is to report the frequency and percentages of respondents
that fall into one of those four categories.
Slide 9
The next two slides show sample frequency test tables of the types generated by SPSS. The
first table shows the frequency of respondents by income.

The eight income categories are mutually exclusive and exhaustive; in other words, all
possible incomes are represented by a unique category. There is a ninth category for ‘no
answer’ code, which indicates respondents who opted not to answer the income
question.

The column to the right of the code column contains the number or respondent count—
out of 400 total respondents—who fell into that category.

The percentages column shows the percentage that each count represents out of 400.
For example, for income category $5,000 to $9,999, the count is 40, which represents
10% of the 400 respondents.

It might be useful to adjust those percentages for non-respondents, and that’s what
appears in the next column. Those percentages are adjusted and a little higher than the
percentages immediately to the left. Consider that same first income category; it’s 40 out
of 400 total respondents, but because 40 people didn’t respond, the net number of
responders to the income question is 360. Dividing 40 by 360 is 11.1%, which appears in
the adjusted percentage column.

The rightmost column contains adjusted cumulative percentages, which are a running
total of the percentages in the column immediately to the left. Take 6.9 and add it to
11.1; the sum is 18.0. Take 18.0 and add it to 16.7—the adjusted percentage for
respondents in the $10,000 to $14,999 income category; the sum to 34.7%. This
summation continues until 99.9% is reached.
In summary, the adjusted cumulative percentage is a running total of the percentages adjusted
for non-respondents. The raw numbers and percentages refer to all respondents, whether or not
they answered the question. The code is the category in which each respondent could fall, and
those numbers are entered into the computer database. The far left column is provided to
enhance the readability of this table. The meaning of each code can be entered into an SPSS
data; for example a code of ‘1’ means less than $5,000 income.
Page | 2
Slide 10
This slide contains a table that is more reflective of the output you would receive from SPSS.
Notice the word ‘Affect’ in the upper left-hand corner; that’s the name for this question. In SPSS
you may enter a name that is as long as eight characters. Immediately to the right of the
question (or variable) name is ‘Feelings about the commercial you just saw.’ Enter that type of
information into SPSS makes the output more readable. The second column—the one titled
‘Value’—contains the numbers 1 through 5, which are the numbers entered into the computer
that correspond with respondents’ answers to this question. Immediately to the left of the
column ‘Value’ is the column ‘Value Label’. That column also contains information that can be
entered into SPSS to enhance the table readability. The remaining four columns—headed by
Frequency, Percent, Valid Percent, and Cumulative Percent—correspond to the last four
columns on the previous slide. The one major difference is ‘Valid cases’ and ‘Missing cases’ at
the bottom of the table. In the previous slide, missing cases were designated by the number ‘9’
and were included in the table. In this example, missing data was left blank in the database and
SPSS picked up that blank case. The last number in the ‘Frequency’ column (‘5’) corresponds to
the five blank cases, the adjacent ‘.9’ is the percent of blank cases, and the ‘Valid Percent’
column contains the word ‘Missing.’ The only difference between this table and the previous
table is the handling of missing cases.
Slide 11
SPSS can also generate histograms, which is a pictorial way of showing frequency data.
Slide 12
The issue introduced in the previous slide, regarding valid cases, relates the notion of the base.
A base is the number of respondents or observations in a row or column that is used as a basis
for computing percentages. There are several alternative bases that make more or less sense
to use depending on what’s being summarized. For example, for questions that all respondents
answered, a base of all respondents means taking the number who responded each way and
dividing by the total number of respondents. Alternatively, if only a subset of respondents
answered a question (that they could validly skip), then it’s more meaningful to divide the
number of responses in each category by the number of respondents who were asked that
question. That’s what occurs in the previous frequency table; only 535 of the 540 respondents
answered the ‘Affect’ question. As a result, all percentages are based on respondents who
answered that question.
The major issue about bases relates to multi-response questions. For example, respondents
could be asked check off all the magazines to which they subscribe and then provide a lengthy
list. One way to look at that data is ‘What percent of respondents subscribe to each magazine?’
To calculate this percent, it’s necessary to divide the number of subscribers by the total number
of respondents. Another way to look at that data is ‘To what degree does each magazine
comprise the universe of what people read?’ To calculate that percent, it’s necessary to divide
the number of subscriptions for each magazine by the total number of subscriptions. Either is a
reasonable way—depending on the research question—to look at the data.
Slide 13
Frequency tables provide an excellent way to summarize survey data, but as the previous
tables suggest, such tables contain many numbers. Is there a more efficient way to summarize
Page | 3
survey data? By efficient, I mean that survey responses are summarized by a smaller number of
numbers. Fortunately, the answer is ‘yes’ because of measures of central tendency. There are
three such measures. The mode, which is the value that occurs most often, can summarize
nominal or categorical data. For example, if more males than females responded to a ‘Sex’
question, then the mode on that question is ‘male’. The median, which is the midpoint on scores
ordered from lowest to highest, can summarize ordinally scaled variable. The median is the
same as the 50th percentile. The arithmetic mean, which is the sum all of scores divided by the
number of respondents, can summarize metric—intervally or ratio-scaled—data.
Slide 14
Although measures of central tendency—like mode, median, and mean—provide the best single
summary number for a set of responses to a survey question, that single number lacks
information about the degree to which people’s responses differed on that question. It would be
useful to know if everyone who answered that question answered similarly, or if they answered
that question rather differently. Thus, both (1) a best single summary measure, and (2) a
measure of the degree to which that single measure captures all responses, are needed. That
second measure would detect dispersion or spread. Five standard measures of dispersion are
range, inter-quartile range, mean absolute deviation, variance, and standard deviation.
Slide 15
The next two slides illustrate low versus high dispersion. In both examples, the mean value on
the variable is roughly 180. In the first example, all the answers tend to cluster around that 180
score. In the second example, the scores range from 150 to 210. In the first example, the mean
is an excellent summary of scores on this variable across all cases or respondents. In the
second example, the mean is not reflective of the score for some cases or respondents.
Slide 16 (No Audio)
Slide 17
Although range is a relatively primitive measure of spread, it’s easy to compute and easy to
understand. Range is the difference between the smallest and largest values in a set of
numbers. Inter-quartile range, which loosely corresponds to plus or minus 1 standard deviation,
is the difference between the 75th percentile and the 25th percentile. Range also provides a
useful approximation for standard deviation. Often, the lowest score in the range corresponds to
three standard deviations below the mean and the highest score corresponds to three standard
deviations above the mean; hence, the range often is roughly six standard deviations. Taking
the range and dividing by six should give a good estimate of the standard deviation.
Slide 18
If we have metric data, so that we can add and subtract, then we can compute deviation scores.
These scores are the differences between each observed value and the mean.
Slide 19
The mean is the best single summary number for metric data. If we take the mean and subtract
it from all other scores, the net result is zero (0), which indicates the mean is an unbiased
estimator of central tendency. The scores below the mean would equal the scores above the
mean weighted for the difference from that mean.
Page | 4
Slide 20
Because average deviation is always zero (0), one useful way to make deviation scores
meaningful is to square them. Squaring both eliminates the zero (0) problem and weighs bigger
deviations from the mean more than smaller deviations from the mean. Because the square of a
large number is far larger than the square of a smaller number, squaring deviations weigh
numbers that differ more from the mean to a greater extent than numbers that differ less from
the mean.
Slide 21
Continuing with the basic notion of difference from the mean, consider variation from the mean
or variance. For a survey, sample variance is the variability of responses to a given question.
Sample variance is an estimate of the population variance.
Slide 22
A mean squared deviation is computed by subtracting the mean from each score on a variable,
squaring that difference, and dividing by ‘n’ (the number or sample members). The n-1 in the
denominator of this equation is an adjustment.
Slide 23
Essentially, the n-1 adjustment in the denominator is ignored when calculating variance, which
is a mean squared deviation given in square units. When it’s more convenient to take the
square root of the variance and not present variability in terms of squared units, the result is the
standard deviation.
Slide 24
To summarize the last several slides, I spoke about central tendency and dispersion. I indicated
that the appropriate measure of central tendency depends on the type of scale: nominal, ordinal,
or metric data. In addition, the measure of dispersion depends on the type of scale. For nominal
data, it’s impossible to create a rank order, so no meaningful measure of dispersion is possible.
For ordinal data, the appropriate measure of dispersion is the percentile, so measures like interquartile range are meaningful. For metric data, standard deviation or variance are appropriate
measures of dispersion.
Slide 25 (No Audio)
Slide 26
If the data on one variable is distributed symmetrically, like in this bell-shaped curve, then the
mean, the mode, and the median are identical.
Slide 27
In contrast, if data are skewed positively or negatively, then the values of the mode, median,
and mean will differ. One way to remember positive versus negative skew is by looking at the
direction of the tail. If the tail points to the positive end of the number line, then the data are
skewed positively, and if the tail points to the negative end of the number line, then the data are
Page | 5
skewed negatively. With positively skewed data, the mode is less than the median, which in turn
is less than the mean. With negatively skewed data, the reverse is true.
Consider housing prices mentioned in the news media. Realtors report the median selling price
of existing and new homes. House prices are in dollars, which is metric data; hence, the mean
is a valid and seemingly appropriate statistic. Nonetheless, the median rather than mean is
reported because the former is more stable over time, which permits more reliable month-tomonth comparisons. In contrast, the mean selling price for a month could increase or decrease
markedly in a community if one or two expensive homes are or aren’t sold. To establish longterm trends in housing prices, a more stabile indicator of prices is preferred. Here, the data is
metric, yet the median provides a better summary number. Researchers must consider the
distribution of their data and what they’re trying to accomplish before selecting the best measure
of central tendency.
Slide 28
Consider one type of symmetric distribution: the normal distribution. This bell-shaped
distribution of values is symmetric about its mean. The mean identifies the highest point on the
curve or the most frequent response on that variable. Almost all values are within +/-3 standard
deviations from the mean, which is why the range divided by six provides a good estimate of
standard deviation. Normal distributions are assumed to be continuous distributions; there could
be an infinite number of cases. As survey data contains discrete scores, it cannot form a true
normal distribution. Nonetheless, it’s worthwhile to assess how closely survey data conforms to
a normal distribution because a skewed distribution of responses has data analysis implications.
Slide 29
Inter-quartile range is a good approximation of +/-1 standard deviation from the mean. The interquartile range is the middle 50% of the distribution; +/-1 standard deviation from the mean in a
normal distribution is a bit more than 68%.
Slide 30 (No Audio)
Slide 31
For IQ scores, the standard deviation, historically, has been 15 points. The mean score is 100
and one standard deviation above the mean score is 115; three standard deviations above the
mean score is 145.
Slide 32
Normally distributed data can be standardized by considering the area under the bell curve as a
probability density of 1.0. This sets the mean equal to ‘0’ and the standard deviation equal to ‘1’.
Slide 33 (No Audio)
Slide 34
Normally distributed data can be converted from raw scores into Z scores, which are scaled
from +3 to -3. That score indicates the extent a score is above or below its mean. When dealing
with raw metric data, it’s important to know the mean and difference from that mean. All that
information is contained within a Z score.
Page | 6
Slide 35 (No Audio)
Slide 36
This slide depicts the transformation of a normally distributed variable into a standardized
normal variable.
Slide 37
Finally, there’s data transformation, which differs from recoding but is similar. Data
transformation converts the data from its original form into a new format that may be more
appropriate for a specific type of analysis. Transformed data creates a new variable.
Slide 38
When dealing with multiple Likert-scaled items that address the same underlying construct, it’s
often useful to sum the scores across all those related items. Assume we have three questions
that relate to the same underlying attitude and were asked in a similar direction (so there’s no
need for reverse coding). It’s reasonable to sum the scores on those three questions to create a
total score on that construct. Researchers use multi-item scales because responses to single
questions are less reliable than the sum of responses to multiple related questions in reflecting
underlying constructs.
Slide 39
Consider index numbers as a type of data transformation. With an index number, the score or
observation is recalibrated to indicate how it relates to a base number. For example, the
Consumer Price Index (CPI) relates current prices to some base year. The problem with index
numbers, especially in secondary reports, is that the index number can change. I recall when
the CPI was pegged to 1967 prices. Since that time, the index number has been pegged to
price levels in later years. Comparing index numbers reported in 1971 to the index numbers
reported in 1995 could cause confusion because the base year has changed. Index numbers
are a useful way to transform data, but such numbers must be used with care if they’ve been
acquired from secondary sources.
Slide 40
To recap this lecture, I discussed count and frequency data to introduce you to descriptive data
analysis. I then talked about the best numbers for summarizing frequency data. Those two
numbers relate to central tendency and dispersion from central tendency. I then talked about
symmetric and skewed data distribution. Then I discussed the value of standardizing normally
distributed data. Finally, I closed with a brief discussion about data transformation.
Page | 7
Download