UNIVARIATE FREQUENCY DISTRIBUTIONS &

advertisement
DATA ANALYSIS WORKBOOK
LAB 2
UNIVARIATE FREQUENCY DISTRIBUTIONS &
MEASURES OF CENTRAL TENDENCY
OVERVIEW
The purpose of this exercise is to increase your understanding of how the shape of a
frequency distribution can affect the ability of the mean to describe the central tendency
or location of the distribution of a variable. You do this by:
1. looking at a graph of the frequency distribution - a histogram
2. guessing which is more typical or representative of the distribution: the mean or
median
3. checking your guess against the actual values of the mean and median
STATISTICS AND DATA ANALYSIS
Using a Histogram to Graph a Frequency Distribution
A graph provides the picture of a frequency distribution. A frequency distribution is a
table that lists the values of the variable and the frequency of occurrence for each value.
A histogram is a bar graph of the distribution of a variable that is measured at either the
interval or ratio levels. Although the variables in the 1987 GSS are discrete, we will treat
the ratio, interval, and in some cases ordinal, variables as being continuous. In a
continuous variable, the values of the variable are capable of being infinitely subdivided.
For example, the unit of time used to measure a person's age could be subdivided into
years, months, weeks, days, hours, minutes, seconds, etc. In contrast, the values of a
discrete variable—for example, family size—cannot be subdivided. In making the
frequency distribution of a continuous variable, the values of the variable are grouped into
class intervals. Usually, the class intervals in a frequency distribution have the same
size—for example, ten-year classes in the case of age, and one year in the case of
education.1
To make a histogram, a bar graph is constructed so that the relative area2 of each bar
equals the relative frequency of occurrence for the class interval. (The relative frequency
In treating a variable as continuous, we adopt the convention of using intervals that are “closed at the
bottom but open at the top.” This means that you read the class interval as containing those values
beginning with the lower limit and up to but not including the upper limit. In the case of age grouped into
class intervals of size ten (years), we would read the interval 15 - 25 as including the ages 15 and up to but
not including 25. In cases where the size of class interval is one, we interpret the value of the variable as
the midpoint of the class interval that begins a half unit below the value and ends a unit above. For
example, we interpret 12 years of schooling as the midpoint of an interval that begins at 11.5 and ends at
(but does not include) 12.5.
2 The area of a bar graph equals the length of the bar times its height. The relative area of a bar is equal to
the bar's area divided by the sum of the areas for all the bars.
2.1
1
LAB 2
DATA ANALYSIS WORKBOOK
of occurrence equals the proportion of cases that have values in the class interval.) For
frequency distributions with equal class intervals, one obtains this equality by making the
height of the bar correspond to the frequency of occurrence for the class interval. To
construct a histogram of a frequency distribution with unequal intervals is more
complicated. In this case, you make the height of the bar equal to the ratio of the
frequency of the class interval divided by size of the interval. Since we rarely, if ever,
encounter variables grouped into unequal class intervals, we go no further on this topic.3
INTERPRETING A HISTOGRAM
The picture given by a histogram provides four rough ideas about the shape of the
distribution of the variable being graphed. First, we can get a rough idea of the
variable’s central tendency. As discussed in more detail below, the central tendency of a
variable is a “typical value” that we use to stand for or summarize the other values of the
variable that have been observed. Second, we can get an idea of the degree of spread or
the amount of variation in the variable. Third, we can see whether the distribution is
asymmetric or skewed. Finally, we can see whether the shape of distribution is either
relatively flat, peaked, or “normal,” technically referred to as the distribution’s kurtosis.
This lab focuses on the central tendency and skewedness of a distribution. We discuss
the variability or spread in Chapter 3. We provide no further discussion of kurtosis.
The distribution of a variable is asymmetric when a minority of cases have either
extremely high or low values (but not both). When the asymmetry is substantial, we say
that the distribution is skewed. The skew corresponds to the region of the extreme
values. The histogram for a skewed distribution will exhibit a “tail” that represents these
extreme cases. Depending on whether the values of the extreme cases are low or high,
relative to the restof the values, we say that distribution is skewed to the left (“negatively
skewed”) or skewed to the right (“positively skewed”). For example, income is positively
skewed because only a small number of people enjoy very large incomes. Most people
have low or moderate values of income, relative to the wealthy. Students scores on an
“easy” exam, on the other hand, typically are skewed to the left (or negatively skewed).
Most students get scores in the middle or upper end of the possible range of scores. A
few students who do not read the text, attend lectures, or do neither get extremely low
scores.
3
One might group years of schooling into unequal class intervals so that the intervals end at the points at
which students often leave school. The discrete (rather than continuous) intervals for an American sample
might be: 0 – 6, 7 – 8, 9 – 11, 12, 13 – 15, 16, and > 16. If, say, a hundred students had left school after
grade 12, the researcher would draw the bar so that the height would correspond to 100 students. If 30
students had left school either at grade 9, 10, or 11, the researcher would draw the bar so that the height
would correspond to 10 students (30/3 = 10). In constructing the histogram this way, the researcher
assumes a uniform distribution of cases across the values of the class interval.
2.2
DATA ANALYSIS WORKBOOK
LAB 2
Using a Measure of Central Tendency or Location to Describe a Frequency
Distribution
A measure of central tendency (sometimes called a measure of “location”) is a value of
the variable that data analysts use to represent the entire distribution of values for a set of
cases. Since we use it to stand for the different values of a variable, we want to choose a
“typical” value. “Typical,” however, can be defined a number of ways. Introductory
statistics courses usually present three measures of tendency: the mean, median, and
mode. Throughout most of the course we concentrate exclusively on the mean (or some
variation of it) as the measure of central tendency. In this lab, however, we also look at
the median (in order to better understand the mean’s strengths and weaknesses as a
measure of central tendency).
The mean is the arithmetic average of the values of a variable for a set of cases. To
compute it, you (or the computer) adds up the values of the variable (for all the cases) and
divides the sum by the number of cases. Equation 1 contains the definitional formula for
the mean for the variable y. In this equation, y stands for the variable, n stands for the
number of cases, and  is the arithmetic operator that tells you to add up the values of y.
Physically, the mean is the centre of gravity or balancing point for a distribution. If you
can visualize the distribution sitting on a “teeter-totter,” the mean is the point at which the
teeter-tooter will balance (that is, where the board will be parallel to the ground). As we
discuss below, this feature of the mean both enhances and detracts from its usefulness as
a measure of central tendency.
(1)
y
y
n
A second measure of central tendency is the median. The median is the value of the
variable that divides the distribution in half. Half the cases have a value greater than
the median; half have a value that is less than the median.4 For an odd number of cases,
the median is the value of the variable for the middle case. For an even number of cases,
you compute the median by taking the average of the two middle cases. When treating a
variable as continuous, you can use linear interpolation to compute a more precise value
of the median. Since none of the computer programs do this, we do not describe this
procedure in detail.5
The mode of a distribution is the third measure of central tendency. It is the most
frequently occurring value of the variable (or class interval in the case of a grouped
frequency distribution). Often, a distribution will have two or more frequently occurring
values (relative to the rest). In this case, we say the distribution is “bimodal” or
4
The median is a single point, so, strictly speaking, none of the cases have exactly the value of the median.
However, a certain per cent will fall in the interval that is bounded by half a value above and below the
median.
5 One consequence of the failure of computer packages to use linear interpolation to compute the median is
that the relative values of the computed mean and median in skewed distributions will fail to exhibit the
properties described below--i.e., the mean should be greater than the median in positively skewed
distributions and less than the median in negatively skewed distributions. This occurs in the example
described below.
2.3
LAB 2
DATA ANALYSIS WORKBOOK
“multimodal.” We refer to distributions in which all cases have the same frequency of
occurrence as “uniform.”
The Choice of a Measure of Central Tendency: The Mean Versus the Median
The choice of which measure of central tendency to use depends, among other things, on
the level at which the variable is measured. Use of the mean assumes at least interval
measurement (or the willingness to treat an ordinal variable as if it were interval). The
reason is straightforward. To change the position of a case in a distribution will affect the
centre of gravity. (Think of the example of a child shifting his or her position on a
teeter-totter.) A mean, therefore, will make sense only if the researcher uses the interval
(or ratio) properties of numbers when measuring the variable. A median assumes only
ordinal measurement because it is unaffected by whether the “number” assigned to cases
on either side of the middle case is either close to the middle number or far away (since
the ordinal property of the number but not its distance from other numbers is meaningful).
Finally, you can use the mode with any level of measurement since finding the most
frequently occuring category (“value”) does not depend on either the order of the
categories or their distance from one another.
This lab deals with interval or ratio variables, so the choice of a measure of central
tendency is between the mean and the median. In this case, the preferred choice of
central tendency is usually the mean. The reason is that, as the centre of gravity, the mean
makes use of more information than the median.6 The extent to which the distribution is
skewed, however, qualifies this choice. In the case of a symmetric distribution (no skew),
the mean and the median will equal one another, so no choice is necessary. In the case of
skewed distributions, on the other hand, the mean will lie between the median and the
values of the extreme scores. Thus, the mean will be greater than the median in a
distribution (positively) skewed to the right and less than the median in a distribution
(negatively) skewed to the left. The difference occurs because the mean is affected by the
distance between cases as well as their order.
The sensitivity of the mean to variations in the way the scores are bunched in a
distribution is not always bad. After all, the purpose of the statistic is to summarize the
distribution. It is only when the skew is extreme that the mean becomes unrepresentative
or atypical of the distribution. In this case, the median is preferred as a measure of central
tendency. One purpose of this lab is to sensitize you to the effect of skewedness on the
choice of a measure of central tendency. For the remainder of this course, however, we
shall assume that the distribution is sufficiently “well behaved” ((i.e., any skew is not
extreme) to warrant the use of the mean as the measure of central tendency.
6
One way of seeing this property of the mean is to invoke the principle of least squares. Stated formally in
equation in (f1), this principle says that the sum of the squared deviations from the mean is less than the sum
of the squared deviations from any constant c, including the median. The significance principle is that we
can interpret the sum of the squared deviations as a measure of the error that results from using a constant
(the mean, median, mode, or any other measure of central tendency) to represent the values of a variable for
a set of cases. The principle of least squares implies that using the mean incurs the least error.
(f1)
 (y y)   (y c)
2
2
, where c is any constant other than
2.4
y.
DATA ANALYSIS WORKBOOK
LAB 2
DATA ANALYSIS EXAMPLE
Research Question
To what extent is the univariate frequency distribution of the variable (EDUC v21)
skewed? In particular, is it so skewed that the mean is preferred as the measure of central
tendency? To answer these questions, we will first look at the histogram for the variable
and then compare the mean and the median.
Computer Generated Histograms
Using the “explore” command, SPSS produces, first, a set of statistics that we can use to
describe a frequency distribution and, as an additional option, plots a histogram of the
distribution. In making the histogram, the program automatically chooses the interval
width. In doing so, it follows the convention of treating the interval limits as closed at the
bottom and open at the top, so, as pointed out earlier, you read the interval as equal to the
lower limit of all values up to (but not including) the upper limit.
Results (Respondent's Education, v21)
The descriptive statistics and histogram for the respondent's education are produced
below in Figure 1 and Table 1. Look at the histogram in Figure 1. A large percentage of
the cases pile up at grade 12. This is the completion of high school in the United States
and the most common point at which people leave school. Looking more closely at the
histogram, you might see an indication of a negative skew; there are a few extreme cases
with no or little schooling. Whether the skew is sufficiently extreme to have a substantial
affect on the mean is a matter of judgment. I believe that the skew is not that extreme, so
I would choose the mean over the median as the better measure of central tendency.
Go now to the statistics in the table below the histogram to check out my judgment. The
program provides a large number of statistics. The only ones that you will deal with in
the lab are the mean and median, although I will comment briefly on the value of the
skew in discussing this example.7 The impression of a negative skew should lead you to
expect that the mean will be less than the median. In fact, it is the other way around due
to the failure of SPSS to use linear interpolation when computing the median (see
7
The descriptive statistics in Table 1 consist of measures of central tendency, measures of dispersion, and
“higher moments.” The measures of central tendency are the mean (“point” and “interval estimates”), the
median, and a 5% trimmed mean. You are familiar with the mean and median. The trimmed mean is the
mean of the cases that remain after dropping extreme cases in the two tails (2.5% in the lower tail; and 2.5%
in the upper tail). The confidence interval provides a more “accurate” but less precise estimate of the
population mean. This estimation is the topic of lab six. The variance, standard deviation, minimum,
maximum, range, and inter-quartile range measure the “dispersion” or the extent to which the values of
education vary. Lab Three focuses on this topic. The measures of “higher moments” are the degree of
skewness and the degree of kurtosis. The degree of skewness is based on the sum of the cubed differences
between the scores and the mean (the third moment), while the measure of kurtosis is based on the sum of
the differences raised to the fourth power (the fourth moment).
2.5
LAB 2
DATA ANALYSIS WORKBOOK
footnote 5). Treating the variable as continuous, I get a median of approximately 12.7,
slightly greater than the mean of 12.33. Moreover, the value of the skew is -.500.
Figure 1. Histogram for Respondent's Education
Histogram
800
600
Frequency
400
200
Std. Dev = 3.28
Mean = 12.3
N = 1809.00
0
0.0
4.0
2.0
8.0
6.0
12.0
10.0
16.0
14.0
20.0
18.0
EDUC
Descriptives
EDUC
Mean
95% Confidence
Interval for Mean
Statistic
12.33
12.18
12.48
12.42
12.00
10.727
3.28
0
20
20
3.00
-.500
1.206
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Std. Error
7.70E-02
.058
.115
These results support my appraisal of the histogram: the distribution of schooling is
negatively skewed. Again, however, the comparison of the values of the mean and
median also support my judgment that the difference is not great enough to choose the
median over the mean as a measure of central tendency.
2.6
DATA ANALYSIS WORKBOOK
LAB 2
LAB 2 EXERCISES
Research Question:
What is the shape and best measure of central tendency for each of the following
variables?
1.
2.
3.
4.
5.
6.
The respondent's age
The income the respondent earns from his job (v31)
The age at which the respondent got married (v17)
The respondent's score on the seven-word vocabulary test (v50)
The number of brothers and sisters that the respondent has (v07)
The respondent's mother's education (v11)

examine the shape of each distribution (approximately symmetric, positively skewed,
or negatively skewed) by looking at the histograms that you generate using SPSS.

describe the shape of the distribution and the measure of central tendency that you
would choose to describe the distribution
Tasks:
1. Lab Exercise 2.1a - Variable Information: For each of the six variables, use the
blue codebook to find the variable name, minimum and maximum values/codes and
labels. You should also determine the metric and level of measurement of each
variable.
2. Lab Exercise 2.1b - Histograms – Skew: Use the histograms to determine whether
the distribution is skewed. If skewed, suggest which will be larger: the mean
(positive skew) or the median (negative skew).
3. Lab Exercise 2.2 - Comparing Measures of Central Tendency: Use the
descriptive statistics generated by the “explore” command to compare the mean and
median. Is the difference large enough to choose one over the other? (Use the criteria
provided on yellow sheet 2.2 to decide.) If so, which is larger: the mean (positive
skew) or the median (negative skew)?
Description of Variables:
For each variable, write one sentence that describes the variable using the measure of
central tendency you have chosen to represent the distribution. Use the metric of the
variable in your description and report the number of cases in the distribution. Write your
description on the back of a yellow worksheet. For example - the distribution for
respondent's education for 1,809 cases is approximately symmetric, so that the mean
of 12.3 years best describes the central tendency of this variable.
<<c:\workbook\white\ninst2\r8.02>>
2.7
Download