DATA ANALYSIS WORKBOOK LAB 2 UNIVARIATE FREQUENCY DISTRIBUTIONS & MEASURES OF CENTRAL TENDENCY OVERVIEW The purpose of this exercise is to increase your understanding of how the shape of a frequency distribution can affect the ability of the mean to describe the central tendency or location of the distribution of a variable. You do this by: 1. looking at a graph of the frequency distribution - a histogram 2. guessing which is more typical or representative of the distribution: the mean or median 3. checking your guess against the actual values of the mean and median STATISTICS AND DATA ANALYSIS Using a Histogram to Graph a Frequency Distribution A graph provides the picture of a frequency distribution. A frequency distribution is a table that lists the values of the variable and the frequency of occurrence for each value. A histogram is a bar graph of the distribution of a variable that is measured at either the interval or ratio levels. Although the variables in the 1987 GSS are discrete, we will treat the ratio, interval, and in some cases ordinal, variables as being continuous. In a continuous variable, the values of the variable are capable of being infinitely subdivided. For example, the unit of time used to measure a person's age could be subdivided into years, months, weeks, days, hours, minutes, seconds, etc. In contrast, the values of a discrete variable—for example, family size—cannot be subdivided. In making the frequency distribution of a continuous variable, the values of the variable are grouped into class intervals. Usually, the class intervals in a frequency distribution have the same size—for example, ten-year classes in the case of age, and one year in the case of education.1 To make a histogram, a bar graph is constructed so that the relative area2 of each bar equals the relative frequency of occurrence for the class interval. (The relative frequency In treating a variable as continuous, we adopt the convention of using intervals that are “closed at the bottom but open at the top.” This means that you read the class interval as containing those values beginning with the lower limit and up to but not including the upper limit. In the case of age grouped into class intervals of size ten (years), we would read the interval 15 - 25 as including the ages 15 and up to but not including 25. In cases where the size of class interval is one, we interpret the value of the variable as the midpoint of the class interval that begins a half unit below the value and ends a unit above. For example, we interpret 12 years of schooling as the midpoint of an interval that begins at 11.5 and ends at (but does not include) 12.5. 2 The area of a bar graph equals the length of the bar times its height. The relative area of a bar is equal to the bar's area divided by the sum of the areas for all the bars. 2.1 1 LAB 2 DATA ANALYSIS WORKBOOK of occurrence equals the proportion of cases that have values in the class interval.) For frequency distributions with equal class intervals, one obtains this equality by making the height of the bar correspond to the frequency of occurrence for the class interval. To construct a histogram of a frequency distribution with unequal intervals is more complicated. In this case, you make the height of the bar equal to the ratio of the frequency of the class interval divided by size of the interval. Since we rarely, if ever, encounter variables grouped into unequal class intervals, we go no further on this topic.3 INTERPRETING A HISTOGRAM The picture given by a histogram provides four rough ideas about the shape of the distribution of the variable being graphed. First, we can get a rough idea of the variable’s central tendency. As discussed in more detail below, the central tendency of a variable is a “typical value” that we use to stand for or summarize the other values of the variable that have been observed. Second, we can get an idea of the degree of spread or the amount of variation in the variable. Third, we can see whether the distribution is asymmetric or skewed. Finally, we can see whether the shape of distribution is either relatively flat, peaked, or “normal,” technically referred to as the distribution’s kurtosis. This lab focuses on the central tendency and skewedness of a distribution. We discuss the variability or spread in Chapter 3. We provide no further discussion of kurtosis. The distribution of a variable is asymmetric when a minority of cases have either extremely high or low values (but not both). When the asymmetry is substantial, we say that the distribution is skewed. The skew corresponds to the region of the extreme values. The histogram for a skewed distribution will exhibit a “tail” that represents these extreme cases. Depending on whether the values of the extreme cases are low or high, relative to the restof the values, we say that distribution is skewed to the left (“negatively skewed”) or skewed to the right (“positively skewed”). For example, income is positively skewed because only a small number of people enjoy very large incomes. Most people have low or moderate values of income, relative to the wealthy. Students scores on an “easy” exam, on the other hand, typically are skewed to the left (or negatively skewed). Most students get scores in the middle or upper end of the possible range of scores. A few students who do not read the text, attend lectures, or do neither get extremely low scores. 3 One might group years of schooling into unequal class intervals so that the intervals end at the points at which students often leave school. The discrete (rather than continuous) intervals for an American sample might be: 0 – 6, 7 – 8, 9 – 11, 12, 13 – 15, 16, and > 16. If, say, a hundred students had left school after grade 12, the researcher would draw the bar so that the height would correspond to 100 students. If 30 students had left school either at grade 9, 10, or 11, the researcher would draw the bar so that the height would correspond to 10 students (30/3 = 10). In constructing the histogram this way, the researcher assumes a uniform distribution of cases across the values of the class interval. 2.2 DATA ANALYSIS WORKBOOK LAB 2 Using a Measure of Central Tendency or Location to Describe a Frequency Distribution A measure of central tendency (sometimes called a measure of “location”) is a value of the variable that data analysts use to represent the entire distribution of values for a set of cases. Since we use it to stand for the different values of a variable, we want to choose a “typical” value. “Typical,” however, can be defined a number of ways. Introductory statistics courses usually present three measures of tendency: the mean, median, and mode. Throughout most of the course we concentrate exclusively on the mean (or some variation of it) as the measure of central tendency. In this lab, however, we also look at the median (in order to better understand the mean’s strengths and weaknesses as a measure of central tendency). The mean is the arithmetic average of the values of a variable for a set of cases. To compute it, you (or the computer) adds up the values of the variable (for all the cases) and divides the sum by the number of cases. Equation 1 contains the definitional formula for the mean for the variable y. In this equation, y stands for the variable, n stands for the number of cases, and is the arithmetic operator that tells you to add up the values of y. Physically, the mean is the centre of gravity or balancing point for a distribution. If you can visualize the distribution sitting on a “teeter-totter,” the mean is the point at which the teeter-tooter will balance (that is, where the board will be parallel to the ground). As we discuss below, this feature of the mean both enhances and detracts from its usefulness as a measure of central tendency. (1) y y n A second measure of central tendency is the median. The median is the value of the variable that divides the distribution in half. Half the cases have a value greater than the median; half have a value that is less than the median.4 For an odd number of cases, the median is the value of the variable for the middle case. For an even number of cases, you compute the median by taking the average of the two middle cases. When treating a variable as continuous, you can use linear interpolation to compute a more precise value of the median. Since none of the computer programs do this, we do not describe this procedure in detail.5 The mode of a distribution is the third measure of central tendency. It is the most frequently occurring value of the variable (or class interval in the case of a grouped frequency distribution). Often, a distribution will have two or more frequently occurring values (relative to the rest). In this case, we say the distribution is “bimodal” or 4 The median is a single point, so, strictly speaking, none of the cases have exactly the value of the median. However, a certain per cent will fall in the interval that is bounded by half a value above and below the median. 5 One consequence of the failure of computer packages to use linear interpolation to compute the median is that the relative values of the computed mean and median in skewed distributions will fail to exhibit the properties described below--i.e., the mean should be greater than the median in positively skewed distributions and less than the median in negatively skewed distributions. This occurs in the example described below. 2.3 LAB 2 DATA ANALYSIS WORKBOOK “multimodal.” We refer to distributions in which all cases have the same frequency of occurrence as “uniform.” The Choice of a Measure of Central Tendency: The Mean Versus the Median The choice of which measure of central tendency to use depends, among other things, on the level at which the variable is measured. Use of the mean assumes at least interval measurement (or the willingness to treat an ordinal variable as if it were interval). The reason is straightforward. To change the position of a case in a distribution will affect the centre of gravity. (Think of the example of a child shifting his or her position on a teeter-totter.) A mean, therefore, will make sense only if the researcher uses the interval (or ratio) properties of numbers when measuring the variable. A median assumes only ordinal measurement because it is unaffected by whether the “number” assigned to cases on either side of the middle case is either close to the middle number or far away (since the ordinal property of the number but not its distance from other numbers is meaningful). Finally, you can use the mode with any level of measurement since finding the most frequently occuring category (“value”) does not depend on either the order of the categories or their distance from one another. This lab deals with interval or ratio variables, so the choice of a measure of central tendency is between the mean and the median. In this case, the preferred choice of central tendency is usually the mean. The reason is that, as the centre of gravity, the mean makes use of more information than the median.6 The extent to which the distribution is skewed, however, qualifies this choice. In the case of a symmetric distribution (no skew), the mean and the median will equal one another, so no choice is necessary. In the case of skewed distributions, on the other hand, the mean will lie between the median and the values of the extreme scores. Thus, the mean will be greater than the median in a distribution (positively) skewed to the right and less than the median in a distribution (negatively) skewed to the left. The difference occurs because the mean is affected by the distance between cases as well as their order. The sensitivity of the mean to variations in the way the scores are bunched in a distribution is not always bad. After all, the purpose of the statistic is to summarize the distribution. It is only when the skew is extreme that the mean becomes unrepresentative or atypical of the distribution. In this case, the median is preferred as a measure of central tendency. One purpose of this lab is to sensitize you to the effect of skewedness on the choice of a measure of central tendency. For the remainder of this course, however, we shall assume that the distribution is sufficiently “well behaved” ((i.e., any skew is not extreme) to warrant the use of the mean as the measure of central tendency. 6 One way of seeing this property of the mean is to invoke the principle of least squares. Stated formally in equation in (f1), this principle says that the sum of the squared deviations from the mean is less than the sum of the squared deviations from any constant c, including the median. The significance principle is that we can interpret the sum of the squared deviations as a measure of the error that results from using a constant (the mean, median, mode, or any other measure of central tendency) to represent the values of a variable for a set of cases. The principle of least squares implies that using the mean incurs the least error. (f1) (y y) (y c) 2 2 , where c is any constant other than 2.4 y. DATA ANALYSIS WORKBOOK LAB 2 DATA ANALYSIS EXAMPLE Research Question To what extent is the univariate frequency distribution of the variable (EDUC v21) skewed? In particular, is it so skewed that the mean is preferred as the measure of central tendency? To answer these questions, we will first look at the histogram for the variable and then compare the mean and the median. Computer Generated Histograms Using the “explore” command, SPSS produces, first, a set of statistics that we can use to describe a frequency distribution and, as an additional option, plots a histogram of the distribution. In making the histogram, the program automatically chooses the interval width. In doing so, it follows the convention of treating the interval limits as closed at the bottom and open at the top, so, as pointed out earlier, you read the interval as equal to the lower limit of all values up to (but not including) the upper limit. Results (Respondent's Education, v21) The descriptive statistics and histogram for the respondent's education are produced below in Figure 1 and Table 1. Look at the histogram in Figure 1. A large percentage of the cases pile up at grade 12. This is the completion of high school in the United States and the most common point at which people leave school. Looking more closely at the histogram, you might see an indication of a negative skew; there are a few extreme cases with no or little schooling. Whether the skew is sufficiently extreme to have a substantial affect on the mean is a matter of judgment. I believe that the skew is not that extreme, so I would choose the mean over the median as the better measure of central tendency. Go now to the statistics in the table below the histogram to check out my judgment. The program provides a large number of statistics. The only ones that you will deal with in the lab are the mean and median, although I will comment briefly on the value of the skew in discussing this example.7 The impression of a negative skew should lead you to expect that the mean will be less than the median. In fact, it is the other way around due to the failure of SPSS to use linear interpolation when computing the median (see 7 The descriptive statistics in Table 1 consist of measures of central tendency, measures of dispersion, and “higher moments.” The measures of central tendency are the mean (“point” and “interval estimates”), the median, and a 5% trimmed mean. You are familiar with the mean and median. The trimmed mean is the mean of the cases that remain after dropping extreme cases in the two tails (2.5% in the lower tail; and 2.5% in the upper tail). The confidence interval provides a more “accurate” but less precise estimate of the population mean. This estimation is the topic of lab six. The variance, standard deviation, minimum, maximum, range, and inter-quartile range measure the “dispersion” or the extent to which the values of education vary. Lab Three focuses on this topic. The measures of “higher moments” are the degree of skewness and the degree of kurtosis. The degree of skewness is based on the sum of the cubed differences between the scores and the mean (the third moment), while the measure of kurtosis is based on the sum of the differences raised to the fourth power (the fourth moment). 2.5 LAB 2 DATA ANALYSIS WORKBOOK footnote 5). Treating the variable as continuous, I get a median of approximately 12.7, slightly greater than the mean of 12.33. Moreover, the value of the skew is -.500. Figure 1. Histogram for Respondent's Education Histogram 800 600 Frequency 400 200 Std. Dev = 3.28 Mean = 12.3 N = 1809.00 0 0.0 4.0 2.0 8.0 6.0 12.0 10.0 16.0 14.0 20.0 18.0 EDUC Descriptives EDUC Mean 95% Confidence Interval for Mean Statistic 12.33 12.18 12.48 12.42 12.00 10.727 3.28 0 20 20 3.00 -.500 1.206 Lower Bound Upper Bound 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Std. Error 7.70E-02 .058 .115 These results support my appraisal of the histogram: the distribution of schooling is negatively skewed. Again, however, the comparison of the values of the mean and median also support my judgment that the difference is not great enough to choose the median over the mean as a measure of central tendency. 2.6 DATA ANALYSIS WORKBOOK LAB 2 LAB 2 EXERCISES Research Question: What is the shape and best measure of central tendency for each of the following variables? 1. 2. 3. 4. 5. 6. The respondent's age The income the respondent earns from his job (v31) The age at which the respondent got married (v17) The respondent's score on the seven-word vocabulary test (v50) The number of brothers and sisters that the respondent has (v07) The respondent's mother's education (v11) examine the shape of each distribution (approximately symmetric, positively skewed, or negatively skewed) by looking at the histograms that you generate using SPSS. describe the shape of the distribution and the measure of central tendency that you would choose to describe the distribution Tasks: 1. Lab Exercise 2.1a - Variable Information: For each of the six variables, use the blue codebook to find the variable name, minimum and maximum values/codes and labels. You should also determine the metric and level of measurement of each variable. 2. Lab Exercise 2.1b - Histograms – Skew: Use the histograms to determine whether the distribution is skewed. If skewed, suggest which will be larger: the mean (positive skew) or the median (negative skew). 3. Lab Exercise 2.2 - Comparing Measures of Central Tendency: Use the descriptive statistics generated by the “explore” command to compare the mean and median. Is the difference large enough to choose one over the other? (Use the criteria provided on yellow sheet 2.2 to decide.) If so, which is larger: the mean (positive skew) or the median (negative skew)? Description of Variables: For each variable, write one sentence that describes the variable using the measure of central tendency you have chosen to represent the distribution. Use the metric of the variable in your description and report the number of cases in the distribution. Write your description on the back of a yellow worksheet. For example - the distribution for respondent's education for 1,809 cases is approximately symmetric, so that the mean of 12.3 years best describes the central tendency of this variable. <<c:\workbook\white\ninst2\r8.02>> 2.7