advertisement

Chapter 2-6. More on Levels of Measurement Summing Dichotomous-Scaled Variables or Ordinal-Scaled Variables Produces an Interval-Scaled Variable Almost all standardized tests, such as the Zung Self Rating Depression Scale (Zung, 1965), are made up of several ordinal scale items, and a total score is derived from summing the item scores. For example, Zung’s scale is made up of 20 items, which are each scored from 1 to 4. For example, the first item is: I feel down-hearted and blue (1) A little of the time (2) Some of the time (3) Good part of the time (4) Most of the time. A total score is then computed by summing the scores from the 20 items, which has a range from 0 to 80. It is widely accepted by measurement theory experts that these total scores, or totals from subsets of the items, are sufficiently interval scales, while individual items should be treated as ordinal scales. Two of the best known measurement theory experts, Nunnally and Bernstein (1994, p.16), comment, “Whereas there is usually little dispute over whether nominal or ordinal properties have been established, there is often great dispute over whether or not a scale possesses a meaningful unit of measurement. Formal scaling methods designed to this end are discussed in Chapters 2, 10, and 15. For now, it suffices to note that many measures are sums of item responses, such as conventionally scored multiple-choice, true-false, and Likert scale items. Data from individual items are clearly ordinal. However, the total score is usually treated as interval, as when the arithmetic mean score, which assumes equality of intervals, is computed. Those who perform such operations thus implicitly use a scaling model to convert data from a lower (ordinal) to a higher (interval) level of measurement when they sum over items to obtain a total score. Some adherents of Stevens’ position have argued that these statistical operations are improper and advocate, among other things, that medians, rather than arithmetic means should be used to describe conventional test data. We strongly disagree with this point of view for reasons we will note throughout this book, not the least of which is the results of summing item responses are usually indistinguishable from using more formal methods. However, some situations clearly do provide only ordinal data, and the results of using statistics that assume an interval can be misleading. One example would be the responses to individual items scored on multi-category (Likert-type) scales.” _______________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine. Chapter 2-6. (Accessed February 13, 2012, at http://www.ccts.utah.edu/biostats/ ?pageId=5385). Chapter 2-6 (revision 13 Feb 2012) p. 1 A refinement of this idea Although summing individual items to produce an interval scale is widely accepted, a little thought will make you wonder. For example, suppose you provide the following list of tasks for someone putting weight (post weightbearing) on a leg after a hip replacement operation: 1. 2. 3. 4. 5. Stand up from a sitting position Walk from room to room in own house using a cane or walker Walk from room to room in own house unassisted Walk up one flight of stairs Run on treadmill for 5 minutes If you sum up the number of tasks completed to get a total score, is it really an interval scale? The problem is that the tasks do not have the same level of difficultly, so the sum will not strictly have equal intervals. To make this a true interval scale, you need to weight the items by level of difficultly. An excellent way to assign weights is the Rasch Model, which is popularly used in measurement development. An excellent textbook on applying this method is Bond and Fox (2007). In referring to the common practice of scoring the number of items correctly answered on a test in school, such as a math test, or going on to express this as a percentage, to measure the student’s ability, Bond and Fox (2007, p.21) regard this as only an ordinal scale, “…The routine procedure in education circles is to express each of these n/N fractions as a percentage and to use them direcly in reporting students’ results. We will soon see that his commonpalce procedure is not justified. In keeping with the caveat we expressed earlier, these n/N fractions should be regarded as merely orderings of the nominal categories, and as insufficient for the inference of interval relations between the frequencies of observations.” Bond and Fox (2007) then go on to show how to weight the difficulty of the exam questions using a Rasch Model to provide a true interval scale with equal intervals of difficulty or student ability. Chapter 2-6 (revision 13 Feb 2012) p. 2 Visual Analog Scales for Symptom Measurements A frequently used way to assess pain is the visual analog scale (VAS). Here, the study subjects rates his or her pain by placing a mark on a visual scale, such as, |--------------------------------------------------------------| no pain worst possible pain These are frequently drawn with a line 100 mm long, so the score is the mm distance from the left (range, 0 to 100). Another variation is an integer rating, from 0 to 10. In remarking on what level of measurement such a scale achieves, McDowell (2006, p.478), in his textbook on rating scales, comments without commiting himself to an opinion, “Although nonparametric statistical analyses are generally considered appropriate (4), one study showed that VAS measures produced a measurement with ratio scale properties (12).” -------------(4) Huskisson EC. (1982). Measurement of pain. J Rheumatol 9;768-769. (12) Price DD, McGrath PA, Rafi A, et al. (1983). The validation of visual analogue scales as ratio scale measures for chronic and experimental pain. Pain 17:45-56. Perhaps the best article to cite for justifying that a VAS can be analyzed as an interval scale is Dexter and Chestnut (1995). These authors did a Monte Carlo simulation, sampling from an actual VAS dataset, and demonstrated that the independent sample t-test (assumes interval scale) performed as well as the Wilcoxon-Mann-Whitney test (assumes ordinal scale) in not inflating the Type I error rate. Similarly, they showed the oneway analysis of variance (assumes interval scale) performed as well as the Kruskal-Wallis test (assumed ordinal scale) in not inflating the Type I error rate. So, treating the VAS as an interval scale for analysis resulted in a correct hypothesis test. In their methods paper assessing the bias and precision of VAS’s, Paul-Dauphin et al (1999) analyze these scales using a statistical approach that assumes at least an interval scale. These authors discuss the different approaches of expressing the VAS, such as reference ticks and labels or not, vertical versus horizontal. It is a good paper to cite if you intend to use a VAS, if you want to show you have put some effort into designing your study well. Chapter 2-6 (revision 13 Feb 2012) p. 3 How Many Categories In An Ordinal Scale Are Required To Consider It an Interval Scale It would seem that adding more categories would take an ordinal scale closer to an interval scale, regardless of whether the intervals are strictly equal sized or not. This occurs because with more categories, there is less opportunity for the intervals to have large inequalities. Also, an ordinal scale has an underlying theoretical continuous scale. So, the scores of the ordinal scale are approximations of the underlying continuous scale. It is somewhat analogous to expressing height by rounding to the nearest inch or centimeter. So, just how many categories does it take for justifying analyzing an ordinal scale as an interval scale? Nunnally and Bernstein (1994, p.115) make a suggestion about the number of categories, “We will somewhat arbitrarily treat a variable as continuous if it provides 11 or more levels, even though it is not continuous in the mathematical sense. Consequently we will normally think of item responses as discrete and total scores as continuous. The number 11 is not ‘magical,’ but experience has indicated that little information is lost relative to a greater number of categories. Moreover, the law of diiminishing returns applies, and so using even 7 or 9 categories does little harm if the convenience of reporting data as a single digit is improtant to the application.” Exercise. Download the Multiple Sclerosis Quality of Life (MSQOL)-54 Instrument from the website: http://gim.med.ucla.edu/FacultyPages/Hays/MSQOL-54%20instrument.pdf Go to items 53 and 54 on the second to last page. Item 53 is a 11 point scale and item 54 is a 7 point scale. Which would you say does the best job of approaching the accuracy of an interval scale? Chapter 2-6 (revision 13 Feb 2012) p. 4 Is It All Right to Treat an Ordinal Scale as an Interval Scale for Analysis? Point Actually it is okay to analyze an ordinal scale using statistical methods that require an interval scale, but do not do it, since the idea has not yet caught on in biomedicine. You can, however, use this idea to make yourself feel comfortable analyzing sums of items not developed using the Rasch method as interval scales, or using VAS scales as interval level scales. Detail (if you are curious) As explained by Nunnally and Bernstein (1994, p.20), there is one camp, called the “fundamentalists”, who hold that ordinal scales should strictly be analyzed by the nonparametric tests that using on the rank order in the data. The other camp, called the “representationalists” advocated that the essential information in an interval scale is the rank ordering, and that there is little harm in analyzing ordinal data using parametric tests that assume an interval scale. These points of view were hotly debated in the 1950s in the social science literature. Studies were done that demonstrated there was little difference in outcomes by treating an interval scale as an ordinal scale, either approach produced basically the same correlation coefficient and p value. What came out of that was a justification for many social science researchers to analyze ordinal scales as interval scales. The trend was not adopted by researchers in biomedicine. Most of the measurements in medicine are either dichotomous or interval, so the issue did not have to be faced. In contrast, most of the measurements in the social sciences are ordinal, or multiple ordinal items deriving a total score of a scale. Therefore, the social scientists looked into this question in ernest, to enable themselves to use regression models and their variants. A famous biostatistician, Ralph D’Agostino, introduced some papers to introduce the idea into biostatistics. In his first paper (Heeren and D’Agostino, 1987), it was demonstrated by simulation that analyzing ordinal scales with a few categories with a t test, and with small sample sizes, 5 to 20, had the desired statistical property of the type I error being what it should be. In a followup paper, Sullivan and D’Agostino (2003) investigate the performance of analysis of covariance, a parametric technique that assumes an interval scale, on ordinal data with 3, 4 and 5 categories. Again, they discovered the type I error was not inflated, while power of the test remained high. D’Agostino did not take a position by stating a conclusion for or against analyzing ordinal scaled data with interval-level statistical approaches. Instead, he published his papers to lay the groundwork to move biostatisticians in this direction. His work, however, implies that this could be done and the idea will slowly catch on. Chapter 2-6 (revision 13 Feb 2012) p. 5 Dichotomous Variables Are Actually Interval Scaled Variables Hardly anyone knows this, because it is not taught in statistics courses, but a dichotomous scale is also an interval scale. What statistics books will advocate, however, is that categorical variables be converted to a set of dummy variables, or indicator variables (these are dichotomies, scored 0 or 1), as a way to include a categorical variable (either nominal or ordinal) into a regression model. Statistics books fail to point out that the reason this works is that dichotomous variables are actually interval scales, so arithmetic can be done the variables themselves. Linear regression estimates an intercept and slope (the equation for a straight line), using the following equations: n ˆ1 n ( X i X )(Yi Y ) i 1 n (X i 1 i X) and 2 ˆ0 Y ˆ1 X where X X i 1 i n We can see that arithmetic is being done on the variables themselves. Interval Scale Assumption Linear regression, as well as the other forms of regression models, assume that all predictor variables have at least an interval scale. This assumption is necessary so arithmetic can be performed on the values of each predictor variable. It makes sense to do arithmetic on an interval scaled variable, since this scale is sufficiently close to our notion of integers and real numbers (the interval scale shares the property of equal intervals with both of these number systems). It is generally accepted that it does not make sense to do arithmetic on nominal and ordinal scales, since these scales do not have equal intervals. Although it is rarely claimed as such, a dichotomous scale could be considered an interval scale, since it has order (although perhaps an arbitrary order), it has equal intervals (one interval that is equal to itself), and one category can be selected to represent 0. This claim is made by Jum C. Nunnally, one of the best-known psychometric experts (Nunnally and Bernstein, 1994, p.16): “When there are only two categories, there is only one interval to consider, so that one interval may be considered an ‘equal’ interval. That is why binary (dichotomous) variables may be considered to form interval scales, the point noted above as being so important to modern regression theory and elsewhere in statistics.” Chapter 2-6 (revision 13 Feb 2012) p. 6 Nunnally and Bernstein (1994, pp. 189-190) further state: “As noted in the section titled ‘Another form of Partialling,’ categorical variables are now used quite commonly in multivariate analysis thanks to Cohen (1968). This use reflects the point made in Chapter 1 that a scale may be regarded as an interval scale when it contains only two points. This is the basis of the analysis of variance. If the variable takes on only two values, such as gender, one level may be coded 0 and the other coded 1…. A variable coded 0 or 1 is called a ‘dummy’ or ‘indicator’ variable. The independent variable’s ‘scale’ has interval properties, by definition, because the scale has only two points.” Sarle (1997), on his web-site discussing measurement theory, states the same thing, “What about binary (0/1) variables? For a binary variable, the classes of one-to-one transformations, monotone increasing/decreasing transformations, and affine transformations are identical--you can't do anything with a one-to-one transformation that you can't do with an affine tranformation. Hence binary variables are at least at the interval level. If the variable connotes presence/absence or if there is some other distinguishing feature of one category, a binary variable may be at the ratio or absolute level. Nominal variables are often analyzed in linear models by coding binary dummy variables. This procedure is justified since binary variables are at the interval level or higher.” This is why you can recode nominal and ordinal predictor variables into indicator, or dummy variables, and include them directly into the regression equation. The regression algorithm treats the indicator variable as an interval scale, and performs arithmetic directly on the 0-1 values. This claim that dichotomous variables are actually interval scales is rarely taught in statistics classes, so few people are even aware why indicator variables work in regression models. Statisticians are traditionally trained to think of a 0-1 variable as a “Bernoulli variable,” rather than as a continuous “interval scale” variable. A Bernoulli variable has mean p and variance p(1-p), where p is the probability of a 1 (Ross, 1998). The derivation for this mean and variance for a Bernoulli variable, with standard deviation being the square root of the variance, is taught in the first semester of a masters degree level statistics program. The important point about the formulas is that they just use the nominal scale property of the variable. That is, they are based on simply counting the number of occurrences of the variables outcome (how 0’s and how many 1’s), and then doing arithmetic on the counts. Arithmetic is not done the values of the variable themselves. Chapter 2-6 (revision 13 Feb 2012) p. 7 These formulas for the mean and standard deviation of a Bernoulli variable look very different than the sample mean and sample standard deviation used in statistics: n X X i 1 i (sample mean) n and n s s2 (X i 1 i X )2 n 1 (sample standard deviation) Let’s apply these standard formulas to a dichotomous variable and see what happens. Reading in the Stata formatted data file, births.dta, using Stata menus: File Open Find the directory where you copied the course CD: Change to the subdirectory datasets & do-files Single click on births.dta Open use births.dta Chapter 2-6 (revision 13 Feb 2012) p. 8 Requesting a frequency table for the dichotomous variable, lowbw, using Stata menus: Statistics Summaries, tables & tests Tables Oneway tables Categorical variable: lowbw OK tabulate lowbw low birth | weight | Freq. Percent Cum. ------------+----------------------------------0 | 440 88.00 88.00 1 | 60 12.00 100.00 ------------+----------------------------------Total | 500 100.00 We see that the lowbw variable is a 0-1 variable, or Bernoulli variable. Using the Bernoulli formulas, we get mean = p = 60/500 = 0.1200 variance = p(1-p) = 0.1200(.8800) = 0.1056 standard deviation = p(1 p) = .324962 Notice how we just use the counts of the categories, the “Frequency” column of the frequency table, and then do arithmetic on the counts, rather than the values of the variable. That is, we computed these test statistics using only the nominal scale property of the variable (we just counted the frequency of occurrence of the name, or label, given to the variable). Now, using the ordinary statistical formulas for mean and standard deviation, which were designed for interval scales, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Variables: lowbw Options: standard display OK summarize lowbw Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------lowbw | 500 .12 .325287 0 1 Chapter 2-6 (revision 13 Feb 2012) p. 9 We see that the Bernoulli mean is exactly the same as when the ordinary formula for the mean is applied, both giving 0.12. We see that the Bernoulli standard deviation of 0.324962 does not quite match the ordinary “sample” standard deviation formula value of 0.325287. However, that is only because the Bernoulli formula is the population formula. The ordinary “population” formula for the standard deviation divides by N rather than N-1, N 2 (X i 1 )2 i (population standard deviation) N where sigma , ϭ, is the population standard deviation and, mu, µ, is the population mean. n 1 , than we have the population standard n If we multiply our sample standard deviation by deviation calculation. n n 1 n 1 s n n (X i 1 i X) n 1 n 2 (X i 1 i n )2 , where X is assumed to be equal to When we do that, display 0.325287*sqrt(499)/sqrt(500) .32496155 which we see is an exact match to the Bernoulli formula, which gave .324962 . So, treating a dichomous variable as an interval scales works for descriptive statistics. That is, treating a dichotomous variable as an interval scale and then applying the ordinary formulas produces an identical result as treating it as a nominal scale Bernoulli variable, and then applying the Bernoulli formulas. Next, let’s see what happens with significance tests, seeing if interval scale significance tests give an identical result to categorical significance tests. Chapter 2-6 (revision 13 Feb 2012) p. 10 Computing a t test, using lowbw as the outcome variable, using Stata menus: Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Variable name: lowbw Group variable name: sex OK ttest lowbw , by(sex) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 264 .1022727 .0186842 .3035821 .0654831 .1390624 2 | 236 .1398305 .0226235 .3475482 .0952598 .1844012 ---------+-------------------------------------------------------------------combined | 500 .12 .0145473 .325287 .0914185 .1485815 ---------+-------------------------------------------------------------------diff | -.0375578 .0291209 -.0947728 .0196572 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = -1.2897 Ho: diff = 0 degrees of freedom = 498 Ha: diff < 0 Pr(T < t) = 0.0989 Ha: diff != 0 Pr(|T| > |t|) = 0.1977 Ha: diff > 0 Pr(T > t) = 0.9011 Next, taking the more traditional statistical approach, compare the proportions using a chi-square test. Using Stata menus, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Row variable: lowbw Column variable: sex Test statistics: Pearson chi-squared Cell contents: Within-column relative frequencies (i.e., column %’s) OK tabulate lowbw sex , col chi2 Chapter 2-6 (revision 13 Feb 2012) p. 11 +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ low birth | sex of baby weight | 1 2 | Total -----------+----------------------+---------0 | 237 203 | 440 | 89.77 86.02 | 88.00 -----------+----------------------+---------1 | 27 33 | 60 | 10.23 13.98 | 12.00 -----------+----------------------+---------Total | 264 236 | 500 | 100.00 100.00 | 100.00 Pearson chi2(1) = 1.6645 Pr = 0.197 We discover that the two-tailed p values are identical between the t test and the chi-square test. Also, notice the column percents in the crosstabulation table agree with the means in the t-test output. A proportion is nothing more than a mean of a 0-1 scored variable: n (mean) X X i 1 n i X 1 X 2 ... X n 1 0 ... 1 p (proportion) n n So, it works for significance tests. We have verified, then, that treating a dichotomous variable outcome variable as an interval scale, and then applying ordinary interval scaled significance tests, provides the same result as treating it as a categorical variable and applying categorical variable significance tests (D’Agostino (1972). That is, D’Agostino (1972) published a similar demonstration, comparing one-way ANOVA to the chi-square test. A one-way ANOVA with two groups is identically the t test, so his demonstration applies to that shown in this chapter. D’Agostino (1972, p. 32) concluded, “We have seen for the situation studied that the one-way ANOVA procedure and the standard chi-squared procedure are algebraically similar and under the null hypothesis asymptotically equivalent. Pointing this out to students and users of statistical methdos may aid substanitally in their understanding of statistical methodology. There really are not two distinct ways of handling this problem.” It seems kind of surprising that the chi-square test, which has the form: 2 i (O - Ei ) 2 (observed - expected) 2 N (ad bc) 2 i expected Oi (a b)(a c)(b d )(c d ) i Chapter 2-6 (revision 13 Feb 2012) p. 12 gives an identical result as the t test, since they have very different looking formulas. In the chisquare formula, the a, b, c, d are the cell counts of the 2 x 2 crosstabulation table, and N is the total sample size (we are only doing arithmetic on the counts of values). It turns out the two formulas are algebraically identical. To see this, first we use the fact that the chi-square test is algebraically identical to the z test for proportions (shown in Chapter 2-4), which has the form: z p1 p2 1 1 p(1 p) n1 n2 , where p(1-p) is the pooled variance. This is identical to the equal variance version of the two-sample t test, t x1 x2 1 1 s n1 n2 , were s is the pooled variance. Suggested Use of This Knowledge Do nothing with it. If you use a t test to compare two proportions, readers and editors, even statistical editors, will think you are incompetent, since they will have never heard about all this. Just be happy with now knowing why you can put a 0-1 variable into a regression equation. Also, this is why you can include dichotomous variables when you compute a Pearson correlation coefficient, which we will do in a later chapter. The Pearson correlation coefficient assumes both variables are interval scaled, since it does arithmetic are the variables themselves. Chapter 2-6 (revision 13 Feb 2012) p. 13 References Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman & Hall/CRC. Bond TG, Fox CM. (2007). Applying the Rasch Model: Fundamental Measurement in the Human Sciences. 2nd ed. Mahwah, NJ, Lawrence Earlbaum Associates, Publishers. Cohen J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin 70:426-443. D’Agostino RB. (1972). Relation between the chi-squared and ANOVA tests for testing the equality of k independent dichotomous populations. The American Statistician 26(3):30-32. Dexter F, Chestnut DH. (1995). Analysis of statistical tests to compare visual analog scale measurements among groups. Anesthesiology 82(4):896-902. Heeren T, D’Agostino R. (1987). Robustness of the two independent samples t-test when applied to ordinal scaled data. Stat Med 6:79-90. McDowell I. (2006). Measuring Health: A Guide to Rating Scales and Questionnaires. 3rd ed, New York, Oxford University Press. Nunnally JC, Bernstein IH. (1994). Psychometric Theory, 3rd ed. New York, McGrawHill Book Company. Paul-Dauphin A, Guillemin F, Virion J-M, Briancon S. (1999). Bias and precision in visual analogue scales: a randomize controlled trial. Am J Epidemiol, 150(10):1117-27. Sarle WS. (1997). Measurement theory: frequently asked questions. Version 3, Sep 14. URL: ftp://ftp.sas.com/pub/neural/measurement.html Sullivan LM, D’Agostino RB Sr. (2003). Robustness and power analysis of covariance applied to ordinal scaled data as arising in randomized controlled trials. Stat Med 22:1317-1334. Zung WWK. (1965). A self-rating depression scale. Arch Gen Psychiatry. 12:63-70. Chapter 2-6 (revision 13 Feb 2012) p. 14