Measurement Operationalization is the process by which we manipulate or measure the constructs we are interested in researching. The act of measurement involves our assigning numbers to objects or events in such a way that the numbers tell us how much of some attribute a given object or event has. I have an interest in investigating the relationship between misanthropy and attitudes about animal rights. I have developed questionnaires for measuring misanthropy and attitudes about animal rights. The value of research involving such questionnaires (or other means of measuring the concepts of interest) depends, in part, on the reliability and validity of those questionnaires. Measurement systems can be characterized in terms of the relationship between the measurements we get and the true amounts of the measured attribute that the measured objects or events have. When interpreting the results of our research it may be helpful for us to consider the level of measurement for the data we analyzed. In this unit we shall discuss levels of measurement and the concepts of reliability and validity. Scales of Measurement Stanley S. Stevens (1946: On the theory of scales of measurement. Science, 103, 677-680) proposed that any measuring system can be characterized as being nominal, ordinal, interval, or ratio in scale. These scales should be thought of a part of a hierarchy. For each scale there is a criterion or criteria that must be met for a measuring system to be considered to belong to that scale, and each higher scale includes the criteria for all lower scales. Nominal Scales The only criterion here is: 1. The numbers (or other symbols) are assigned to objects according to some rules. The numbers need not indicate how much of any attribute the measured object or event has. For example, the government assigned you a “social security number.” That number does not tell you how much of something you have, it simply serves as an alternative name for you. As another example, we might classify a person’s religion as A = Protestant, B = Catholic, C = Jewish, D = Other. Ordinal Scales Let m(Oi ) = our measurement of the amount of some attribute that object i has, and t(Oi ) = the true amount of that attribute that object i has Copyright 2011, Karl L. Wuensch - All rights reserved. Research-4-Measurement.docx 2 For a measuring scale to be ordinal, the following two criteria must be met: 2. m(O1 ) m(O2 ) only if t(O1 ) t(O2 ) If two measurements are unequal, the true magnitudes (amounts of the measured attribute) are unequal. 3. m(O1 ) > m(O2 ) only if t(O1 ) > t(O2 ) If measurement 1 is larger than measurement 2, then object 1 has more of the measured attribute than object 2. If the relationship between the Truth and our Measurements is positive monotonic [whenever T increases, M increases], these criteria are met. Interval Scales For a measuring scale to be interval, the above criteria must be met and also a third criterion: 4. The measurement is some positive linear function of the Truth. That is, letting Xi stand for t(Oi ) to simplify the notation: m(Oi ) = a + bXi, b > 0 Thus, we may say that t(O1 ) - t(O2 ) = t(O3 ) - t(O4 ) if m(O1 ) - m(O2 ) = m(O3 ) - m(O4 ) Since the latter is (a + bX1) - (a + bX2) = (a + bX3) - (a + bX4) bX1 - bX2 = bX3 - bX4 X1 - X2 = X3 - X4 In other words, a difference of y units on the measuring scale represents the same true amount of the attribute being measured regardless of where on the scale the measurements are taken. Consider the following data on how long it takes a runner to finish a race: Runner: Joe Lou Sam Bob Wes Rank: 1 2 3 4 5 (True) Time (sec) 60 60.001 65 75 80 The ranks are ordinal data. In terms of ranks, the difference between Lou and Sam is equal to the difference between Joe and Lou, even though the true difference between Lou and Sam is 4999 times as large as the true difference between Joe and 3 Lou. Assume we used God’s stopwatch (or one that produced scores that are a linear function of her stopwatch, which is ultimately a matter of faith) to obtain the TIME (sec) measurements above. As Interval data, the amount of time represented by 65 minus 60, Sam vs Joe, is exactly the same as that represented by 80 minus 75, Wes vs Bob. That is, an interval of 5 sec represents the same amount of time regardless of where on the scale it is located. Note that a linear function is monotonic, but a monotonic function is not necessarily linear. A linear function has a constant slope. The slope of a monotonic function has a constant sign (always positive or always negative) but not necessarily a constant magnitude. Ratio Scales In addition to the three criteria already mentioned, a fourth criterion is necessary for a scale to be ratio: 5. a = 0, that is m(Oi ) = bXi, that is, we have a “true zero point.” If m(O) = 0, then 0 = bX, thus X must = 0 since b > 0. In order words, if an object has a measurement of zero, it has absolutely none of the measured attribute. For ratio data, m(O1 ) m(O2 ) = bX1 bX2 = X1 X2 Thus, we can interpret the ratios of ratio measurements as the ratio of the true magnitudes. a 0. For nonratio, interval data, m(O1 ) m(O2 ) = (a + bX1 ) (a + bX2) X1 X2 since When you worked Gas Law problems in high school [for example, volume of gas held constant, pressure of gas at 10 degrees Celsius given, what is pressure if temperature is raised to 20 degrees Celsius] you had to convert from Celsius to Kelvin because you were working with ratios, but Celsius is not ratio [20 degrees Celsius is not twice as hot as 10 degrees Celsius], Kelvin is. Please note that classifying a measuring system into one of these levels is, in part, a matter of faith, because we cannot ever really know what is the nature of the 4 relationship between the true scores and our measurements. One way around this is to define the true scores on the measurements – for example, I may argue that the construct which I am measuring is some abstraction that is a linear transformation of my measurements. Reliability Simply put, a reliable measuring instrument is one which gives you the same measurements when you repeatedly measure the same unchanged objects or events. We shall briefly discuss here methods of estimating an instrument’s reliability. The theory underlying this discussion is that which is sometimes called “classical measurement theory.” The foundations for this theory were developed by Charles Spearman (1904, “General Intelligence,” objectively determined and measures. American Journal of Psychology, 15, 201-293). If a measuring instrument were perfectly reliable, then it would have a perfect positive (r = +1) correlation with the true scores. If you measured an object or event twice, and the true scores did not change, then you would get the same measurement both times. We theorize that our measurements contain random error, but that the mean error is zero. That is, some of our measurements have error that make them lower than the true scores, but others have errors that make them higher than the true scores, with the sum of the score-decreasing errors being equal to the sum of the score increasing errors. Accordingly, random error will not affect the mean of the measurements, but it will increase the variance of the measurements. Our definition of reliability is rXX T2 T2 2 rTM . That is, reliability is the M2 T2 E2 proportion of the variance in the measurement scores that is due to differences in the true scores rather than due to random error. Please note that I have ignored systematic (nonrandom) error, optimistically assuming that it is zero or at least small. Systematic error arises when our instrument consistently measures something other than what it was designed to measure. For example, a test of political conservatism might mistakenly also measure personal stinginess. Also note that I can never know what the reliability of an instrument (a test) is, because I cannot know what the true scores are. I can, however, estimate reliability. Test-Retest Reliability. The most straightforward method of estimating reliability is to administer the test twice to the same set of subjects and then correlate the two measurements (that at Time 1 and that at Time 2). Pearson r is the index of correlation most often used in this context. If the test is reliable, and the subjects have not changed from Time 1 to Time 2, then we should get a high value of r. We would likely be satisfied if our value of r were at least .70 for instruments used in research, at least .80 (preferably .90 or higher) for instruments used in practical applications such as making psychiatric diagnoses (see my document Nunnally on Reliability). We would also want the mean and standard deviation not to change appreciably from Time 1 to 5 Time 2. On some tests, however, we would expect some increase in the mean due to practice effects. Alternate/Parallel Forms Reliability. If there two or more forms of a test, we want to know that the two forms are equivalent (on means, standard deviations, and correlations with other measures) and highly correlated. The r between alternate forms can be used as an estimate of the tests’ reliability. Split-Half Reliability. It may be prohibitively expensive or inconvenient to administer a test twice to estimate its reliability. Also, practice effects or other changes between Time 1 and Time 2 might invalidate test-retest estimates of reliability. An alternative approach is to correlate scores on one random half of the items on the test with the scores on the other random half. That is, just divide the items up into two groups, compute each subject’s score on the each half, and correlate the two sets of scores. This is like computing an alternate forms estimate of reliability after producing two alternate forms (split-halves) from a single test. We shall call this coefficient the half-test reliability coefficient, rhh. Spearman-Brown. One problem with the split-half reliability coefficient is that it is based on alternate forms that have only one-half the number of items that the full test has. Reducing the number of items on a test generally reduces it reliability coefficient. To get a better estimate of the reliability of the full test, we apply the Spearman-Brown 2rhh correction, rsb . 1 rhh Cronbach’s Coefficient Alpha. Another problem with the split-half method is that the reliability estimate obtained using one pair of random halves of the items is likely to differ from that obtained using another pair of random halves of the items. Which random half is the one we should use? One solution to this problem is to compute the Spearman-Brown corrected split-half reliability coefficient for every one of the possible split-halves and then find the mean of those coefficients. This mean is known as Cronbach’s coefficient alpha. Instructions for computing it can be found in my document Cronbach’s Alpha and Maximized Lambda4. You should learn to use SPSS to compute Cronbach’s alpha. Maximized Lambda4. H. G. Osburn (Coefficient alpha and related internal consistency reliability coefficients, Psychological Methods, 2000, 5, 343-355) noted that coefficient alpha is a lower bound to the true reliability of a measuring instrument, and that it may seriously underestimate the true reliability. They used Monte Carlo techniques to study a variety of alternative methods of estimating reliability from internal consistency. Their conclusion was that maximized lambda4 was the most consistently accurate of the techniques. 4 is the rsb for one pair of split-halves of the instrument. To obtain maximized 4, one simply computes 4 for all possible split-halves and then selects the largest .5(2n)! obtained value of 4. The problem is that the number of possible split halves is (n! ) 2 6 for a test with 2n items. If there are only four or five items, this is tedious but not unreasonably difficult. If there are more than four or five items, computing maximized 4 is unreasonably difficulty, but it can be estimated -- see my document Estimating Maximized Lambda4. Intra-Rater Reliability. Sometimes your measuring instrument is a person who is rating the subjects on one or more attributes. If you are using a single rater, you could, under some circumstances, evaluate his or her reliability with a test-retest reliability coefficient. For example, suppose the rater is viewing videotapes of children playing and is rating how aggressive a child is. You could have the rater view the videotapes twice, each time rating a child, and then correlate the ratings from the first viewing with those from the second viewing. Inter-Rater Reliability. If you have two or more raters, you will want to estimate reliability using something more like an alternate forms reliability coefficient. That is, you want to demonstrate that the ratings are pretty much the same whether the one person or the other person is doing the ratings. When the measurement variable is categorical in nature, one may compute the percentage of agreement as an inter-rater reliability statistic. For example, suppose that each of two persons were observing children at play and at a designated time or interval of time determining whether or not the target child was involved in a fight. Furthermore, if the rater decided a fight was in progress, the target child was classified as being the aggressor or the victim. Consider the following hypothetical data: Rater 2 No Fight No Fight 70 (54.75) 3 2 75 Aggressor 2 6 (2.08) 5 13 Victim 1 7 4 (1.32) 12 73 16 11 100 marginal Aggressor Victim marginal Rater 1 The percentage of agreement here is quite pretty good, (70 + 6 + 4) 100 = 80%, but not all is rosy here. The raters have done a pretty good job of agreeing regarding whether the child is fighting or not, but there is considerable disagreement between the raters with respect to whether the child is the aggressor or the victim. Jacob Cohen developed a coefficient of agreement, kappa, that corrects the percentage of agreement statistic for the tendency to get high values by chance alone when one of the categories is very frequently chosen by both raters. On the main diagonal of the table above I have entered in parentheses the number of agreements that would be expected by chance alone given the marginal totals. Each of these expected frequencies is computed by taking the marginal total for the column the cell is in, multiplying it by the marginal total for the row the cell is in, and then dividing by the total count. For example, for the No Fight-No Fight cell, (73)(75) 100 = 54.75. Kappa 7 O E , where the O’s are observed frequencies on the N E main diagonal, the E’s are expected frequencies on the main diagonal, and N is the total 70 6 4 54.75 2.08 1.32 21.85 0.52 , which is not so count. For our data, 100 54.75 2.08 1.32 41.85 impressive. is then computed as: More impressive would be these data, for which kappa is 0.82: Rater 2 No Fight No Fight 70 (52.56) 0 2 72 Aggressor 2 13 (2.40) 1 16 Victim 1 2 9 (1.44) 12 73 15 12 100 marginal Aggressor Victim marginal Rater 1 If your ratings are ranks, you can estimate inter-rater reliability with Spearman rho or Kendall’s tau if there are two raters, or with Kendall’s coefficient of concordance if there are more than two raters. If your ratings can be treated as a continuous variable, you can, for two raters, simply compute Pearson r between the two sets of ratings. For two or more raters a better procedure is to compute an intraclass correlation coefficient. It is possible for ratings to be well correlated with one another but the raters not agree on the magnitude of the measured attributed. The intraclass correlation coefficient takes this into account, producing a better coefficient for measuring degree of agreement. Please read my document Inter-Rater Agreement for more details on this. Construct Validity Simply put, the construct validity of an operationalization (a measurement or a manipulation) is the extent to which it really measures (or manipulates) what it claims to measure (or manipulate). When the dimension being measured is an abstract construct that is inferred from directly observable events, then we may speak of “construct validity.” Face Validity. An operationalization has face validity when others agree that it looks like it does measure or manipulate the construct of interest. For example, if I tell you that I am manipulating my subjects’ sexual arousal by having them drink a pint of isotonic saline solution, you would probably be skeptical. On the other hand, if I told you I was measuring my male subjects’ sexual arousal by measuring erection of their penises, you would probably think that measurement to have face validity. Content Validity. Assume that we can detail the entire population of behavior (or other things) that an operationalization is supposed to capture. Now consider our operationalization to be a sample taken from that population. Our operationalization will 8 have content validity to the extent that the sample is representative of the population. To measure content validity we can do our best to describe the population of interest and then ask experts (people who should know about the construct of interest) to judge how well representative our sample is of that population. Criterion-Related Validity. Here we test the validity of our operationalization by seeing how it is related to other variables. Suppose that we have developed a test of statistics ability. We might employ the following types of criterion-related validity: Concurrent Validity. Are scores on our instrument strongly correlated with scores on other concurrent variables (variables that are measured at the same time). For our example, we should be able to show that students who just finished a stats course score higher than those who have never taken a stats course. Also, we should be able to show a strong correlation between score on our test and students’ current level of performance in a stats class. Predictive Validity. Can our instrument predict future performance on an activity that is related to the construct we are measuring? For our example, is there a strong correlation between scores on our test and subsequent performance of employees in an occupation that requires the use of statistics. Convergent Validity. Is our instrument well correlated with measures of other constructs to which it should, theoretically, be related? For our example, we might expect scores on our test to be well correlated with tests of logical thinking, abstract reasoning, verbal ability, and, to a lesser extent, mathematical ability. Discriminant Validity. Is our instrument not well correlated with measures of other constructs to which it should not be related? For example, we might expect scores on our test not to be well correlated with tests of political conservatism, ethical ideology, love of Italian food, and so on. Threats to Construct Validity. Inadequate Preoperational Explication. Here our failure is with respect to defining the population of behaviors that represents the construct. For example, if I want to construct a measure of “love of Italian cuisine,” it might be a mistake to sample only items that have to do with pizza, spaghetti, and ravioli. Would my results generalize to a broader definition of Italian cuisine? Mono-Operation Bias. Here the problem is that we have relied on only one method of manipulating the construct of interest. Suppose I am interested in effects of physical exercise. My research involves only one type of exercise, jogging. Would my results generalize to other types of exercise? Mono-Method Bias. Here the problem is that we use only one means of measuring a construct. For example, I use a self-report questionnaire for measuring love of Italian cuisine. Would my results be different if I used direct observations of what my subjects ordered when eating out, or if I asked their 9 friends what sort of meals my subjects cooked, or if I investigated what types of cook books were on my subjects’ shelves, etc. Interaction of Different Treatments. The problem here is that subjects exposed to one treatment may also have been exposed to other treatments and those other treatments may have altered effect of the treatment of interest. For example, a researcher wants to test the effect of a new drug on his colony of monkeys. These monkeys have been used in other drug experiments previously. Would their response to the new drug be the same if they had not been used in other drug experiments previously? Testing x Treatment Interaction. This is a problem only when subjects are given a pretest, then an experimental treatment, then a posttest. The question is whether or not the pretesting affected the subjects’ response to the experimental treatment. Restricted Generalizability Across Constructs. The problem here is that an experimental treatment might have unanticipated effects on constructs that you did not measure. For example, the drug we are testing for effectiveness in treating depression might also affect a variety of other traits, but if we did not measure those other traits, then we cannot fully describe the effects of the drug. Confounding Constructs with Levels of Constructs. This problem involves our thinking that the few levels of our experimental variable used in our study fairly represent the entire range of possible levels of the construct. For example, suppose that we are interested in the effects of test anxiety on performance of a complex cognitive task. We manipulate our subjects’ test anxiety with only two levels, low and medium. We find that their performance is higher with medium anxiety than with low anxiety and conclude that test anxiety is positively correlated with performance. We might have found a more complex relationship if we had used different levels of anxiety. For example, if we used low, medium, and high anxiety, we might have found a curvilinear relationship, with performance decreasing as anxiety increased beyond moderate levels. Social Threats. The basic problem here is that subjects who are observed may not behave the same as subjects who are not observed. Is the effect of your experimental treatment due to or affected by subjects knowing that you are observing them? Here are some examples of threats of this type: o Hypothesis Guessing. Your subjects guess what your hypothesis are and then either try to help you confirm them (the good-guy effect) or do their best to give you data that would lead to disconfirmation (the screwyou effect). Whether the subjects guessed correctly or not, their behavior is affected by their thoughts regarding what your hypothesis are. o Evaluation Apprehension. Subjects may get nervous when they know that you are evaluating them. They may even be afraid that you, the 10 psychologist, will be able to see their innermost secrets. Such apprehension may well affect their behavior. o Expectancy Effects. Both the subjects and the experimenter may have expectancies about the effects of the experimental treatments. In one classic experiment on experimenter-expectancy effects, graduate students were given rats to test in a maze. They were told there were two different groups of rats, one groups comprise of stupid rats and the other of smart rats. The students were also told to which group each tested rat belonged. As expected, the smart rats did much better in the maze than did the stupid rats, but guess what -- there were not really two different groups of rats, the rats were all randomly chosen from one group. To avoid such problems, the data collector may be kept ignorant with respect to which group each subject is in. Likewise, if the subject knows which group he is in, his behavior may be affected by that expectation, a subject-expectancy problem. Accordingly, the researcher may take steps to prevent the subjects from knowing what group they are in. For example, in drug trials, subjects who are in the group that does not get the drug treatment may be given a placebo drug to take, a substance that has no effect on the dependent variable. This is intended to make it harder for the subject to figure out whether he is getting the experimental drug or not. When both the data collector and the subjects are kept in the dark with respect to who got what treatments, the experiment is said to be a double-blind experiment. Copyright 2011, Karl L. Wuensch - All rights reserved.