Measurement
Operationalization is the process by which we manipulate or measure the
constructs we are interested in researching. The act of measurement involves our
assigning numbers to objects or events in such a way that the numbers tell us how
much of some attribute a given object or event has.
I have an interest in investigating the relationship between misanthropy and
attitudes about animal rights. I have developed questionnaires for measuring
misanthropy and attitudes about animal rights. The value of research involving such
questionnaires (or other means of measuring the concepts of interest) depends, in part,
on the reliability and validity of those questionnaires.
Measurement systems can be characterized in terms of the relationship between
the measurements we get and the true amounts of the measured attribute that the
measured objects or events have. When interpreting the results of our research it may
be helpful for us to consider the level of measurement for the data we analyzed.
In this unit we shall discuss levels of measurement and the concepts of reliability
and validity.
Scales of Measurement
Stanley S. Stevens (1946: On the theory of scales of measurement. Science,
103, 677-680) proposed that any measuring system can be characterized as being
nominal, ordinal, interval, or ratio in scale. These scales should be thought of a part of
a hierarchy. For each scale there is a criterion or criteria that must be met for a
measuring system to be considered to belong to that scale, and each higher scale
includes the criteria for all lower scales.
Nominal Scales
The only criterion here is:
1. The numbers (or other symbols) are assigned to objects according to some
rules.
The numbers need not indicate how much of any attribute the measured object
or event has. For example, the government assigned you a “social security number.”
That number does not tell you how much of something you have, it simply serves as an
alternative name for you. As another example, we might classify a person’s religion as
A = Protestant, B = Catholic, C = Jewish, D = Other.
Ordinal Scales
Let
m(Oi ) = our measurement of the amount of some attribute that object i has, and
t(Oi ) = the true amount of that attribute that object i has

Copyright 2011, Karl L. Wuensch - All rights reserved.
Research-4-Measurement.docx
2
For a measuring scale to be ordinal, the following two criteria must be met:
2.
m(O1 )  m(O2 ) only if t(O1 )  t(O2 )
If two measurements are
unequal, the true magnitudes (amounts of the measured attribute) are unequal.
3.
m(O1 ) > m(O2 ) only if t(O1 ) > t(O2 )
If measurement 1 is larger than
measurement 2, then object 1 has more of the measured attribute than object 2.
If the relationship between the Truth and our Measurements is positive
monotonic [whenever T increases, M increases], these criteria are met.
Interval Scales
For a measuring scale to be interval, the above criteria must be met and also a
third criterion:
4. The measurement is some positive linear function of the Truth. That is,
letting Xi stand for t(Oi ) to simplify the notation: m(Oi ) = a + bXi, b > 0
Thus, we may say that
t(O1 ) - t(O2 ) = t(O3 ) - t(O4 ) if m(O1 ) - m(O2 ) = m(O3 ) - m(O4 )
Since the latter is (a + bX1) - (a + bX2) = (a + bX3) - (a + bX4)  bX1 - bX2 = bX3 - bX4
 X1 - X2 = X3 - X4
In other words, a difference of y units on the measuring scale represents the
same true amount of the attribute being measured regardless of where on the scale the
measurements are taken. Consider the following data on how long it takes a runner to
finish a race:
Runner:
Joe
Lou
Sam
Bob
Wes
Rank:
1
2
3
4
5
(True) Time (sec)
60
60.001
65
75
80
The ranks are ordinal data. In terms of ranks, the difference between Lou and
Sam is equal to the difference between Joe and Lou, even though the true difference
between Lou and Sam is 4999 times as large as the true difference between Joe and
3
Lou. Assume we used God’s stopwatch (or one that produced scores that are a linear
function of her stopwatch, which is ultimately a matter of faith) to obtain the TIME (sec)
measurements above. As Interval data, the amount of time represented by 65 minus
60, Sam vs Joe, is exactly the same as that represented by 80 minus 75, Wes vs Bob.
That is, an interval of 5 sec represents the same amount of time regardless of where on
the scale it is located.
Note that a linear function is monotonic, but a monotonic function is not
necessarily linear. A linear function has a constant slope. The slope of a monotonic
function has a constant sign (always positive or always negative) but not necessarily a
constant magnitude.
Ratio Scales
In addition to the three criteria already mentioned, a fourth criterion is necessary
for a scale to be ratio:
5. a = 0, that is m(Oi ) = bXi, that is, we have a “true zero point.”
If m(O) = 0, then 0 = bX, thus X must = 0 since b > 0. In order words, if an object
has a measurement of zero, it has absolutely none of the measured attribute.
For ratio data, m(O1 ) m(O2 ) = bX1  bX2 = X1  X2
Thus, we can interpret the ratios of ratio measurements as the ratio of the true
magnitudes.
a  0.
For nonratio, interval data, m(O1 ) m(O2 ) = (a + bX1 )  (a + bX2)  X1  X2 since
When you worked Gas Law problems in high school [for example, volume of gas
held constant, pressure of gas at 10 degrees Celsius given, what is pressure if
temperature is raised to 20 degrees Celsius] you had to convert from Celsius to Kelvin
because you were working with ratios, but Celsius is not ratio [20 degrees Celsius is not
twice as hot as 10 degrees Celsius], Kelvin is.
Please note that classifying a measuring system into one of these levels is, in
part, a matter of faith, because we cannot ever really know what is the nature of the
4
relationship between the true scores and our measurements. One way around this is to
define the true scores on the measurements – for example, I may argue that the
construct which I am measuring is some abstraction that is a linear transformation of my
measurements.
Reliability
Simply put, a reliable measuring instrument is one which gives you the same
measurements when you repeatedly measure the same unchanged objects or events.
We shall briefly discuss here methods of estimating an instrument’s reliability. The
theory underlying this discussion is that which is sometimes called “classical
measurement theory.” The foundations for this theory were developed by Charles
Spearman (1904, “General Intelligence,” objectively determined and measures.
American Journal of Psychology, 15, 201-293).
If a measuring instrument were perfectly reliable, then it would have a perfect
positive (r = +1) correlation with the true scores. If you measured an object or event
twice, and the true scores did not change, then you would get the same measurement
both times.
We theorize that our measurements contain random error, but that the mean
error is zero. That is, some of our measurements have error that make them lower than
the true scores, but others have errors that make them higher than the true scores, with
the sum of the score-decreasing errors being equal to the sum of the score increasing
errors. Accordingly, random error will not affect the mean of the measurements, but it
will increase the variance of the measurements.
Our definition of reliability is rXX 
 T2
 T2
2

 rTM
. That is, reliability is the
 M2  T2   E2
proportion of the variance in the measurement scores that is due to differences in the
true scores rather than due to random error.
Please note that I have ignored systematic (nonrandom) error, optimistically
assuming that it is zero or at least small. Systematic error arises when our instrument
consistently measures something other than what it was designed to measure. For
example, a test of political conservatism might mistakenly also measure personal
stinginess.
Also note that I can never know what the reliability of an instrument (a test) is,
because I cannot know what the true scores are. I can, however, estimate reliability.
Test-Retest Reliability. The most straightforward method of estimating
reliability is to administer the test twice to the same set of subjects and then correlate
the two measurements (that at Time 1 and that at Time 2). Pearson r is the index of
correlation most often used in this context. If the test is reliable, and the subjects have
not changed from Time 1 to Time 2, then we should get a high value of r. We would
likely be satisfied if our value of r were at least .70 for instruments used in research, at
least .80 (preferably .90 or higher) for instruments used in practical applications such as
making psychiatric diagnoses (see my document Nunnally on Reliability). We would
also want the mean and standard deviation not to change appreciably from Time 1 to
5
Time 2. On some tests, however, we would expect some increase in the mean due to
practice effects.
Alternate/Parallel Forms Reliability. If there two or more forms of a test, we
want to know that the two forms are equivalent (on means, standard deviations, and
correlations with other measures) and highly correlated. The r between alternate forms
can be used as an estimate of the tests’ reliability.
Split-Half Reliability. It may be prohibitively expensive or inconvenient to
administer a test twice to estimate its reliability. Also, practice effects or other changes
between Time 1 and Time 2 might invalidate test-retest estimates of reliability. An
alternative approach is to correlate scores on one random half of the items on the test
with the scores on the other random half. That is, just divide the items up into two
groups, compute each subject’s score on the each half, and correlate the two sets of
scores. This is like computing an alternate forms estimate of reliability after producing
two alternate forms (split-halves) from a single test. We shall call this coefficient the
half-test reliability coefficient, rhh.
Spearman-Brown. One problem with the split-half reliability coefficient is that it
is based on alternate forms that have only one-half the number of items that the full test
has. Reducing the number of items on a test generally reduces it reliability coefficient.
To get a better estimate of the reliability of the full test, we apply the Spearman-Brown
2rhh
correction, rsb 
.
1  rhh
Cronbach’s Coefficient Alpha. Another problem with the split-half method is
that the reliability estimate obtained using one pair of random halves of the items is
likely to differ from that obtained using another pair of random halves of the items.
Which random half is the one we should use? One solution to this problem is to
compute the Spearman-Brown corrected split-half reliability coefficient for every one of
the possible split-halves and then find the mean of those coefficients. This mean is
known as Cronbach’s coefficient alpha. Instructions for computing it can be found in my
document Cronbach’s Alpha and Maximized Lambda4. You should learn to use SPSS
to compute Cronbach’s alpha.
Maximized Lambda4. H. G. Osburn (Coefficient alpha and related internal
consistency reliability coefficients, Psychological Methods, 2000, 5, 343-355) noted that
coefficient alpha is a lower bound to the true reliability of a measuring instrument, and
that it may seriously underestimate the true reliability. They used Monte Carlo
techniques to study a variety of alternative methods of estimating reliability from internal
consistency. Their conclusion was that maximized lambda4 was the most consistently
accurate of the techniques.
4 is the rsb for one pair of split-halves of the instrument. To obtain maximized 4,
one simply computes 4 for all possible split-halves and then selects the largest
.5(2n)!
obtained value of 4. The problem is that the number of possible split halves is
(n! ) 2
6
for a test with 2n items. If there are only four or five items, this is tedious but not
unreasonably difficult. If there are more than four or five items, computing maximized 4
is unreasonably difficulty, but it can be estimated -- see my document Estimating
Maximized Lambda4.
Intra-Rater Reliability. Sometimes your measuring instrument is a person who
is rating the subjects on one or more attributes. If you are using a single rater, you
could, under some circumstances, evaluate his or her reliability with a test-retest
reliability coefficient. For example, suppose the rater is viewing videotapes of children
playing and is rating how aggressive a child is. You could have the rater view the
videotapes twice, each time rating a child, and then correlate the ratings from the first
viewing with those from the second viewing.
Inter-Rater Reliability. If you have two or more raters, you will want to estimate
reliability using something more like an alternate forms reliability coefficient. That is,
you want to demonstrate that the ratings are pretty much the same whether the one
person or the other person is doing the ratings.
When the measurement variable is categorical in nature, one may compute the
percentage of agreement as an inter-rater reliability statistic. For example, suppose that
each of two persons were observing children at play and at a designated time or interval
of time determining whether or not the target child was involved in a fight. Furthermore,
if the rater decided a fight was in progress, the target child was classified as being the
aggressor or the victim. Consider the following hypothetical data:
Rater 2
No Fight
No Fight
70 (54.75)
3
2
75
Aggressor
2
6 (2.08)
5
13
Victim
1
7
4 (1.32)
12
73
16
11
100
marginal
Aggressor
Victim
marginal
Rater 1
The percentage of agreement here is quite pretty good, (70 + 6 + 4)  100 =
80%, but not all is rosy here. The raters have done a pretty good job of agreeing
regarding whether the child is fighting or not, but there is considerable disagreement
between the raters with respect to whether the child is the aggressor or the victim.
Jacob Cohen developed a coefficient of agreement, kappa, that corrects the
percentage of agreement statistic for the tendency to get high values by chance alone
when one of the categories is very frequently chosen by both raters. On the main
diagonal of the table above I have entered in parentheses the number of agreements
that would be expected by chance alone given the marginal totals. Each of these
expected frequencies is computed by taking the marginal total for the column the cell is
in, multiplying it by the marginal total for the row the cell is in, and then dividing by the
total count. For example, for the No Fight-No Fight cell, (73)(75)  100 = 54.75. Kappa
7
O   E
, where the O’s are observed frequencies on the
N  E
main diagonal, the E’s are expected frequencies on the main diagonal, and N is the total
70  6  4  54.75  2.08  1.32 21.85

 0.52 , which is not so
count. For our data,  
100  54.75  2.08  1.32
41.85
impressive.
is then computed as:  
More impressive would be these data, for which kappa is 0.82:
Rater 2
No Fight
No Fight
70 (52.56)
0
2
72
Aggressor
2
13 (2.40)
1
16
Victim
1
2
9 (1.44)
12
73
15
12
100
marginal
Aggressor
Victim
marginal
Rater 1
If your ratings are ranks, you can estimate inter-rater reliability with Spearman
rho or Kendall’s tau if there are two raters, or with Kendall’s coefficient of concordance if
there are more than two raters.
If your ratings can be treated as a continuous variable, you can, for two raters,
simply compute Pearson r between the two sets of ratings. For two or more raters a
better procedure is to compute an intraclass correlation coefficient. It is possible for
ratings to be well correlated with one another but the raters not agree on the magnitude
of the measured attributed. The intraclass correlation coefficient takes this into account,
producing a better coefficient for measuring degree of agreement. Please read my
document Inter-Rater Agreement for more details on this.
Construct Validity
Simply put, the construct validity of an operationalization (a measurement or a
manipulation) is the extent to which it really measures (or manipulates) what it claims to
measure (or manipulate). When the dimension being measured is an abstract construct
that is inferred from directly observable events, then we may speak of
“construct validity.”
Face Validity. An operationalization has face validity when others agree that it
looks like it does measure or manipulate the construct of interest. For example, if I tell
you that I am manipulating my subjects’ sexual arousal by having them drink a pint of
isotonic saline solution, you would probably be skeptical. On the other hand, if I told
you I was measuring my male subjects’ sexual arousal by measuring erection of their
penises, you would probably think that measurement to have face validity.
Content Validity. Assume that we can detail the entire population of behavior
(or other things) that an operationalization is supposed to capture. Now consider our
operationalization to be a sample taken from that population. Our operationalization will
8
have content validity to the extent that the sample is representative of the population.
To measure content validity we can do our best to describe the population of interest
and then ask experts (people who should know about the construct of interest) to judge
how well representative our sample is of that population.
Criterion-Related Validity. Here we test the validity of our operationalization by
seeing how it is related to other variables. Suppose that we have developed a test of
statistics ability. We might employ the following types of criterion-related validity:

Concurrent Validity. Are scores on our instrument strongly correlated with
scores on other concurrent variables (variables that are measured at the same
time). For our example, we should be able to show that students who just
finished a stats course score higher than those who have never taken a stats
course. Also, we should be able to show a strong correlation between score on
our test and students’ current level of performance in a stats class.

Predictive Validity. Can our instrument predict future performance on an
activity that is related to the construct we are measuring? For our example, is
there a strong correlation between scores on our test and subsequent
performance of employees in an occupation that requires the use of statistics.

Convergent Validity. Is our instrument well correlated with measures of other
constructs to which it should, theoretically, be related? For our example, we
might expect scores on our test to be well correlated with tests of logical thinking,
abstract reasoning, verbal ability, and, to a lesser extent, mathematical ability.

Discriminant Validity. Is our instrument not well correlated with measures of
other constructs to which it should not be related? For example, we might expect
scores on our test not to be well correlated with tests of political conservatism,
ethical ideology, love of Italian food, and so on.
Threats to Construct Validity.

Inadequate Preoperational Explication. Here our failure is with respect to
defining the population of behaviors that represents the construct. For example,
if I want to construct a measure of “love of Italian cuisine,” it might be a mistake
to sample only items that have to do with pizza, spaghetti, and ravioli. Would my
results generalize to a broader definition of Italian cuisine?

Mono-Operation Bias. Here the problem is that we have relied on only one
method of manipulating the construct of interest. Suppose I am interested in
effects of physical exercise. My research involves only one type of exercise,
jogging. Would my results generalize to other types of exercise?

Mono-Method Bias. Here the problem is that we use only one means of
measuring a construct. For example, I use a self-report questionnaire for
measuring love of Italian cuisine. Would my results be different if I used direct
observations of what my subjects ordered when eating out, or if I asked their
9
friends what sort of meals my subjects cooked, or if I investigated what types of
cook books were on my subjects’ shelves, etc.

Interaction of Different Treatments. The problem here is that subjects
exposed to one treatment may also have been exposed to other treatments and
those other treatments may have altered effect of the treatment of interest. For
example, a researcher wants to test the effect of a new drug on his colony of
monkeys. These monkeys have been used in other drug experiments previously.
Would their response to the new drug be the same if they had not been used in
other drug experiments previously?

Testing x Treatment Interaction. This is a problem only when subjects are
given a pretest, then an experimental treatment, then a posttest. The question is
whether or not the pretesting affected the subjects’ response to the experimental
treatment.

Restricted Generalizability Across Constructs. The problem here is that an
experimental treatment might have unanticipated effects on constructs that you
did not measure. For example, the drug we are testing for effectiveness in
treating depression might also affect a variety of other traits, but if we did not
measure those other traits, then we cannot fully describe the effects of the drug.

Confounding Constructs with Levels of Constructs. This problem involves
our thinking that the few levels of our experimental variable used in our study
fairly represent the entire range of possible levels of the construct. For example,
suppose that we are interested in the effects of test anxiety on performance of a
complex cognitive task. We manipulate our subjects’ test anxiety with only two
levels, low and medium. We find that their performance is higher with medium
anxiety than with low anxiety and conclude that test anxiety is positively
correlated with performance. We might have found a more complex relationship
if we had used different levels of anxiety. For example, if we used low, medium,
and high anxiety, we might have found a curvilinear relationship, with
performance decreasing as anxiety increased beyond moderate levels.

Social Threats. The basic problem here is that subjects who are observed may
not behave the same as subjects who are not observed. Is the effect of your
experimental treatment due to or affected by subjects knowing that you are
observing them? Here are some examples of threats of this type:
o Hypothesis Guessing. Your subjects guess what your hypothesis are
and then either try to help you confirm them (the good-guy effect) or do
their best to give you data that would lead to disconfirmation (the screwyou effect). Whether the subjects guessed correctly or not, their behavior
is affected by their thoughts regarding what your hypothesis are.
o Evaluation Apprehension. Subjects may get nervous when they know
that you are evaluating them. They may even be afraid that you, the
10
psychologist, will be able to see their innermost secrets. Such
apprehension may well affect their behavior.
o Expectancy Effects. Both the subjects and the experimenter may have
expectancies about the effects of the experimental treatments. In one
classic experiment on experimenter-expectancy effects, graduate
students were given rats to test in a maze. They were told there were two
different groups of rats, one groups comprise of stupid rats and the other
of smart rats. The students were also told to which group each tested rat
belonged. As expected, the smart rats did much better in the maze than
did the stupid rats, but guess what -- there were not really two different
groups of rats, the rats were all randomly chosen from one group. To
avoid such problems, the data collector may be kept ignorant with respect
to which group each subject is in. Likewise, if the subject knows which
group he is in, his behavior may be affected by that expectation, a
subject-expectancy problem. Accordingly, the researcher may take
steps to prevent the subjects from knowing what group they are in. For
example, in drug trials, subjects who are in the group that does not get the
drug treatment may be given a placebo drug to take, a substance that has
no effect on the dependent variable. This is intended to make it harder for
the subject to figure out whether he is getting the experimental drug or not.
When both the data collector and the subjects are kept in the dark with
respect to who got what treatments, the experiment is said to be a
double-blind experiment.
Copyright 2011, Karl L. Wuensch - All rights reserved.