Word [] file

advertisement
Classical Test Theory & Reliability Notes
Basic Facts and Notation











We have a sample of N observed values of a variable x.
The observed values will generally be denoted xobs, but the observed value for
subject i in the sample will be denoted xi.
We shall use xave to represent the sample mean (average) for the variable.
We shall use s to represent the sample standard deviation and Var(xobs) to
represent the sample variance.
Note that Var(xobs) = s2.
Also note that Var(xobs) = {i (xi – xave)2}/(N – 1).
We shall use G(a,b) to denote normal (Gaussian) distribution which has a mean of
a and a standard deviation of b.
Thus, Gn(0,1) is a value drawn from one particular standard normal distribution.
Traditionally,  is used to denote a random value from a normally-distributed error
variable, with mean of zero and a variance of err2.
Exp{Q} denotes the expected value of a quantity Q. In a large sample of measures
of Q, this value will be approximated by the average value of Q.
D(Q) denotes the distribution of values of a quantity Q.
Assumptions of Classical Test Theory
Classical test theory (CTT) starts with the notion of the true value of a variable, e.g.
xtrue. CTT assumes that the true values of variable x in a population of interest follow
a normal distribution. Let us denote the population mean by  and the population
standard deviation true. Using the notation introduced above, we can therefore write
that the distribution of the true value for a population of participants will be:
D(xtrue) = G1( ,true)
… (1)
The population values ( and true, also called parameters) are different from those
which we measure in a sample (xave and s) due to sampling error. Classical test theory
is concerned with how the measured (i.e. observed) values of x will be related to the
true values xtrue. CTT proposes that the observed values are a combination of the true
values plus a random measurement error component. By stressing random error, CTT
is making 3 assumptions about the error component:
(i)
The error component will have zero mean and so this means that the observed
mean will not be systematically distorted away from the true value by the error
(and this contrasts with a systematic bias effect which would distort the
observed mean away from its true value).
(ii) The measurement errors are assumed to follow a normal distribution.
(iii) The measurement errors are uncorrelated with the true values.
Therefore, according to CTT, we can write the following expression for the
distribution of xobs:
D(xobs) = D(xtrue) + G2(0, err)
…(2)
where err is the standard deviation of the normal random error term. For an
individual (ith) participant we could also write the following expression for their
observed score on variable x:xi = xi,true + 2i
… (3)
where xi,true denotes the value of xtrue for participant i, drawn from the true value
distribution G1( ,true), and 2i denotes the error term for the ith participant, drawn
from the error distribution G2(0, err).
It follows from these assumptions that the expected value of the sample mean,
Exp{xave}, is . Also, the sample standard deviation of x is going to be larger than
true because the random error component (with a standard deviation of err) increases
the variation in xobs.
In fact, it is easy to work out the expected sample variance exactly. Imagine one has
two variables a and b, and a variable c which is the sum of a and b (i.e., a + b). It
follows from the definition of variance given in the section on “Basic Facts and
Notation” that:
Var(c) = Var(a) + Var(b) + 2*rab*sqrt(Var(a))*sqrt(Var(b))
… (4)
where rab is the correlation between a and b and sqrt(Q) is the square root of quantity
Q.
We can use the general formula given in (4) to evaluate the expected sample variance
according to CTT, because the observed values are the sum of the true values plus
random measurement error. Also recall that the correlation between true and error
values is assumed to be zero under the theory. Thus, the expected value of the sample
variance is just the sum of the variances of the true score and error terms:
Exp{s2} = true2 + err2
… (5)
Correlations Between Measures
Assume two measures (x and y) index the same psychological process, p. Following
CTT, assume further that the process is normally distributed in the population of
interest with a mean p and standard deviation p. We can write the following
expressions for the distributions of xobs and yobs:
D(xobs) = G1(p, p) + G2(0, errx)
D(yobs) = G1(p, p) + G3(0, erry)
… (6)
As before, the subscripts on the normal distribution terms (i.e. Gn) are used to indicate
that the error distributions affecting x and y are independent of one another (different
subscript), whereas the measures contain an identical true value for the same subject
(drawn from the same distribution, indicated by a common subscript).
The correlation between x and y is denoted by rxy and is defined thus:
rxy = Covar(xobs, yobs) / sqrt(Var(xobs)*Var(yobs))
… (7)
where Covar(a, b) is the covariance between a and b.
From the equations (4) and (5) above, the expected values of sample variance of x and
y can be written as:
Exp{Var(xobs)} = p2 + errx2
Exp{Var(yobs)} = p2 + erry2
… (8)
The covariance (shared variance) between x and y is p2. It follows from the definition
of the correlation between measures (7) and the expected variance results (8) that we
can obtain the following result for the expected value of the correlation:
Exp{rxy} = p2 / sqrt((p2 + errx2)*( p2 + erry2))
… (9)
Reliability of a Measure
The reliability of a measure is defined as the proportion of variance in the measure
that is due to the construct being measured (rather than measurement error). Thus,
according to CTT reliability is defined as:
Reliability = true2 / (true2 + err2)
… (10)
We can show, using CTT, that a test-retest correlation for a particular measure, will
give us an estimate of reliability. A test-retest correlation uses a measure taken at two
separate times (denoted 1 and 2). The test-retest correlation for x can be written as r12.
We can write the following expressions:
D(xobs1) = G1(, true) + G2(0, err1)
D(xobs2) = G1(, true) + G3(0, err2)
… (11)
where we have assumed that the error variance at time 2 is independent of the error
variance at time 1. However, we have no reason to suppose that the amount of
measurement error variance will differ at each timepoint. Therefore, we assume that
err1 = err2 = err. Using this assumption, plus the expressions given in (9) and (11)
the expected value of the test-retest correlation has the following value:
Exp{r12} = true2 / (true2 + err2)
which is the same as the CTT definition of reliability (10, above).
… (12)
Summing Two Measures to Increase Reliability
We can use CTT to show that summing measures of the same construct increases
reliability. Let’s return to our two measures x and y of process p that we discussed
previously. Let’s define our combined measure, Cobs = 0.5*(xobs + yobs). Using the
expressions in (6), defined above, we can write:
D(Cobs) = G1(p, p) + 0.5*G2(0, errx) + 0.5*G3(0, erry)
… (13)
For simplicity let us assume that the both xobs and yobs are equally reliable measures;
i.e., they have the same measurement error variance. Therefore, we can write the
following errx = erry = err. Also, it follows from the definition of standard deviation
that n*G(m, s)= G(n*m, n*s). We can therefore rewrite expression (13) thus:
D(Cobs) = G1(p, p) + G2(0, 0.5*err) + G3(0, 0.5*err)
… (14)
Using our result for the sum of two variables (expression 4 above), we can combine
the two uncorrelated error terms of expression (14) into a single random error term,
thus:
Cobs = G1(p, p) + G4(0, sqrt(0.5)*err)
… (15)
From the definition of reliability given earlier (10), we note that the reliability of the
combined measure is p2 / (p2 + 0.5*err2). From the expressions in 8 above, and
after replacing the separate error variances by the common error variance term (err2),
we can see that either measure alone has a reliability of p2 / (p2 + err2). We can thus
see that the combined measure is more reliable. The absolute amount of error variance
in the combined measure is halved.
Cronbach’s Alpha () as a measure of internal consistency
In the previous section we combined two variables by averaging them. We can extend
this by combining more than 2 variables (either by summing or averaging, it doesn’t
matter which). Let us assume we have a scale that is made up of m items and the
observed scores on each item are given by xj. Let the scale total score (formed by
summing all the items) be described as xsum (i.e., xsum = j xj). Cronbach’s  is a
measure of reliability for such a scale. It is intended to reflect the internal consistency
of the scale, i.e. the extent to which all the items tend to measure the same construct
(process p), and it is defined thus:
 = (m/(m - 1))*(1 – {j Var(xj)} / Var(xsum))
…(16)
Using CTT we can assume that distribution of scores on each item can be represented
by the following kind of expression:
D(xj) = G(p, p) + Gj(0, errj)
…(17)
where Gj relates to the unique error associated with each item. It is easy to show,
using CTT and expressions like (17), that the value of  reduces to the following
expression:
 = (m2 * p2) / {(m2 * p2) + j errj2 }
… (18)
Expression (18) reveals why  measures internal consistency. For example, if the
scale has only two items (m = 2) and the error variance for each item is the same
(i.e.,err1 = err2 = err) then expression (18) reduces to p2 / (p2 + 0.5*err2). This
was shown earlier to be the reliability of a combined (average) score derived from two
variables (see equation 15).
Note also that  is directly related to the square of the number of items in the scale.
Imagine that, for every item on the scale, the true and error variance for item on the
scale are the same (i.e., p2 = err2). The reliability of each item is thus 0.5. If you
have two items (m=2) then the internal consistency, , is 0.667. If you have 10 items
(m=10) then  is 0.909. So the 2-item scale and the 10-item scale differ numerically
in the value of  even though both scales are made up of individual items of identical
reliability. Thus, an impressive-seeming value of  (e.g., 0.95) might just reflect the
fact that you have a scale with many items.
The Need to Take Care With Difference Measures
Imagine an experiment (e.g., a reaction time experiment) in which we measure RT in
3 different conditions. Condition 1 is supposed to measure process p1; condition 2 is
supposed to measure the combined effect of p1 plus p2; condition 3 the combination
of all 3 processes p1, p2 and p3.
Assuming the processes combine additively, then we can estimate processes p2 and
p3 by subtractions. CTT expressions for the observed score distributions in each
condition (x1, x2, and x3) can be written:
D(x1) = G1(p1, p1) + G11(0, err1)
D(x2) = G2(p2, p2) + G1(p1, p1) + G12(0, err2)
D(x3) = G3(p3, p3) + G2(p2, p2) + G1(p1, p1) + G13(0, err3)
… (19)
Thus, the distributions of our empirical estimates of p2 and p3 are given by:D(p2est) = G2(p2, p2) + G12(0, err2) - G11(0, err1)
D(p3est) = G3(p3, p3) + G13(0, err3) - G12(0, err2)
… (20)
Expression (20) shows two important properties about these difference measures:
(i) They are less reliable than either of the constituent measures (as they contain the
error terms from each part of the subtraction)
(ii) They should generally not be correlated with one another as they share a common
error term [err2]. This common term has opposing signs in the two expressions in
(20) above and so it will generate a negative correlation between the measures
even where no correlation between the processes exists.
If one really wants to correlate the subtractive estimate of p2 with that for p3, then one
has to make careful design choices to avoid the problems caused by a shared error
term. For example, one can have twice as many trials from condition 2 as the other
conditions and then use a randomly chosen half of the condition 2 trials to estimate p2
and the other half to estimate p3 by appropriate subtractions. This requires that the
random errors associated with one half of the condition 2 trials can be argued to be
independent of the random measurement errors for the other half of condition 2 trials
(i.e., the same assumption classical test theory is making for the separate conditions).
Correlated Error Terms
Error terms need not be completely independent (uncorrelated) across the conditions
of an experiment (such as the RT study discussed above). Error is that part of the
measure of x1-x3 which is not caused by variation in processes p1-p3. Some error
processes may affect all conditions equally. For example, if a particular participant is
tired on the testing day then this is still a random error process across the sample
tested. It is likely to be a zero mean error process across the whole sample too, as
other subjects may be unusually alert. This means that the whole sample will give a
reasonable value for the average RT across participants. However, for the individual
participant who is tired, it is likely that their average RTs in each condition would be
slower than would normally be the case if they had been tested on a typical day. For
the RT study discussed earlier we can write a set of modified expressions to take
account of the possible correlation in part of the error across conditions:D(x1) = G1(p1, p1) + G11(0, err1) + G0(0, errc)
… (21)
D(x2) = G2(p2, p2) + G1(p1, p1) + G12(0, err2) + G0(0, errc)
D(x3) = G3(p3, p3) + G2(p2, p2) + G1(p1, p1) + G13(0, err3) + G0(0, errc)
where G0(0, errc) denotes the distribution of an error component which is correlated
across the 3 conditions. Each participant will thus have the same value of this error
component in each of the 3 conditions, in much the same way that they will have the
same value for process p1 in conditions 1 and 2. The expressions for the difference
measure estimates of our key processes (20) are still correct because the subtraction
process removes the correlated error term (G0(0, errc)). In this case, however, the
relative size of the correlated and uncorrelated error terms will determine whether the
difference measures (which contain 2 of the uncorrelated error terms) are more or less
reliable than the individual condition measures (which contain the correlated error
term plus one of the uncorrelated error terms).
Download