Classical Test Theory

advertisement
Classical Test Theory & Reliability Notes
Psychometrics

Psychometrics concerned with measurement of psychological attributes
(personality, abilities, attitudes etc)

Aim is often to study differences in individuals/groups

Crucial is that measurement is accurate and reliable

Understanding of statistical theory can help us:
o construct reliable test measures
o assess reliability (and other properties) using formal tests based on
statistical theory (e.g. Cronbach’s alpha)
Statistical theory in psychometrics
Classical Test Theory (CTT)
 Evolved in early 1900’s from work of Spearman on measurement of individual
differences in mental abilities

Used in psychometric research for over a century

Basic principle: X=T+E
observed score= true + error
Item Response Theory (IRT) – not covered in this lecture
 Modern approach to reliability assessment (also Rasch model)

Certain advantages of IRT approach over CTT

Also relative advantages of CTT inc. valid with smaller representative
samples, relatively straightforward maths & available in SPSS!

Fan (1998) even suggests that little empirical difference between CTT & IRT
1
Basic Facts and Notation
N =sample of N observations
xobs = observed value, and xi = observed value for ith subject
 = ‘the sum of’
Population statistics :
 = population mean,  = population standard deviation , 2 = population variance
Sample statistics:
= sample mean, sx = sample standard deviation, sx2= sample variance
Note that:
Var(xobs) = s2= i (xi – )2
N–1
Distributions
 D(Q) denotes the distribution of values of a quantity Q
 G (a,b) denotes a normal (Gaussian) distribution with mean of a and a standard
deviation of b
For example, D(IQ)=G(100,15)
 Gn (0,1) is the nth value drawn from a standard normal distribution
 ε = a random value from a normally-distributed error variable, with mean of zero
and a variance of err2
Expected values
 Exp{Q} = expected value of a quantity Q.
In a large sample of measures of Q, this value will be approximated by the average
value of Q
Assumptions of Classical Test Theory
CTT starts with the notion of the true value of a variable, e.g. xtrue. CTT assumes that
the true values of variable x in a population of interest follow a normal (or ‘Gaussian’)
distribution. Let us denote the population mean by  and the population standard
deviation true. Using the notation introduced above, we can therefore write that the
distribution of the true value for a population of participants will be:
D(xtrue) = G1( ,true)
… (1)
The population values ( and true, also called parameters) are different from those
which we measure in a sample ( and s) due to sampling error. Classical test theory is
concerned with how the measured (i.e. observed) values of x will be related to the true
values xtrue. CTT proposes that the observed values are a combination of the true
values plus a random measurement error component. By stressing random error, CTT
is making 3 assumptions about the error component:
2
(i)
The error component will have zero mean and so the observed mean will not be
systematically distorted away from the true value by the error (this contrasts
with a systematic bias effect which would distort the observed mean away from
its true value).
(ii) The measurement errors are assumed to follow a normal distribution.
(iii) The measurement errors are uncorrelated with the true values.
Therefore, according to CTT, we can write the following expression for the
distribution of xobs:
D(xobs) = D(xtrue) + G2(0, err)
…(2)
where err is the standard deviation of the normal random error term. For an
individual (ith) participant we could also write the following expression for their
observed score on variable x:xi = xi,true + 2i
… (3)
where xi,true denotes the value of xtrue for participant i, drawn from the true value
distribution G1( ,true), and 2i denotes the error term for the ith participant, drawn
from the error distribution G2(0, err).
It follows from these assumptions that the expected value of the sample mean,
Exp{ } is  . Also, the sample standard deviation of observed x is going to be larger
than true as the random error component (with a standard deviation of err) increases
the variation in xobs.
In fact, it is easy to work out the expected sample variance exactly. Imagine one has
two variables a and b, and a variable c which is the sum of a and b (i.e., a + b). The
variance of the new variable c is given by:
Var (c) = Var (a) + Var (b) + 2rab*√[Var(a) * Var(b)]
… (4)
or 2c= a2 + b2 + 2rab*√ [a2 *b2]
where rab is the correlation between a and b
We can use the general formula given in (4) to evaluate the expected sample variance
according to CTT, as the observed values are the sum of the true values and random
measurement error. Using ‘true’ and ‘error’ instead of ‘a’and ‘b’, the expected value
of the sample variance is just the sum of the variances of the true score and error
terms:
Exp{s2} = true2 + err2
… (5)
The last term in formula 4 disappears because we are assuming that there is zero
correlation between true & error values (assumption 3).
3
Correlations Between Measures
Assume two measures (x and y) index the same psychological process, p. Following
CTT, assume further that p is normally distributed in the population of interest with a
mean p and standard deviation p. We can write the following expressions for the
distributions of xobs and yobs:
D(x) = G1(p, p) + G2(0, errx)
D(y) = G1(p, p) + G3(0, erry)
… (6)
The subscripts on the normal distribution terms (i.e. Gn) are used to indicate that the
error distributions affecting x and y are independent of one another (different
subscript), whereas the measures contain an identical true value for the same subject
(drawn from the same distribution, indicated by a common subscript).
The correlation between x and y is denoted by rxy and is defined thus:
rxy =
…(7)
… (7)
Covar(x, y)
√ [x2*y2]
where Covar(x.y) is the covariance between x and y. rxy indicates the proportion of
covariance relative to the total variance (and thus r cannot exceed 1). The more x and
y measure p (i.e. the smaller the error), the greater the value of rxy.
From equation (5) above, the expected values of sample variance of x and y can be
expressed as:
Exp{sx2} = p2 + errx2
Exp{sy2} = p2 + erry2
… (8)
The covariance (shared variance) between x and y is p2. If we substitute the expected
variances of x and y given above (8) into equation 7, we can obtain the following
result for the expected value of the correlation:
Exp{rxy} =
√
[(p2
p2
+ errx )*(
2
…(9)
p2
+
erry2)]
Reliability of a Measure
The reliability of a measure is defined as the proportion of variance in the measure
that is due to the construct being measured (rather than measurement error). Thus,
according to CTT reliability is defined as:
Reliability = true2 / (true2 + err2)
… (10)
We can show, using CTT, that a test-retest correlation for a particular measure, will
give us an estimate of reliability. A test-retest correlation uses a measure taken at two
separate times (denoted 1 and 2). The test-retest correlation for x can be written as r12.
We can write the following expressions (from formula 6):
4
D(xobs1) = G1(, true) + G2(0, err1)
D(xobs2) = G1(, true) + G3(0, err2)
… (11)
where we have assumed that the error variance at time 2 is independent of the error
variance at time 1. However, we have no reason to suppose that the amount of
measurement error variance will differ at each timepoint. Therefore, we assume that
err1 = err2 = err. Using this assumption, plus the expressions given in (9) and (11)
the expected value of the test-retest correlation has the following value:
Exp{r12} = true2 / (true2 + err2)
… (12)
which is the same as the CTT definition of reliability (10, above).
Summing Two Measures to Increase Reliability
We can use CTT to show that summing measures of the same construct increases
reliability. Let’s return to our two measures x and y of process p that we discussed
previously. Let’s define our combined measure, Cobs = 0.5*(xobs + yobs). Using the
expressions in (6), defined above, we can write:
D(Cobs) = G1(p, p) + 0.5*G2(0, errx) + 0.5*G3(0, erry)
… (13)
For simplicity let us assume that both xobs and yobs are equally reliable measures; (i.e.,
same measurement error). Therefore, we can write the following errx = erry = err.
Also, it follows from the definition of standard deviation that n*G(m, s)= G(n*m, n*s)
- that is, if you multiply the entire set of scores by a number n, then the standard
deviation (and the mean) become n times bigger (for example, ‘-1 0 1’ x 0.5 = ‘-0.5 0
0.5 ’). We can therefore rewrite expression (13) thus:
D(Cobs) = G1(p, p) + G2(0, 0.5*err) + G3(0, 0.5*err)
… (14)
Using our result for the sum of two variables (expression 4 above), we can combine
the two uncorrelated error terms of expression (14) into a single random error term,
thus:
Cobs = G1(p, p) + G4(0, √0.5*err)
… (15)
From the definition of reliability given earlier (10), we note that the reliability of the
combined measure is p2 / (p2 + 0.5*err2). However, from previous expressions we
can see that either measure alone has a reliability of p2 / (p2 + err2). It is thus
apparent that the combined measure is more reliable. In fact, the absolute amount of
error variance in the combined measure is halved.
Cronbach’s Alpha () as a measure of internal consistency
In the previous section we combined two variables by averaging them. We can extend
this by combining more than 2 variables (either by summing or averaging, it doesn’t
matter which). Let us assume we have a scale that is made up of n items and the
observed scores on each item are given by xj. Let the scale total score (formed by
adding all the items) be described as xsum (i.e., xtotal = j xj). Cronbach’s  is a measure
5
of reliability for such a scale. It is intended to reflect the internal consistency of the
scale, i.e. the extent to which all the items tend to measure the same construct (process
p), and it is defined thus:
 = [n/(n - 1)] * [1 -
 2xj

2
]
…(16)
xtotal
Using CTT we can assume that distribution of scores on each item can be represented
by the following kind of expression:
D(xj) = G(p, p) + Gj(0, errj)
…(17)
where Gj relates to the unique error associated with each item. It is easy to show,
using CTT and expressions like (17), that the value of  reduces to the following
expression:
 = (n2 * p2) / {(n2 * p2) + j errj2 }
… (18)
Expression (18) reveals why  measures internal consistency. For example, if the
scale has only two items (n = 2) and the error variance for each item is the same
(i.e.,err1 = err2 = err) then expression (18) reduces to p2 / (p2 + 0.5*err2). This
was shown earlier to be the reliability of a combined (average) score derived from two
variables (see equation 15).
Note also that  is directly related to the square of the number of items in the scale.
Imagine that, for every item on the scale, the true and error variance for item on the
scale are the same (i.e., p2 = err2). The reliability of each item is thus 0.5. If you
have two items (n=2) then the internal consistency, , is 0.667. If you have 10 items
(n=10) then  is 0.909. So the 2-item scale and the 10-item scale differ numerically in
the value of  even though both scales are made up of individual items of identical
reliability. Thus, an impressive-seeming value of  (e.g., 0.95) mainly reflects the fact
that you have a scale with many items.
An alternative, more intuitive expression of formula 18 --algebraically reworked-- is:
This expression further underlines that there are two important factors that increase
reliability (alpha). The number of items in the test (or number of tests) and the
correlation between the items. The more items and the higher their correlation, the
greater the reliability.
Note: when calculating Cronbach’s alpha, make sure the reverse-coded items are
reflected (i.e. scored in the same direction as the other items).
6
Download