Classical Test Theory & Reliability Notes Basic Facts and Notation We have a sample of N observed values of a variable x. The observed values will generally be denoted xobs, but the observed value for subject i in the sample will be denoted xi. We shall use xave to represent the sample mean (average) for the variable. We shall use s to represent the sample standard deviation and Var(xobs) to represent the sample variance. Note that Var(xobs) = s2. Also note that Var(xobs) = {i (xi – xave)2}/(N – 1). We shall use G(a,b) to denote normal (Gaussian) distribution which has a mean of a and a standard deviation of b. Thus, Gn(0,1) is a value drawn from one particular standard normal distribution. Traditionally, is used to denote a random value from a normally-distributed error variable, with mean of zero and a variance of err2. Exp{Q} denotes the expected value of a quantity Q. In a large sample of measures of Q, this value will be approximated by the average value of Q. D(Q) denotes the distribution of values of a quantity Q. Assumptions of Classical Test Theory Classical test theory (CTT) starts with the notion of the true value of a variable, e.g. xtrue. CTT assumes that the true values of variable x in a population of interest follow a normal distribution. Let us denote the population mean by and the population standard deviation true. Using the notation introduced above, we can therefore write that the distribution of the true value for a population of participants will be: D(xtrue) = G1( ,true) … (1) The population values ( and true, also called parameters) are different from those which we measure in a sample (xave and s) due to sampling error. Classical test theory is concerned with how the measured (i.e. observed) values of x will be related to the true values xtrue. CTT proposes that the observed values are a combination of the true values plus a random measurement error component. By stressing random error, CTT is making 3 assumptions about the error component: (i) The error component will have zero mean and so this means that the observed mean will not be systematically distorted away from the true value by the error (and this contrasts with a systematic bias effect which would distort the observed mean away from its true value). (ii) The measurement errors are assumed to follow a normal distribution. (iii) The measurement errors are uncorrelated with the true values. Therefore, according to CTT, we can write the following expression for the distribution of xobs: D(xobs) = D(xtrue) + G2(0, err) …(2) where err is the standard deviation of the normal random error term. For an individual (ith) participant we could also write the following expression for their observed score on variable x:xi = xi,true + 2i … (3) where xi,true denotes the value of xtrue for participant i, drawn from the true value distribution G1( ,true), and 2i denotes the error term for the ith participant, drawn from the error distribution G2(0, err). It follows from these assumptions that the expected value of the sample mean, Exp{xave}, is . Also, the sample standard deviation of x is going to be larger than true because the random error component (with a standard deviation of err) increases the variation in xobs. In fact, it is easy to work out the expected sample variance exactly. Imagine one has two variables a and b, and a variable c which is the sum of a and b (i.e., a + b). It follows from the definition of variance given in the section on “Basic Facts and Notation” that: Var(c) = Var(a) + Var(b) + 2*rab*sqrt(Var(a))*sqrt(Var(b)) … (4) where rab is the correlation between a and b and sqrt(Q) is the square root of quantity Q. We can use the general formula given in (4) to evaluate the expected sample variance according to CTT, because the observed values are the sum of the true values plus random measurement error. Also recall that the correlation between true and error values is assumed to be zero under the theory. Thus, the expected value of the sample variance is just the sum of the variances of the true score and error terms: Exp{s2} = true2 + err2 … (5) Correlations Between Measures Assume two measures (x and y) index the same psychological process, p. Following CTT, assume further that the process is normally distributed in the population of interest with a mean p and standard deviation p. We can write the following expressions for the distributions of xobs and yobs: D(xobs) = G1(p, p) + G2(0, errx) D(yobs) = G1(p, p) + G3(0, erry) … (6) As before, the subscripts on the normal distribution terms (i.e. Gn) are used to indicate that the error distributions affecting x and y are independent of one another (different subscript), whereas the measures contain an identical true value for the same subject (drawn from the same distribution, indicated by a common subscript). The correlation between x and y is denoted by rxy and is defined thus: rxy = Covar(xobs, yobs) / sqrt(Var(xobs)*Var(yobs)) … (7) where Covar(a, b) is the covariance between a and b. From the equations (4) and (5) above, the expected values of sample variance of x and y can be written as: Exp{Var(xobs)} = p2 + errx2 Exp{Var(yobs)} = p2 + erry2 … (8) The covariance (shared variance) between x and y is p2. It follows from the definition of the correlation between measures (7) and the expected variance results (8) that we can obtain the following result for the expected value of the correlation: Exp{rxy} = p2 / sqrt((p2 + errx2)*( p2 + erry2)) … (9) Reliability of a Measure The reliability of a measure is defined as the proportion of variance in the measure that is due to the construct being measured (rather than measurement error). Thus, according to CTT reliability is defined as: Reliability = true2 / (true2 + err2) … (10) We can show, using CTT, that a test-retest correlation for a particular measure, will give us an estimate of reliability. A test-retest correlation uses a measure taken at two separate times (denoted 1 and 2). The test-retest correlation for x can be written as r12. We can write the following expressions: D(xobs1) = G1(, true) + G2(0, err1) D(xobs2) = G1(, true) + G3(0, err2) … (11) where we have assumed that the error variance at time 2 is independent of the error variance at time 1. However, we have no reason to suppose that the amount of measurement error variance will differ at each timepoint. Therefore, we assume that err1 = err2 = err. Using this assumption, plus the expressions given in (9) and (11) the expected value of the test-retest correlation has the following value: Exp{r12} = true2 / (true2 + err2) which is the same as the CTT definition of reliability (10, above). … (12) Summing Two Measures to Increase Reliability We can use CTT to show that summing measures of the same construct increases reliability. Let’s return to our two measures x and y of process p that we discussed previously. Let’s define our combined measure, Cobs = 0.5*(xobs + yobs). Using the expressions in (6), defined above, we can write: D(Cobs) = G1(p, p) + 0.5*G2(0, errx) + 0.5*G3(0, erry) … (13) For simplicity let us assume that the both xobs and yobs are equally reliable measures; i.e., they have the same measurement error variance. Therefore, we can write the following errx = erry = err. Also, it follows from the definition of standard deviation that n*G(m, s)= G(n*m, n*s). We can therefore rewrite expression (13) thus: D(Cobs) = G1(p, p) + G2(0, 0.5*err) + G3(0, 0.5*err) … (14) Using our result for the sum of two variables (expression 4 above), we can combine the two uncorrelated error terms of expression (14) into a single random error term, thus: Cobs = G1(p, p) + G4(0, sqrt(0.5)*err) … (15) From the definition of reliability given earlier (10), we note that the reliability of the combined measure is p2 / (p2 + 0.5*err2). From the expressions in 8 above, and after replacing the separate error variances by the common error variance term (err2), we can see that either measure alone has a reliability of p2 / (p2 + err2). We can thus see that the combined measure is more reliable. The absolute amount of error variance in the combined measure is halved. Cronbach’s Alpha () as a measure of internal consistency In the previous section we combined two variables by averaging them. We can extend this by combining more than 2 variables (either by summing or averaging, it doesn’t matter which). Let us assume we have a scale that is made up of m items and the observed scores on each item are given by xj. Let the scale total score (formed by summing all the items) be described as xsum (i.e., xsum = j xj). Cronbach’s is a measure of reliability for such a scale. It is intended to reflect the internal consistency of the scale, i.e. the extent to which all the items tend to measure the same construct (process p), and it is defined thus: = (m/(m - 1))*(1 – {j Var(xj)} / Var(xsum)) …(16) Using CTT we can assume that distribution of scores on each item can be represented by the following kind of expression: D(xj) = G(p, p) + Gj(0, errj) …(17) where Gj relates to the unique error associated with each item. It is easy to show, using CTT and expressions like (17), that the value of reduces to the following expression: = (m2 * p2) / {(m2 * p2) + j errj2 } … (18) Expression (18) reveals why measures internal consistency. For example, if the scale has only two items (m = 2) and the error variance for each item is the same (i.e.,err1 = err2 = err) then expression (18) reduces to p2 / (p2 + 0.5*err2). This was shown earlier to be the reliability of a combined (average) score derived from two variables (see equation 15). Note also that is directly related to the square of the number of items in the scale. Imagine that, for every item on the scale, the true and error variance for item on the scale are the same (i.e., p2 = err2). The reliability of each item is thus 0.5. If you have two items (m=2) then the internal consistency, , is 0.667. If you have 10 items (m=10) then is 0.909. So the 2-item scale and the 10-item scale differ numerically in the value of even though both scales are made up of individual items of identical reliability. Thus, an impressive-seeming value of (e.g., 0.95) might just reflect the fact that you have a scale with many items. The Need to Take Care With Difference Measures Imagine an experiment (e.g., a reaction time experiment) in which we measure RT in 3 different conditions. Condition 1 is supposed to measure process p1; condition 2 is supposed to measure the combined effect of p1 plus p2; condition 3 the combination of all 3 processes p1, p2 and p3. Assuming the processes combine additively, then we can estimate processes p2 and p3 by subtractions. CTT expressions for the observed score distributions in each condition (x1, x2, and x3) can be written: D(x1) = G1(p1, p1) + G11(0, err1) D(x2) = G2(p2, p2) + G1(p1, p1) + G12(0, err2) D(x3) = G3(p3, p3) + G2(p2, p2) + G1(p1, p1) + G13(0, err3) … (19) Thus, the distributions of our empirical estimates of p2 and p3 are given by:D(p2est) = G2(p2, p2) + G12(0, err2) - G11(0, err1) D(p3est) = G3(p3, p3) + G13(0, err3) - G12(0, err2) … (20) Expression (20) shows two important properties about these difference measures: (i) They are less reliable than either of the constituent measures (as they contain the error terms from each part of the subtraction) (ii) They should generally not be correlated with one another as they share a common error term [err2]. This common term has opposing signs in the two expressions in (20) above and so it will generate a negative correlation between the measures even where no correlation between the processes exists. If one really wants to correlate the subtractive estimate of p2 with that for p3, then one has to make careful design choices to avoid the problems caused by a shared error term. For example, one can have twice as many trials from condition 2 as the other conditions and then use a randomly chosen half of the condition 2 trials to estimate p2 and the other half to estimate p3 by appropriate subtractions. This requires that the random errors associated with one half of the condition 2 trials can be argued to be independent of the random measurement errors for the other half of condition 2 trials (i.e., the same assumption classical test theory is making for the separate conditions). Correlated Error Terms Error terms need not be completely independent (uncorrelated) across the conditions of an experiment (such as the RT study discussed above). Error is that part of the measure of x1-x3 which is not caused by variation in processes p1-p3. Some error processes may affect all conditions equally. For example, if a particular participant is tired on the testing day then this is still a random error process across the sample tested. It is likely to be a zero mean error process across the whole sample too, as other subjects may be unusually alert. This means that the whole sample will give a reasonable value for the average RT across participants. However, for the individual participant who is tired, it is likely that their average RTs in each condition would be slower than would normally be the case if they had been tested on a typical day. For the RT study discussed earlier we can write a set of modified expressions to take account of the possible correlation in part of the error across conditions:D(x1) = G1(p1, p1) + G11(0, err1) + G0(0, errc) … (21) D(x2) = G2(p2, p2) + G1(p1, p1) + G12(0, err2) + G0(0, errc) D(x3) = G3(p3, p3) + G2(p2, p2) + G1(p1, p1) + G13(0, err3) + G0(0, errc) where G0(0, errc) denotes the distribution of an error component which is correlated across the 3 conditions. Each participant will thus have the same value of this error component in each of the 3 conditions, in much the same way that they will have the same value for process p1 in conditions 1 and 2. The expressions for the difference measure estimates of our key processes (20) are still correct because the subtraction process removes the correlated error term (G0(0, errc)). In this case, however, the relative size of the correlated and uncorrelated error terms will determine whether the difference measures (which contain 2 of the uncorrelated error terms) are more or less reliable than the individual condition measures (which contain the correlated error term plus one of the uncorrelated error terms).