Psychometrics and Test Theory

advertisement
Psychometrics and Test Theory
Genuine interdisciplinary:
{mathematics & statistics}  {psychology & behavioral sci}
„Test“ = formalized abstractive modeling variable substituted to any original psychological
variable whose use typically was for a diagnostic purpose
General aim of „test theory“: improving quality of diagnostic methods
(a weak comparison is error theory in technical measurement)
Practical uses:
- assessment of scientific “provability” of data
- new test construction
- old test verification
- prediction
- person selection
- correction due to selection
etc.
Object of study: formalized properties of „tests“
Method:
mathematics, statistics, probability,
partly also: operation research, optimization (linear programming), ...
Specific purpose: to formulate mathematical relationship between test properties
to enable manipulating some of them to optimalize the desired target
property/ies, usually thediagnostically important, e.g. validity etc.
Test properties are conceptually formalized and expressed in terms of statistical indexes:
illustration: „difficulty“ of test as percentage of persons fulfilling/notfulfiling a „norm“
Test properties:
validities - there are about 20 types of validity of the
„how test measures the attribute it is supposed to measure“
(nice operationalistic tautology !)
classical test theory: validity defined as absolute magnitude of correlation
reliability - an analogy of measurement error and accuracy in technical measurement
(mistaken confusion with similarity during repetition}
objectivity
difficulty
length
test time
test speed
dimenzionality (vs. non-statistical „content homogeneity“)
consistency
generalizability
equivalency
specificity
and several other
Illustration of some nonformalized “farmer’s logic” relationships between the properties:
- difficulty \ validity
- length
\ reliability
- reliability \ validity
Some types of validity: ... type of criterion = D.V. , test = I.V.
- simple
\ composed
(in classics: simple \ multiple correlation)
in the composed case too: incremental, “pure”
- internal
\ external
- manifest \ latent
... empirical \ “construct” (usually “factor”)
- convergent \ discriminant (= divergent)
- concurrent \ predictive (test=predictor, criterion=predictand)
- absolute
\ differential
some nonformalized: - “content” validity by expert’s opinion
- “face“ validity and motivation
Some types of test equivalency - :homogeneous” tests measuring the same construct can be
- unidimensional
- congeneric
- tau-equivalent
- parallel
Scientific concept formation \ weak associative measurement of “constructs”
(mathematical dimensionality of latent variables vs. psychological homogeneity)
Three traditional models
(in modern approach they are subsumed in one genralized model)
1. Classical model of test score
2. Factor analysis
3. IRT-models for binary tests (Item Response Theory, obsolete “Item Analysis”)
Practical use:
- assessment of scientific “provability” of data - critical difference of scores
- new test construction
- old test verification
- prediction
- person selection
- correcting predictive validity with respect to selection and selection ratio
Examples of formulas derived on the theoretical-axiomatic basis:
maximum validity of test x to errorless criterion y : max rxy =  Relx , sq. root of reliability
dtto to criterion with reliability Rely :
max rxy = rxy / ( Relx . Rely)
suppose rxy= .48 and reliabilities .64, .81, max = .66
Spearman-Brown, influence of length L :
RelL.x = L . Relx / [1 + (L-1) Relx ]
standard error of test score
se = sx (1-Relx)
critical difference of two scores:
D95 = 2.8 se
correction of predictive validity
: Fajfer’s example
Further:
- multivariate simultaneous optimalization of number of subtests, their lengths, reliabilities
and mutual weights
- constituting test batteries of items from item banks:
- test difficulty equating
- linear programming to obtain desired reliability
- dimensionality of test batteries - simplest approach:
FactorAnalysis, confirmatory - as a part of methods for test construction
- constitutive and reductive process of test consttruction
- incremental validity and suppressors
An illustration of formal building of classical test theory
Testing as a formalized procedure
A general problem to be formalized:
“test” is a “diagnostic device” which should allow to “approximate” the level of an
psychological attribute which is not accessible to our direct observation and the
“approximation” should fulfill certain conditions to ensure its minimum diagnostic quality
Formalized formulation (i) - deterministic:
pi
...
an object possessing attribute A
P = { pi }... nonemtpy set, possibly infinite ( i = 1, 2, ...)
an assumed attribute with levels conceptually representable on
a scale of interval type
A(pi) ... a real point function representing the level of A
(in the sense of theory of representation measurement)
fA(pi) ... function approximating the unknown function A(pi)
{A1, A2,...} set of axioms and conditions setting the way f approximates  ,
say, among other things, dealing also with the discrepancy between f and  ,
e.g. the difference f -  , or so, which indicates that the formalized definition
must include the pair {f ,}
In ideal case the axioms should be consisten (non-contradictory), independent and complete.
Formally the problem is set, then.
Interpretation: ... population of persons
A
... psychological attribute to be “tested”

... ideal diagnostic procedure’s score representing the level of A
f
... the practically accessible “test score” aproximating the ideal 
Re-formulation (ii) - stochastic:

x
or
or
 A(pi) ... random variable substituting 
x A(pi) ...
“
“
“
f
Statistical assumptions:
(Sample space, resp. variation range, of , x corresponds to the domain, resp. range, of f,)
S1 : 1st and 2nd moments of  and x are finite
S2 : their 2nd moments are nonzero
Definition - Tests
Assume P be set of elements pi , (i=1, 2,...), and for each of the n elements j, k,... from
non-empty index set I1n assume an ordered pair { xj , j }, { xk ,k }... of random variables,
each in any pair with finite first moments and finite and nonzero second moments over P,
where the first variable be directly observable*) but the second one not. Then set of the pairs,
T = [{ x1 ,1 },...,{ xj , j }, { xk , k }, ... { xn , n }] is family of tests,
and each of the pairs a test, if the following system of three axioms A1, A2, A3 holds.
Axioms, for simplicity formulated using auxilliary definition xj -j  j :
( Remark: the definition is related to the “quality of approximation)
A1 Expectation
A2 Correlation
A3 Correlation
E (j ) = 0 for all j
R (j , k ) = 0 for all pairs j , k
R (j, k ) = 0 for all mutually different j , k
The “auxilliary definition” xj -j  j
then turns into
xj = j + j , the model of classical test theory,
with the interpretations:
xj is observed score of a test
j is true score
j is “error” of the observed score
-------*) The observability vs. unobservability from a statistical point view is another problem, depends on
randomness in the measurement/observation conditions, the true score then can be defined as expectation of
observed score of a test with infinite length. It is also related to the problem of decomposition of “error” or
“unique” part of score into “specific” and “unreliable” component. For this moment let us rely on just intuitive
meaning of “observability”.
---------
The true score can be of two kinds:
- specific true score - specific to the only one given test. e.g. “true” number of words in a
verbal test, i. e. non-influenced by unreliability conditions like factors of environment
disturbing the optimum concentration, motivation of the tested person, climate conditions etc.,
so the atribute A is concrete and empirically or operationally defined by the test-specific
task, i.e. uniquelly and completely measured by the test
- generic true score - common to the content-homogeneous battery of several tests
measuring the same abstract psychological concept - theoretically assumed “construct” A
(say “verbal ability”), the tests thus being of the same “generic” character
For the generic true score the model then can be further decomposed:
xj = j + j = j +
j + j
where, in the generic case, the term j is to be recognized not as just an error due to
unreliability of measurement of the abstract concept j but rather as
a) from point of view of modeling j is discrepancy of the model
b) from the point of viewm of testing j is a unique part of test score, i.e. the part unique
to the separate test while the rest is part common with the other tests that jointly measure their
common generic attribute A
In any of the two viewpoints the decomposition
j= j + j
is into
j ... the specific part which is not due to unreliability, and
j ... the error due to unreliability
An illustration of derivation of several consequencies of the formal formulation:
Decomposition of variance of the specific true score:
S2(xj) = S2(j +j) = S2(j ) + S2(j) + 2 S2(j , j) by the known theorem,
but due to A2 the covariance as the last member is 0, and therefore
S2(xj) = S2(j ) + S2(j).
True variance is equal to observed-true score covariance:
- by a general definition this covariance is
S2(xj ,j) = E (xj j) - E (xj) E (j)
- substitution yields S2(xj ,j) = E [(j+ j) j] - E(j+ j) E (j)
= E (j2) - [E (j)] 2 + [E (j j)] 2 - E (j ) E (j)
= S2(j ) + S2(j,j),
and by A2 :
S2(xj ,j) = S2(j ) .
(*)
Coefficient of determination:
R 2 (xj ,j) = S2(j ) / S2(xj) ,
as folows from correlation coefficient after substituting the last finding S2(xj ,j) = S2(j )
into its definition: R (xj ,j) = S2(xj ,j) / S2(xj) S2(j ) = S 4(j ) / S2(xj) S2(j ) .
This coefficient of determination has two interpretations:
1. just as the ratio of
true to observed variance or complement of error to observed variance
S2(j ) / S2(xj) = [ 1 - S2(j) ] / S2(xj)  Rel
it is called reliability of test.
2. Its square root, the correlation of observed and true score
R (xj ,j)
it is one of the cases of validity.
Since, according to the definition of test, the variances S2(j ) > 0 , S2(xj) > 0 , it follows
from (*) that the covariance is also positive, and therefore the correlation
R (xj ,j) > 0 ,
so by the definition the pair (xj ,j) is a test only if it has nonzero reliability and if the
observed score has nonzero validity toward true score.
The square root of the error variance i.e. the standard deviationof the errors
S(j) is called standard error of test score
Axiom of parallelism - this is the strongest type of test equivalency:
parallel tests or “classic equivalent tests” - :
{ xj ,  }, { xk ,  }, ...
are defined as tests of the same  (their common true score) that have the same standard
errors
S(j) = S(k) = ...
Consequencies of the classic-model parallelism:
- parallel tests have the same means of observed scores
(follows from applying E to the the model xj = j + j , axiom A1, and the equality (*) )
- parallel test have the same observed variances
(follows from definition of parallelism and from the variance decomposition )
- true variance of parallel tests equal to their observed covariance:
S2(xj , xk) = S2( )
(**)
follows from substituting model into definition of covariance and axioms A2, A3, A1:
S2(xj , xk) = E ( xj xk) - E (xj) E (xkj)
= E [( + j) ( + k) ] - E ( + j) E ( + k)
= E [2 +  j +  k + j, k] - { [ E ( ) + E (j) ] [E ( ) + E (k) ] }
= E (2) + 0 + 0 + 0 - [E ()] 2 - 0.E ( ) - 0.E ( ) - 0.0
= E (2) - [E ( )] 2  S2( )
This important theorem makes possible to estimate reliability if two parallel tests or
parallel forms of a test are at disposal
BUT: the often use of two repeated measurements assumed to be parallel forms
leads to the mistakenly tragic pseudo-explanation of reliability as an
“ability of test to yield similar or stable results on repetitions”
- parallel tests have the same reliability
Rel j = Rel k = Rel .
(follows from definitions of paralelism and reliability ratio)
- mutual observed validity of parallel tests is equal to their reliability :
R (xj , xk ) = S2( ) / S2(x.)
(substitute formula (**) into the numerator of reliability ratio, further due to the equal variances
its denominator can be substituted by the product S(xj) S(xk ), which transforms the ratio into the
correlation coefficient as by definition)
- correlation coefficients of mutual observed validities between all pairs of parallel tests are
equal - and the correlation matrix of the battery of parallel tests contains the constant in its
off-diagonal cells
(the correlations are equal since they are equal to their equal reliabilities)
- the correlation matrix of the parallel tests is therefore unidimensional or speaking in
factor analytyc terms it is “explained” perfectly with one common and general factor, with
factor validity couefficients (loadings) mutually equal (and with equal
uniquennesses/communalities)
- it is also possible to show, that parallel tests have the same validity for predictin any
external criterion variable (as dependent variable)
So, in this sense the parallel test - as the strongest case among other types of equivalent test
- are by intuition really “fully equivalent” and mutually completely interchangeble.
Download