
University of Ostrava
Czech republic
26-31, March, 2012
Classical Test Theory (CTT):
The first half of the 20-th century
Advantages – simplicity of data treatment and results
Posses a number of essential limitations
Modern Test Theory (Item Response Theory (IRT)):
The second half of the 20-th century
Allows to overcome shortcomings and limitations of
Additional opportunities for data analysis and use of
new technologies
The total examinee number
Maximum observed score
Minimum observed score
Average score
Standard deviation
Average difficulty level
Maximium difficulty level
Minimum difficulty level
Average discrimination index
Average point-biserial correlation
Reliability (Cronbach's alpha )
Ошибка измерения
Examinee scores are test-dependent
Item characteristics are group-dependent
Measurement scale is ordinal. Any transformation of raw scores
doesn’t improve the scale level
Methods of assessing reliability require essential assumptions and
give distorted results
Standard error of measurement is the same for all examinees
Doesn’t provide satisfactory solutions to specific test problems
Is not appropriate for Computer Adaptive Testing
Examinee scores and item parameters lie on different scales
It is not possible to predict item performance
Examinee scores are not test-dependent
Item characteristics are not group-dependent
Standard error of measurement is estimated particularly for each
ability estimate and each item difficulty estimate
Measurement scale is interval
Examinee scores and items parameters lie on the same scale
Allows further analysis of other factors influence
Provides solutions to specific test problems
Theory of Computer Adaptive Testing is based on IRT
• between examinee abilities
• between item difficulties
• between examinee ability and item difficulty
• between examinee ability and cut score
The task is to transform formal observations of accidental
events outcomes into measurements that are displayed as
continuous variables with values on interval scale
To transform observations into measurements is possible
only by means of some mathematical model
The main objective of IRT is to develop mathematical model of
test process. Parameters of the model are characteristics of
examinees and test items
Probability Pi(θ) of examinee with ability θ to give a
correct answer to item i
Different IRT models differ in the mathematical
form of ICF and /or the number of parameters
specified in the model
Plot of ICF Pi(θ) is called Item Characteristic Curve
(ICC) of item i
ICC describes relationship between probability of
correct answer and examinee ability
Each item has its own characteristic curve
ICCs of different items differ in location along the
axis and in shape (steepness). These two properties
of ICC are connected with two characteristics of
items – item difficulty and item discrimination
Item difficulty is a location parameter,
indicating the position of the ICC in relation
to the ability scale.
The greater the value of this parameter, the
greater the ability that is required for an
examinee to get the item right; hence, the
harder the item.
Difficult items are located to the right of the
ability scale, easy items are located to the left
of the ability scale.
Discrimination parameter is proportional to
the slope of the ICC at its middle point: the
greater the value of this parameter, the
steeper slope of the ICC, the more useful this
item for separating examinees into different
ability levels; hence the more discriminating
power of the item.
Unidimensionality: only one ability is measured by
a given set of items
Local Independence: examinee’s responses to any
pair of items are statistically independent
ICF reflects true relationship between latent
variables (ability level) and observed variables (item
Additional assumptions about item characteristics
which have an influence on item performance by
The first line:
Lord & Novick “Statistical Theories of Mental Test Scores” (1968)
 Seminar at ETS (F.Lord, M.Novick, A.Birnbaum, S. Messick,
F.Samejima, R. McDonald, W.Meredith)
 R. Darell Bock and his collaborators at University of Chicago (David
Thissen, Eiji Muraki, Richard Gibbon, Robert Mislevy): algorithms of
parameter estimation and first computer programs for data
The second line:
 G.Rasch “Probabilistic models for some intelligence and attainment
tests” (1960)
 Andersen (1972): effective methods of parameter estimation under
Rasch models
 B. Wright (University of Chicago) and his followers (David Andrich,
Geoffrey Masters, Mark Wilson, Mikle Linacre and others):
“Best Test Design” (1979, with Stone M.N)
“Rating Scale Analysis” (1982, with Masters G.N.).
The choice of model for measurement
Parameter estimation
Fit analysis
Analysis of test items
Analysis of examinees
Analysis of test (dimensionality, error of
measurement, co-functioning of different types items,
validity issue, etc.)
Specific test problems (test equating, detection and
measuring rater effects, standard setting procedures,
DIF analysis, detection of doubtful test results)
Scale construction and results reporting
IRT model is defined by the mathematical equation for ICF
The main difference between models - the number of item
parameters specified in the model
Any model requires accomplishment of certain assumptions
that must be checked
The choice of the model – responsibility of a researcher.
Properties of test items (dichotomous / polytomous, item
type, items’ characteristics)
Properties of the test (unidimensional / multidimensional)
Properties of the model (scale properties, possible
solutions of test specific problems)
Model-data fit
Comparison of models (statistical indices)
Comparison of item difficulty estimates and ability
estimates in different models
Available sample size
Other IRT Models
Rasch Models
Characteristic curves of items (in
dichotomous case) or their steps
(in polytomous case) differ only in
location along the axis, i.e. they
don’t cross (are parallel)
Item difficulty is the only item
characteristic that defines the result
of item performance by an
Characteristic curves of items (in
dichotomous case) or their steps
(in polytomous case) are not
parallel, they cross: every curve
has its own slope (discrimination
parameters are different)
Items differ in difficulty level and
The simplest models that provide parameter
Include minimal number of parameters
Parameters have simple interpretation, can be
easily estimated (on the interval scale with estimate
of precision)
Can be applied to all item types which use in
educational and psychological tests
Theory of item and examinee analysis is well
All specific testing problems can be easily solved
Dichotomous Models
One Parameter Logistic
Model(1 PL , Rasch Model)
Polytomous Models
Two Parameter Logistic
Model (2 PL)
Three Parameter Model
(3 PL)
Partial Credit Model (PCM, Masters,
1982) – extension of Rasch Model;
Rating Scale Model (RSM, Andrich, 1978)
– modification of PCM for use with
questionnairies that have a common
rating scale format (for example, for
Likert-type scales);
Graded response model (GRM,
Samejima, 1969) extension of 2PL;
Modified graded response model (MGRM, Muraki, 1992) – modification of
GRM for use with questionnairies that
have a common rating scale format (for
example, for Likert-type scales)/
Unidimensional Models
Multidimensional Models
All IRT models were initially
developed as unidimensional ones
Unidimensionality demands to be
Latent characteristic that defines test
results, is not necessarily the valid
measure of construct the test was
designed for
Some instruments are designed
with items meant to measure
multiple domains of ability
Only one attempt to an
Only one attempt is allowed to
an item
Multiple attempts to an
Most of IRT models satisfy this
Several independent attempts are
allowed to an item and the number
of successes is counted (tests of
psychomotor system)
Binomial Model
Poisson Model (the number of trials
is infinite and the probability of
success is small)
Nominal Response Model (NRM),
(Darell R. Bock)
Many-Facet Rasch Model (Linacre J.M.)
exp( n   i )
P ( X ni  1 /  n ,  i ) 
1  exp( n   i )
where P(Xni =1/ θn, δi) is the probability that an
examinee n (n=1,…,N) with ability θn answers
item i (i=1,…,I) with difficulty δi correctly.
δi – the point on the ability scale where the probability of a correct response is
0.5. The greater the value of this parameter, the greater the ability that is
required for an examinee to have a 50% chance of getting the item correct;
hence, the harder the item.
In theory δi parameter can vary from -∞ to +∞, but typically values of δi vary
from about -3 to +3.
ICCs differ only in their location along the ability
scale, they don’t cross (are parallel).
Item difficulty is the only item characteristic that
influences examinee performance.
All items are equally discriminating.
The lower asymptote of the ICC is zero: examinees
of very low ability have zero probability of correctly
answering the item (no guessing).
P ( X ni
exp ai ( n   i )
 1 /  n ,  i , ai ) 
1  exp ai ( n   i )
where P(Xni =1/ θn, δi) is the probability that an
examinee n (n=1,…,N) with ability θn answers
correctly item i (i=1,…,I) with difficulty δi and
discrimination ai
Item difficulty parameter δi is defined as well as in
1PL model.
Discrimination parameter is proportional to the
slope of the ICC at point δi : items with steeper
slopes are more useful for separating examinees
into different ability levels than items with less
steep slopes.
In theory ai parameter can vary from -∞ to +∞,
but typically values of ai vary from about 0 to +2.
Negatively discriminating items should be deleted
because something is wrong with the item if the
probability of answering it correctly decreases as
examinee ability increases.
ICCs are not parallel, they can cross.
 Each ICC has its own slope (discrimination values
are different).
 Items differ in difficulty level. Parameter difficulty δi
is the point on ability scale where an examinee has
a 50% chance of getting the item correct.
 The lower asymptote of the ICC is zero: examinees
of very low ability have zero probability of correctly
answering the item (no guessing).
For item 1: δ1= -1, a1=1;
for item 3: δ3= 1, a3=0,75;
for item 2: δ2= -1, a2=0,5;
for item 4: δ4=1, a4=2.
exp ai ( n   i )
P( X ni  1/  n ,  i , ai , ci )  ci  (1  ci )
1  exp ai ( n   i )
The additional parameter ci is often called guessing
parameter. It provides nonzero lower asymptote for ICC to
take into account performance at the low end of the ability
It is the probability of answering the item correctly for low
ability examinees.
For 1PL and 2PL ci =0.
In theory ci parameter can vary from 0 to 1, but typically
values less than 0,35 consider to be allowable.
The parameter ci represents the probability of examinees
with low ability answering the item correctly.
Item difficulty has a different meaning in comparing with 1PL
and 2Pl. It is still the point of inflection in the ICC, but it is
no the point at which the probability of success is 0.5. Now it
is the point on the ability scale where the probability of
correct answer is equal to (1+ ci)/2=1/2 + ci /2. It is the
middle value between ci and 1.
Discrimination parameter ai is proportional to the slope of the
ICC at point δi as before.
For item 1:
For item 2:
For item 3:
For item 4:
δ1= -1,
δ2= 0,
δ3= 1,
δ4= 1,
a1=1,5, c1= 0,1;
a2=2, c2= 0,05;
a3=2, c3= 0,2;
a4=1, c4= 0.
Winsteps (all Rasch models)
XCalibre, Multilog, Parscale (all unidimensional IRT
models (1PL, 2PL, 3PL) and their polytomous
Facets (Many-Facet Rasch Model)
Conquest (all Rasch models including Many-Facet and
Multidimensional ones)
Simple in use
A huge number of outputs:
item analysis, test analysis
(including special kinds of
analysis), examinee analysis
Good graphic
Good manual
Not expensive
Works only with Rasch
No other disadvantages
TITLE = "History-var1“
NAME1 = 1
; column of start of person information
ITEM1 = 4
NI = 51
; maximum length of person information
; column of first item-level response
; number of items = test length
; number of columns per response
PERSON = Student ; Persons are called ...
ITEM = Item
; Items are called ...
CODES = 012345 ; valid response codes
;IVALUEC= 0112 ; for rescoring for item type A
;IVALUED= 01122; for rescoring for item type B
;IVALUEG= 01233
;24 .75
;25 2.83
;26 3.13
Distribution of
examinees and
items on the same
Checking Dimensionality
DIF – analysis of items
Optimizing Rating Scale and Partial Credit
Category Effectiveness
Analysis of examinees (statistics and
Analysis of different scales
An opportunity to fix item difficulties or
category difficulties
Fit the
Handy and intuitively clear
Works both with Rasch models
and other IRT models
(dichotomous and polytomous)
Good item analysis
Allows to chose the method of
parameter estimation
Fulfils DIF analysis
Produces a report automatically
Is intended for item analysis only
Analysis of examinees is not
No special kinds of analysis
(variable map, dimensionality
investigation, etc.)
IRT Item Parameter
Calibration Report
Report created on 12.07.2011
Works both with Rasch models
and other IRT models
(dichotomous and polytomous)
Allows to chose the method of
parameter estimation
Allows to chose the best model
Attractive graphic (ICCs of all
items are shown on the same
Old, not handy, it is hard to create
a control file
Analysis of examinees is not
No special kinds of analysis
(variable map, dimensionality
investigation, DIF analysis, etc.)
Rasch model
Works with a big number of
models, including Many-Facet
and Multidimensional ones
Reports a confidence interval for
fit statistics
Good item analysis
Creates a variable map
Many outputs
Requires creation of control file
Requires special knowledge on
interpretation outputs in complex
Doesn’t work with 1PL and 2PL
models and their polytomous
Title Rater Effects Model One;
datafile CQdata.dat;
format id 1-4 rater 6-8 responses 10-18 rater 20-22
responses 23-31 ! criteria (9);
codes 0,1,2,3;
score (0,1,2,3) (0,1,2,3);
labels << Items.lab;
model rater + criteria + criteria*step;
show ! estimates=latent >> rater1.shw;
itanal >> rater1.itn;
show cases ! estimates=mle >> students1.shw;
Plot icc;
Plot expected;
Comparison of
two raters who
evaluated the
same set of
The used model
Required analysis
Software availability
Personal preferences