Uploaded by Muhammad Kevin A

12014854

advertisement
An Introduction to Item
Response Theory - IRT
1
Dalton Andrade (UFSC)
Héliton Tavares (UFPA)
Adriano Borgatto (UFSC)
2
Overview
 Main ideas, concepts and applications of Item Response
Theory – IRT in different areas
 Session 1:
1. Main ideas and the concepts of IRT
2. Unidimensional models for dicothomous and polythomous
items
3. Estimation methods
4. Construction of latent trace scales
 Session 2:
1. Equating methods
2. Differential item functioning – DIF
3. Computerized adaptive testing – CAT
4. Several applications of IRT in many different areas
Why do we need to get measurements?
3
We have got measurements since the first day of our lives!
Why do we need to get measurements?
4
 Temperature (degree):
Celsius / Fahrenheit  C = (F – 32)/1.8
 Body weight (mass):
Kilogram / Pound  1 kg = 2.2046226218 lb
 Body height:
Meter / Feet  1 m = 3.28084 ft
 Blood pressure: millimeters of mercury (mm Hg)
 etc ...
Why do we need to get measurements?
5
 What about these caracteristics?
 Satisfaction
 Depression (Psychiatry)
 Life Quality
 Math proficiency/ability (Education)
 Statistics use in workplace
 WEB usability (e-commerce)
 Diagnostic reasoning – Nursing
Why do we need to get measurements?
6
 (cont.)
 Resistance to change
 Organizational environmental performance
and soo on .....
 They are all examples of what we call “Latent
Trace”
 They are “caracteristics” that can not be measured
/observed directly
 We need to “build scales/metrics” to measure them
How to measure Latent trace?
7
 To build it, we will need:
 Measurement instrument: Questionnary, Test ....
 Scale/metric
8
Motivation
 Is it possible to estimate the body height of a person?
INSTRUCTIONS: Please, read the questions below about you and answer 1 for
“YES” and 0 for “NO”, filling up in the green line or use the NO/YES Options.
1. In bed, I often suffer from cold feet
2. When walking down the stairs, I take two steps at a time
3. I think I would do well on a basket ball team
4. As a police officer, I would not make much of an impression
5. In most cars, I sit uncomfortably
6. I literally look up to most of my friends
7. I am able to pick up an object on top of a cabinet, without using stairs
Motivation
9
 Is it possible to estimate the body height of a person?
INSTRUCTIONS: Please, read the questions below about you and answer 1 for
“YES” and 0 for “NO”, filling up in the green line or use the NO/YES Options.
8. I bump my head quite often
9. I can store luggage in the trunk of the plane or bus
10. I usually set the car seat back
11. Usually when I'm walking ride they offer me the front seat
12. For school pictures I was always asked to stand in the last row
13. I have trouble to accommodate me on the bus
14. Among several friends, you’re would be preferred for changing light bulbs
Motivation
“Playing” with body height(*)
10
1,50m
1,55m 1,60m
1,65m 1,70m
1,75m 1,80m
1,85m
(*) Many thanks to Prof. C. A. W. Glas – University of Twente –
Netherlands - ABE - SINAPE 2006.
11
Positioning respondents and
items on the same scale
1,50m
1,55m 1,60m
I9
1,65m 1,70m
I7
1,75m 1,80m
I10
I2 I4
1,85m
12
Positioning respondents and
items on the same scale
1,50m
1,55m 1,60m
I9
1,65m 1,70m
I7
1,75m 1,80m
I10
Dalton
I2 I4
1,85m
13
Positioning respondents and
items on the same scale
1,50m
1,55m 1,60m
I9
1,65m 1,70m
I7
1,75m 1,80m
I10
1,85m
I2 I4
Dalton Adriano Héliton
14
Positioning respondents and
items on the same scale
I9
I7
I10
I2 I4
Dalton Adriano Héliton
Positioning respondents and
items on the same scale
-4
-3
-2
9
-1
0
1
7
10
Dalton
2
3
2 4
Adriano
Héliton
Positioning respondents and
items on the same scale
60
70
80
9
90
100
110
7
10
Dalton
120
130
2 4
Adriano
Héliton
17
Concepts and Constructs
 Concept:
 “An abstraction formed by generalization from particulars”
 Abstracts are hard to define
 E.g. intelligence
 Construct:
 A concept with scientific purpose (i.e. operationalized)
 Can be measured and studied.
 E.g. IQ
Psy 427 - Cal State Northridge
What is item analysis in general?
 Item analysis provides a way of measuring the quality of
questions - seeing how appropriate they were for the
respondents and how well they measured their ability/trait.
 It also provides a way of re-using items over and over again in
different tests with prior knowledge of how they are going to
perform; creating a population of questions with known
properties (e.g. test bank)
19
Item Analysis
Classical Test
Theory
Latent Trait
Models
Item Response
Theory
Rasch
Models
…
1PL
2PL 3,4,5 Grad
PL
Nom
Mult
Ass
Unfold
Types of items
 Dichotomous: Body height
“In bed, I often suffer from cold feet”: Yes/No
 Polytomous ordinal: Memory
“Do you forget to give messages?”: Never, Rarely,
Some times, Frequently, Always
 Polytomous nominal: Math
“Multiple choice item with five categories A, B, C, D, E”:
usually treated as Right/Wrong
……
 One can have more than one type of item in the same
questionnary
21
Classical Test Theory
CTT
22
Classical test Theory (CTT)
 Classical Test Theory (CTT) – often called the
“true score model”
 Called classic relative to Item Response
Theory (IRT) which is a more modern
approach (Digital vs Analogical)
 CTT describes a set of psychometric
procedures used to test items and scales
reliability, difficulty, discrimination, etc.
Classical Test Theory vs.
Latent Trait Models
 Classical analysis has the test (not the item) as
its basis. Although the statistics generated are
often generalised to similar students taking a
similar test; they only really apply to those
students taking that test
 Latent trait models aim to look beyond that at
the underlying traits which are producing the
test performance. They are measured at item
level and provide sample-free measurement
24
Classical test Theory (CTT)
 Assumes that every person has a true score on an
item or a scale if we can only measure it directly
without error
 CTT analyses assumes that a person’s test score is
comprised of their “true” score plus some
measurement error.
 This is the common true score model
X T E
CTT: Internal Consistency Reliability
25
Coefficient Alpha (Cronbach´s) can also be
defined as:
k s


k 1 
s 

2
sTotal

2
Total
2
i
2
𝑆𝑇𝑜𝑡𝑎𝑙
is the composite variance (if items were
summed)
 𝑆𝑖2 is variance for each item i=1,…,k
k is the number of items
Standard Error of Measurement
 The standard error of measurement is the error associated with
trying to estimate a true score from a specific test
 This error can come from many sources
 We can calculate it’s size by:
𝑆𝑚𝑒𝑎𝑠 = 𝑆 1 − 𝛼
 S is the standard deviation; 𝛼 is reliability
Graphical Information
Key
N
B
20438
Dificulty
Average
CTT Parameters
100,0
Very Good
Responses
% Total
“Middle” Item
Discrimination
A
B
C
D
24,10
53,30
16,10
6,50
% Group 1
28,20
31,10
28,10
12,60
% Group 2
29,10
47,50
17,40
6,00
% Group 3
17,10
75,50
5,50
1,90
Rbis
-0,10
0,40
-0,35
-0,34
Percentual
27
80,0
60,0
A
40,0
B
C
20,0
D
0,0
1
2
Grupo
3
28
Item Response Theory
IRT
29
Item Response Theory (IRT)
 Item Response Theory (IRT) – refers to a family of latent trait
models used to establish psychometric properties of items
and scales
 Sometimes referred to as modern psychometrics because in
large-scale education assessment, testing programs and
professional testing firms IRT has almost completely replaced
CTT as method of choice
 IRT has many advantages over CTT that have brought IRT into
more frequent use
 Item response theory (IRT)
 Set of probabilistic models that…
 Describes the relationship between a respondent’s magnitude on
a construct (a.k.a. latent trait; e.g., extraversion, cognitive ability,
affective commitment)…
 To his or her probability of a particular response to an individual
item
30
Some other advantages
 Provides more information than classical
test theory
 Classical test statistics depend on the set of
items and sample examined
 IRT modeling not dependent on sample
examined
 Can examine item bias/ measurement
equivalence and provide conditional
standard errors of measurement
Three Basics Components of IRT
 Item Response Function (IRF) – Mathematical function that relates
the latent trait to the probability of endorsing an item
 IRFs can then be converted into Item Characteristic Curves (ICC) which
are graphical functions that represents the respondents ability as a
function of the probability of endorsing the item
 Item Information Function – an indication of item quality; an item’s
ability to differentiate among respondents
 Invariance – position on the latent trait can be estimated by any
items with know IRFs and item characteristics are population
independent within a linear transformation
32
Item Response Theory
IRT
33
Item Response Theory
Models: they depend on the item type
• Items scores as right/wrong, Yes/No etc.
Logistic Model (one dimensional trait) with 1, 2 or 3
parameters
P( U ij  1 |  j )  ci  ( 1  ci )
1
1 e
 ai (  j bi )
3 Parameters Logistic model (3PL)
34
Item Characteristic Curve - ICC
Probability of correct response
1
0,9
0,8
0,7
0,6
a
c
0,5
0,4
0,3
0,1
-3
a = 1.7
b=0
0,2
c
b
0
-4
b
-2
-1
0
1
2
Ability (Latent Trait)
P( U ij  1 |  j )  ci  ( 1  ci )
3
1
1 e
 ai (  j bi )
4
c=0
IRF – Item Parameters Location (b)
 An item’s location is defined as the amount of the latent trait needed
to have a .5 probability of endorsing the item.
 The higher the “b” parameter the higher on the trait level a
respondent needs to be in order to endorse the item
 Analogous to difficulty in CTT
 Like Z scores, the values of b typically range from -3 to +3, when
considering de scale (0,1)
IRF – Item Parameters Discrimination (a)
 Indicates the steepness of the IRF at the items location
 An items discrimination indicates how strongly related the item is to
the latent trait like loadings in a factor analysis
 Items with high discriminations are better at differentiating
respondents around the location point; small changes in the latent
trait lead to large changes in probability
 Vice versa for items with low discriminations
IRF – Item Parameters Guessing (c)
 The inclusion of a “c” parameter suggests that respondents very low
on the trait may still choose the correct answer.
 In other words respondents with low trait levels may still have a small
probability of endorsing an item
 This is mostly used with multiple choice testing…and the value should
not vary excessively from the reciprocal of the number of choices.

P(U ij  1 |  j )  ci  (d i  ci ) / 1  e

 ai ( j bi ) f i
IRF – Item Parameters Upper asymptote (d)
 The inclusion of a “d” parameter suggests that respondents very high on
the latent trait are not guaranteed (i.e. have less than 1 probability) to
endorse the item
 Often an item that is difficult to endorse (e.g. suicide ideation as an
indicator of depression)
 Not used in most cases
IRF – Item Parameters Lower asymptote (f)
 The inclusion of a “f” parameter suggests that the item IRT may not
be symmetrical around “b” parameter.
 Not used in most cases
39
Effect of the “a” parameter
Small “a,”
poor
discrimination
40
Effect of the “a” parameter
Larger “a,”
better
discrimination
41
Effect of the “b” parameter
Low “b,”
“easy item”
42
Effect of the “b” parameter
Higher “b,”
more
difficult item
“b” inversely proportional to CTT p
43
c=0,
asymptote
at zero
Effect of the “c” parameter
44
Effect of the “c” parameter
“low ability”
respondents may
endorse correct
response
45
ICC from real data
46
More calibrated items
Some items have problems and should be revised.
47
ICCs from body height application
48
Item Response Theory
Some other models
49
IRT: Nominal Response Model (NRM)
• Introduced by Bock (1972)
• Polytomous responses in NRM are unordered
• It considers all response categories (h=1,...,mi)
P(U ijs  1 |  j ) 
exp[ a is ( j  bis )]
mi
 exp[ a
h 1
ih
( j  bih )]
Interpretation of ais and bis are such as in Logistic model
Modelo Nominal
Probabilidade
a=(-2,-1,1,0) e b=(-2,-1,2,1)
1,0
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0,0
-4,0
-3,0
-2,0
-1,0
0,0
1,0
2,0
3,0
4,0
Traço latente
P1
P2
P3
P4
50
51
IRT: Gradual Response Model (GRM)
• Samejima (1969, 1972, 1995)
• Likert-scale items (strongly disagree, disagree, neutral,
agree, and strongly agree)
• GRM considers ordered categories (h=1,...,mi)
1
P(Uijs  1 |  j ) 

1  exp[ai ( j  bis )]
1

1  exp[ai ( j  bi(s1) )]
bi1  bi 2  ...  bim
i
52
Modelo Resposta Gradual
Probabilidade
a=1,2 e b=(-2,-1,1)
1,2
1,0
0,8
0,6
0,4
0,2
0,0
-4,0 -3,0 -2,0 -1,0 0,0
1,0
2,0
3,0
Traço latente
P0
P1
P2
P3
4,0
53
Item Response Models
•Partial Credit Model (PCM): GRM with ai=1
•Generalized Partial Credit Model (GPCM)
• Rating Scale Model (RSM): Andrich (1978a, 1978b)
GRM with bis = bi – ds
•Unfolding Models (Roberts, 2000): Non Cumulative
latent traces (attitude, behavior ….)
• Multidimensional (Dr. Reckase): Compensatory
Three-Parameter Logistic Model (MC3PLM) …
•These models can be used to One-Group or
Multiple Group Analysis
54
Item Response Theory
Item and Test Information
Function
IRT – Item Information Function
Statistical FISHER Information
 Each IRF can be transformed into an item information
function (IIF); the precision an item provides at all
levels of the latent trait.
 The information is an index representing the item’s
ability to differentiate among individuals.
 The standard error of measurement (which is the
variance of the latent trait level) is the reciprocal of
information, and thus, more information means less
error.
 Measurement error is expressed on the same metric
as the latent trait level, so it can be used to build
confidence intervals.
IRT – Item Information Function
Difficulty parameter - the location of the
highest information point
Discrimination - height of the information.
Large discriminations - tall and narrow IIFs;
high precision/narrow range
Low discrimination - short and wide IIFs; low
precision/broad range.
1
2plm
3plm
4plm
0.9
0.8
Information
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
Trait Level
1
2
3
IRT – Test Information Function
 Test Information Function (TIF) – The IIFs are also additive
so that we can judge the test as a whole and see at
which part of the trait range it is working the best.
1.4
1.2
Information
1
0.8
0.6
0.4
0.2
0
-3
-2
-1
0
Trait Level
1
2
3
60
Item Response Theory
Important Assumptions:
1. Invariance
2. Dimensionality
61
IRT - Invariance
 Invariance - IRT model parameters have an
invariance property
 Examinee trait level estimates do not depend on
which items are administered, and in turn, item
parameters do not depend on a particular sample
of examinees (within a linear transformation).
 Invariance allows researchers to:
1) efficiently “link” different scales that measure
the same construct,
2) compare examinees even if they responded
to different items, and
3) implement computerized adaptive testing.
62
IRT - Dimensionality
 The models presented make a common
assumption of unidimensionality
 Hattie (1985) reviewed 30 techniques
 Some propose the ratio of the 1st eigenvalue
to the 2nd eigenvalue (Lord, 1980)
63
PAF and scree plots
If the data are
dichotomous, factor
analyze tetrachoric
correlations
Assume continuum
underlies item
responses
Dominant
first factor
64
Item Response Theory
Estimation:
1. Items (Calibration): Marginal Maximum Likelihood
2. Conditional Ability Distribution
Estimating equations for Item Parameters
K sk
 log L( , )
 D(1  ci ) rkj  [(ukji  Pi )(  bi )Wi ]g kj* ( )d
ai
k 1 j 1
IR
K sk
 log L( , )
  D(1  ci ) rkj  [(ukji  Pi )(  bi )Wi ]g kj* ( )d
bi
k 1 j 1
IR
Wi *
 log L( , ) K sk
  rkj  [(ukji  Pi ) ]g kj ( )d
ci
Pi
k 1 j 1
IR
• Numerical process: EM Algorithm
Estimating equation for ability
Based on the ability distribution, conditional to
response vector
g ( )  g ( | u j )  P(u j |  ) g ( |  )
*
j
Função de Verossimilhança para cada indivíduo
1
2
3
4
5
6
Verossimilhança
7
8
9
10
11
12
13
14
15
16
17
18
19
-3
-2,75 -2,5 -2,25
-2
-1,75 -1,5 -1,25
-1
-0,75 -0,5 -0,25
0
0,25
Habilidade
0,5
0,75
1
1,25
1,5
1,75
2
2,25
2,5
2,75
3
20
N(0,1)
Example 1
67
 SARESP 2007: Portuguese(LP)
 3a. Grade high school: Prova-POR-3EM-Manha.pdf
 30 multiple choice items
 1,001 students (small sample just for presentation)
 Data: 3EM_Manha.DAT file
 Software: BilogMG - 3EM_Manha.BLM (sintaxe)
 Results: 3EM_Manha.PH1
3EM_Manha.PH2
3EM_Manha.PH3
3EM_Manha.PAR
3EM_Manha.SCO
68
Example 2
 Lifestyle
 Questionnaire: next page
 15 polytomous ordinal items with four categories:
No, Sometimes, Almost always, Always
 580 respondents
 Data: Estilo.dat file
 Software: Multilog – Estilo.MLG (sintaxe)
 Results: Estilo.OUT
Itens
Dimensões
Sua alimentação diária inclui pelo menos 5 porções de frutas e verduras.
1.
2.
Descrição
Alimentação
Você evita ingerir alimentos gordurosos (carnes gordas, frituras) e doces.
3.
Você faz de 5 refeições variadas ao dia, incluindo café da manhã completo.
4.
Você realiza ao menos 30 minutos de atividades moderadas/ intensas, de forma
contínua ou acumulada, 5 ou mais dias na semana.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Ativídade
Física
Ao menos duas vezes por semana você realiza exercícios que envolvam força e
alongamento muscular.
No seu dia-a-dia, você caminha ou pedala como meio de transporte e,
preferencialmente, usa as escadas ao invés do elevador.
Você conhece sua pressão arterial, seus níveis de colesterol e procura controlálos.
Comportamento Você não fuma e não ingere álcool (ou com moderação).
Preventivo
Você respeita as normas de transito (como pedestre ciclista ou motorista); se
dirige usa sempre o cinto de segurança e nunca ingere álcool.
Você procura cultivar amigos e está satisfeito com seus relacionamentos.
Relacionamento Seu lazer inclui encontros com amigos, atividades esportivas em grupo,
participação em associações ou entidades sociais.
Social
Você procura ser ativo em sua comunidade, sentindo-se útil no seu ambiente
social.
Você reserva tempo (ao menos 5 minutos) todos os dias para relaxar.
Controle
Você mantém uma discussão sem alterar-se, mesmo quando contrariado.
do Estresse
Você equilibra o tempo dedicado ao trabalho com o tempo dedicado ao lazer 69
Example 3: Body Height
70
Is it possible to estimate the body height of a person?
INSTRCTIONS: Please, read the questions below about you and answer 1 for “YES” and
0 for “NO”, filling up in the green line or use the NO/YES Options.
Statistical Distribution
1. In bed, I often suffer from cold feet
2. When walking down the stairs, I take two steps at a time
3. I think I would do well on a basket ball team
4. As a police officer, I would not make much of an impression
5. In most cars, I sit uncomfortably
6. I literally look up to most of my friends
7. I am able to pick up an object on top of a cabinet, without using stairs
8. I bump my head quite often
9. I can store luggage in the trunk of the plane or bus
10. I usually set the car seat back
11. Usually when I'm walking ride they offer me the front seat
12. For school pictures I was always asked to stand in the last row
145
13. I have trouble to accommodate me on the bus
14. Among several friends, you’re would be preferred for changing light bulbs
155
165
175
Responses to items
1
1
NO
YES
2
1
3
0
4
0
5
1
6
1
7
1
8
0
9
1
10
1
11
1
12
1
13
1
14
1
Height
(cm)
184
6'0"
Score
11
Get it!
185
71
Item Response Theory
Building the Ability scale
1. Positioning items
2. Interpretation
Positioning Items
 Definition of anchoring items:
Two consecutive levels Y and Z, with Y < Z
Y
Z
We say that an item is anchor at a level Z if and only if
a) P(X=1/=Z)  0,65
b) P(X=1/=Y) < 0,50
c) P(X=1/=Z) - P(X=1/=Y)  0,30
Positioning Items
 Back to Exemplo 1: Portuguese, 30 items
 PositioningItems.xlsx
 Interpretation of the scale
Test Equating
 Participants that have taken different tests measuring
the same construct, can be placed on the same scale
and compared or scored equivalently
 Equating across grades on math ability
 Equating across years for placement or admissions
tests
Test Equating
 Example 3: Evening students
 3a. Grade high school: Prova-POR-3EM-Noite.pdf
 30 multiple choice items
 1,001 students (small sample just for presentation)
 Data: 3EM_Noite.DAT file
 Software: BilogMG - 3EM_Noite.BLM (sintaxe)
 Results: 3EM_Noite.PH1
3EM_Noite.PH2
3EM_Noite.PH3
3EM_Noite.PAR
3EM_Noite.SCO
Test Equating
Five common items between the two
Tests: Items 15 to 19 in both Tests
 Invariance principle
𝑃 𝑈 = 1 𝜃, 𝑎, 𝑏, 𝑐 = 𝑃 𝑈 = 1 𝜃 ∗, 𝑎 ∗, 𝑏 ∗, 𝑐 ∗
with θ* = λ 𝜃 + β,
b* = λ 𝑏 + β
𝑎* = 𝑎/λ
and c* = c
Equal_24_07_Posteriori_ManhaNoite.xls
Test Equating
Multiple group equating: k = 1, 2, …., K
Group k: mean µk and variance σk2
P (U ijk  1 |  jk )  ci  (1  ci )
1
1 e
 ai ( jk  bi )
For the reference group R, we set µR = 0 and
σR2 = 1
Test Equating
Example 4: Example 1 + Example 3 with
K=2 and R=1  µ1 = 0 and σ12 = 1
•
•
•
•
•
55 (=30 + 30 – 5) multiple choice items
2,002 students
Data: 3EM_Equat_MxE.DAT file
Software: BilogMG - 3EM_Equat_MxE.BLM (sintaxe)
Results: 3EM_Equat_MxE.PH1
3EM_Equat_MxE.PH2
3EM_Equat_MxE.PH3
3EM_Equat_MxE.PAR
3EM_Equat_MxE.SCO
Test Equating
Example 5: National Basic
Assessment System - SAEB
Education
•
5th and 9th grades (Fundamental) and 3th
grade (High school)
•
Every two years (odd years)
• In each grade, the amount of items needed is
much bigger than what one student can
answer
80
How many items we need to
cover a matrix?
One example (SARESP): 13 Booklets with 8 items each (104
items). Every examinee takes just 3
Booklet
1
2
3
4
5
6
7
8
9
10
11
12
13
ITEMS
1
10
93
29
55
67
11
103
21
74
89
18
38
40
2
79
27
84
28
64
1
48
47
102
8
82
66
90
3
52
100
45
72
24
75
59
13
32
95
77
19
73
4
81
76
62
80
63
51
25
33
2
12
37
96
46
5
68
53
14
4
5
85
30
83
86
39
23
104
54
6
16
91
61
97
78
69
56
57
41
70
65
31
7
7
44
87
26
15
99
3
35
71
98
49
6
88
50
8
22
34
43
94
20
60
92
42
9
101
58
36
17
81
BIB (Balanced Incomplete Block) design:
Total: 26 bundles.
Bundle
Booklet
1
2
3
1
1
2
3
2
2
3
3
3
4
Bundle
Booklet
1
2
3
14
1
2
5
4
15
2
3
6
4
5
16
3
4
7
4
5
6
17
4
5
8
5
5
6
7
18
5
6
9
6
6
7
8
19
6
7
10
7
7
8
9
20
7
8
11
8
8
9
10
21
8
9
12
9
9
10
11
22
9
10
13
10
10
11
12
23
10
11
1
11
11
12
13
24
11
12
2
12
12
13
1
25
12
13
3
13
13
1
2
26
13
1
4
Test Equating
Example 5: National Basic
Assessment System - SAEB
•
•
•
•
Education
Common items between grades
Common items between years
Multiple groups model
Items already calibrated and new items
SAEB - LP
SAEB - MT
85
Differential Item Functioning
DIF
 How can age groups, genders, cultures, ethnic groups, and
socioeconomic backgrounds be meaningfully compared?
 Can be a research goal as opposed to just a test of an
assumption?
 Test equivalency of test items translated into multiple
languages
 Test items influenced by cultural differences
 Test for intelligence items that gender biased
 Test for age differences in response to personality items
87
Atividades Pós-Administração
• Ajuda identificar se um item de um teste está
refletindo acuradamente reais diferenças entre
grupos ou se o item por si mesmo está produzindo
diferenças injustas.
• Descartar itens que são comprovadamente injustos
• Indivíduos de mesmo escore/proficiência respondem
de forma diferenciada a um item pelo fato de
pertencerem a grupos diferentes
• Exemplos: Sexo, Raça, Região, EJA/Não EJA,
etc …
88
Importante:
• Não estamos dizendo, por exemplo que os alunos
do Nordeste não podem apresentar uma maior
proporção de acerto a um item de matemática do
que os alunos do Sul!!!!
• O que estamos dizendo é que alunos de mesma
proficiência em matemática, tanto do Nordeste
quanto do Sul, devem apresentar a mesma
performance no item
89
• DIFERENÇAS entre grupos = Impacto
• DIFERENÇAS entre matched-for-ability grupos = DIF
90
91
92
93
94
95
96
DIF by IRT
- DIF uniform
Only b (difficulty) parameter
- DIF non uniform
Parameters b (difficulty) and a (discrimination)
97
DIF Uniform
98
DIF Non Uniform
Computerized Adaptive Testing
CAT
An item is given to the participant (usually easy
to moderate difficulty) and their answer allows
their trait score to be estimated, so that the next
item is chosen to target that trait level
After the second item is answered their trait
score is re-estimated, etc.
CA tests are at least twice as efficient as their
paper and pencil counterparts with no loss of
precision
Computerized Adaptive Testing
CAT
The implementation of a CAT is not an easy task.
It involves different skills in different areas of
knowledge and very sensitive issues such as
information security, item bank development,
choice of estimation methods, criteria for
selecting the next item, stopping rules,
incorporation of new items etc.
Implementation and use in Brazil
1. University of São Paulo at São Carlos (USP-SC)
2. Federal University of Santa Catarina (UFSC)
3. Federal University of Pará (UFPA)
4. Cesgranrio Foundation
5. University of Brasília - Cespe/Cebraspe
6. Vunesp Foundation
Applications of IRT in Education
Brazilian assessments
 ENEM (Exame Nacional do Ensino Médio / National High School
Exam)
 SAEB (Sistema Nacional de Avaliação da Educação Básica /
National Basic Education Assessment System)
 ENCCEJA: National Exam for Certification of Competences of
Youngsters and Adults
 ANA: National Assessment of Alphabetization
 SARESP, SisPAE, SaePE …
Applications of IRT in Education
International assessments
 PISA: Programme for International Student Assessment - OECD
 TIMSS: Trends in International Mathematics and Science Study
 TALIS: Teaching and Learning International Survey
 (T)ERCE (Unesco): (Third) Comparative Latin America and Caribe
Study:
 “Estudio de logro de aprendizaje a gran escala más importante de la
región, ya que comprende 15 países (Argentina, Brasil, Chile, Colombia,
Costa Rica, Ecuador, Guatemala, Honduras, México, Nicaragua,
Panamá, Paraguay, Perú, República Dominicana y Uruguay) más el
Estado de Nuevo León (México).”
104
Applications of IRT in other areas
Environment and Ecology
Almeida, V. L. Avaliação do Desempenho Ambiental de
Estabelecimentos de Saúde, por meio da Teoria da Resposta
ao Item, como Incremento da criação do Conhecimento
Organizacional. Tese de Doutorado, PPGEGC/UFSC, 2009.
Trierweiller, A. C., Peixe, B. C. S., Tezza, R., Bornia , A. C.,
Andrade, D. F. and Campos, L. M. S. Environmental
management performance for brazilian industrials: measuring
with the item response theory. Work, 41, 2179-2186, 2011.
Afonso, M. H. F. Mensuração da Predisposição ao
Comportamento Sustentável por Meio da Teoria da Resposta
ao Item. Dissertação de Mestrado, PPGEP/UFSC, 2013.
105
Applications of IRT in other areas
Environment and Ecology
Trierweiller, A. C., Peixe, B. C. S., Bornia , A. C., Campos, L.
M. S. and Tezza, R. (2013). Evidenciation of environmental
management: an evaluation with item response theory.
Brazilian
Journal
of
Operations
&
Production
Management, v. 9, no. 2, 91-109, 2013.
Peixe, B. C. S. Mensuração da Maturidade do Sistema de
Gestão Ambiental de Empresas Industriais Utilizando a Teoria
de Resposta ao Item. Tese de Doutorado, PPGEP/UFSC,
2014.
10
6
Applications of IRT in other areas
Customer Satisfaction
Costa, M.B.F. (2001). Técnica derivada da teoria da
resposta ao item aplicada ao setor de serviços.
Dissertação de Mestrado – PPGMUE/UFPR
Bortolotti, S.L.V. (2003). Aplicação de um modelo de
desdobramento da teoria da resposta ao item – TRI.
Dissertação de Mestrado. EPS/UFSC.
Bayley, S. (2001). Measuring customer satisfaction.
Evaluation Journal of Australasia, v. 1, no. 1, 8-16.
10
7
Applications of IRT in other areas
Total Quality Management
Alexandre, J.W.C., Andrade, D.F., Vasconcelos, A.P.
e Araújo, A.M.S. (2002). Uma proposta de análise de
um construto para a medição dos fatores críticos da
gestão pela qualidade através da teoria da
resposta ao item. Gestão & Produção, v.9, n.2, p.129141
Bosi, M.A. (2010). Um Estudo sobre o Grau de
Maturidade e a Evolução da Gestão pela
Qualidade Total no Setor de Transformação
Cearense por Meio da Teoria da Resposta ao
Item. Dissertação de Mestrado, GES-LOG/UFC.
108
Applications of IRT in other areas
Psychiatry / Psychology
Psychiatric scales:
Beck Depression Inventory (BDI)
Escala de sintomas Depressivos (CES-D)
Escala de rastreamento de dependência de sexo
(ERDS)
Schaeffer, N. C. (1988). An Application of Item
Response to the Measurement of Depression.
Sociological Methodology, 18, 271–307.
10
9
Applications of IRT in other areas
Psychiatry / Psychology
Coleman, M. J., Matthysse, S., Levy, D. L., Cook, S., Lo, J.
B. Y.,Rubin, D. B. and Holzman, P. S. (2002). Spatial
and object working memory impairments in
schizophrenia patients: a bayesian item-response
theory analysis. Journal of Abnormal Psychology, 111,
number 3, 425-435.
Hays, R., Morales, L. S. e Reise, S. P. (2000). Item response
theory and health outcomes measurement in the 21st
century, Medical Care, v.38.
Kirisci, L., Hsu, T. C. e Tarter, R. (1994). Fitting a twoparameter logistic item response model to clarify the
psychometric properties of the drug use screening
inventory for adolescent alcohol and drug abusers,
Alcohol Clin. Exp. Res 18: 1335–1341.
11
0
Applications of IRT in other areas
Psychiatry / Psychology
Langenbucher, J. W., Labouvie, E., Sanjuan, P. M.,
Bavly, L., Martin, C. S. e Kirisci, L. (2004). An
application of item response theory analysis to
alcohol, cannabis and cocaine criteria in DSM-IV,
Journal of Abnormal Psychology 113: 72–80.
Yesavage JA, Brink TL Rose TL et al. (1983).
Development and
validation of a geriatric
depression screening scale: a preliminary report. J
Psychiat Res, 17:37-49.
Cúri, M. (2006). Análise de questionários com itens
constrangedores. Tese de Doutorado. IME/USP. São
Paulo.
11
1
Applications of IRT in other areas
Organizational Leadership
SCHERBAUM, C.A., FINLINSON, S., BARDEN, K.,
& TAMANINI, K. Applications of Item
Response Theory to Measurement Issues in
Leadership
Research.
Leadership
Quarterly, 17, 366-386, 2006.
Faz uma aplicação de ambos modelos,
acumulativo (MRG) e de desdobramento
(GGUM).
11
2
Applications of IRT in other areas
Attribute Importance
SAMARTINI, A. L. S. Modelos com Variáveis
Latentes Aplicados à Mensuração de
Importância de Atributos. Tese de
doutorado. Escola de Administração de
Empresas de São Paulo, 2006.
Aplications of IRT in other areas
113
Quality of Life
Mesbah, M., Cole, B.F. and Lee, M.L.T.(2002). Ed. Statistical
methods for quality of life studies: design, measurements and
analysis. Boston:Kluwer Academic Publishers
Genetics:
to measure the predisposition of na
individual to a specific disease
Tavares, H. R.; Andrade, D. F.; Pereira, C.A. (2004) Detection of
determinant genes and diagnostic via item response theory.
Genetics and Molecular Biology, v. 27, n. 4, p. 679-685.
Aplications of IRT in other areas
114
Food insecurity
Parke E. Wilde, Gerald J. and Dorothy R. Friedman (2004).
Differential Response Patterns Affect Food-Security Prevalence
Estimates for Households with and without Children. J. Nutr.134:
1910–1915.
Physicians Clinical Competence
Jishnu Das, Jeffrey Hammer (2005). Which doctor? Combining
vignettes and item response to measure clinical competence.
Journal of Development Economics 78, 348-383.
Aplications of IRT in other areas
115
 Tezza. R., Bornia, A.C., Andrade, D.F.(2011). Measuring web
usability using item response theory: Principles, features and
opportunities. Interacting with Computers, 23, 167-175.
 Menegon, L.S.(2013). Mensuração de Conforto e Desconforto
em Poltrona de Aeronave pela Teoria da Resposta ao Item.
Tese de Doutorado, PPGEP/UFSC.
Adilson(tese)
Silvana(tese)
 Juliano(paper)
Applications of IRT in other areas
116
Laboratório de Custos e Medidas – LCM/
EPS/UFSC (www.custosemedidas.ufsc.br)
Linha de Pesquisa: Teoria da Resposta ao Item
Aplicada às Organizações
PPGEP / UFSC
Some “Theoretical Applications”
 Santosa, V.L.F., Moura, F.A.S., Andrade, D.F., Gonçalves,
K.C.M.(2016). Multidimensional and Longitudinal Item
Response Models for Non-ignorable Data.
Computation
Statistics and Data Analysis (Accepted for publication)
 Borgatto, A.F., Azevedo, C.L., Pinheiro, A., Andrade, D.F.(2015).
Comparison of Ability Estimation Methods Using IRT for Tests
with Different Degrees of Difficulty. Communications in
Statistics-Simulation and Computation, 44, 474-488.
 Caio(paper)
 Mariana(paper)
 Héliton(paper)
118
Computacional Aspects
 Commercial: BilogMG, Multilog, IRTPro ….
 Non Commercial(Free): R Packages (IRT)
 LTM
 IRTOYS
 MIRT
 CATR
 MIRTCAT
 PSYCH
 https://en.wikipedia.org/wiki/Psychometric_software
119
References
 ANDRADE, D. F., TAVARES, H. R., VALLE, R. C. (2000).
Teoria da Resposta ao Item: conceitos e aplicações.
14o SINAPE, Associação Brasileira de Estatística.
(Available in www.inf.ufsc.br/~dandrade/tri)
 BAKER, F. B., (1992). Item Response Theory: Parameter
Estimation Techniques. Marcel Dekker.
 BEATON, A. E; ALLEN, N. L. Interpreting scales through scale
anchoring. J. Educ. Stat, v. 17, p. 191–204, 1999.
120
References
 BOCK, R.D. & ZIMOWSKI, M.F. (1996). Multiple Group IRT,
in Linden, W.J. van der & Hambleton, R.K. (eds).
Handbook of Modern Item Response Theory, Springer.
 Embretson, S. E. and Reise, S. P. (2000). Item response
theory for psychologists. New Jersey: Lawrence
Erlbaum Associates, Inc., Publishers..
12
1
References
 KLEIN, R. (2003). Utilização da Teoria de Resposta ao
Item no Sistema Nacional de Avaliação da Educação
Básica (SAEB). Ensaio: Avaliação e Políticas Públicas
em Educação, Rio de Janeiro, v.11, n.40, p.283-296,
2003.
 LORD, F.M. (1980). Applications of item response
theory to practical testing problems.Hillsdale:Lawrence
Erlbaum Associates Inc.
 LORD, F. M; NOVICK, M. R. Statistical Theories of Mental
Test Score. Reading: Addison-Wesley, 1968.
12
2
References
 RECKASE, M.D. Multidimensional
Theory. New York: Springer, 2009.
Item
Response
 Sistema Nacional de Avaliação da Educação Básica:
SAEB 2001, Relatório Técnico.
(2002). Consórcio
Fundação Cesgranrio/Fundação Carlos Chagas, Rio
de Janeiro.
123
Thank you!!!
 Dalton F. Andrade (UFSC/Vunesp)
dalton.andrade@ufsc.br
 Héliton R. Tavares (UFPA/Vunesp/UF)

heliton@ufpa.br
 Adriano Borgatto (UFSC)
adriano.borgatto@ufsc.br
Applications
 Scaling individuals for further analysis
 We often collect data in multifaceted forms (e.g. multi-items
surveys) and then collapse them into a single raw score
 IRT based scores represent an optimal scaling of individuals on the
trait
 Most sophisticated analyses require at-least interval level
measurement and IRT scores are closer to interval level than raw
scores
 Using scaled scores as opposed to raw scores has been shown to
reduce spurious results
Applications
 Scale Construction and Modification
 The focus is changing from creating fixed length,
paper/pencil tests to creating a “universe” of items with
known IRF’s that can be used interchangeably
 Scales are being designed based around IRT properties
 Pre-existing scales that were developed using CTT are
being “revamped” using IRT
126
About tendencies/Research
 Use of Response Times: Same approach works with
any other source of collateral information, e.g.,
physiological measures, confidence marking, etc.
 Item Cloning for banking:
 Use computer to generate new items from family for
administration to examinee
 Calibrate item families of clones rather than each
individual item
 Use hierarchical IRT to allow for (small) random variation
in item parameter values
 Multidimensional and /or Multivariate approaches
Applications
 Computer Adaptive Testing (CAT)
 CA tests are at least twice as efficient as their paper and pencil
counterparts with no loss of precision
 Primary testing approach used by ETS
 Adaptive form of the Headache Impact Survey outperformed the P and
P counterpart in reducing patient burden, tracking change and in
reliability and validity (Ware et al., 2003)
12
8
Item Response Theory
Estimation with
BILOG-MG and R
12
9
Before we begin…
 Data preparation
 Raw data must be recoded if necessary (negatively
worded items must be reverse coded such that all items
in the scale indicate a positive direction)
 Dichotomization (optional)
 Reducing multiple options into two separate values (0, 1;
right, wrong)
13
0
Estimating 3PL parameters
BILOG-MG (Scientific Software)
Multiple files in directory (ASCII text) :
BLM,DAT,NPF,PRM,…
Data file must be saved as ASCII text
ID number
Individual responses
13
1
BILOG-MG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
Blank or not
Title lines:
2
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV =
'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
13
2
BILOG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
Data File Name
Parameters
Characters in ID field
File for missing
13
3
BILOG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
Requested files for: Scoring,
Parameters, Covariances
13
4
BILOG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
Number of items
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
Sample size
13
5
BILOG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
FORTRAN statement for
reading data
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
Name of scale/
measure
13
6
BILOG input file (*.BLG)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
>INPUT SAMPLE=99999;
Estimation specifications (not the
default for BILOG-MG)
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
13
7
BILOG input file (*.BLM)
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
>GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE;
>SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV';
>LENGTH NITEMS=(10);
>INPUT SAMPLE=99999;
(4A1,10A1)
>TEST TNAME=AGR;
>CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0;
>SCORE MET=2, IDIST=0, RSC=0, NOPRINT;
Scoring: Maximum likelihood,
no prior distribution of scale
scores, no rescaling
Phase one output file (*.PH1)
13
8
CLASSICAL ITEM STATISTICS FOR SUBTEST AGR
NUMBER
NUMBER
ITEM NAME TRIED
RIGHT
ITEM*TEST CORRELATION
PERCENT
LOGIT/1.7 PEARSON
BISERIAL
--------------------------------------------------------------------1
0001
1500.0
1158.0
0.772
0.72
0.535
0.742
2
0002
1500.0
991.0
0.661
0.39
0.421
0.545
3
0003
1500.0
1354.0
0.903
1.31
0.290
0.500
4
0004
1500.0
1187.0
0.791
0.78
0.518
0.733
5
0005
1500.0
970.0
0.647
0.36 0.566
0.728
6
0006
1500.0
1203.0
0.802
0.82 0.362
0.519
7
0007
1500.0
875.0
0.583
0.20
0.533
0.674
8
0008
1500.0
810.0
0.540
0.09
0.473
0.594
9
0009
1500.0
1022.0
0.681
0.45
0.415
0.542
10
0010
1500.0
869.0
0.579
0.19
0.426
0.538
---------------------------------------------------------------------
Can indicate problems in parameter estimation
13
9
Phase two output file (*.PH2)
CYCLE 12: LARGEST CHANGE = 0.00116
-2 LOG LIKELIHOOD =
15181.4541
CYCLE 13: LARGEST CHANGE = 0.00071
[FULL NEWTON STEP]
-2 LOG LIKELIHOOD =
15181.2347
CYCLE 14: LARGEST CHANGE = 0.00066
Check for
convergence
14
0
Phase three output file (*.PH3)
 Theta estimation
 Scoring of individual respondents
 Required for DTF analyses
14
1
Parameter file (specified, *.PAR)
“b”
“a”
“c”
AGREEABLENESS CALIBRATION FOR IRT TUTORIAL.
>COMMENT
1
10
10
0001AGR
0002AGR
0003AGR
0004AGR
0005AGR
0006AGR
0007AGR
111
211
311
411
511
611
711
1.130784
1.533393
-0.737439
0.652148
0.147203
0.101834
0.185726
0.135455
0.078989
0.053688
0.360630
0.870309
-0.414371
1.149018
0.132796
0.087236
0.097709
0.098866
0.129000
0.054461
1.474175
0.743095
-1.983831
1.345723
0.197127
0.108974
0.084487
0.250499
0.153003
0.087578
1.196368
1.256263
-0.952323
0.796012
0.090901
0.087856
0.114710
0.123613
0.072684
0.042937
0.544388
1.403904
-0.387767
0.712300
0.056774
0.071490
0.133486
0.080438
0.067727
0.026086
0.892399
0.777440
-1.147869
1.286273
0.173882
0.093109
0.082096
0.152846
0.135828
0.075829
0.174395
1.369223
-0.127368
0.730341
0.088135
0.083777
0.159712
0.085084
0.085190
0.032376
(32X,2F12.6,12X,F12.6)
14
2
Scoring and covariance files
 Like the *.PAR file, specifically requested
 *.COV - Provides parameters as well as the
variances/covariances between the parameters
 Necessary for DIF analyses
 *.SCO - Provides ability score information for each
respondent
Download