An Introduction to Item Response Theory - IRT 1 Dalton Andrade (UFSC) Héliton Tavares (UFPA) Adriano Borgatto (UFSC) 2 Overview Main ideas, concepts and applications of Item Response Theory – IRT in different areas Session 1: 1. Main ideas and the concepts of IRT 2. Unidimensional models for dicothomous and polythomous items 3. Estimation methods 4. Construction of latent trace scales Session 2: 1. Equating methods 2. Differential item functioning – DIF 3. Computerized adaptive testing – CAT 4. Several applications of IRT in many different areas Why do we need to get measurements? 3 We have got measurements since the first day of our lives! Why do we need to get measurements? 4 Temperature (degree): Celsius / Fahrenheit C = (F – 32)/1.8 Body weight (mass): Kilogram / Pound 1 kg = 2.2046226218 lb Body height: Meter / Feet 1 m = 3.28084 ft Blood pressure: millimeters of mercury (mm Hg) etc ... Why do we need to get measurements? 5 What about these caracteristics? Satisfaction Depression (Psychiatry) Life Quality Math proficiency/ability (Education) Statistics use in workplace WEB usability (e-commerce) Diagnostic reasoning – Nursing Why do we need to get measurements? 6 (cont.) Resistance to change Organizational environmental performance and soo on ..... They are all examples of what we call “Latent Trace” They are “caracteristics” that can not be measured /observed directly We need to “build scales/metrics” to measure them How to measure Latent trace? 7 To build it, we will need: Measurement instrument: Questionnary, Test .... Scale/metric 8 Motivation Is it possible to estimate the body height of a person? INSTRUCTIONS: Please, read the questions below about you and answer 1 for “YES” and 0 for “NO”, filling up in the green line or use the NO/YES Options. 1. In bed, I often suffer from cold feet 2. When walking down the stairs, I take two steps at a time 3. I think I would do well on a basket ball team 4. As a police officer, I would not make much of an impression 5. In most cars, I sit uncomfortably 6. I literally look up to most of my friends 7. I am able to pick up an object on top of a cabinet, without using stairs Motivation 9 Is it possible to estimate the body height of a person? INSTRUCTIONS: Please, read the questions below about you and answer 1 for “YES” and 0 for “NO”, filling up in the green line or use the NO/YES Options. 8. I bump my head quite often 9. I can store luggage in the trunk of the plane or bus 10. I usually set the car seat back 11. Usually when I'm walking ride they offer me the front seat 12. For school pictures I was always asked to stand in the last row 13. I have trouble to accommodate me on the bus 14. Among several friends, you’re would be preferred for changing light bulbs Motivation “Playing” with body height(*) 10 1,50m 1,55m 1,60m 1,65m 1,70m 1,75m 1,80m 1,85m (*) Many thanks to Prof. C. A. W. Glas – University of Twente – Netherlands - ABE - SINAPE 2006. 11 Positioning respondents and items on the same scale 1,50m 1,55m 1,60m I9 1,65m 1,70m I7 1,75m 1,80m I10 I2 I4 1,85m 12 Positioning respondents and items on the same scale 1,50m 1,55m 1,60m I9 1,65m 1,70m I7 1,75m 1,80m I10 Dalton I2 I4 1,85m 13 Positioning respondents and items on the same scale 1,50m 1,55m 1,60m I9 1,65m 1,70m I7 1,75m 1,80m I10 1,85m I2 I4 Dalton Adriano Héliton 14 Positioning respondents and items on the same scale I9 I7 I10 I2 I4 Dalton Adriano Héliton Positioning respondents and items on the same scale -4 -3 -2 9 -1 0 1 7 10 Dalton 2 3 2 4 Adriano Héliton Positioning respondents and items on the same scale 60 70 80 9 90 100 110 7 10 Dalton 120 130 2 4 Adriano Héliton 17 Concepts and Constructs Concept: “An abstraction formed by generalization from particulars” Abstracts are hard to define E.g. intelligence Construct: A concept with scientific purpose (i.e. operationalized) Can be measured and studied. E.g. IQ Psy 427 - Cal State Northridge What is item analysis in general? Item analysis provides a way of measuring the quality of questions - seeing how appropriate they were for the respondents and how well they measured their ability/trait. It also provides a way of re-using items over and over again in different tests with prior knowledge of how they are going to perform; creating a population of questions with known properties (e.g. test bank) 19 Item Analysis Classical Test Theory Latent Trait Models Item Response Theory Rasch Models … 1PL 2PL 3,4,5 Grad PL Nom Mult Ass Unfold Types of items Dichotomous: Body height “In bed, I often suffer from cold feet”: Yes/No Polytomous ordinal: Memory “Do you forget to give messages?”: Never, Rarely, Some times, Frequently, Always Polytomous nominal: Math “Multiple choice item with five categories A, B, C, D, E”: usually treated as Right/Wrong …… One can have more than one type of item in the same questionnary 21 Classical Test Theory CTT 22 Classical test Theory (CTT) Classical Test Theory (CTT) – often called the “true score model” Called classic relative to Item Response Theory (IRT) which is a more modern approach (Digital vs Analogical) CTT describes a set of psychometric procedures used to test items and scales reliability, difficulty, discrimination, etc. Classical Test Theory vs. Latent Trait Models Classical analysis has the test (not the item) as its basis. Although the statistics generated are often generalised to similar students taking a similar test; they only really apply to those students taking that test Latent trait models aim to look beyond that at the underlying traits which are producing the test performance. They are measured at item level and provide sample-free measurement 24 Classical test Theory (CTT) Assumes that every person has a true score on an item or a scale if we can only measure it directly without error CTT analyses assumes that a person’s test score is comprised of their “true” score plus some measurement error. This is the common true score model X T E CTT: Internal Consistency Reliability 25 Coefficient Alpha (Cronbach´s) can also be defined as: k s k 1 s 2 sTotal 2 Total 2 i 2 𝑆𝑇𝑜𝑡𝑎𝑙 is the composite variance (if items were summed) 𝑆𝑖2 is variance for each item i=1,…,k k is the number of items Standard Error of Measurement The standard error of measurement is the error associated with trying to estimate a true score from a specific test This error can come from many sources We can calculate it’s size by: 𝑆𝑚𝑒𝑎𝑠 = 𝑆 1 − 𝛼 S is the standard deviation; 𝛼 is reliability Graphical Information Key N B 20438 Dificulty Average CTT Parameters 100,0 Very Good Responses % Total “Middle” Item Discrimination A B C D 24,10 53,30 16,10 6,50 % Group 1 28,20 31,10 28,10 12,60 % Group 2 29,10 47,50 17,40 6,00 % Group 3 17,10 75,50 5,50 1,90 Rbis -0,10 0,40 -0,35 -0,34 Percentual 27 80,0 60,0 A 40,0 B C 20,0 D 0,0 1 2 Grupo 3 28 Item Response Theory IRT 29 Item Response Theory (IRT) Item Response Theory (IRT) – refers to a family of latent trait models used to establish psychometric properties of items and scales Sometimes referred to as modern psychometrics because in large-scale education assessment, testing programs and professional testing firms IRT has almost completely replaced CTT as method of choice IRT has many advantages over CTT that have brought IRT into more frequent use Item response theory (IRT) Set of probabilistic models that… Describes the relationship between a respondent’s magnitude on a construct (a.k.a. latent trait; e.g., extraversion, cognitive ability, affective commitment)… To his or her probability of a particular response to an individual item 30 Some other advantages Provides more information than classical test theory Classical test statistics depend on the set of items and sample examined IRT modeling not dependent on sample examined Can examine item bias/ measurement equivalence and provide conditional standard errors of measurement Three Basics Components of IRT Item Response Function (IRF) – Mathematical function that relates the latent trait to the probability of endorsing an item IRFs can then be converted into Item Characteristic Curves (ICC) which are graphical functions that represents the respondents ability as a function of the probability of endorsing the item Item Information Function – an indication of item quality; an item’s ability to differentiate among respondents Invariance – position on the latent trait can be estimated by any items with know IRFs and item characteristics are population independent within a linear transformation 32 Item Response Theory IRT 33 Item Response Theory Models: they depend on the item type • Items scores as right/wrong, Yes/No etc. Logistic Model (one dimensional trait) with 1, 2 or 3 parameters P( U ij 1 | j ) ci ( 1 ci ) 1 1 e ai ( j bi ) 3 Parameters Logistic model (3PL) 34 Item Characteristic Curve - ICC Probability of correct response 1 0,9 0,8 0,7 0,6 a c 0,5 0,4 0,3 0,1 -3 a = 1.7 b=0 0,2 c b 0 -4 b -2 -1 0 1 2 Ability (Latent Trait) P( U ij 1 | j ) ci ( 1 ci ) 3 1 1 e ai ( j bi ) 4 c=0 IRF – Item Parameters Location (b) An item’s location is defined as the amount of the latent trait needed to have a .5 probability of endorsing the item. The higher the “b” parameter the higher on the trait level a respondent needs to be in order to endorse the item Analogous to difficulty in CTT Like Z scores, the values of b typically range from -3 to +3, when considering de scale (0,1) IRF – Item Parameters Discrimination (a) Indicates the steepness of the IRF at the items location An items discrimination indicates how strongly related the item is to the latent trait like loadings in a factor analysis Items with high discriminations are better at differentiating respondents around the location point; small changes in the latent trait lead to large changes in probability Vice versa for items with low discriminations IRF – Item Parameters Guessing (c) The inclusion of a “c” parameter suggests that respondents very low on the trait may still choose the correct answer. In other words respondents with low trait levels may still have a small probability of endorsing an item This is mostly used with multiple choice testing…and the value should not vary excessively from the reciprocal of the number of choices. P(U ij 1 | j ) ci (d i ci ) / 1 e ai ( j bi ) f i IRF – Item Parameters Upper asymptote (d) The inclusion of a “d” parameter suggests that respondents very high on the latent trait are not guaranteed (i.e. have less than 1 probability) to endorse the item Often an item that is difficult to endorse (e.g. suicide ideation as an indicator of depression) Not used in most cases IRF – Item Parameters Lower asymptote (f) The inclusion of a “f” parameter suggests that the item IRT may not be symmetrical around “b” parameter. Not used in most cases 39 Effect of the “a” parameter Small “a,” poor discrimination 40 Effect of the “a” parameter Larger “a,” better discrimination 41 Effect of the “b” parameter Low “b,” “easy item” 42 Effect of the “b” parameter Higher “b,” more difficult item “b” inversely proportional to CTT p 43 c=0, asymptote at zero Effect of the “c” parameter 44 Effect of the “c” parameter “low ability” respondents may endorse correct response 45 ICC from real data 46 More calibrated items Some items have problems and should be revised. 47 ICCs from body height application 48 Item Response Theory Some other models 49 IRT: Nominal Response Model (NRM) • Introduced by Bock (1972) • Polytomous responses in NRM are unordered • It considers all response categories (h=1,...,mi) P(U ijs 1 | j ) exp[ a is ( j bis )] mi exp[ a h 1 ih ( j bih )] Interpretation of ais and bis are such as in Logistic model Modelo Nominal Probabilidade a=(-2,-1,1,0) e b=(-2,-1,2,1) 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0,0 -4,0 -3,0 -2,0 -1,0 0,0 1,0 2,0 3,0 4,0 Traço latente P1 P2 P3 P4 50 51 IRT: Gradual Response Model (GRM) • Samejima (1969, 1972, 1995) • Likert-scale items (strongly disagree, disagree, neutral, agree, and strongly agree) • GRM considers ordered categories (h=1,...,mi) 1 P(Uijs 1 | j ) 1 exp[ai ( j bis )] 1 1 exp[ai ( j bi(s1) )] bi1 bi 2 ... bim i 52 Modelo Resposta Gradual Probabilidade a=1,2 e b=(-2,-1,1) 1,2 1,0 0,8 0,6 0,4 0,2 0,0 -4,0 -3,0 -2,0 -1,0 0,0 1,0 2,0 3,0 Traço latente P0 P1 P2 P3 4,0 53 Item Response Models •Partial Credit Model (PCM): GRM with ai=1 •Generalized Partial Credit Model (GPCM) • Rating Scale Model (RSM): Andrich (1978a, 1978b) GRM with bis = bi – ds •Unfolding Models (Roberts, 2000): Non Cumulative latent traces (attitude, behavior ….) • Multidimensional (Dr. Reckase): Compensatory Three-Parameter Logistic Model (MC3PLM) … •These models can be used to One-Group or Multiple Group Analysis 54 Item Response Theory Item and Test Information Function IRT – Item Information Function Statistical FISHER Information Each IRF can be transformed into an item information function (IIF); the precision an item provides at all levels of the latent trait. The information is an index representing the item’s ability to differentiate among individuals. The standard error of measurement (which is the variance of the latent trait level) is the reciprocal of information, and thus, more information means less error. Measurement error is expressed on the same metric as the latent trait level, so it can be used to build confidence intervals. IRT – Item Information Function Difficulty parameter - the location of the highest information point Discrimination - height of the information. Large discriminations - tall and narrow IIFs; high precision/narrow range Low discrimination - short and wide IIFs; low precision/broad range. 1 2plm 3plm 4plm 0.9 0.8 Information 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Trait Level 1 2 3 IRT – Test Information Function Test Information Function (TIF) – The IIFs are also additive so that we can judge the test as a whole and see at which part of the trait range it is working the best. 1.4 1.2 Information 1 0.8 0.6 0.4 0.2 0 -3 -2 -1 0 Trait Level 1 2 3 60 Item Response Theory Important Assumptions: 1. Invariance 2. Dimensionality 61 IRT - Invariance Invariance - IRT model parameters have an invariance property Examinee trait level estimates do not depend on which items are administered, and in turn, item parameters do not depend on a particular sample of examinees (within a linear transformation). Invariance allows researchers to: 1) efficiently “link” different scales that measure the same construct, 2) compare examinees even if they responded to different items, and 3) implement computerized adaptive testing. 62 IRT - Dimensionality The models presented make a common assumption of unidimensionality Hattie (1985) reviewed 30 techniques Some propose the ratio of the 1st eigenvalue to the 2nd eigenvalue (Lord, 1980) 63 PAF and scree plots If the data are dichotomous, factor analyze tetrachoric correlations Assume continuum underlies item responses Dominant first factor 64 Item Response Theory Estimation: 1. Items (Calibration): Marginal Maximum Likelihood 2. Conditional Ability Distribution Estimating equations for Item Parameters K sk log L( , ) D(1 ci ) rkj [(ukji Pi )( bi )Wi ]g kj* ( )d ai k 1 j 1 IR K sk log L( , ) D(1 ci ) rkj [(ukji Pi )( bi )Wi ]g kj* ( )d bi k 1 j 1 IR Wi * log L( , ) K sk rkj [(ukji Pi ) ]g kj ( )d ci Pi k 1 j 1 IR • Numerical process: EM Algorithm Estimating equation for ability Based on the ability distribution, conditional to response vector g ( ) g ( | u j ) P(u j | ) g ( | ) * j Função de Verossimilhança para cada indivíduo 1 2 3 4 5 6 Verossimilhança 7 8 9 10 11 12 13 14 15 16 17 18 19 -3 -2,75 -2,5 -2,25 -2 -1,75 -1,5 -1,25 -1 -0,75 -0,5 -0,25 0 0,25 Habilidade 0,5 0,75 1 1,25 1,5 1,75 2 2,25 2,5 2,75 3 20 N(0,1) Example 1 67 SARESP 2007: Portuguese(LP) 3a. Grade high school: Prova-POR-3EM-Manha.pdf 30 multiple choice items 1,001 students (small sample just for presentation) Data: 3EM_Manha.DAT file Software: BilogMG - 3EM_Manha.BLM (sintaxe) Results: 3EM_Manha.PH1 3EM_Manha.PH2 3EM_Manha.PH3 3EM_Manha.PAR 3EM_Manha.SCO 68 Example 2 Lifestyle Questionnaire: next page 15 polytomous ordinal items with four categories: No, Sometimes, Almost always, Always 580 respondents Data: Estilo.dat file Software: Multilog – Estilo.MLG (sintaxe) Results: Estilo.OUT Itens Dimensões Sua alimentação diária inclui pelo menos 5 porções de frutas e verduras. 1. 2. Descrição Alimentação Você evita ingerir alimentos gordurosos (carnes gordas, frituras) e doces. 3. Você faz de 5 refeições variadas ao dia, incluindo café da manhã completo. 4. Você realiza ao menos 30 minutos de atividades moderadas/ intensas, de forma contínua ou acumulada, 5 ou mais dias na semana. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Ativídade Física Ao menos duas vezes por semana você realiza exercícios que envolvam força e alongamento muscular. No seu dia-a-dia, você caminha ou pedala como meio de transporte e, preferencialmente, usa as escadas ao invés do elevador. Você conhece sua pressão arterial, seus níveis de colesterol e procura controlálos. Comportamento Você não fuma e não ingere álcool (ou com moderação). Preventivo Você respeita as normas de transito (como pedestre ciclista ou motorista); se dirige usa sempre o cinto de segurança e nunca ingere álcool. Você procura cultivar amigos e está satisfeito com seus relacionamentos. Relacionamento Seu lazer inclui encontros com amigos, atividades esportivas em grupo, participação em associações ou entidades sociais. Social Você procura ser ativo em sua comunidade, sentindo-se útil no seu ambiente social. Você reserva tempo (ao menos 5 minutos) todos os dias para relaxar. Controle Você mantém uma discussão sem alterar-se, mesmo quando contrariado. do Estresse Você equilibra o tempo dedicado ao trabalho com o tempo dedicado ao lazer 69 Example 3: Body Height 70 Is it possible to estimate the body height of a person? INSTRCTIONS: Please, read the questions below about you and answer 1 for “YES” and 0 for “NO”, filling up in the green line or use the NO/YES Options. Statistical Distribution 1. In bed, I often suffer from cold feet 2. When walking down the stairs, I take two steps at a time 3. I think I would do well on a basket ball team 4. As a police officer, I would not make much of an impression 5. In most cars, I sit uncomfortably 6. I literally look up to most of my friends 7. I am able to pick up an object on top of a cabinet, without using stairs 8. I bump my head quite often 9. I can store luggage in the trunk of the plane or bus 10. I usually set the car seat back 11. Usually when I'm walking ride they offer me the front seat 12. For school pictures I was always asked to stand in the last row 145 13. I have trouble to accommodate me on the bus 14. Among several friends, you’re would be preferred for changing light bulbs 155 165 175 Responses to items 1 1 NO YES 2 1 3 0 4 0 5 1 6 1 7 1 8 0 9 1 10 1 11 1 12 1 13 1 14 1 Height (cm) 184 6'0" Score 11 Get it! 185 71 Item Response Theory Building the Ability scale 1. Positioning items 2. Interpretation Positioning Items Definition of anchoring items: Two consecutive levels Y and Z, with Y < Z Y Z We say that an item is anchor at a level Z if and only if a) P(X=1/=Z) 0,65 b) P(X=1/=Y) < 0,50 c) P(X=1/=Z) - P(X=1/=Y) 0,30 Positioning Items Back to Exemplo 1: Portuguese, 30 items PositioningItems.xlsx Interpretation of the scale Test Equating Participants that have taken different tests measuring the same construct, can be placed on the same scale and compared or scored equivalently Equating across grades on math ability Equating across years for placement or admissions tests Test Equating Example 3: Evening students 3a. Grade high school: Prova-POR-3EM-Noite.pdf 30 multiple choice items 1,001 students (small sample just for presentation) Data: 3EM_Noite.DAT file Software: BilogMG - 3EM_Noite.BLM (sintaxe) Results: 3EM_Noite.PH1 3EM_Noite.PH2 3EM_Noite.PH3 3EM_Noite.PAR 3EM_Noite.SCO Test Equating Five common items between the two Tests: Items 15 to 19 in both Tests Invariance principle 𝑃 𝑈 = 1 𝜃, 𝑎, 𝑏, 𝑐 = 𝑃 𝑈 = 1 𝜃 ∗, 𝑎 ∗, 𝑏 ∗, 𝑐 ∗ with θ* = λ 𝜃 + β, b* = λ 𝑏 + β 𝑎* = 𝑎/λ and c* = c Equal_24_07_Posteriori_ManhaNoite.xls Test Equating Multiple group equating: k = 1, 2, …., K Group k: mean µk and variance σk2 P (U ijk 1 | jk ) ci (1 ci ) 1 1 e ai ( jk bi ) For the reference group R, we set µR = 0 and σR2 = 1 Test Equating Example 4: Example 1 + Example 3 with K=2 and R=1 µ1 = 0 and σ12 = 1 • • • • • 55 (=30 + 30 – 5) multiple choice items 2,002 students Data: 3EM_Equat_MxE.DAT file Software: BilogMG - 3EM_Equat_MxE.BLM (sintaxe) Results: 3EM_Equat_MxE.PH1 3EM_Equat_MxE.PH2 3EM_Equat_MxE.PH3 3EM_Equat_MxE.PAR 3EM_Equat_MxE.SCO Test Equating Example 5: National Basic Assessment System - SAEB Education • 5th and 9th grades (Fundamental) and 3th grade (High school) • Every two years (odd years) • In each grade, the amount of items needed is much bigger than what one student can answer 80 How many items we need to cover a matrix? One example (SARESP): 13 Booklets with 8 items each (104 items). Every examinee takes just 3 Booklet 1 2 3 4 5 6 7 8 9 10 11 12 13 ITEMS 1 10 93 29 55 67 11 103 21 74 89 18 38 40 2 79 27 84 28 64 1 48 47 102 8 82 66 90 3 52 100 45 72 24 75 59 13 32 95 77 19 73 4 81 76 62 80 63 51 25 33 2 12 37 96 46 5 68 53 14 4 5 85 30 83 86 39 23 104 54 6 16 91 61 97 78 69 56 57 41 70 65 31 7 7 44 87 26 15 99 3 35 71 98 49 6 88 50 8 22 34 43 94 20 60 92 42 9 101 58 36 17 81 BIB (Balanced Incomplete Block) design: Total: 26 bundles. Bundle Booklet 1 2 3 1 1 2 3 2 2 3 3 3 4 Bundle Booklet 1 2 3 14 1 2 5 4 15 2 3 6 4 5 16 3 4 7 4 5 6 17 4 5 8 5 5 6 7 18 5 6 9 6 6 7 8 19 6 7 10 7 7 8 9 20 7 8 11 8 8 9 10 21 8 9 12 9 9 10 11 22 9 10 13 10 10 11 12 23 10 11 1 11 11 12 13 24 11 12 2 12 12 13 1 25 12 13 3 13 13 1 2 26 13 1 4 Test Equating Example 5: National Basic Assessment System - SAEB • • • • Education Common items between grades Common items between years Multiple groups model Items already calibrated and new items SAEB - LP SAEB - MT 85 Differential Item Functioning DIF How can age groups, genders, cultures, ethnic groups, and socioeconomic backgrounds be meaningfully compared? Can be a research goal as opposed to just a test of an assumption? Test equivalency of test items translated into multiple languages Test items influenced by cultural differences Test for intelligence items that gender biased Test for age differences in response to personality items 87 Atividades Pós-Administração • Ajuda identificar se um item de um teste está refletindo acuradamente reais diferenças entre grupos ou se o item por si mesmo está produzindo diferenças injustas. • Descartar itens que são comprovadamente injustos • Indivíduos de mesmo escore/proficiência respondem de forma diferenciada a um item pelo fato de pertencerem a grupos diferentes • Exemplos: Sexo, Raça, Região, EJA/Não EJA, etc … 88 Importante: • Não estamos dizendo, por exemplo que os alunos do Nordeste não podem apresentar uma maior proporção de acerto a um item de matemática do que os alunos do Sul!!!! • O que estamos dizendo é que alunos de mesma proficiência em matemática, tanto do Nordeste quanto do Sul, devem apresentar a mesma performance no item 89 • DIFERENÇAS entre grupos = Impacto • DIFERENÇAS entre matched-for-ability grupos = DIF 90 91 92 93 94 95 96 DIF by IRT - DIF uniform Only b (difficulty) parameter - DIF non uniform Parameters b (difficulty) and a (discrimination) 97 DIF Uniform 98 DIF Non Uniform Computerized Adaptive Testing CAT An item is given to the participant (usually easy to moderate difficulty) and their answer allows their trait score to be estimated, so that the next item is chosen to target that trait level After the second item is answered their trait score is re-estimated, etc. CA tests are at least twice as efficient as their paper and pencil counterparts with no loss of precision Computerized Adaptive Testing CAT The implementation of a CAT is not an easy task. It involves different skills in different areas of knowledge and very sensitive issues such as information security, item bank development, choice of estimation methods, criteria for selecting the next item, stopping rules, incorporation of new items etc. Implementation and use in Brazil 1. University of São Paulo at São Carlos (USP-SC) 2. Federal University of Santa Catarina (UFSC) 3. Federal University of Pará (UFPA) 4. Cesgranrio Foundation 5. University of Brasília - Cespe/Cebraspe 6. Vunesp Foundation Applications of IRT in Education Brazilian assessments ENEM (Exame Nacional do Ensino Médio / National High School Exam) SAEB (Sistema Nacional de Avaliação da Educação Básica / National Basic Education Assessment System) ENCCEJA: National Exam for Certification of Competences of Youngsters and Adults ANA: National Assessment of Alphabetization SARESP, SisPAE, SaePE … Applications of IRT in Education International assessments PISA: Programme for International Student Assessment - OECD TIMSS: Trends in International Mathematics and Science Study TALIS: Teaching and Learning International Survey (T)ERCE (Unesco): (Third) Comparative Latin America and Caribe Study: “Estudio de logro de aprendizaje a gran escala más importante de la región, ya que comprende 15 países (Argentina, Brasil, Chile, Colombia, Costa Rica, Ecuador, Guatemala, Honduras, México, Nicaragua, Panamá, Paraguay, Perú, República Dominicana y Uruguay) más el Estado de Nuevo León (México).” 104 Applications of IRT in other areas Environment and Ecology Almeida, V. L. Avaliação do Desempenho Ambiental de Estabelecimentos de Saúde, por meio da Teoria da Resposta ao Item, como Incremento da criação do Conhecimento Organizacional. Tese de Doutorado, PPGEGC/UFSC, 2009. Trierweiller, A. C., Peixe, B. C. S., Tezza, R., Bornia , A. C., Andrade, D. F. and Campos, L. M. S. Environmental management performance for brazilian industrials: measuring with the item response theory. Work, 41, 2179-2186, 2011. Afonso, M. H. F. Mensuração da Predisposição ao Comportamento Sustentável por Meio da Teoria da Resposta ao Item. Dissertação de Mestrado, PPGEP/UFSC, 2013. 105 Applications of IRT in other areas Environment and Ecology Trierweiller, A. C., Peixe, B. C. S., Bornia , A. C., Campos, L. M. S. and Tezza, R. (2013). Evidenciation of environmental management: an evaluation with item response theory. Brazilian Journal of Operations & Production Management, v. 9, no. 2, 91-109, 2013. Peixe, B. C. S. Mensuração da Maturidade do Sistema de Gestão Ambiental de Empresas Industriais Utilizando a Teoria de Resposta ao Item. Tese de Doutorado, PPGEP/UFSC, 2014. 10 6 Applications of IRT in other areas Customer Satisfaction Costa, M.B.F. (2001). Técnica derivada da teoria da resposta ao item aplicada ao setor de serviços. Dissertação de Mestrado – PPGMUE/UFPR Bortolotti, S.L.V. (2003). Aplicação de um modelo de desdobramento da teoria da resposta ao item – TRI. Dissertação de Mestrado. EPS/UFSC. Bayley, S. (2001). Measuring customer satisfaction. Evaluation Journal of Australasia, v. 1, no. 1, 8-16. 10 7 Applications of IRT in other areas Total Quality Management Alexandre, J.W.C., Andrade, D.F., Vasconcelos, A.P. e Araújo, A.M.S. (2002). Uma proposta de análise de um construto para a medição dos fatores críticos da gestão pela qualidade através da teoria da resposta ao item. Gestão & Produção, v.9, n.2, p.129141 Bosi, M.A. (2010). Um Estudo sobre o Grau de Maturidade e a Evolução da Gestão pela Qualidade Total no Setor de Transformação Cearense por Meio da Teoria da Resposta ao Item. Dissertação de Mestrado, GES-LOG/UFC. 108 Applications of IRT in other areas Psychiatry / Psychology Psychiatric scales: Beck Depression Inventory (BDI) Escala de sintomas Depressivos (CES-D) Escala de rastreamento de dependência de sexo (ERDS) Schaeffer, N. C. (1988). An Application of Item Response to the Measurement of Depression. Sociological Methodology, 18, 271–307. 10 9 Applications of IRT in other areas Psychiatry / Psychology Coleman, M. J., Matthysse, S., Levy, D. L., Cook, S., Lo, J. B. Y.,Rubin, D. B. and Holzman, P. S. (2002). Spatial and object working memory impairments in schizophrenia patients: a bayesian item-response theory analysis. Journal of Abnormal Psychology, 111, number 3, 425-435. Hays, R., Morales, L. S. e Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century, Medical Care, v.38. Kirisci, L., Hsu, T. C. e Tarter, R. (1994). Fitting a twoparameter logistic item response model to clarify the psychometric properties of the drug use screening inventory for adolescent alcohol and drug abusers, Alcohol Clin. Exp. Res 18: 1335–1341. 11 0 Applications of IRT in other areas Psychiatry / Psychology Langenbucher, J. W., Labouvie, E., Sanjuan, P. M., Bavly, L., Martin, C. S. e Kirisci, L. (2004). An application of item response theory analysis to alcohol, cannabis and cocaine criteria in DSM-IV, Journal of Abnormal Psychology 113: 72–80. Yesavage JA, Brink TL Rose TL et al. (1983). Development and validation of a geriatric depression screening scale: a preliminary report. J Psychiat Res, 17:37-49. Cúri, M. (2006). Análise de questionários com itens constrangedores. Tese de Doutorado. IME/USP. São Paulo. 11 1 Applications of IRT in other areas Organizational Leadership SCHERBAUM, C.A., FINLINSON, S., BARDEN, K., & TAMANINI, K. Applications of Item Response Theory to Measurement Issues in Leadership Research. Leadership Quarterly, 17, 366-386, 2006. Faz uma aplicação de ambos modelos, acumulativo (MRG) e de desdobramento (GGUM). 11 2 Applications of IRT in other areas Attribute Importance SAMARTINI, A. L. S. Modelos com Variáveis Latentes Aplicados à Mensuração de Importância de Atributos. Tese de doutorado. Escola de Administração de Empresas de São Paulo, 2006. Aplications of IRT in other areas 113 Quality of Life Mesbah, M., Cole, B.F. and Lee, M.L.T.(2002). Ed. Statistical methods for quality of life studies: design, measurements and analysis. Boston:Kluwer Academic Publishers Genetics: to measure the predisposition of na individual to a specific disease Tavares, H. R.; Andrade, D. F.; Pereira, C.A. (2004) Detection of determinant genes and diagnostic via item response theory. Genetics and Molecular Biology, v. 27, n. 4, p. 679-685. Aplications of IRT in other areas 114 Food insecurity Parke E. Wilde, Gerald J. and Dorothy R. Friedman (2004). Differential Response Patterns Affect Food-Security Prevalence Estimates for Households with and without Children. J. Nutr.134: 1910–1915. Physicians Clinical Competence Jishnu Das, Jeffrey Hammer (2005). Which doctor? Combining vignettes and item response to measure clinical competence. Journal of Development Economics 78, 348-383. Aplications of IRT in other areas 115 Tezza. R., Bornia, A.C., Andrade, D.F.(2011). Measuring web usability using item response theory: Principles, features and opportunities. Interacting with Computers, 23, 167-175. Menegon, L.S.(2013). Mensuração de Conforto e Desconforto em Poltrona de Aeronave pela Teoria da Resposta ao Item. Tese de Doutorado, PPGEP/UFSC. Adilson(tese) Silvana(tese) Juliano(paper) Applications of IRT in other areas 116 Laboratório de Custos e Medidas – LCM/ EPS/UFSC (www.custosemedidas.ufsc.br) Linha de Pesquisa: Teoria da Resposta ao Item Aplicada às Organizações PPGEP / UFSC Some “Theoretical Applications” Santosa, V.L.F., Moura, F.A.S., Andrade, D.F., Gonçalves, K.C.M.(2016). Multidimensional and Longitudinal Item Response Models for Non-ignorable Data. Computation Statistics and Data Analysis (Accepted for publication) Borgatto, A.F., Azevedo, C.L., Pinheiro, A., Andrade, D.F.(2015). Comparison of Ability Estimation Methods Using IRT for Tests with Different Degrees of Difficulty. Communications in Statistics-Simulation and Computation, 44, 474-488. Caio(paper) Mariana(paper) Héliton(paper) 118 Computacional Aspects Commercial: BilogMG, Multilog, IRTPro …. Non Commercial(Free): R Packages (IRT) LTM IRTOYS MIRT CATR MIRTCAT PSYCH https://en.wikipedia.org/wiki/Psychometric_software 119 References ANDRADE, D. F., TAVARES, H. R., VALLE, R. C. (2000). Teoria da Resposta ao Item: conceitos e aplicações. 14o SINAPE, Associação Brasileira de Estatística. (Available in www.inf.ufsc.br/~dandrade/tri) BAKER, F. B., (1992). Item Response Theory: Parameter Estimation Techniques. Marcel Dekker. BEATON, A. E; ALLEN, N. L. Interpreting scales through scale anchoring. J. Educ. Stat, v. 17, p. 191–204, 1999. 120 References BOCK, R.D. & ZIMOWSKI, M.F. (1996). Multiple Group IRT, in Linden, W.J. van der & Hambleton, R.K. (eds). Handbook of Modern Item Response Theory, Springer. Embretson, S. E. and Reise, S. P. (2000). Item response theory for psychologists. New Jersey: Lawrence Erlbaum Associates, Inc., Publishers.. 12 1 References KLEIN, R. (2003). Utilização da Teoria de Resposta ao Item no Sistema Nacional de Avaliação da Educação Básica (SAEB). Ensaio: Avaliação e Políticas Públicas em Educação, Rio de Janeiro, v.11, n.40, p.283-296, 2003. LORD, F.M. (1980). Applications of item response theory to practical testing problems.Hillsdale:Lawrence Erlbaum Associates Inc. LORD, F. M; NOVICK, M. R. Statistical Theories of Mental Test Score. Reading: Addison-Wesley, 1968. 12 2 References RECKASE, M.D. Multidimensional Theory. New York: Springer, 2009. Item Response Sistema Nacional de Avaliação da Educação Básica: SAEB 2001, Relatório Técnico. (2002). Consórcio Fundação Cesgranrio/Fundação Carlos Chagas, Rio de Janeiro. 123 Thank you!!! Dalton F. Andrade (UFSC/Vunesp) dalton.andrade@ufsc.br Héliton R. Tavares (UFPA/Vunesp/UF) heliton@ufpa.br Adriano Borgatto (UFSC) adriano.borgatto@ufsc.br Applications Scaling individuals for further analysis We often collect data in multifaceted forms (e.g. multi-items surveys) and then collapse them into a single raw score IRT based scores represent an optimal scaling of individuals on the trait Most sophisticated analyses require at-least interval level measurement and IRT scores are closer to interval level than raw scores Using scaled scores as opposed to raw scores has been shown to reduce spurious results Applications Scale Construction and Modification The focus is changing from creating fixed length, paper/pencil tests to creating a “universe” of items with known IRF’s that can be used interchangeably Scales are being designed based around IRT properties Pre-existing scales that were developed using CTT are being “revamped” using IRT 126 About tendencies/Research Use of Response Times: Same approach works with any other source of collateral information, e.g., physiological measures, confidence marking, etc. Item Cloning for banking: Use computer to generate new items from family for administration to examinee Calibrate item families of clones rather than each individual item Use hierarchical IRT to allow for (small) random variation in item parameter values Multidimensional and /or Multivariate approaches Applications Computer Adaptive Testing (CAT) CA tests are at least twice as efficient as their paper and pencil counterparts with no loss of precision Primary testing approach used by ETS Adaptive form of the Headache Impact Survey outperformed the P and P counterpart in reducing patient burden, tracking change and in reliability and validity (Ware et al., 2003) 12 8 Item Response Theory Estimation with BILOG-MG and R 12 9 Before we begin… Data preparation Raw data must be recoded if necessary (negatively worded items must be reverse coded such that all items in the scale indicate a positive direction) Dichotomization (optional) Reducing multiple options into two separate values (0, 1; right, wrong) 13 0 Estimating 3PL parameters BILOG-MG (Scientific Software) Multiple files in directory (ASCII text) : BLM,DAT,NPF,PRM,… Data file must be saved as ASCII text ID number Individual responses 13 1 BILOG-MG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. Blank or not Title lines: 2 >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; 13 2 BILOG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; Data File Name Parameters Characters in ID field File for missing 13 3 BILOG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; Requested files for: Scoring, Parameters, Covariances 13 4 BILOG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); Number of items >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; Sample size 13 5 BILOG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); FORTRAN statement for reading data >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; Name of scale/ measure 13 6 BILOG input file (*.BLG) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); >INPUT SAMPLE=99999; Estimation specifications (not the default for BILOG-MG) (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; 13 7 BILOG input file (*.BLM) AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT >GLOBAL DFN='AGR2_CAL.DAT', NIDW=4, NPARM=3, OFNAME='OMIT.KEY', SAVE; >SAVE SCO = 'AGR2_CAL.SCO', PARM = 'AGR2_CAL.PAR', COV = 'AGR2_CAL.COV'; >LENGTH NITEMS=(10); >INPUT SAMPLE=99999; (4A1,10A1) >TEST TNAME=AGR; >CALIB NQPT=40, CYC=100, NEW=30, CRIT=.001, PLOT=0; >SCORE MET=2, IDIST=0, RSC=0, NOPRINT; Scoring: Maximum likelihood, no prior distribution of scale scores, no rescaling Phase one output file (*.PH1) 13 8 CLASSICAL ITEM STATISTICS FOR SUBTEST AGR NUMBER NUMBER ITEM NAME TRIED RIGHT ITEM*TEST CORRELATION PERCENT LOGIT/1.7 PEARSON BISERIAL --------------------------------------------------------------------1 0001 1500.0 1158.0 0.772 0.72 0.535 0.742 2 0002 1500.0 991.0 0.661 0.39 0.421 0.545 3 0003 1500.0 1354.0 0.903 1.31 0.290 0.500 4 0004 1500.0 1187.0 0.791 0.78 0.518 0.733 5 0005 1500.0 970.0 0.647 0.36 0.566 0.728 6 0006 1500.0 1203.0 0.802 0.82 0.362 0.519 7 0007 1500.0 875.0 0.583 0.20 0.533 0.674 8 0008 1500.0 810.0 0.540 0.09 0.473 0.594 9 0009 1500.0 1022.0 0.681 0.45 0.415 0.542 10 0010 1500.0 869.0 0.579 0.19 0.426 0.538 --------------------------------------------------------------------- Can indicate problems in parameter estimation 13 9 Phase two output file (*.PH2) CYCLE 12: LARGEST CHANGE = 0.00116 -2 LOG LIKELIHOOD = 15181.4541 CYCLE 13: LARGEST CHANGE = 0.00071 [FULL NEWTON STEP] -2 LOG LIKELIHOOD = 15181.2347 CYCLE 14: LARGEST CHANGE = 0.00066 Check for convergence 14 0 Phase three output file (*.PH3) Theta estimation Scoring of individual respondents Required for DTF analyses 14 1 Parameter file (specified, *.PAR) “b” “a” “c” AGREEABLENESS CALIBRATION FOR IRT TUTORIAL. >COMMENT 1 10 10 0001AGR 0002AGR 0003AGR 0004AGR 0005AGR 0006AGR 0007AGR 111 211 311 411 511 611 711 1.130784 1.533393 -0.737439 0.652148 0.147203 0.101834 0.185726 0.135455 0.078989 0.053688 0.360630 0.870309 -0.414371 1.149018 0.132796 0.087236 0.097709 0.098866 0.129000 0.054461 1.474175 0.743095 -1.983831 1.345723 0.197127 0.108974 0.084487 0.250499 0.153003 0.087578 1.196368 1.256263 -0.952323 0.796012 0.090901 0.087856 0.114710 0.123613 0.072684 0.042937 0.544388 1.403904 -0.387767 0.712300 0.056774 0.071490 0.133486 0.080438 0.067727 0.026086 0.892399 0.777440 -1.147869 1.286273 0.173882 0.093109 0.082096 0.152846 0.135828 0.075829 0.174395 1.369223 -0.127368 0.730341 0.088135 0.083777 0.159712 0.085084 0.085190 0.032376 (32X,2F12.6,12X,F12.6) 14 2 Scoring and covariance files Like the *.PAR file, specifically requested *.COV - Provides parameters as well as the variances/covariances between the parameters Necessary for DIF analyses *.SCO - Provides ability score information for each respondent