University of Ostrava Czech republic 26-31, March, 2012 Classical Test Theory (CTT): The first half of the 20-th century Advantages – simplicity of data treatment and results interpretation Posses a number of essential limitations Modern Test Theory (Item Response Theory (IRT)): The second half of the 20-th century Allows to overcome shortcomings and limitations of CTT Additional opportunities for data analysis and use of new technologies Parameters The total examinee number Maximum observed score Minimum observed score Average score Standard deviation Average difficulty level Maximium difficulty level Minimum difficulty level Average discrimination index Average point-biserial correlation Reliability (Cronbach's alpha ) Ошибка измерения Values 73 24 5 14,0 4,0 0,36 0,88 0,0 0,25 0,22 0,60 2,53 Examinee scores are test-dependent Item characteristics are group-dependent Measurement scale is ordinal. Any transformation of raw scores doesn’t improve the scale level Methods of assessing reliability require essential assumptions and give distorted results Standard error of measurement is the same for all examinees Doesn’t provide satisfactory solutions to specific test problems Is not appropriate for Computer Adaptive Testing Examinee scores and item parameters lie on different scales It is not possible to predict item performance Examinee scores are not test-dependent Item characteristics are not group-dependent Standard error of measurement is estimated particularly for each ability estimate and each item difficulty estimate Measurement scale is interval Examinee scores and items parameters lie on the same scale Allows further analysis of other factors influence Provides solutions to specific test problems Theory of Computer Adaptive Testing is based on IRT • between examinee abilities • between item difficulties • between examinee ability and item difficulty • between examinee ability and cut score The task is to transform formal observations of accidental events outcomes into measurements that are displayed as continuous variables with values on interval scale To transform observations into measurements is possible only by means of some mathematical model The main objective of IRT is to develop mathematical model of test process. Parameters of the model are characteristics of examinees and test items Probability Pi(θ) of examinee with ability θ to give a correct answer to item i Different IRT models differ in the mathematical form of ICF and /or the number of parameters specified in the model Plot of ICF Pi(θ) is called Item Characteristic Curve (ICC) of item i ICC describes relationship between probability of correct answer and examinee ability Each item has its own characteristic curve ICCs of different items differ in location along the axis and in shape (steepness). These two properties of ICC are connected with two characteristics of items – item difficulty and item discrimination Item difficulty is a location parameter, indicating the position of the ICC in relation to the ability scale. The greater the value of this parameter, the greater the ability that is required for an examinee to get the item right; hence, the harder the item. Difficult items are located to the right of the ability scale, easy items are located to the left of the ability scale. Discrimination parameter is proportional to the slope of the ICC at its middle point: the greater the value of this parameter, the steeper slope of the ICC, the more useful this item for separating examinees into different ability levels; hence the more discriminating power of the item. Unidimensionality: only one ability is measured by a given set of items Local Independence: examinee’s responses to any pair of items are statistically independent ICF reflects true relationship between latent variables (ability level) and observed variables (item performance) Additional assumptions about item characteristics which have an influence on item performance by examinee The first line: Lord & Novick “Statistical Theories of Mental Test Scores” (1968) Seminar at ETS (F.Lord, M.Novick, A.Birnbaum, S. Messick, F.Samejima, R. McDonald, W.Meredith) R. Darell Bock and his collaborators at University of Chicago (David Thissen, Eiji Muraki, Richard Gibbon, Robert Mislevy): algorithms of parameter estimation and first computer programs for data treatment under IRT (BILOG, TESTFACT, MULTILOG, PARSCALE). The second line: G.Rasch “Probabilistic models for some intelligence and attainment tests” (1960) Andersen (1972): effective methods of parameter estimation under Rasch models B. Wright (University of Chicago) and his followers (David Andrich, Geoffrey Masters, Mark Wilson, Mikle Linacre and others): “Best Test Design” (1979, with Stone M.N) “Rating Scale Analysis” (1982, with Masters G.N.). The choice of model for measurement Parameter estimation Fit analysis Analysis of test items Analysis of examinees Analysis of test (dimensionality, error of measurement, co-functioning of different types items, validity issue, etc.) Specific test problems (test equating, detection and measuring rater effects, standard setting procedures, DIF analysis, detection of doubtful test results) Scale construction and results reporting IRT model is defined by the mathematical equation for ICF The main difference between models - the number of item parameters specified in the model Any model requires accomplishment of certain assumptions that must be checked The choice of the model – responsibility of a researcher. Properties of test items (dichotomous / polytomous, item type, items’ characteristics) Properties of the test (unidimensional / multidimensional) Properties of the model (scale properties, possible solutions of test specific problems) Model-data fit Comparison of models (statistical indices) Comparison of item difficulty estimates and ability estimates in different models Available sample size Other IRT Models Rasch Models Characteristic curves of items (in dichotomous case) or their steps (in polytomous case) differ only in location along the axis, i.e. they don’t cross (are parallel) Item difficulty is the only item characteristic that defines the result of item performance by an examinee Characteristic curves of items (in dichotomous case) or their steps (in polytomous case) are not parallel, they cross: every curve has its own slope (discrimination parameters are different) Items differ in difficulty level and discrimination The simplest models that provide parameter invariance Include minimal number of parameters Parameters have simple interpretation, can be easily estimated (on the interval scale with estimate of precision) Can be applied to all item types which use in educational and psychological tests Theory of item and examinee analysis is well developed All specific testing problems can be easily solved Dichotomous Models One Parameter Logistic Model(1 PL , Rasch Model) Polytomous Models Two Parameter Logistic Model (2 PL) Three Parameter Model (3 PL) Partial Credit Model (PCM, Masters, 1982) – extension of Rasch Model; Rating Scale Model (RSM, Andrich, 1978) – modification of PCM for use with questionnairies that have a common rating scale format (for example, for Likert-type scales); Graded response model (GRM, Samejima, 1969) extension of 2PL; Modified graded response model (MGRM, Muraki, 1992) – modification of GRM for use with questionnairies that have a common rating scale format (for example, for Likert-type scales)/ Unidimensional Models Multidimensional Models All IRT models were initially developed as unidimensional ones Unidimensionality demands to be proved Latent characteristic that defines test results, is not necessarily the valid measure of construct the test was designed for Some instruments are designed with items meant to measure multiple domains of ability Only one attempt to an item Only one attempt is allowed to an item Multiple attempts to an item Most of IRT models satisfy this condition Several independent attempts are allowed to an item and the number of successes is counted (tests of psychomotor system) Binomial Model Poisson Model (the number of trials is infinite and the probability of success is small) Nominal Response Model (NRM), (Darell R. Bock) Many-Facet Rasch Model (Linacre J.M.) exp( n i ) P ( X ni 1 / n , i ) 1 exp( n i ) where P(Xni =1/ θn, δi) is the probability that an examinee n (n=1,…,N) with ability θn answers item i (i=1,…,I) with difficulty δi correctly. δi – the point on the ability scale where the probability of a correct response is 0.5. The greater the value of this parameter, the greater the ability that is required for an examinee to have a 50% chance of getting the item correct; hence, the harder the item. In theory δi parameter can vary from -∞ to +∞, but typically values of δi vary from about -3 to +3. ICCs differ only in their location along the ability scale, they don’t cross (are parallel). Item difficulty is the only item characteristic that influences examinee performance. All items are equally discriminating. The lower asymptote of the ICC is zero: examinees of very low ability have zero probability of correctly answering the item (no guessing). P ( X ni exp ai ( n i ) 1 / n , i , ai ) 1 exp ai ( n i ) where P(Xni =1/ θn, δi) is the probability that an examinee n (n=1,…,N) with ability θn answers correctly item i (i=1,…,I) with difficulty δi and discrimination ai Item difficulty parameter δi is defined as well as in 1PL model. Discrimination parameter is proportional to the slope of the ICC at point δi : items with steeper slopes are more useful for separating examinees into different ability levels than items with less steep slopes. In theory ai parameter can vary from -∞ to +∞, but typically values of ai vary from about 0 to +2. Negatively discriminating items should be deleted because something is wrong with the item if the probability of answering it correctly decreases as examinee ability increases. ICCs are not parallel, they can cross. Each ICC has its own slope (discrimination values are different). Items differ in difficulty level. Parameter difficulty δi is the point on ability scale where an examinee has a 50% chance of getting the item correct. The lower asymptote of the ICC is zero: examinees of very low ability have zero probability of correctly answering the item (no guessing). For item 1: δ1= -1, a1=1; for item 3: δ3= 1, a3=0,75; for item 2: δ2= -1, a2=0,5; for item 4: δ4=1, a4=2. exp ai ( n i ) P( X ni 1/ n , i , ai , ci ) ci (1 ci ) 1 exp ai ( n i ) The additional parameter ci is often called guessing parameter. It provides nonzero lower asymptote for ICC to take into account performance at the low end of the ability scale. It is the probability of answering the item correctly for low ability examinees. For 1PL and 2PL ci =0. In theory ci parameter can vary from 0 to 1, but typically values less than 0,35 consider to be allowable. The parameter ci represents the probability of examinees with low ability answering the item correctly. Item difficulty has a different meaning in comparing with 1PL and 2Pl. It is still the point of inflection in the ICC, but it is no the point at which the probability of success is 0.5. Now it is the point on the ability scale where the probability of correct answer is equal to (1+ ci)/2=1/2 + ci /2. It is the middle value between ci and 1. Discrimination parameter ai is proportional to the slope of the ICC at point δi as before. For item 1: For item 2: For item 3: For item 4: δ1= -1, δ2= 0, δ3= 1, δ4= 1, a1=1,5, c1= 0,1; a2=2, c2= 0,05; a3=2, c3= 0,2; a4=1, c4= 0. Winsteps (all Rasch models) XCalibre, Multilog, Parscale (all unidimensional IRT models (1PL, 2PL, 3PL) and their polytomous extensions) Facets (Many-Facet Rasch Model) Conquest (all Rasch models including Many-Facet and Multidimensional ones) Advantages Simple in use A huge number of outputs: item analysis, test analysis (including special kinds of analysis), examinee analysis Good graphic Good manual Not expensive Disadvantages Works only with Rasch models No other disadvantages &INST TITLE = "History-var1“ NAME1 = 1 ; column of start of person information NAMLEN = 3 ITEM1 = 4 NI = 51 ; maximum length of person information ; column of first item-level response ; number of items = test length XWIDE = 1 ; number of columns per response PERSON = Student ; Persons are called ... ITEM = Item ; Items are called ... GROUPS=0 MODELS = R CODES = 012345 ; valid response codes IREFER=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEBEBBEBEBCCCFCCDG ;IVALUEC= 0112 ; for rescoring for item type A ;IVALUED= 01122; for rescoring for item type B ;IVALUEE= 001 ;IVALUEG= 01233 ;IDFILE=* ;47-71 ;* ;IAFILE=* ;24 .75 ;25 2.83 ;26 3.13 ;* &END Distribution of examinees and items on the same scale Checking Dimensionality DIF – analysis of items Optimizing Rating Scale and Partial Credit Category Effectiveness Analysis of examinees (statistics and patterns) Analysis of different scales An opportunity to fix item difficulties or category difficulties Graphic Fit the model: “good” item Advantages Disadvantages Handy and intuitively clear interface Works both with Rasch models and other IRT models (dichotomous and polytomous) Good item analysis Allows to chose the method of parameter estimation Fulfils DIF analysis Produces a report automatically Is intended for item analysis only Analysis of examinees is not conducted No special kinds of analysis (variable map, dimensionality investigation, etc.) IRT Item Parameter Calibration Report MATPO Report created on 12.07.2011 Xcalibre 4.1: IRT Item Parameter Estimation Software Copyright © 2011 - Assessment Systems Corporation Automatically produced report Пример дихотомического задания Parscale (Scientific Software International, Inc.) Advantages Disadvantages Works both with Rasch models and other IRT models (dichotomous and polytomous) Allows to chose the method of parameter estimation Allows to chose the best model Attractive graphic (ICCs of all items are shown on the same picture) Old, not handy, it is hard to create a control file Analysis of examinees is not conducted No special kinds of analysis (variable map, dimensionality investigation, DIF analysis, etc.) Dichotomous items Rasch model Advantages Works with a big number of models, including Many-Facet and Multidimensional ones (Rasch) Reports a confidence interval for fit statistics Good item analysis Creates a variable map Many outputs Disadvantages Requires creation of control file Requires special knowledge on interpretation outputs in complex analysis Doesn’t work with 1PL and 2PL models and their polytomous extensions Dialog window Title Rater Effects Model One; datafile CQdata.dat; format id 1-4 rater 6-8 responses 10-18 rater 20-22 responses 23-31 ! criteria (9); codes 0,1,2,3; score (0,1,2,3) (0,1,2,3); labels << Items.lab; model rater + criteria + criteria*step; estimate; show ! estimates=latent >> rater1.shw; itanal >> rater1.itn; show cases ! estimates=mle >> students1.shw; Plot icc; Plot expected; Comparison of two raters who evaluated the same set of examinees The used model Required analysis Software availability Personal preferences