University of Ostrava Czech republic 26-31, March, 2012 Different forms of a test Item banking Achievement monitoring Classical Test Theory It is applied only for different test forms equating It is often ignored (conception of parallel test forms) Establishes equivalent scores on different test forms Doesn’t create a common scale Item ResponseTheory Allows to satisfy all equating needs Allows to put all estimates of item and examinee parameters to the common scale It is a special procedure that allows to establish relation between examinee scores on different test forms and place them onto the same scale. As a result, measure based on responses to one test can be matched to a measure based on responses to another test, and the conclusions drawn about examinee are identical, regardless of the test form that produced the measure. Equating of different test forms is called horizontal equating. The purpose: comparison of student achievements at different grade levels Test forms are designed to be of different difficulties Measures from different tests should be placed on the same linear continuum Procedure of this test equating is called vertical equating. • • • Item bank – a set of items from which test forms that create equivalent measures may be constructed. Item bank is composed of a set of test items that have been placed onto a common scale, so that different subsets of these items produce interchangeable measures for an examinee. In the presence of item bank we dont need in further equating Both are designed to place estimated parameters onto a common scale In test equating the goal is to place person measures from the multiple test forms onto the same scale In item banking the goal is to place item calibrations on the same scale Procedures are nearly identical when we use Rasch measurement Equating – procedure that ensures the examinee measures obtained from different subsets of items are interchangeable. When two tests are equated, the resulting measures are placed onto the same scale. Scaling – procedure that associates numbers with the performance of examinees. Tests can be scaled identically, but have not been equated. Applies only to compare examinee test scores on two different test forms A problem can be ignored (introduction of “parallel” test froms) Implies only an establishment of relation between test scores on different test forms Doesn’t imply creation of a common scale Linear equating Equipercentile equating It is based on equating the standard score on test X to the standard score on test Y: xx x Thus, y y y y A x B , where A y x , B y y x x Scores on tests X and Y are considered to be equivalent if their respective percentile ranks in any given group are equal. Both methods require assumptions concerning identity of test score destrubutions and about equivalence of examinee groups Equating in CTT doesn’t imply creation of a common scale Measuring the same trait – tests of different content can not be equated (but can be scaled in a similar manner). Invariance of equating results across samples of examinees Independence of equating results on which test is used as a reference test • Method of common items: linkage between two • Method of common persons: linkage between • Combined methods: linkage between two test test forms is accomplished by means of a set of items which are common for two test forms two test forms is accomplished by means of a set of persons who respond to both test forms forms is accomplished by means of common items and / or common persons plus common raters Internal anchor: Each test form has one set of items that is shared with other forms and another set of items that is unique to this form External anchor: Each test form has an additional set of items, that are not from these test forms Involving all examinees respond both test forms. There are two approaches to this design: - same group/ same time - same group/ different time Linkage between two test forms is accomplished by means of a set of examinees who respond to all items. Selecting an equating method Parameter estimation Transformation of parameters from different test froms to the same scale Evaluating the quality of the links between test froms Simultaneous calibration: all parameters are estimated simultaneously in one run of the estimation software. Data are automatically scaled to the same scale. Separate calibration: parameters are estimated for each test form separately. That is, the data are calibrated in multiple runs of the estimation software. Separate calibration may be more difficult to accomplish because the test developer needs to transform measures to a common scale Separate calibration of all test forms with transformating measures to the common scale Simultaneous calibration of all test forms and placing all measures on the common scale Separate calibration of all test forms with anchoring the difficulty values of the common items and consecutive placing all parameters on the common scale As a rule this procedure is used with method of common items that are called nodal items in this case Each test form is calibrated separately. As a result for each test form all estimates lie on the own scale. The only difference between scales is in difference between origins of the scales This difference can be removed by means of calculating location shift It is desirable to have not less that 15-20 % nodal items (some of them can be deleted from the link later). Choice of a common scale Selection of nodal items Calibration of all test forms Calculating equating constants Link quality evaluation Transformating all parameters onto a common scale l t12 ( i2 i1 ) i 1 l t12 – shift constant from test form 1 to test form 2; δi1 – difficulty estimate of item i in test from 1; δi2 – difficulty estimate of item i in test from 2; l – the number of common items. Sometimes other formulas are applied - weighted mean, dispersion shift, etc. δi1' = δi1 + t12 , where δi1 – difficulty estimate for item i in test form 1; δi1' – difficulty estimate for the same item on the scale of test form 2, i=1,…,k, k – the total number of test items; θn1'= θn1 + t12, where θn1 – ability estimate for examinee n who respond items of test form 1; θn1' – ability estimate for the same examinee on the scale of test form 2, n=1,…, N; N – the total number of examinees who respond items of test form 1. Shifted by this way parameter estimates of test from 1 will be placed to the scale of test form 2. Item-within-link (fit analysis of linking items); Item-between-link (stability of the item calibrations between two test forms) Ui i1 i1 i12 where σi12 is defined by σi122 = σi12+ σi22 ; σi1 , σi2 - standard errors of measurement for item i under calibration of test form 1 and 2; δi1 - difficulty estimate for item i in test form 1; δi1' - difficulty estimate for the same item on the scale of test form 2; Ui ~ N(0,1) All parameters of all test forms are estimated simultaneously Is the simplest approach to equating test forms or calibrating an item bank because it requires no subsequent transformation of the estimated measures or calibrations. Data are automatically scaled to the same scale in one run the estimation software As a rule this procedure is used with method of common items that are called anchor items in this case Common items are estimated one time during calibration of the first test form During calibration of another test form the calibration values for these items are treated as being fixed or known and are not estimated. As a result, the remaining parameter estimates are forced onto the same scale as the anchor items It is easy to anchor items in most estimation software IAFILE=* 2 -0.29 4 -1.06 8 -0.49 11 -0.04 17 -0.28 37 -2.20 38 -1.34 * Numbers of anchor items and their difficulties are specified. These difficulty values will be fixed and not be estimated during calibration of new test form Choice of a common scale Selection of anchor items Calibration of the test form which scale is accepted as a common scale Sequential calibration of other test forms with fixing the difficulty values of anchor items Item-Within Link Fit (fit analysis of linking items); If we use different equating procedures, obtained scales will be different and can not be directly compared. It is connected with different ways of origin selection in different procedures. There are papers (for example, Smith R.M. «Applications of Rasch Measurement». Chicago: Mesa Press. -1992) where all three procedures are analyzed. The precision of estimated examinee and item parameters is approximately the same and correlation between measures is high. Each test form has 26 dichotomous items Both test forms have 6 common items: № 4, 6, 7, 14, 20, 24 (23 % of the total number of items) The total number of examinees for test form 1 is 654, for test form 2 - 661 For test calibration Winsteps software was used Means of examinee measures are -1,07 и -0,72 logits for test form 1 and 2 correspondingly The first test form scale was chosen as a common scale Item numbe r 4 6 7 14 20 Sum Mean Test form 1 Difficult Standar y d Error estimate σi δi -1.39 -0.93 -2.57 -0.44 0.88 -4.45 -0.89 Test form 2 Difficu Standard Shifted lty Difficul Error estimat ty σi e estimate δi δi' 0.09 0.1 0.1 0.1 0.12 Shift constant t12= - 0,298. -1.07 -0.54 -1.99 -0.32 0.96 -2.96 -0.592 0.09 0.09 0.1 0.09 0.11 -1.368 -0.838 -2.288 -0.618 0.662 -4.45 -0.89 ui -0.17 0.69 2.0 -1.33 -1.34 It implies creation of a common response matrix for both test forms containing 1315 examinees and 46 different items. Measures of all examinees and difficulty values of all items will be placed on a common scale that is centered in the difficulty mean of all 46 items Calibration of test form 1 Calibration of test form 2 with fixing the difficulty values of anchor items from the first calibration IAFILE=* 4 -1.39 6 -0.93 7 -2.57 14 -0.44 20 0.88 * As a result examinee measures from both test forms will be on the first test form scale Comparison of examinee measures from three equating procedures revealed approximately similar results: correlation is closed to 1 The choice of equating procedure is determined by the real data design and purpose of research