Thanks for the Organizors of Assessment Conference, Hong Kong to give me such a chance to do the presentation January 15-16, 2013 Hong Kong SAR, China Rasch Model in China: Retrospect and Status Quo by Prof. Zhang Quan Ph.D College of Foreign Studies, Jiaxing University Zhejiang, China I. Rasch Model, 20 years ago • As early as in 1980s, the ideas and concepts regarding Rasch Model and IRT were first introduced into China by Prof. Gui Shichun, my Ph.D supervisor, and it is Prof. Gui who first conducted with great success the ten-year long (1990-1999) Equating Project for Matriculation English Test (MET) in China. MET is the most influrential and competitive entrance examintaion for higher education administered annually to over 3.3 million candidates then. The Equating Project won recognition by Charles Alderson and other foreign counterparts during 1990s. Academically, those were Good Old Days for Chinese testing experts and psychometricians. Then for certain reasons, the equating practice abruptly discontinued. Therefore, in China nowadays, the application of Rasch Model or the IRT-based software like BILOG, Parscale, Winsteps and others to real testing problem solving is confined within a small ‘band’ of people. I. Rasch Model, 20 years ago • Rasch was used to do equating for MET, (Matriculation English Test), • the most influencial and competitive (20% can be enrolled) entrance examination administered annually to candidates of approximately 3.3 million (from 1990 on) and the number is increasing in the following years. • Features of MET • 1. Compulsory: • All the Chinese middle school students must take it if planning to • study in a college or a university. • 2. High-stake: the pass or failure may decide the rest of one’s life. • 3. Unified and at national level: One and the same test paper is used • across China Mainland. • 4. Test format: mainly multiple choice questions plus a small portion • of writing. I. Rasch Model, 20 years ago • Features of MET (continued) • 5. Family-bound. To pass MET and to be admitted into • universities for higher learning are the very concern and • expection of their kids by all the parents in China. • 6. Equating via anchored items was done annually from • 1990-1999. (Test scores, after conversion, can be • comparable on the same scale across China). The only test on large scale to which equating with real data was conducted and the whose rescaled scores were used for recruitment. • 7. Moderating of test items was based on the item analysis. II. Rasch Model and MET equating, 20 years ago • One thing worth mentioning here is that the equating via Rasch Model in the Chinese situation, a situation somewhat unique in a number of ways was done very successfully. (The presenter here is one of the key members of equating group headed by Prof. Gui from 1990-1999) • 1. As the uneven deveolpment of education and the big number of candicates taking the test, the population is actually heterogenous though the candidates are all senior middle school graduates. Difficult to set an unbiased test, let alone to equate two parallel test forms administered on different occasions. II. Rasch Model and MET equating, 20 years ago • 2. Although the test papers were centrally set, there was no way yet to score the papers centrally. The general practice was to assign every individual province to score its own papers and to work out its own norm for recruitment. This made university authorities confronted with the problem of selecting candidates whose scores were graded according to different criteria set up by different provinces. II. Rasch Model and MET equating, 20 years ago • 3. In China, there is no feasible way to protect test security immediately after its adminstration. Nor is it possible to use common items in different forms, nor is feasible to conduct any pre-test for future use. • To find feasible solution(s) to such problems, we established an anchorage, i.e. three sampling bases (middle schools) to monitor the performance of the candidates. We designed an equivalent test form • (35+65=100 items) and had it administered to the candidates who were going to take MET three days before MET was administered. The equivalent test form was used repeatedly for 10 years (1990-1999). II. Rasch Model and MET equating, 20 years ago • In doing so, we could not only observe but also compare the performance of candidates taking MET in different years. • Hypothesis: • There will be no big change in terms of general M (English proficiency) within one’s year’s time. If there is any change of means, it must be associated with the change of difficulty level of test froms across two years. II. Rasch Model and MET equating, 20 years ago • Then, we came to realize that such a hypothesis is by no means perfect in at least three reasons: • First, the sample size. We were going risk of test leakage. The sample must be big enough to be representative; however, the larger the sample, the greater the danger of test exposure; • Next, the general level of population is not likely to remain unchanged. Instead, it may fluctuate. Insignificant changes may accumulate into significant changes. (Gui Shichun:1990) II. Rasch Model and MET equating, 20 years ago • Finally, if there is any changes in terms of difficulty level of the test forms, it would not be accepted by simply making any linear adjustments based on individual test scores regarding the difference between the test forms. • It is based on such a hypothesis that Anchor-testrandom-groups design was put forward and conducted 2.1. Anchor-test-random-groups design Test takers A Test A Equivalent Test of 35 linking items+65 Test Takers B Test B The equivalent test was taken externally, three days before MET was actually administered. 2.2. Anchor-test-random-groups design: summarized (1) • • • • • • • 1. Sampling 2. Administration 3. Chi-square test (Wright,1979) of the 35 linking items was applied so as to delete the inappropriate items. In 1989, 28 items, In 1990-1991, 27 items, 4. Equating test forms The test results of 1988 (the year when MET was first administered across China) was used as basal reference. With anchor test, Rasch Model (Gitest), all the following test forms got equated (calibrated and rescaled) (Wright,1979) 2.2. Anchor-test-random-groups design: summarized (2) • 5. Ability estimation • In the case of Rasch Model, the ability estimation is straighforward. To obtain the maximum likelihood estimation of theta (θ), we used the Newton-Raphson procedure (Hambleton,1985) . The ability values are again converted into probabilities for those who know nothing about Rasch. • As the model has the sample-free character, we could make use the derived data to obtain adjusted scores for the population. 2.3. Anchor-test-random-groups design: summarized (3) • Why Rasch Model and not other models, two- or three-p models? • 1. Feasible implementation • Once the item parameters were calibrated, the ability parameters can be easily estimated. • A typical example: A candidate getting a raw score of 60 correct answers out of 85 test items will be assigned an ability value regardless of which combination of the 60 correct answers. In the case of two- or three-p, the procedures get complicated. The estimation is very much associated with the discrimination and the so-called ‘guessing’ parameter. Therefore, the two or more candidates getting a raw score of 60 correct answers out of 85 test items will be assigned different ability values becasue of the combinations of 60 correct answers vary from person to person. Imagine, the combinations of items from 1 to 84 is huge or astronomical! • Impossible to use the sampled data to predict the population performance. Very often the iteration never came to convergence because of mainly two big problems, computer configuration problems and the jumble data size impossible to manipulate within two weeks. 2.2. Anchor-test-random-groups design: summarized (4) • 2. Model-data fit • With Rasch Model, item and ability fit can be computed (Wright,1982 ) and can demonstrate the degree of goodness-of-fit of the model. • ... ... 2.3. Item Difficulty of MET 1988-1992 MET88 MET89 MET90 MET91 MET92 0.793 (0.31) -0.186 (0.55) 0.992 (0.28) Phonetics -0.860 (.70) -0.69(0.48) Grammar 0.228 (.44 -0.372(0.59) 0.471 (0.38) 0.500 (0.38) 0.801 (0.31) BLK-Filling -0.367 (.59) 0.271(0.43) 0.871 (0.30) 0.845 (0.30) 0.609 (0.35) Reading -0.330(.58) -0.581(0.64) 0.600 (0.35) -0.179 (0.54) -0.202 (0.55) Means -0.206 (.55) -0.180(0.54) 0.657 (0.34) 0.361 (0.41) 0.523 (0.37) For better illustration, the numbers in brackets are probabilities converted from difficulties. As shown in the talbe, no big differences between MET88 and MET89; however, MET90 turned out to be more difficult. 2.4. Ability (θ) of MET 1988-1992 MET88 MET89 MET90 MET91 MET92 Total N 136543 117085 128543 136047 133965 θ Means 40.0 44.4 53.7 50.0 54.2 % 47.0 52.2 63.2 58.8 63.8 SD 17.9 16.1 13.5 missing 15.2 The θMeans as shown in the table above refer to the rescaled average ability parameters 40.0 regarding the MC parts only, the full score:85; 85 MC + 15 writing = 100 III. MET and Rasch Model: Status Quo • MET remains the most influencial and competitive (20% can be enrolled) entrance examination administered annually to candidates of approximately 3.3 million (from 1990 on) and the number is increasing in the following years. • Features of MET remain: • 1. Compulsory: • All the Chinese middle school students must take it if planning to Disbanded. Resumed the traditional test item writing, scoring at study in a college or a to university. provincial level, reporting in raw scores, no pre-test, no item • 2. High-stake: the pass or failure may decide the rest of one’s life. analysis (Rasch or IRT) and no equating. Problems of test item • 3. Unified and at national level: One and the same test paper is used writing and moderating. • across China. Each province or regions its ownplus testapaper. normof writing. • 4. Mainly multiple choice use questions small No portion established. • 5. No equating is done annually. Statistically, test scores are not comparable. III. MET and Rasch Model: Status Quo • • MET remains the most influrential and competitive entrance examintaion for higher education in China. The number has been increasing. It reached 9.5 millions in 2006 and the graph shows the numbers of candidates taking MET in recent years (2006-2010) nationwide. • And the average number of candidates in Zhejiang Province where our university is located is 300,000 (and 360,000 in 2012). According to the latest offical report, the number of candidates taking MET in Jiaxing in 2012 goes as follows: 7688 of humanity, 1493 of arts and Chinese, 176 of sports, 12991 of science, 443 of arts and science and 237 of sports and science. • The number of students taking MET is decreasing annually. III. MET and Rasch Model: Status Quo • MET features (Continued) • According to the most updated source, MET in China will be administered separately from the other entrance examinations and will be administered more than once within a year’s time so that students may have more chances to take MET. From the professional point of view, such a practice needs equating. • Updated IRT-based computer software IV. College English Test (CET) • CET is another most influencial examination administered two times a year to students of non-English major of approximately over 10 million in recent years. • Features of CET • 1. from Compulsory to Optional: • Not all the undergraduate students of non-English major should take it. • 2. from High-stake to not very high-stake: the pass or failure may make no difference for a student to get the diploma. • 3. Unified and at national level: One and the same test paper is used • across China. • 4. Mainly multiple choice questions plus a small portion of writing. • 5. The test whose equating has been done annually from 1990 in China.(with a team of qualified test item writers) The first Rasch-based computer software developed by Prof. Gui in 1990s. Test Paper Report by GITEST Mean the mean scores of the whole examinees; SD the standard deviations of the whole examinees; Varn. the variants based on the whole examinees; P+ probability of correct answers; Pd value, difficulty parameter based on probability; R11 by Kuder-Richardson20,reliability, this value should be over 0.9 aVALUE reliability parameter ,also called value,by Cronbach formular, this value should be over 0.8 Rbis discrimination index(in the unit of bi-serial) Skewness score distribution value, . 0 indicating normal distribution; above 0, indicating positive skewness, showing the test items more difficult; below 0, indicating negative skewness, showing the test items easier; Kurtosis score distribution height: 0 indicating normal; above 0 showing “narrower”,i.e. small range between the scores; below 0, indicating “flat”,i.e. big range between scores; Difficulty VD(<0.1), D(=0.10.3), I(0.30.7), E(0.70.9), VE(>0.9) The three curves generated by GiTST, BILOG and PARSCALE, indicating item difficulties based on the same data 6 4 2 0 -2 GITEST BILOG PARSCALE -4 -6 As shown in the figure above, the curves are very close. The BILOG and PARSCALE are-8almost overlapping. This is very much related to the number of cycles and the predetermined value for convergence set in respective command file. BILOG came to convergence after 6 cycles with the largest changes = 0.005, while PARSCALE came to convergence after 72 cycles with the LARGEST CHANGE = 0.01. GITEST looks a little bit different. This is because all the parameters are set as defaults. On whole, there is no big difference in terms of test item difficulty calibration. 2.2. Equating and its why • In testing practice, equating is used to monitor any possible changes of item difficulties so as to adjust the ability estimates yielded by different groups of candidates taking the two parallel tests on different occasions such as in the equating project of Matriculation English Test (MET) in China launched ever since 1986, or equating of College of English Test (Candidates take two tests and may choose the higher score of the two. 2.3. Equating and its concept Test A ? Difficult d Test takers Test B ? Difficult d Test takers A ? Ability θ ? Ability θ Test Test takers B Test takers A Test Takers B Test A ? Difficulty / Ability linking items Test B ? Difficulty / Ability Equating defined The concept of ‘equating’ discussed here refers to linking of test forms through common items so that scores derived from the tests which were administered separately to different test takers on different occasions after conversion will be comparable on the same scale. (Hambleton & Swaminathan, Gui Shi Chun:1985 and et al) Equating --- Item bank • Equating makes an item-bank possible; • An item-bank serves computerized testing. • Itembank calibrated testing items Computerized Testing to be presented • • • • • • • • • • • • • • • • • • • • BILOG-W Command File EQUATING OF PRETCO2002(20+100) LINKED WITH PRETCO2002 (20+100) >COMMENTS The data were collected from more than 1,000 PRETCO candidates of colleges within Guangdong Province. The data are in the file PRETCO01.DAT of the BILOG directory; The respondents' scores are estimated by the ML method and re-scaled to mean 0 and standard deviation 1 in the sample (RSC=2). The item parameter estimates are saved AFTER re-scaling. >GLOBAL NWGHT=0, FNAME='d:\BILOG\Examples\blgdat\PRETCO01.DAT', NPArm=1, SAVe; >SAVe GRAPH='PRETCO01.PLT', PARM='PRETCO01.PAR', SCORE='PRETCO01.SCO'; >LENGTH NITems=220; >INPUT FORms=2, NTOT=120, NALT=4, INOPT=1, NIDCH=12; (12A1,1X,I1,120A1) >FORm1 LENgth =120, ITEms = (1(1)120); >FORm2 LENgth =120, ITEms = (1(1)20,(121(1)220); >TESt TNAMe= 'EQUATING', LINK=(1(0)20,0(0)200); >CALIB TPRior, SPRior; >SCORE MET=1, RSC=2; • BILOG-W Data File • • • • • • • GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 GD2006070002 1 1010101001110101010101010010101010101010010101010101010100 GD2006070003 1 1010101001010101010101010010101010101010010101010101010111 GD2006070004 1 1010101001010101010101010010101010101010010101010101010100 GD2006070005 1 1010101001010101010101000010101010101010010101010101010100 GD2006070006 1 1010101001011111110101010010101010101010010101010101010100 GD2006070007 1 1010101001010101010101010010101010101010010101010101010100 • • • • • • • • • • • • • • ... … … … … … … … ... … … … … … … … GD2006070001 1 1010101001010101011101010010101010101011111101010101010100 GD2006070001 1 1010101001010101010101010010101010101011110101010101010100 GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 GD2006070001 1 1010101111010101010101011110101010101010010101010101010100 GD2006070001 1 1110101001010101010101010010101010101010010101010101010100 GD2006070001 1 1010101001010101010101010010101010101010010101010101010100 GD2007070001 2 1010101001010101010101010010101010101010010101010101010100 GD2007070002 2 1010101001010101010101010010101010101011110101010101010100 GD2007070003 2 1010101001010101010101010010101010101010010101010101010100 GD2007070004 2 1010101001010101010101011110101010111111111111010101010100 GD2007070005 2 1010101001010101010101011110101010101010010101010101010100 GD2007070006 2 1010101001010101010101010010101010101010010101010101010100 • • • • … … … … … … … … GD2007070007 2 1010101001010101010101011110101010101011010101010101010100 GD2007070008 2 1010101001010101010101011110101010101010010101010101010111 GD2007070008 2 1111111111110101010101010010101010101010010101010101010100 PARSCALE-W Command File Command file: EQT8599.PSL EQ8599 Equating: Simulated Dada >COMMENT: This example illustrates calibration and scoring of two parallel MET tests: MET85 and MET99 containing respectively 20 common items and 85 MET items. The total items for each test is 20 linking items plus 85 items. The simulated data represent responses of 300 examinees drawn randomly from a population with a mean trait score of 0.0 and standard deviation of 1.0. All items are response data from multiple choice questions with four alternatives. All items have varying difficulties and discriminating powers saved in the file MET85-99.DAT. The scores, which are equated to be comparable on the same scale, are not printed but saved in the file METEQT8599.SCO. In addition, the estimated item parameters are saved in the file METEQT8599.PAR. by maximum likelihood method (MLE) from one-parameter model. • • • • • • • • >FILE DFNAME='MET8599.DAT', NFNAME= 'MET8599.NPR', SAV; >SAVE PARM='MET8599.PAR', SCORE= 'METEQT8599. SCO'; >INPUT NIDW=10, NTOTAL=190, NTEST=1; (10A1,190A1) >TEST1 TNAME='EQ8599', ITEM=(1(1)190), NBLOCK=1, SLOPE; >BLOCK NITEMS=190, NCAT=2, GPARM=0.0, GUESS=(2,FIX), CSLOPE, ORIGINAL=(0,1), MODIF=(1,2); >CAL LOGISTIC, SCALE=1.7, NQPTS=30, CYCLE=30, CRIT=0.01, ITEMFIT=6; >SCORE MLE; PARSCALE-W Data File • • • • • • • • • • • • • • • • • • • • • • • TESTX011101010100110000001011000119999999999999 TESTX020110101100011110111111111109999999999999 TESTX031000101100000101110000000019999999999999 TESTX101111110001011110111100111109999999999999 TESTX110101111111110110000000000119999999999999 TESTX120011000000111100001010101009999999999999 … … …… … … TESTX181110010101010000000001011009999999999999 TESTX191001010100010000001000011009999999999999 TESTX200001100110001000000001000109999999999999 TESTX210011010101010010001000010009999999999999 TESTX221010111000111000110000000009999999999999 TEXTY080111101010101099999999999990010011111000 TEXTY091111010101010199999999999991110111111011 TEXTY101011010101011199999999999990110110100101 TEXTY111111010101010099999999999990011010100011 TEXTY121001101010101099999999999990010000000000 … … … … … … TEXTY150101010101011099999999999991110010000000 TEXTY161101101010101099999999999990010010101001 TEXTY171100110101010199999999999990010010110001 TEXTY180101101010101199999999999991111010000010 TEXTY311100111010101199999999999990000010100010 The numbers of candidates taking MET in recent years (2006-2010) nationawide. 12 10 8 6 Unit:Million 4 2 0 1990-1999 2006 2007 2008 2009 2010 The two curves indicating item difficulties of PET1999 and PET2011 generated by GiTEST, after being equated, can be comparable on the same scale. 4 3 2 1 0 -1 -2 -3 -4 PET1999 PET2011 The most recently updated BILOG and PASRCALE could process, in a single run, unlimited number of test items by unlimited number of test takers. The data matrix is actually infinite. This makes CAT feasible. V. What we need in the present status quo • • • • • • Examinations on large scale in China today (1) Matriculation English Test (MET) (2) College English Test Band-4 and Band-6 (CET) (3) Test for English Majors (TEM) (4) Practical English Test for Colleges (PRETCO) (5) Public English Test System (PETS) IV. What we need in the future • • • • • • (1) Testing theory: Rasch model, IRT, ... ... (2) More workshops (3) More experienced experts Towards International Practice (4) More text books of language testing of Language Testing in China (5) More PROMS conferences (6) More cooperations and exchanges Thank you for your Questions attention Prof. Zhang Quan Ph.D Dean, College of Foreign Studies, Jiaxing University, Zhejiang Province, P.R.China email: qzhang141@yahoo.cn Tel: 86-0573-83640029 Cell: 86-13902251564