The application of Rasch Model in China

advertisement
Thanks for the
Organizors of Assessment Conference, Hong Kong
to give me such a chance to do the presentation
January 15-16, 2013
Hong Kong SAR, China
Rasch Model in China:
Retrospect and Status Quo
by
Prof. Zhang Quan Ph.D
College of Foreign Studies, Jiaxing University
Zhejiang, China
I. Rasch Model, 20 years ago
• As early as in 1980s, the ideas and concepts regarding Rasch Model
and IRT were first introduced into China by Prof. Gui Shichun, my
Ph.D supervisor, and it is Prof. Gui who first conducted with great
success the ten-year long (1990-1999) Equating Project for
Matriculation English Test (MET) in China. MET is the most
influrential and competitive entrance examintaion for higher education
administered annually to over 3.3 million candidates then. The
Equating Project won recognition by Charles Alderson and other
foreign counterparts during 1990s. Academically, those were Good
Old Days for Chinese testing experts and psychometricians. Then for
certain reasons, the equating practice abruptly discontinued. Therefore,
in China nowadays, the application of Rasch Model or the IRT-based
software like BILOG, Parscale, Winsteps and others to real testing
problem solving is confined within a small ‘band’ of people.
I. Rasch Model, 20 years ago
• Rasch was used to do equating for MET, (Matriculation English Test),
• the most influencial and competitive (20% can be enrolled) entrance
examination administered annually to candidates of approximately 3.3
million (from 1990 on) and the number is increasing in the following
years.
• Features of MET
• 1. Compulsory:
•
All the Chinese middle school students must take it if planning to
•
study in a college or a university.
• 2. High-stake: the pass or failure may decide the rest of one’s life.
• 3. Unified and at national level: One and the same test paper is used
•
across China Mainland.
• 4. Test format: mainly multiple choice questions plus a small portion
•
of writing.
I. Rasch Model, 20 years ago
• Features of MET (continued)
• 5. Family-bound. To pass MET and to be admitted into
•
universities for higher learning are the very concern and
•
expection of their kids by all the parents in China.
• 6. Equating via anchored items was done annually from
• 1990-1999. (Test scores, after conversion, can be
• comparable on the same scale across China). The only test
on large scale to which equating with real data was
conducted and the whose rescaled scores were used for
recruitment.
• 7. Moderating of test items was based on the item analysis.
II. Rasch Model and MET equating, 20 years ago
• One thing worth mentioning here is that the equating via
Rasch Model in the Chinese situation, a situation somewhat
unique in a number of ways was done very successfully.
(The presenter here is one of the key members of equating
group headed by Prof. Gui from 1990-1999)
• 1. As the uneven deveolpment of education and the big
number of candicates taking the test, the population is
actually heterogenous though the candidates are all senior
middle school graduates. Difficult to set an unbiased test, let
alone to equate two parallel test forms administered on
different occasions.
II. Rasch Model and MET equating, 20 years ago
• 2. Although the test papers were centrally set,
there was no way yet to score the papers centrally.
The general practice was to assign every
individual province to score its own papers and to
work out its own norm for recruitment. This made
university authorities confronted with the problem
of selecting candidates whose scores were graded
according to different criteria set up by different
provinces.
II. Rasch Model and MET equating, 20 years ago
• 3. In China, there is no feasible way to protect test
security immediately after its adminstration. Nor is it
possible to use common items in different forms, nor is
feasible to conduct any pre-test for future use.
• To find feasible solution(s) to such problems, we
established an anchorage, i.e. three sampling bases
(middle schools) to monitor the performance of the
candidates. We designed an equivalent test form
• (35+65=100 items) and had it administered to the
candidates who were going to take MET three days
before MET was administered. The equivalent test form
was used repeatedly for 10 years (1990-1999).
II. Rasch Model and MET equating, 20 years ago
• In doing so, we could not only observe but also
compare the performance of candidates taking
MET in different years.
• Hypothesis:
• There will be no big change in terms of general M
(English proficiency) within one’s year’s time. If
there is any change of means, it must be
associated with the change of difficulty level of
test froms across two years.
II. Rasch Model and MET equating, 20 years ago
• Then, we came to realize that such a hypothesis is
by no means perfect in at least three reasons:
• First, the sample size. We were going risk of test
leakage. The sample must be big enough to be
representative; however, the larger the sample, the
greater the danger of test exposure;
• Next, the general level of population is not likely
to remain unchanged. Instead, it may fluctuate.
Insignificant changes may accumulate into
significant changes. (Gui Shichun:1990)
II. Rasch Model and MET equating, 20 years ago
• Finally, if there is any changes in terms of
difficulty level of the test forms, it would not be
accepted by simply making any linear adjustments
based on individual test scores regarding the
difference between the test forms.
• It is based on such a hypothesis that Anchor-testrandom-groups design was put forward and
conducted
2.1. Anchor-test-random-groups design
Test takers A
Test A
Equivalent Test of 35 linking items+65
Test Takers B
Test B
The equivalent test was taken externally, three days
before MET was actually administered.
2.2. Anchor-test-random-groups design:
summarized (1)
•
•
•
•
•
•
•
1. Sampling
2. Administration
3. Chi-square test (Wright,1979) of the 35 linking
items was applied so as to delete the inappropriate
items. In 1989, 28 items, In 1990-1991, 27 items,
4. Equating test forms
The test results of 1988 (the year when MET was first
administered across China) was used as basal
reference. With anchor test, Rasch Model (Gitest), all
the following test forms got equated (calibrated and
rescaled) (Wright,1979)
2.2. Anchor-test-random-groups design:
summarized (2)
• 5. Ability estimation
•
In the case of Rasch Model, the ability estimation
is straighforward. To obtain the maximum likelihood
estimation of theta (θ), we used the Newton-Raphson
procedure (Hambleton,1985) . The ability values are
again converted into probabilities for those who know
nothing about Rasch.
• As the model has the sample-free character, we could
make use the derived data to obtain adjusted scores
for the population.
2.3. Anchor-test-random-groups design:
summarized (3)
• Why Rasch Model and not other models, two- or three-p models?
• 1. Feasible implementation
• Once the item parameters were calibrated, the ability parameters can be easily
estimated.
• A typical example: A candidate getting a raw score of 60 correct answers out of
85 test items will be assigned an ability value regardless of which combination of
the 60 correct answers. In the case of two- or three-p, the procedures get
complicated. The estimation is very much associated with the discrimination and
the so-called ‘guessing’ parameter. Therefore, the two or more candidates getting
a raw score of 60 correct answers out of 85 test items will be assigned different
ability values becasue of the combinations of 60 correct answers vary from person
to person. Imagine, the combinations of items from 1 to 84 is huge or
astronomical!
• Impossible to use the sampled data to predict the population performance. Very
often the iteration never came to convergence because of mainly two big
problems, computer configuration problems and the jumble data size impossible
to manipulate within two weeks.
2.2. Anchor-test-random-groups design:
summarized (4)
• 2. Model-data fit
• With Rasch Model, item and ability fit can be
computed (Wright,1982 ) and can demonstrate
the degree of goodness-of-fit of the model.
• ... ...
2.3. Item Difficulty of MET 1988-1992
MET88
MET89 MET90
MET91
MET92
0.793 (0.31) -0.186 (0.55)
0.992 (0.28)
Phonetics
-0.860 (.70)
-0.69(0.48)
Grammar
0.228 (.44
-0.372(0.59) 0.471 (0.38) 0.500 (0.38)
0.801 (0.31)
BLK-Filling
-0.367 (.59)
0.271(0.43) 0.871 (0.30) 0.845 (0.30)
0.609 (0.35)
Reading
-0.330(.58)
-0.581(0.64) 0.600 (0.35) -0.179 (0.54)
-0.202 (0.55)
Means
-0.206 (.55)
-0.180(0.54) 0.657 (0.34) 0.361 (0.41)
0.523 (0.37)
For better illustration, the numbers in brackets are
probabilities converted from difficulties. As shown in the talbe,
no big differences between MET88 and MET89; however,
MET90 turned out to be more difficult.
2.4. Ability (θ) of MET 1988-1992
MET88
MET89 MET90
MET91
MET92
Total N
136543
117085
128543
136047
133965
θ Means
40.0
44.4
53.7
50.0
54.2
%
47.0
52.2
63.2
58.8
63.8
SD
17.9
16.1
13.5
missing
15.2
The θMeans as shown in the table above refer to the rescaled
average ability parameters 40.0 regarding the MC parts only,
the full score:85;
85 MC + 15 writing = 100
III. MET and Rasch Model: Status Quo
• MET remains the most influencial and competitive (20% can be
enrolled) entrance examination administered annually to candidates of
approximately 3.3 million (from 1990 on) and the number is
increasing in the following years.
• Features of MET remain:
• 1. Compulsory:
• All the Chinese middle school students must take it if planning to
Disbanded.
Resumed
the traditional test item writing, scoring at
study
in a college
or a to
university.
provincial level, reporting in raw scores, no pre-test, no item
• 2. High-stake: the pass or failure may decide the rest of one’s life.
analysis (Rasch or IRT) and no equating. Problems of test item
• 3. Unified and at national level: One and the same test paper is used
writing and moderating.
•
across China.
Each
province
or regions
its ownplus
testapaper.
normof writing.
• 4.
Mainly
multiple
choice use
questions
small No
portion
established.
• 5.
No equating is done annually. Statistically, test scores are not
comparable.
III. MET and Rasch Model: Status Quo
•
• MET remains the most influrential and competitive entrance
examintaion for higher education in China. The number has been
increasing. It reached 9.5 millions in 2006 and the graph shows
the numbers of candidates taking MET in recent years (2006-2010)
nationwide.
• And the average number of candidates in Zhejiang Province
where our university is located is 300,000 (and 360,000 in 2012).
According to the latest offical report, the number of candidates
taking MET in Jiaxing in 2012 goes as follows: 7688 of humanity,
1493 of arts and Chinese, 176 of sports, 12991 of science, 443 of
arts and science and 237 of sports and science.
• The number of students taking MET is decreasing annually.
III. MET and Rasch Model: Status Quo
• MET features (Continued)
• According to the most updated source, MET in
China will be administered separately from the
other entrance examinations and will be
administered more than once within a year’s
time so that students may have more chances to
take MET. From the professional point of view,
such a practice needs equating.
• Updated IRT-based computer software
IV. College English Test (CET)
• CET is another most influencial examination administered two times a
year to students of non-English major of approximately over 10
million in recent years.
• Features of CET
• 1. from Compulsory to Optional:
• Not all the undergraduate students of non-English major should take it.
• 2. from High-stake to not very high-stake: the pass or failure may
make no difference for a student to get the diploma.
• 3. Unified and at national level: One and the same test paper is used
•
across China.
• 4. Mainly multiple choice questions plus a small portion of writing.
• 5. The test whose equating has been done annually from 1990 in
China.(with a team of qualified test item writers)
The first Rasch-based computer software developed by Prof. Gui in 1990s.
Test Paper Report by GITEST
Mean
the mean scores of the whole examinees;
SD
the standard deviations of the whole examinees;
Varn.
the variants based on the whole examinees;
P+
probability of correct answers;
Pd
 value, difficulty parameter based on probability;
R11
by Kuder-Richardson20,reliability, this value should be over 0.9
aVALUE reliability parameter ,also called  value,by Cronbach formular,
this value should be over 0.8
Rbis
discrimination index(in the unit of bi-serial)
Skewness score distribution value, .
0 indicating normal distribution;
above 0, indicating positive skewness, showing the test items more difficult;
below 0, indicating negative skewness, showing the test items easier;
Kurtosis score distribution height:
0 indicating normal; above 0 showing “narrower”,i.e. small range between
the scores; below 0, indicating “flat”,i.e. big range between scores;
Difficulty VD(<0.1), D(=0.10.3), I(0.30.7), E(0.70.9), VE(>0.9)
The three curves generated by GiTST, BILOG and PARSCALE,
indicating item difficulties based on the same data
6
4
2
0
-2
GITEST
BILOG
PARSCALE
-4
-6
As shown in the figure above, the curves are very close. The BILOG and PARSCALE
are-8almost overlapping. This is very much related to the number of cycles and the predetermined value for convergence set in respective command file. BILOG came to
convergence after 6 cycles with the largest changes = 0.005, while PARSCALE came
to convergence after 72 cycles with the LARGEST CHANGE = 0.01. GITEST looks a
little bit different. This is because all the parameters are set as defaults. On whole,
there is no big difference in terms of test item difficulty calibration.
2.2. Equating and its why
•
In testing practice, equating is used to monitor
any possible changes of item difficulties so as to
adjust the ability estimates yielded by different
groups of candidates taking the two parallel tests on
different occasions such as in the equating project
of Matriculation English Test (MET) in China
launched ever since 1986, or equating of College of
English Test (Candidates take two tests and may
choose the higher score of the two.
2.3. Equating and its concept
Test A
? Difficult
d
Test takers
Test B
? Difficult
d
Test takers A
? Ability
θ
? Ability
θ
Test
Test takers B
Test takers A
Test Takers B
Test A ? Difficulty / Ability
linking items
Test B ? Difficulty / Ability
Equating defined
The concept of ‘equating’ discussed here
refers to linking of test forms through
common items so that scores derived from
the tests which were administered
separately to different test takers on
different occasions after conversion will be
comparable on the same scale.
(Hambleton & Swaminathan, Gui Shi Chun:1985 and et al)
Equating --- Item bank
• Equating makes an item-bank possible;
• An item-bank serves computerized
testing.
•
Itembank
calibrated
testing items
Computerized Testing
to be presented
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
BILOG-W Command File
EQUATING OF PRETCO2002(20+100) LINKED WITH PRETCO2002 (20+100)
>COMMENTS
The data were collected from more than 1,000 PRETCO candidates of
colleges within Guangdong Province.
The data are in the file PRETCO01.DAT of the BILOG directory;
The respondents' scores are estimated by the ML method and re-scaled to
mean 0 and standard deviation 1 in the sample (RSC=2).
The item parameter estimates are saved AFTER re-scaling.
>GLOBAL NWGHT=0, FNAME='d:\BILOG\Examples\blgdat\PRETCO01.DAT',
NPArm=1, SAVe;
>SAVe
GRAPH='PRETCO01.PLT',
PARM='PRETCO01.PAR',
SCORE='PRETCO01.SCO';
>LENGTH NITems=220;
>INPUT
FORms=2, NTOT=120, NALT=4, INOPT=1, NIDCH=12;
(12A1,1X,I1,120A1)
>FORm1 LENgth =120, ITEms = (1(1)120);
>FORm2 LENgth =120, ITEms = (1(1)20,(121(1)220);
>TESt
TNAMe= 'EQUATING', LINK=(1(0)20,0(0)200);
>CALIB
TPRior, SPRior;
>SCORE
MET=1, RSC=2;
•
BILOG-W Data File
•
•
•
•
•
•
•
GD2006070001 1 1010101001010101010101010010101010101010010101010101010100
GD2006070002 1 1010101001110101010101010010101010101010010101010101010100
GD2006070003 1 1010101001010101010101010010101010101010010101010101010111
GD2006070004 1 1010101001010101010101010010101010101010010101010101010100
GD2006070005 1 1010101001010101010101000010101010101010010101010101010100
GD2006070006 1 1010101001011111110101010010101010101010010101010101010100
GD2006070007 1 1010101001010101010101010010101010101010010101010101010100
•
•
•
•
•
•
•
•
•
•
•
•
•
•
... … … … … … … …
... … … … … … … …
GD2006070001 1 1010101001010101011101010010101010101011111101010101010100
GD2006070001 1 1010101001010101010101010010101010101011110101010101010100
GD2006070001 1 1010101001010101010101010010101010101010010101010101010100
GD2006070001 1 1010101001010101010101010010101010101010010101010101010100
GD2006070001 1 1010101111010101010101011110101010101010010101010101010100
GD2006070001 1 1110101001010101010101010010101010101010010101010101010100
GD2006070001 1 1010101001010101010101010010101010101010010101010101010100
GD2007070001 2 1010101001010101010101010010101010101010010101010101010100
GD2007070002 2 1010101001010101010101010010101010101011110101010101010100
GD2007070003 2 1010101001010101010101010010101010101010010101010101010100
GD2007070004 2 1010101001010101010101011110101010111111111111010101010100
GD2007070005 2 1010101001010101010101011110101010101010010101010101010100
GD2007070006 2 1010101001010101010101010010101010101010010101010101010100
•
•
•
• … … … … … … … …
GD2007070007 2 1010101001010101010101011110101010101011010101010101010100
GD2007070008 2 1010101001010101010101011110101010101010010101010101010111
GD2007070008 2 1111111111110101010101010010101010101010010101010101010100
PARSCALE-W Command File
Command file: EQT8599.PSL
EQ8599 Equating: Simulated Dada
>COMMENT:
This example illustrates calibration and scoring of two parallel MET tests: MET85 and MET99 containing
respectively 20 common items and 85 MET items. The total items for each test is 20 linking items plus 85
items. The simulated data represent responses of 300 examinees drawn randomly from a population with a
mean trait score of 0.0 and standard deviation of 1.0.
All items are response data from multiple choice questions with four alternatives. All items have varying
difficulties and discriminating powers saved in the file MET85-99.DAT. The scores, which are equated to be
comparable on the same scale, are not printed but saved in the file METEQT8599.SCO. In addition, the
estimated item parameters are saved in the file METEQT8599.PAR. by maximum likelihood method (MLE)
from one-parameter model.
•
•
•
•
•
•
•
•
>FILE DFNAME='MET8599.DAT', NFNAME= 'MET8599.NPR', SAV;
>SAVE PARM='MET8599.PAR', SCORE= 'METEQT8599. SCO';
>INPUT NIDW=10, NTOTAL=190, NTEST=1;
(10A1,190A1)
>TEST1 TNAME='EQ8599', ITEM=(1(1)190), NBLOCK=1, SLOPE;
>BLOCK NITEMS=190, NCAT=2, GPARM=0.0, GUESS=(2,FIX), CSLOPE,
ORIGINAL=(0,1), MODIF=(1,2);
>CAL LOGISTIC, SCALE=1.7, NQPTS=30, CYCLE=30, CRIT=0.01, ITEMFIT=6;
>SCORE MLE;
PARSCALE-W Data File
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
TESTX011101010100110000001011000119999999999999
TESTX020110101100011110111111111109999999999999
TESTX031000101100000101110000000019999999999999
TESTX101111110001011110111100111109999999999999
TESTX110101111111110110000000000119999999999999
TESTX120011000000111100001010101009999999999999
… … …… … …
TESTX181110010101010000000001011009999999999999
TESTX191001010100010000001000011009999999999999
TESTX200001100110001000000001000109999999999999
TESTX210011010101010010001000010009999999999999
TESTX221010111000111000110000000009999999999999
TEXTY080111101010101099999999999990010011111000
TEXTY091111010101010199999999999991110111111011
TEXTY101011010101011199999999999990110110100101
TEXTY111111010101010099999999999990011010100011
TEXTY121001101010101099999999999990010000000000
… … … … … …
TEXTY150101010101011099999999999991110010000000
TEXTY161101101010101099999999999990010010101001
TEXTY171100110101010199999999999990010010110001
TEXTY180101101010101199999999999991111010000010
TEXTY311100111010101199999999999990000010100010
The numbers of candidates taking MET
in recent years (2006-2010) nationawide.
12
10
8
6
Unit:Million
4
2
0
1990-1999
2006
2007
2008
2009
2010
The two curves indicating item difficulties of PET1999 and PET2011
generated by GiTEST, after being equated, can be comparable on the
same scale.
4
3
2
1
0
-1
-2
-3
-4
PET1999
PET2011
The most recently updated BILOG and
PASRCALE could process, in a single run,
unlimited number of test items by unlimited
number of test takers.
The data matrix is actually infinite. This
makes CAT feasible.
V. What we need in the present status quo
•
•
•
•
•
•
Examinations on large scale in China today
(1) Matriculation English Test (MET)
(2) College English Test Band-4 and Band-6 (CET)
(3) Test for English Majors (TEM)
(4) Practical English Test for Colleges (PRETCO)
(5) Public English Test System (PETS)
IV. What we need in the future
•
•
•
•
•
•
(1) Testing theory: Rasch model, IRT, ... ...
(2) More workshops
(3) More experienced experts
Towards International Practice
(4) More text books of language testing
of
Language
Testing
in
China
(5) More PROMS conferences
(6) More cooperations and exchanges
Thank
you
for
your
Questions
attention
Prof. Zhang Quan Ph.D
Dean, College of Foreign Studies, Jiaxing University,
Zhejiang Province, P.R.China
email: qzhang141@yahoo.cn
Tel: 86-0573-83640029 Cell: 86-13902251564
Download