TEST USES IN THE US - Association of Test Publishers

advertisement
FUTURE OF ASSESSMENT IN INDIA
CHALLENGES AND SOLUTIONS
HARIHARAN SWAMINATHAN
UNIVERSITY OF CONNECTICUT
Testing: A Brief History
1. Testing is part of human nature and has a long history. In
the old testament, “Mastery Testing” was used in the old
testament to “classify” people into two groups:
• The Gileadites defeated the Ephraimites, and when the
Ephraimites tried to escape, they were given a high stakes
test;
• The test was to pronounce the word “shibboleth”. Failure
had a drastic consequence.
2. Civil service exams were used in China more than 3000
years ago.
Testing: A Brief History
3. Testing in India may go even further back in time
4. In the West, testing has a mixed history – its use has
waxed and waned.
5. In the US, Horace Mann argued for written exam
(objective type) and the first was introduced in
Boston in 1845
6. Tests were used for grade-to-grade promotion
7. This testing practice fell into disrepute because of
teaching to the test.
Testing: A Brief History
8. Grade promotion based on testing was abolished in Chicago
in 1881.
9. Binet introduced mental testing in 1901 (became the
Stanford Binet Test).
10. The issue of fairness that “everyone should get the same
test” was not relevant to Binet.
11. Binet rank ordered the items in order of difficulty and
targeted items to the child’s ability.
12. The first “Adaptive Test” was born.
Individualized Testing
• Adaptive testing was the primary mode of testing before the
notion of group testing was introduced.
• With the advent of group testing, individualized (adaptive)
testing went to the back burner.
• Primarily because it was impossible to administer to large
groups.
• Group based adaptive testing was not feasible, Until……
• We will return to adaptive testing later.
Testing in India
• Testing in India has a long history
Knowledge/Skill testing was common in ancient
India
• Rama was subjected to competency testing by
Sugriva
• Yudhisthira was tested with a 133 item, high stakes
test (Yaksha Prashna)
• According to the Bhagavatam, the name Parikshit
(Examiner) was given to the successor of
Yudhisthira
Testing in the form of puzzles and games was often
used for entertainment
Testing in India
India is a country of superlatives
• India can boast of probably the longest tradition of
education, stretching over at least two and half millennia
• Taxasila is considered the oldest seat of learning in the world
• Nalanda perhaps the oldest university in the world
• This tradition of learning and the value placed on education
continues in India
Testing in India
Population Distribution in India
Country
India
USA
USA Total Population
Age Range
0-4
5-9
10-15
0-15
0-15
Number
128 Million
128 Million
128 Million
384 Million
32 Million
352 Million
Testing in India
• These numbers are expected to decline slightly over the
next twenty years, but nevertheless, will far exceed that of
any other country in the world.
• The tradition of learning, the value placed on education,
and the population explosion has created considerable
stress on the Indian education system.
• It is not surprising that assessment and testing procedures
in India have focused on selection and certification .
Testing in India
• Some of the selection examinations in India are perhaps the
most grueling and most selective in the world.
-- IAS Examination: of the 450,000 candidates, 1200
selected (.3% )
-- IIT Joint Entrance examination: of the 500,000 applicants
only about 10,000 are selected for admission (2%)
• According to the scientific advisor to the previous prime
minister, C.N.R. Rao, “India has an examination system but
not an education system.”
• Another criticism levelled against the examination system is
that they promote intensive coaching with little regard for a
properly grounded knowledge base.
Testing in India
• Needless to say, these criticisms are not unlike the criticism
levelled against testing in the US.
• However, intensive testing in schools in the US is now being
directed more towards assessment of learning and growth
for accountability purposes rather than on assessment of
student achievement for the purposes of certification and
promotion.
• Although assessment and testing play an important role in
Indian education, assessment practices in India do not seem
to have kept pace with the modern approaches and trends
in testing and assessment.
Test Uses in the U.S.
1.
Management of Instruction
2.
Placement and Counseling
3.
Selection
4.
Licensure and Certification
5.
Accountability
Test Uses in the U.S.
1. Management of Instruction
• Classroom and standardized tests for daily
management of instruction (formative and
diagnostic evaluation)
• Classroom and standardized tests for grading
(summative evaluation)
2. Placement and Counseling
• Standardized tests for transition from one level of
school to another or from school to work
Test Uses in the U.S.
3.
Selection (Entry Decisions)
• standardized achievement and aptitude
tests for admission to college, graduate
school, and special programs
4.
Licensure and Certification
• Standardized tests for determining
qualification for entry into a profession
• School graduation requirement
Test Uses in the U.S.
5. Accountability
• Standardized tests to show satisfactory achievement
or growth of students in schools receiving public
money, often required by state and federal legislation.
If students in schools do not show adequate progress,
sanctions are imposed on the schools and on the
state.
Theoretical Framework for Tests
•
For all these purposes we need to design tests for
measuring the student’s “ability” or “proficiency”
•
The use of the ability/proficiency test scores must be
appropriate for the intended use
•
Tests are measurement instruments and when we use
them to measure, like with all measurement devises, we
make measurement errors
Theoretical Framework for Tests
•
The objective of measurement is to measure what we
want to measure appropriately and do so with minimum
error.
•
The construction of tests and the determination of an
examinee’s proficiency level/ability scores are carried out
within one of two theoretical frameworks:
• Classical Test Theory
• Modern Test Theory or Item Response Theory (IRT)
Classical Test Theory Model
X
Observed Score
=
T
+
True Score
 X2 = Observed Score Variance
 T2 = True Score Variance
 E2 = Error Variance
E
Error
Indices Used In Traditional Test Construction
•
Item and Test Indices:
 Item difficulty
 Item discrimination
 Test score reliability
 Standard Error of Measurement
•
Examinee Indices:
 Test score
Classical Item Indices: Item Difficulty
•
Item difficulty : Proportion of examinees answering the
item correctly
•
It is an index of how difficult an item is.
•
If the value is low, it indicates that the item is very
difficult. Only a few examinees will respond correctly to
this item.
•
If the value is high, the item is easy as many examinees
will respond correctly to it.
Classical Item Indices: Item Discrimination
•
•
•
•
•
Item discrimination: correlation between item score
and total score
A value close to 1 indicates that examinees with high
scores (ability) answer this item correctly, while
examinees with low ability will respond incorrectly
A low value implies that there is hardly any relationship
between ability and how examinees respond to this
item
Such items are not very useful for separating high ability
examinees from low ability examinees
Items with high values of discrimination are very useful
for ADAPTIVE TESTING
Standard Error Of Measurement 𝝈𝑬
•
Indicates the amount of error to be expected in the test
scores
•
Arguably, the most important quantity
•
Depends on the scale, and therefore difficult to assess
its magnitude
•
Can be re-expressed in terms of reliability which varies
between 0 and 1.
Test Score Reliability
•
Reliability Index, 𝝆 , is defined as the correlation
between true scores on “parallel” tests.
•
It will take on value between 0 and 1, with 0 denoting
totally unreliable test scores and 1 perfectly reliable
test scores.
•
It is related to the Standard Error of Measurement
according to the expression
𝝈𝑬 = 𝝈𝑿 √(𝟏 − 𝝆)
•
If the test scores are perfectly reliable, 𝝆=1 and 𝝈𝑬 = 0.
Reliability
• Error in scores is due to factors such as testing
conditions, fatigue, guessing, emotional or physical
condition of student, etc.
• Different types of reliability coefficients reflect different
interpretations of error
Reliability
1. Reliability refers to consistency of test scores
• over time
• over different sets of test items
2. Reliability refers to test results, not the test itself
3. A test can have more than one type of reliability
coefficient
4. Reliability is necessary but not sufficient for validity
Shortcomings of the Indices based on
Classical test Theory
• They are
group DEPENDENT
i.e., they change as the groups change.
• Reliability and hence Standard Error of Measurement
are defined in terms of Parallel Tests, almost impossible
to realize in practice.
And what’s wrong with that?
•
We cannot compare item characteristics for items
whose indices were computed on different groups
of examinees
•
We cannot compare the test scores of individuals
who have taken different sets of test items
Wouldn’t it be nice if ….?
•
Our item indices did not depend on the characteristics
of the individuals on which the item data was obtained
•
Our examinee measures did not depend on the
characteristics of the of items that was administered
ITEM RESPONSE THEORY
solves the problem!*
* Certain conditions apply. Individual results may vary. IRT is
not for everyone, including those with small samples. Side
effects include nausea, drowsiness, and difficulty
swallowing. If symptoms persist, consult a psychometrician.
For more information, see Hambleton and Swaminathan
(1985), and Hambleton, Swaminathan and Rogers (1991).
Item Response Theory
•
Based on the postulate that the probability of a correct
response to an item depends on the ability of the
examinee and the characteristics of the item
The Item Response Model
• The mathematical relationship between the
probability of a response, the ability of the
examinee, and the characteristics of the item
is specified by the
ITEM RESPONSE MODEL
Item Characteristics
• An item may be characterized by its
DIFFICULTY level (usually denoted as b),
DISCRIMINATION level (usually denoted by a),
“PSEUDO-CHANCE” level (usually denoted as c).
Item Response Model
1.0
Probability of Correct Response
0.9
0.8
a = 0.5
b = -0.5
c = 0.0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
Item Response Model
1.0
Probability of Correct Response
0.9
0.8
a = 0.5
b = -0.5
c = 0.0
a = 2.0
b = 0.0
c = .25
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
Item Response Model
1.0
Probability of Correct Response
0.9
0.8
a = 0.5
b = -0.5
c = 0.0
a = 2.0
b = 0.0
c = .25
0.7
0.6
a = 0.8
b = 1.5
c = 0.1
0.5
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
IRT Item Difficulty b
•
Differs from classical item difficulty
•
b is the θ value at which the probability of a correct
response is .5
•
The harder the item, the higher the b
•
b is on the same scale as θ and does not depend on the
characteristics of the group of test takers
IRT Item Discrimination a
•
Differs from classical item discrimination
•
a is proportional to the slope of the ICC at θ = b
•
The slope indicates how much the probability of a
correct response changes for individuals with slightly
different θ values, i.e., how well the item
discriminates between them
IRT Item Guessing Parameter
• No analog in classical test theory
• c is the probability that an examinee with very
low θ will answer the item correctly
• b is now the θ value at which the probability of
a correct response is (1 + c)/2
Item Response Models
• The One-Parameter Model (Rasch Model)
e ( b )
P(correct response ) 
1  e ( b )
• The Two-Parameter Model
e a ( b )
P(correct response ) 
1  e a ( b )
• The Three-Parameter Model
e a ( b )
P(correct response )  c  (1  c)
1  e a ( b )
How Is IRT Used In Practice?
•
Test construction
•
Equating of test forms
•
Vertical scaling (for growth assessment)
•
Detection of differential item functioning
•
Adaptive testing
Test Construction
•
Traditional approach: select items with p-values in the
.2 - .8 range and as highly discriminating as possible
•
We cannot, however, design a test that has the required
reliability, SEM, and score distribution.
•
We cannot design a test with pre-specified
characteristics.
Test Construction (cont.)
•
IRT approach: INFORMATION FUNCTIONS
•
The information provided by the test about an examinee
with given ability 𝜽 is directly related to the Standard Error
of Measurement
•
We CAN assemble a test that has the characteristics we
want, impossible to accomplish this in a classical framework
Test Construction (cont.)
•
The TEST INFORMATION FUNCTION specifies the
information provided by the test across the θ range
•
Test information is a sum of the information provided by
each item
•
Because of this property, we can combine items to obtain a
pre-specified test information
Test Construction (cont.)
•
Items can be selected to maximize information in desired
θ regions depending on test purpose
•
Test can be constructed of minimal length to keep
standard errors below a specified maximum
•
By selecting items that have optimal properties, we can
create a shorter test that have the same degree of
precision as a longer test
Item Information Functions
•
Bell-shaped
•
Peak is at or near difficulty value b: item provides
greatest information at θ values near b
•
Height depends on discrimination; more discriminating
items provide greater information over a narrow range
around b
•
Items with low c provide most information
Test And Item Information Functions
8
7
6
Information
5
4
3
2
1
0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
Theta
0.5
1.0
1.5
2.0
2.5
And what do we get for all this?
• The proficiency level (ability) of an examinee is
not tied to the specific items we administer
• We CAN compare the ability scores of
examinees who have taken different sets of test
items
• We can therefore match items to examinee’s
ability level and measure ability more precisely
with shorter tests
And what do we get for all this?
• We can create a bank of items by administering
different items to different groups of examinees
at different times
• This will allow us to administer comparable
tests or individually tailored tests to examinees
• By administering different items to different
individuals or groups we can improve test
security and minimize cheating
And what do we get for all this?
• We can ensure the fairness of tests by making
sure the test and test items are functioning in
the same way across different groups
• For assessment of learning, we can give
different SHORT tests made up of items that
measure the entire domain of skills; otherwise
such coverage will require a very long test, and
be unmanageable.
Equating
• Purpose of equating is to place scores from one
form of a test on the scale of another
• The goal of equating is for scores to be
exchangeable; it should not matter to
examinees which form of the test they take
• True equating is not strictly possible using
traditional procedures
IRT “Equating”
• Under IRT, equating is not necessary
• But we have to undo the scaling; when we
calibrate a test, i.e., estimate the ability scores
and item parameters, we commonly
“standardize” the ability scores. We have to
undo this scaling to place the parameters on a
common scale.
• IRT “equating” is simply rescaling.
• We will describe the scaling procedure in detail
later.
Differential Item Functioning (DIF)
• When we design a test to measure a construct
of interest, that test should not measure
something else that is irrelevant.
• For example, if we are measuring reading
comprehension, we should not include items
that have mathematical content (This would be
silly unless this skill is a relevant and required
skill).
Differential Item Functioning (DIF)
• Similarly, we should not include items that have
a heavy reading component if our purpose is to
measure the mathematical ability of the
student.
• Such items will favor one group over another.
• Construct irrelevant items adversely affect
validity.
Differential Item Functioning (DIF)
• IRT provides a natural framework for defining
and assessing DIF
Definition:
An item shows DIF if two examinees at the
same ability level but from different groups do
not have the same probability of answering the
item correctly.
Differential Item Functioning (DIF)
•
•
Detecting and eliminating these items are carried out
routinely in all testing programs.
It is a crtical part in the validation process, in
determining if construct irrelevant variables pose
threats to the intended uses of the test.
Tailored Testing
•
Fred Lord introduced Item Response Theory in the
early 50’s and with it, the notion of Tailored Testing.
•
Without computers, tailored testing was not feasible.
•
To overcome this problem, Lord developed
“FLEXILEVEL TESTING”
•
Flexilevel test follows Binet’s idea; only the difficulty
level of the item is used in routing
Computerized Adaptive Testing (CAT)
•
Adaptive testing is the process of tailoring the test
items administered to the best current estimate of an
examinee’s trait level
•
Items are most informative when their difficulty is close
to the examinee’s θ value
•
Different examinees take different tests
•
Only through IRT can items be appropriately selected,
trait values estimated after each item administered,
and the resulting test scores compared
Advantages Of CAT
•
Testing time can be shortened
•
Examinees’ trait values can be estimated with a desired
degree of precision
•
Scoring and reporting can be immediate
•
Scoring errors and loss of data are reduced
•
Test security is preserved (in theory)
•
Paper use is eliminated
•
Need for supervision is reduced
Adaptive Testing
•
Items are pre-calibrated and stored in a bank
•
Item bank should be large and varied in difficulty
•
Examinee is administered one or more items of
moderate difficulty to obtain initial trait estimate
Computerized Adaptive Testing (CAT)
•
Once an initial θ estimate is obtained, items are selected
for administration based on their information functions –
the most informative item at the current estimate of θ is
selected, taking into account content considerations
•
The trait estimate is updated after each item response
•
Testing is terminated after a fixed number of items or
when the standard error of the estimate is at a desired
level
Computerized Adaptive Testing (CAT)
•
The Office of Naval Research, the Army, and the Air Force
funded research for advancing CAT during the 70s.
•
David Weiss and his team at University of Minnesota were
funded for developing operational procedures for
implementing CAT
•
I was funded for the development of Bayesian estimation
procedures so that we can estimate item parameters more
accurately
•
All these activities were motivated because of the large
volume of test takers in the armed forces
Operationalizing CAT
•
I was on the Board of Directors of GRE in the mid 80s.
•
The Board authorized and funded research for
implementing GRE-CAT
•
GRE CAT was operational in the early 90s. GMAT
followed suit soon after.
Theory v. Practice
•
First clash between theory and practice occurred
in the implementation of CAT
•
As Personal Computers were not common GRE
had to contract with a delivery system provider
•
“Seat-time” was the major obstacle.
•
In theory, testing should continue until the
stopping criterion, prescribed standard error, was
reached. Instead the time i.e., test length, had to
be fixed.
•
Examinees had to complete 80% of items
Issues
•
Item Bank: A large item bank is needed and
maintained well. In developing item banks items
from paper and pencil administration should not be
used without careful investigation.
•
Exposure Control: In high stakes testing, exposure
control is critical
•
Content Specification and Balancing: This is a
critical issue and must be addressed early on in the
development of item bank and item selection
criteria
Issues (cont’d)
•
CAT algorithm: Unchecked, a CAT algorithm will be
greedy and choose items with the highest
information. Algorithms for selecting items with
content balancing must be in place.
•
Item parameter shift : Over time item parameter
values will change because of instruction, targeted
instruction, and exposure of items. Item
parameters must be re-estimated and items that
show large drifts must be eliminated.
Issues (Cont’d)
•
Item BIAS (Differential Item Functioning): Performance
of subgroups on items must be examined to determine
if subgroups are performing differentially on items .
This analysis must be carried out not only in the
development of the item bank but also during
operational administrations.
VALIDITY
"The concept of validity refers to the appropriateness,
meaningfulness, and usefulness of the specific inferences
made from test scores. Test validation is the process of
accumulating evidence to support such inference."
Standards for Educational and
Psychological Testing
AERA/APA/NCME
VALIDITY
--
refers to the appropriateness of interpretations of test
scores
-- is not a property of the test itself
-- is a matter of degree
-- is specific to a particular use of the test
VALIDITY
is assessed using evidence from three categories:
content-related
criterion-related
construct-related
Validity of Ancient Tests
•
Was Sugriva’s Test of Rama’s Skills Valid for the
intended use?
•
Sugriva could have given Rama a general
knowledge test to see if Rama knew of Vali’s
strength and skill in battle
Validity of Ancient Tests
•
Sugriva could have given Rama a math test to
determine if Rama could compute the angle to
shoot the arrow to reach the target.
•
Sugriva could have given Rama a Physics test to
determine if he could calculate the force
necessary to penetrate the giant trees.
Validity of Ancient Tests
•
But then anyone with a degree of general
knowledge, and knowledge of mathematics and
physics could have answered the question,
without even knowing how to use a Bow and
Arrow.
•
Sugriva, chose to give Rama an “authentic” test
to determine his ability.
•
The test was clearly appropriate for the use and
therefore Valid.
Validity of Ancient Tests
•
The God of Death, Yama, could have given
Yudhisthira a general knowledge test.
•
But that would not have served his purpose. The
God Yama designed a test that was valid for its
intended purpose.
Innovative Item Formats
• Sugriva and the God Yama have shown how to
choose the right item format to conduct their
examinations.
•
With the aid of computers, we can design tests
with innovative item formats.
•
These item formats will permit authentic
assessments of the skills we intend to
measure.
•
These formats are currently being used in
credentialing examinations.
In closing….
I have provided a very general overview of:
•
How tests are used in the US
•
How testing is being viewed in the US
•
The Classical and the modern IRT frameworks
that underpin the development of
measurements
In closing….
I have provided a very general overview of:
•
The advantages IRT offers over the classical
framework, i.e., how IRT
 Provides item characteristics that are invariant
over subgroups
 Provides proficiency scores that are not
dependent on the sets of items
 Enables the construction of tests that have prespecified standard error of measurement and
reliability
In closing….
how IRT
 Enables the determination of proficiency with the
desired accuracy at critical points on the proficiency
continuum
 Enables the delivery of individualized and targeted
tests for the efficient determination of proficiency
scores
In closing….
•
Are any of these innovations and approaches to testing
relevant or applicable in India?
•
Assessment is a tradition in India.
•
By necessity, assessment has been employed primarily
for selection
•
These assessments, while necessary, can be streamlined
and made shorter and targeted through CAT
•
Advances in technology have made innovative item
formats possible. Higher order skills and creativity can
be assessed through these innovative item formats.
In closing….
•
India can most certainly benefit from the advances made
in assessment.
•
The weak link in Indian assessment system is that not
much attention seems to have been paid to the issue of
validity.
•
The current system of examinations has had some
negative side effects. It has promoted extensive coaching
and cheating, two factors that can lead to inequity in
education and stifle creativity.
In closing….
•
Assessment can play a critical role in the education
system, and lead to improvements in learning.
•
Cognitive Diagnostic Assessment is receiving
considerable attention in the US.
•
Cognitive diagnostic assessment can be applied
successfully to identify misconceptions, especially in the
mathematics and science areas, and through feedback to
students, student learning and instructional techniques
can be enhanced.
India Has Got Talent!
•
India has the talent and manpower to lead the way in the
use of technology in instruction and assessment.
•
We have world class experts, such Professor Dhande, on
the panel here to provide leadership in this area.
•
Indian students, if given a fair chance, can become
leading world class scholars, a fact evidenced by the
achievements of Indian students who have gone abroad
to seek education.
•
India can most certainly benefit from the advances made
in assessment. By judiciously applying assessments, it
can transform itself from a country of examination
systems to a country of education systems.
Download