Reed Castle

advertisement
CLEAR Pre-Conference
Workshop
Testing Essentials
Job Analysis- Reed A. Castle, PhD
Item Writing- Steven S. Nettles, EdD
Test Development- Julia M. Leahy, PhD
Standard Setting- Paul D. Naylor, PhD
Scaling/Scoring-Lauren J. Wood- PhD, LP
5 topics, 20 minutes and 20 minutes Q&A
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Job Analysis
Reed A. Castle, Ph.D.
Schroeder Measurement Technologies, Inc.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
What is a Job Analysis?
An investigation of the ability requirements that go
with a particular job (Credentialing Exam Context).
It is the study that helps establish a link between test
scores and the content of the profession.
The Joint Technical Standards14.14
“The content domain to be covered by a credentialing test
should be defined clearly and justified in terms of importance
of the content for the credential-worthy performance in an
occupation or profession. A rationale should be provided to
support a claim that the knowledge or skills being assessed
are required for credential-worthy performance in an
occupation and are consistent with the purpose for which the
licensing or certification program was instituted.”
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Why Conduct a Job Analysis?
Need to establish a validity link.
Need to articulate a rationale for
examination content.
Need to reduce the threat of legal
challenges.
Need to determine what is relatively
important practice.
Need to understand the profession
before we assess it.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Types of Job Analyses
Focus Group
Traditional Survey-Based
Electronic Survey-Based
Transportability
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Focus Group
Need to Identify the best group of SMEs
possible
Areas of Practice
 Geographic representation
 Demographically Balanced

8 to 12
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Focus Group
Prior to MeetingComprehensive review of profession




Job Descriptions
Performance Appraisals
Curriculum
Other job-related documents
Create a Master Task List
Send list to SMEs prior to meeting to give
them chance to review
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Focus Group
At MeetingReview Comprehensive Task List
Determine which tasks are important
Determine which tasks are performed with an
appropriate level of frequency
Determine which tasks are duplicative
Identify and add missing tasks
Organize into coherent outline
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Focus Group
AdvantagesMay be only solution for new/emerging
professions
 Relatively quick
 Less expensive

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Focus Group
Disadvantages
Based on one group (Results may not
generalize)
 May be considered a weaker model when
considering validation.
 May result in complaints from constituents
about the content of the test.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
First steps are similar to the focus group
(i.e., task list is generated in same
manner)
After the task list is created, three more
issues must be addressed to complete
the first survey development meeting.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
First, demographic questions must be
developed with two goals in mind.
Questions should help describe the sample
of respondents
 Some Question will be used for analyses
help generalize across groups (e.g.,
geographic regions)

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
Second, rating scale(s) should be
developed.

Minimally, two pieces of information should
be collected
Importance or significance
 Frequency of performance

Additional scales can be added but may
take away from response rate.
 Shorter is sometimes better.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
Sample Scale combining Importance and Frequency
High correlation b/w Freq and Imp Ratings (.95 and higher)
Considering both the importance and frequency, how important
is this task in relation to the safe, effective, and competent
performance of a Testing Professional? If you believe the task is
never performed by a Testing Professional, please select the
'Not performed' rating.






0 = Not performed
1 = Minimal importance
2 = Below average or low importance
3 = Average or medium importance
4 = Above average or high importance
5 = Extreme or critical importance
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
SamplingOne of the more important considerations is
the sampling model employed.
Surveys should be distributed to a sample
that is reflective of the entire population.
Demographic questions help describe the
sample.
One should anticipate a low response rate
(20%) when planning for an appropriate
number of responses.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Traditional Survey-Based
Mailing Surveys
Enclose a postage paid return
envelope.
Plan well in advance for international
mailings (can be logistically painful with
different countries).
When bulk mailed, plan extra time.
Keep daily track of return volume.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Electronic Survey-Based
Identical to traditional, but delivery and
return are different.
Need Email addresses.
Need profession with ready access to
Internet.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Electronic Survey-Based
Advantages






Faster response time.
Data entry is no longer needed.
Reduced processing time on R & D side.
Possibly less expense (less admin costs).
Can modify sampling and survey on the fly if
needed
Sample can be the population with little additional
cost.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Electronic Survey-Based
Disadvantages
Need Email addresses
 High rate of “bounce-back”
 Control for ballot stuffing
 Data compatibility

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Transportability
Using the results of other job analysis
Determine compatibility or
transportability
Similar to Focus Group
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Four Types Review
Focus Group
Traditional Survey-Based
Electronic Survey-Based
Transportability
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Data
Demographics
Importance
Ratings
Frequency
Ratings
Composite
Sub group
Analyses
Decision Rules
Reliability


Raters
Instrument
Survey Adequacy
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Primary Demographics
Geographic
Region
Years Experience
Work Setting
Position
Role/Function
Percent Time in
certain activities
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Mean Importance Ratings3.0 criterion
Task
Task
Task
Task
Task
Task
Task
6
4
1
5
3
7
2
Mean
2.45
2.97
3.21
3.85
3.91
4.25
4.28
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Out
In
% Not Performed Ratings,
Criterion 25% (75% perform)
Task
Task
Task
Task
Task
Task
Task
6
4
1
5
2
3
7
% NP
38%
29%
26%
16%
10%
5%
3%
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Composite Ratings
Composite ratings using rating scale Natural
Logs (when multiple scales are used) can be
calculated and combined based on some
weighting scheme.
For example, if you want to weight frequency
33.33% and importance 66.66%, you can
adjust for this in the composite rating equation.
Personal opinion is that you will likely end up in
a very similar place if establishing decision
criteria on each scale individually.
In addition, multiple decision rules is more
conservative
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Mean Importance
Sub-group Analyses
Region
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Africa
3.22
4.21
3.91
2.95
3.88
2.41
4.22
Asia
3.12
4.08
3.87
2.99
3.82
2.85
4.09
N.
S.
Australia Europe America America
3.01
2.96
3.21
3.18
3.85
3.84
4.51
4.38
3.78
3.75
3.48
3.25
3.03
3.1
2.91
2.89
3.84
3.89
3.78
3.48
2.14
2.47
2.85
2.35
3.85
3.84
4.47
4.25
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
<=3.0
1
0
0
4
0
6
0
Assessment Type
SMEs are asked to determine which
assessment type will best measure a
given task
Multiple choice
Performance
Essay/short answer
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive Levels
Each task on the content outline
requires some level of cognition to
perform
3 basic levels exist (from Bloom’s
Taxonomy)
Knowledge/Recall
 Application
 Analysis

Steve will discuss in next presentation
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive Levels
Of the remaining tasks-post inclusion
decision criteria, SMEs are asked to rate
them on a 3 point scale
For each major content area, an average
rating is calculated
The average is applied to specific criteria to
determine the number of items by cognitive
level for each content area
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Weighting
Weighting is usually done with SME’s based on
some type of data
For example, average importance or composite
rating for a given content area
Applied to assessment type and cognitive levels.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test
Specifications/Weights
Standard Exclusion/Inclusion criteria
Test Specifications
Assessment type/Cognitive levels
Weights based on rational approach



Reflect test-type
Statistical
Consensus
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Item Writing
Steven S. Nettles, EdD
Applied Measurement Professionals, Inc.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Overview of
Measurement
Job Analysis
Test Specifications
Detailed Content Outline
Item Writing
Examination Development
Standard Setting
Administration and Scoring
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Specifications &
Detailed Content Outline
Developed based on the judgment of an
advisory committee as they interpreted
job analysis results from many
respondents.
Guides item writing and examination
development.
Provides information to candidates.

Required!
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Item Writing Workshop
Goals
appropriate item content and cognitive
complexity
consistent style and format
efficient examination committee work
Certified
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
A test item
measures one unit of content.
contains a stimulus (the question).
prescribes a particular response form.
The response allows an inference about
candidates’ abilities on the specific bit of
content.

When items are linked to job content, summing all
correct item responses allows broader inferences
about candidates’ abilities to do a job.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Preparing to Write
Each item must be linked to the
prescribed
part of the Detailed Content Outline.
 cognitive level (optional).

Write multiple-choice items.
Three options better for similar ability
groups.
 Five options better for diverse groups.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Why multiple choice?
Dichotomous (right/wrong) scoring encourages
measurement precision.
Validity is strongly supported because each item
measures one specific bit of content.

Many items sample the entire content area.
The flexible format allows measurement of a variety
of objectives.
Examinees cannot bluff their way to receiving credit
(although they can correctly guess).

We will talk more about minimizing effective guessing.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Item components
include
stem.
three to four options.
one key
 two to three distractors.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Item Components
Stem
The statement or question to which
candidates respond.
 The stem can also include a chart, table, or
graphic.
 The stem should clearly present one
problem or idea.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Example Stems
Direct question

Which of the following best describes the
primary purpose of the Code of Federal
Regulations?
Incomplete statement

The primary purpose of the CFR includes
New writers tend to write clearer direct questions. If you are new to
item writing, it may be best to concentrate on that type.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Among the options will be
the
key


With a positively worded stem, the key is the best
or most appropriate of the available stem
responses.
With a negatively worded stem, the key is the
least appropriate or worst of the available stem
responses.

Negatively written items are not encouraged!
distractors - plausible yet incorrect
responses to the stem
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive levels
Recall
Application
Analysis
Cognitive levels are designated because we recognize
that varying dimensions of the job require varying
levels of cognition. By linking items to cognitive levels,
a test better represents the job, i.e., is more jobrelated.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive levels
Recall items
use an effort of rote memorization.
 are NEVER situationally dependent.
 have options that frequently start with
nouns.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Recall item
Which of the following beers is brewed in St.
John’s?
A. LaBlatts
B. Molson
C. Moosehead
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive levels
Application items

use interpretation, classification, translation, or
recognition of elements and relationships



have keys that depend on the situation presented
in the stem


Any item involving manipulations of formulas, no matter
how simple, are application level.
Items using graphics or data tables will be at least at the
application level.
If the key would be correct in any situation, then the item
is probably just a dressed up recall item.
have options that frequently start with verbs.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Application item
Which of the following is the best approach
when trout-fishing in the Canadian Rockies?
A. Use a fly fishing system with a small insect
lure.
B. Use a spinning system with a medium
Mepps lure.
C. Use a bait casting system with a large
nightcrawler.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Cognitive levels
Analysis items
use information synthesis, problem solving,
and evaluation of the best response.
 require candidates to find the problem from
clues and act toward resolution.
 have options that frequently start with
verbs.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Analysis item
Total parenteral nutrition (TPN) is initiated in a nondiabetic patient at a rate of 42 ml/hour. On the
second day of therapy, serum and urine electrolytes
are normal, urine glucose level is 3% and urine
output exceeds parenteral intake. Which of the
following is the MOST likely cause of these findings?
A. The patient has developed an acute glucose
tolerance.
B. The patient’s renal threshold for glucose has
been exceeded.
C. The patient is now a Type 2 diabetic requiring
supplemental insulin.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Other Item Types:
complex multiple-choice
(CMC)
are best for complex situations with
multiple correct solutions.
may incorporate a direct question or
incomplete statement stem format.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
CMC items
Which of the following lab test results have been
associated with fibromyalgia or myofascial pain?
1. Elevated CPK
Elements
2. Elevated LDH isoenzyme subsets
3. White blood cell magnesium deficiency
4. EMG abnormalities
A. 1 and 3 only
B. 1 and 4 only
C. 2 and 3 only
D. 2 and 4 only
Options
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
K - Type Item
A child suffering from an acute exacerbation of
rheumatic fever usually has
1. An elongated sedimentation rate
2. A prolonged PR interval
3. An elevated antistreptolysin O titer
4. Subcutaneous nodules
A. 1, 2, and 3 only
B. 1, 3 only
C. 2, 4 only
D. 4 only
E. All are correct (From: Constructing Written Test Questions for the Basic and Clinical
Sciences, Case & Swanson, 1996, NBME, Philadelphia, PA)
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Other Item Types:
negatively worded
Avoid negative wording when a
positively worded item (e.g., CMC type)
can be used.

Negative wording encourages
measurement error when able candidates
become confused.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Negatively worded
items
A civil subpoena is valid for all of the following
EXCEPT when it is
A. served by registered mail.
B. accompanied by any required witness fee.
C. accompanied by a written authorization from
the patient.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Convert negatively
worded items to CMC
items
If you find yourself writing a negatively
worded item, finish it.
Then consider rewriting it as a CMC item
where 2-3 elements are true, and 1 or 2
elements are not included in the key.
Don’t write all CMC items to have 3 true and
1 false element.

Mix it up – e.g., 2 true and 2 false.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Things to
do
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Use an efficient and
clear option format.
List options on separate lines.
Begin each option with a letter (i.e., A,
B, C, D) to avoid confusion with
numerical answers.
Write options in similar lengths.

New item writers tend to produce keys that
are longer and more detailed than the
distractors.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Put as many words as
possible into the stem.
The psychometrician should recommend
A. that the committee write longer, more difficult
to read stems.
B. that the committee write distractors of length
similar to the key.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Write distractors with
care
Item difficulty largely depends on the quality
of the distractors.
The finer distinctions candidates must make,
the more difficult the item.
When writing item stems, you should do all
you can to help candidates clearly
understand the situation and the question.
Distractors should be written with a more
ruthless (but not tricky) attitude.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Write distractors you
know some candidates
will select.
Use common misconceptions.
Use candidates’ familiar language.
Use impressive-sounding and
technical words in the distractors.
Use scientific and stereotyped phrases,
and verbal associations.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Things to avoid
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Avoid
using “All of the above or” “None of the above”
as options.
using stereotypical or prejudicial language.
overlapping data ranges.
using humorous options.
placing similar phrases in the stem and key, even
including identical words.
writing the key in far more technical, detailed
language.
producing items related to definitions.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Avoid
using modifiers associated with


true statements (e.g., may, sometimes, usually) for keys.
false statements (e.g., never, all, none, always) for
distractors.
options having the same meaning.

Therefore, both options must be incorrect.
using parallel options (mutually exclusive) unless
balanced by another pair of parallel options.
writing items with undirected stems

Use the “undirected stem test”
writing items that allow test wise candidates to
converge on the key.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Converging on the key
A. 2 pills qid
B. 4 pills bid
C. 2 pills qid
D. 6 pills tid
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Are you test
wise?
You are test wise if you can select
the key based on clues given in
the item without knowing the
content.
Please refer to your Pre-test
Exercise.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Development
Julia M. Leahy, PhD
Chauncey Group International
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Development
Item
Evaluation
Job
Analysis
Test Plan
Test
Content Experts
Pretest
Draft Items
Edit &
Review
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Validity
Test specifications
derived from job
analysis
Test items linked to
job analysis and test
specifications
Test items measure
content that is
relevant to
occupation or job.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Validity
Content validity refers to the degree to
which the items on a licensure/ certification
examination are representative of the
knowledge and/or skills that are necessary
for competent performance
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Validity
The specific
use of test
scores and/or
the
interpretations
of the results.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Validity
Supports the
appropriateness of
the test content to
the domain the
test is intended to
represent.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Form Assembly
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Program Considerations
Factors to consider :

Computer-Based Testing





Linear test forms versus pools of items
Continuous testing or windows
Issues of exposure
Item bank size: probably large
Paper-and-Pencil Examinations



Single versus multiple administrations
Issues of exposure
Item bank size: small to large
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Specifications
What content is important
to test?
What content is necessary
to be a minimally
competent practitioner?
How much emphasis
should be placed on certain
content categories?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Creating Test Specifications
Use data from Job Analysis
Determine the following:



What content do we put in
each test?
What feedback do we give
people who are not
successful?
What kinds of Questions do
we ask?

How many of each kind do we
ask?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Specifications
Should include:



Purpose of the Test
Intended Population
Test Domain and Relative
Emphasis




Content to be tested
Cognitive level to be assessed
Mode of Assessment & Item
Types
Psychometric Characteristics
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Specifications
Use test specifications


Every time a test form is created
To be certain each test form asks questions on
important content


Validity - Passing the test is supposed to mean that a
person knows enough to be considered proficient
Fairness

It would be unfair if certain content were not on every
form
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Using Test Specifications
for Item Development
You’re not flying
blind!
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Factors Influencing Numbers
of Forms and Items Needed
Annually
Test Modality

Is the test a paper-and-pencil or a computerbased test?

Is both methodologies used?

If CBT, will test be administered in windows
or continuous?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Factors Influencing Numbers
of Forms and Items Needed
Annually
Test length

How many items will be needed for one
form?

How many forms?

How often will forms be changed?

What is the allowable percentage of overlap?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Factors Influencing Numbers
of Forms and Items Needed
Annually
Number of test administrations per year

Will each administration have a different form?

What is the expected test volume per form?

How are special test situations handled?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Factors Influencing Numbers
of Forms and Items Needed
Annually
Level of Security Needed:

Is this a high-stakes examination
for licensure or certification in an
occupation or profession?

Or is the examination for lowstakes certificates, such as with
continuing education or selfassessment?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Factors Influencing Numbers
of Forms and Items Needed
Annually
Organizational Policies:

When and under what circumstances can failing
candidates repeat the examination?

Must items be blocked for repeat candidates?

Are there a minimum number of candidate
responses required for new/pretest items?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Using Test
Specifications
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Form Assembly
Select items
•
Meet the test plan specifications
•
•
•
Total number of items
Correct distribution of items by domains
or subdomains
Preference to use items with known
statistical performance
•
Distribution of statistical parameters,
such as difficulty and discrimination
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Form Assembly
Select items
•
Determine need to consider non-test plan
parameters in form assembly
•
•
Cognitive level
Use automatic selection software, if
possible
•
Generate test forms that meet required and
preferred parameters
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Form Assembly
Consider Test Delivery: CBT
• For continuous CBT delivery, large numbers of
equivalent forms are needed for security reasons
• Every form must meet the same detailed content
and statistical specifications
• Quality assurance is vital; forms cannot vary in
quality, content coverage, difficulty, or pacing
• Reproducibility and accuracy of scores and
pass/fail decisions must be consistent across
forms and over time
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Form Assembly
Automatic Item Selection
•
•
•
Consider multiple rules for selecting
items, such as content codes,
statistics,
Determine number of forms to reduce
exposure
Still need to evaluate selected form for
overlap
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
5-Step Item Review
Process
• Grammar
• Style
• Internal Expert Review
• Sensitivity/Fairness
• External/Client Review
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Grammar
• Review items for spelling,
punctuation, and grammar
• Usually done by an editor or trained
test developer
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Style
• Format items to conform with
established style guidelines
• Use capitalization and bolding as
appropriate to alert candidates to
words such as:
• Maximum, Minimum, and Except
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Internal Review
Review by internal experts for
verification of content accuracy
Is the key correct?
 Is the key referenced (if applicable)?
 Are the distractors clearly wrong, yet
plausible?
 Is the item relevant?

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
• Candidate Fairness: all candidates
should be treated equally & fairly
regardless of differences in
personal characteristics that are not
relevant to the test
• Acknowledge the multicultural
nature of society and treat its
diverse population with respect
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Review by test developers and/or
external organizations
Review items for references to gender,
race, religion, or any possibly offensive
terminology

Use only when relevant to the item
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
ETS Sensitivity Review Guidelines and
Procedures:





Cultural diversity of the United States
Diversity of background, cultural traditions and
viewpoints
Changing roles and attitudes toward groups in
the US.
Contributions of various groups
Role of language in setting and changing
attitudes towards various groups
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Stereotypes:
No population group should be depicted as
either being inferior or superior
 Avoid inflammatory material
 Avoid inappropriate tone
 Appropriate tone reflects respect and
avoids upsetting or otherwise
disadvantaging a group of test takers.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Stereotypes:

Examples:
Men who are abusers
 Woman who are depressed
 African Americans who live in depressed
environments
 65 and older adults who are frail, elderly and
unemployable

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Stereotypes:

Examples:
People with disabilities who are
nonproductive
 Using diagnoses or conditions as adjectives

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
How to Avoid Stereotypes:

Examples:

A depressed patient


A diabetic patient


The patient with diabetes mellitus
An elderly person


The patient with depression
A 72-year-old person
A psychiatric patient

A patient with paranoid schizophrenia
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
How to Avoid Stereotypes:

Examples:

A male who abuses woman


A housewife


An individual who is a primary caretaker
A Hispanic who speaks no English


An individual with abusive tendencies (avoid gender)
An individual who speaks English as a second
language
An Asian American who eats Sushi

An individual whose diet consists mainly of fish
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Population diversity: No one
population group should be
dominant:
Ethnic balance: use ethnicity only
when necessary.
 Gender balance: avoid male and
females identification if at all possible.

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Ethnic group references:
African American or Black;
 Caucasian or White;
 Hispanic American;
 Asian American

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Sensitivity/Fairness
Inappropriate tone:
Avoid highly inflammatory material that is
inappropriate to content of examination
 Avoid material that is elitist, patronizing,
sarcastic, derogatory or inflammatory



Examples: lady lawyer; little woman; strongwilled male
Avoid terminology that might be known to
only one group

Examples: stickball; country clubs; maven
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Differential Item Functioning
Identifies items that function differently
for two groups of examinees or
candidates
DIF is said to occur for an item when the
performance on that item differs
systematically for focal and reference
group members with the same level of
proficiency
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Differential Item Functioning
Reference group:

Majority group: In nursing, that is general
white-females
Focal groups:

All minority groups
Men
 Non-white ethnic groups

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Differential Item Functioning
Use the Mantel-Haenszel (MH) procedure,
which matches the reference and focal
groups on some measure of proficiency,
which generally is the total number right
score on the test
Requires a minimum number per focal
group—can do with a minimum of 40-50 in
focal group
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Differential Item Functioning
Examples of content related issues:
Negative category C DIF was noted for
males on content related to woman’s
health
 Positive category C DIF was noted for
males on content involving the use of
equipment and actions likely to be taken in
emergencies

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Differential Item Functioning
Negative category C DIF was noted for
the focal minority groups on content
involving references to/inferences about:
assumptions regarding the nuclear family,
childrearing, and dominant culture;
 idiomatic use of language;
 hypothetical situations requiring "role-playing"

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
External/Client Review
Review by external experts for
verification of content accuracy
Is the key correct?
 Is the key referenced (if applicable)?
 Are the distractors clearly wrong, yet
plausible?
 Is the item relevant?

Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Review and Approval
• Item review can be an iterative
process
• Yes, No, and Yes with modifications
• Who has final sign off on items?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Production
Paper-and-pencil forms
•
Determine format
•
•
•
•
One column;
Two column
Directions -- back page, if
booklet sealed
Answer sheets
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Test Production
Computer-based tests
• Linear forms or test pools
• Tutorials
• Item appearance:
• Top -- bottom
• Side by side
• Survey forms
• Form review
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Item/Pool Review
• Establishing a process for item
review
• How item review relates to item
approval
• Who should be involved in the item
review
• How often should items be reviewed
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Timing of Item/Pool
Reviews
• Establish a schedule for item reviews
 Anticipate regulatory or industry changes
• Review and revise items for content
accuracy

Use candidate comments for feedback on items
• Review and revise items based on
statistical information

* To be discussed in the afternoon session
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Setting The Cut-Score
Paul D. Naylor, Ph.D.
Psychometric Consultant
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Standard Setting
The process used to arrive at a passing
score
Lowest score that permits entry to the
field
Recommended standard
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Standards
Mandated
Norm-referenced
Criterion-referenced
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Mandated Standards
Often used in licensing
Difficult to defend
Not related to minimum qualification
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Norm-referenced
Standards
Popular in schools
Limits entry
Inconsistent results
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Criterion-referenced
Standards
Wide acceptance in professional testing
Determines minimum qualification
Not test population dependant
Exam or item centered
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Procedures
Angoff (modfied)
Nedelsky
Ebell
Others
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Minimally Competent
Performance
Minimum acceptable performance
Minimal qualification
Borderline
It’s all relative
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Angoff Method
Judges
Selection
 Training

Probabilities
Would vs Should
Rater agreement
Tabulation
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
AV 1
Ratings
2 3 4 5
6
68.3 60
75
70
50
80
75
52.5 60
50
45
45
55
60
72.5 85
80
75
70
70
55
89.2 95
95
90
80
85
90
76.7 75
70
90
70
75
80
77.5 75
70
85
80
85
70
73.33
75.83
65.83
75.0
72.78
75.0
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
71.66
Application
Adjustment
Angoff Values
Alternative Forms of Exam
Passing Score
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
References
Livingston, S.A. & Zieky, M.J.(1982).
Passing Scores. ETS.
Cizek, C.J. Standard setting guidelines.
Educational Meausrement: Issues and
Practices, 15 (1), 13-21, 12.
CLEAR Exam Review (Winter 2001,
Summer 2001 and others)
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Presentation Follow-up
Please pick up a handout from this
presentation -AND/ORPlease give me your business card to
receive an e-mail of the presentation
materials -ORPresentation materials will be posted on
CLEAR’s website
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
QUESTIONS
AND
ANSWERS
THANK YOU!
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Lauren J. Wood, PhD, LP
Director of Test Development
Experior Assessments:
A Division of Capstar
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Scaling and Scoring
Objectives:
-Describe a number of types of scores that
you may wish to report
-Define “scaled scores” and describe the
scaling process
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Back to the Basics
Public Protection
versus
Candidate Fairness
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring
Scoring of the examination needs to be
considered long before the examination is
administered.
Important to the examinees that the decision
(pass/fail) and the scoring (raw, scaled) be
reported in simple/clear language
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Scoring:
Who gets the score
reports?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring:
What do you report?
Types of scores:
-Score outcome: pass/fail,
-Score in comparison to a criterion
-Score in comparison to others
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring:
What do you report?
Helpful to candidates to report subscore or
section information
 Strengths and weaknesses
 Plan for remediation
Need to caution that subscore or section
score is meaningful
 Number of subscores/questions
 Can be graphical rather than numeric
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring:
What do you report?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring:
When/where do you report
the scores?
Delayed: Mailed to examinee’s home
-group comparisons
-scoring/scaling
Immediate: At the test site
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scoring
Score Reporting:
-Compute examinee scores directly
(raw scores)
-Compute scaled or derived scores
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Experience:
Exam: Residential Plumbing Exam (Theory)
Administration date: August 11, 2003
Cut score: 70 % correct
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Candidate #2
The Occupational and Professional Licensing Division regrets to inform
you that you did not attain a satisfactory grade on the Residential
Plumbing examination that you took on August 11, 2003.
Your examination grade is:
74
Fail
You may retake the examination…
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Candidate #3
The Occupational and Professional Licensing Division is pleased to inform
you that you attained a satisfactory grade on the Residential Plumbing
examination that you took on August 11, 2003.
Your examination grade is:
66
Pass
Congratulations! You may apply for licensure by…
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Score
Result
Candidate #1
68
Fail
Candidate #2
74
Fail
Candidate #3
66
Pass
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
What happens next?
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score
Reports
Candidate #1
The Occupational and Professional Licensing Division regrets to
inform you that you did not attain a satisfactory grade on the
Residential Plumbing examination that you took on August 11,
2003.
Your examination raw score is:
68
Fail
You may retake the examination…
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Residential Plumber
Examination
Form #1
Form #2
Form #3
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling
Why develop multiple forms of the same
examination?
Exam security
2) Examination and question content changes
over time
1)
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
How do the forms of the
Examination Differ?
Item content differs,
though the content
of the items remains
true to the exam
content outline
Item difficultly,
discrimination, etc.
differ across the
different forms of the
examination
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Residential Plumber
Examination
Form #1: P-value=.66
Form #2: P-value=.72
Form #3: P-value=.70
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Equating:
The design and statistical procedure that
permits scores on one form of a test to be
comparable to scores on an alternative form
of an examination
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Why equate forms?
Adjust for unintended differences in form
difficulty
Ease in candidate to candidate score
interpretation
Maintain candidate fairness in the testing
process
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
How is this done?
There are a number of methods used to
equate examination scores
Statistical conversions of the scores are
applied and the resulting scores are often
called “scaled scores” or “derived scores”
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Raw
Score
Scaled
Score
Status
Candidate #1
68
67
Fail
Candidate #2
74
66
Fail
Candidate #3
66
70
Pass
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Three Candidate’s Score Reports
Candidate #1
The Occupational and Professional Licensing Division regrets to
inform you that you did not attain a satisfactory grade on the
Residential Plumbing examination that you took on August 11,
2003.
Your examination scaled score is:
67
You may retake the examination…
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Fail
Scaling and Scoring
Raw score


Advantage
 Meaning clearly understood
Disadvantage
 Can’t make comparisons
 Specific for each test administration
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Scaled score:
Based on test mean, standard deviation and raw score
 Advantage
 Make meaningful comparisons
 Disadvantage
 Interpretation not clear cut
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling
When scaling procedures are performed it is
often important to provide an explanation that
such procedures will be performed and what
these scores will mean to the candidate.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling
“The Purpose of Scaling” (Candidate Information Bulletin)
Scaling allows scores to be reported on a common scale. Instead
of having to remember that a 35 on the examination that you
took is equivalent to a 40 on the examination that your friend
took, we can use a common scale and report your score as a
scaled score of 75.
Since we know that your friend’s score of 40 is equal to your score
of 35, your friend’s score would also be reported as a scaled
score of 75.
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Scaling and Scoring
Starts at the beginning of the test
development process
Report as outcome, relation to criterion,
relation to others
Delayed or immediate
Raw or scaled scores
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Presentation Follow-up
Please pick up a handout from this
presentation -AND/ORPlease give me your business card to
receive an e-mail of the presentation
materials -ORPresentation materials will be posted on
CLEAR’s website
Presented at CLEAR’s 23rd Annual Conference
Toronto, Ontario September, 2003
Download