Planning Classroom Tests
Designing Test and Item Specifications to Help Ensure Validity and Reliability
You learned last week that measurement is the process of assigning a number to express the
degree to which a particular characteristic is present. Most of us know the common examples of
height, in which the scale used to measure it is in feet and inches, and weight, in which the scale
used to measure it is in pounds. The task here is to measure learning effectively and accurately,
and skills attainment is usually measured as a score on an achievement (test) instrument. The
process of evaluation, or a procedure used to determine quality, is then used to determine if the
test does what it is intended to do. Test evaluation involves comparing the measure of
performance to some established criteria in order to establish its worth in skills attainment.
Test evaluation of the instrument itself involves asking the following:
1. If the test were to be given repeatedly, would it consistency yield the same score for each
individual (reliability)?
2. Does the test measure what it is supposed to measure (validity)?
2. Is the cost of the test reasonable and is it practical for the purpose intended (utility)?
Criterion-Referenced Tests
Most classroom teachers are concerned with specific content and skills that are subsumed under
state standards and design and develop tests that are specifically linked to these standards in
order to determine learning. The content being measured, in which the student is to acquire
skills, is often referred to as a domain, i.e., there is an existing goal framework of subordinate
skills with hierarchical and non-hierarchical relationships that is to be learned in order to
accomplish a particular goal, or overall learning task. Tests that measure such specific content
and skills are referred to as Criterion-Referenced Tests (CRT) in which the items are drawn from
a delineated domain of tasks. CRT’s are also referred to as objective-referenced tests in that
specific measurable behavioral objectives are developed before test construction and guide item
development. In order to guide item development, so as to link the behavioral objectives to
specific types of behavior that are to be demonstrated by the learner, a Table of Specifications is
developed as that guide.
Table of Specifications
A table of specification is a two-way matrix that has the objectives that are to be learned with the
level of learning required, and also indicates the item format and the number of items to be used
on the test. Such a table insures content validity, or the appropriateness of inferences that are to
be made from the test scores for particular subject matter. As stated earlier, behavioral
objectives are developed by using the behavior and content directly from the goal framework
with the addition of conditions of performance. Questions include:
1. Do the behavioral objectives represent the goal framework?
A direct representation insures that there is content validity, i.e., the objectives represent the
main content area and are balanced across the different levels of learning.
© by Carolyn Pearson 2005. All rights reserved.
1
Assessment/Planning/Dr. Carolyn Pearson
2. How will test results be used?
There are several uses for test results including remediation (entry-level skills), evaluation of
the instruction (practice tests), enrichment (when it has been determined that much of the
learning has already occurred), and to insure that there were lessons for all of the objectives
(content validity).
3. Which test item format should be used to measure achievement of the objective?
It is necessary to insure that there is a direct match between the behavior that is stated in the
objective and the type of response that is required of the student.
4. How many items should be included in the table of specifications and the test so as to
measure performance adequately (number of items for minimal mastery)?
Simple skills that are measured at low levels of learning, such as knowledge level tasks
usually require only one item because the learner either knows the correct answer or does
not. For more complex levels of learning, such as application level tasks and higher need
several items in order to allow for discrimination between those learners who do indeed
know the content and those learners who do not. .For example, let’s say you have a safety
objective with 4 items measuring the objective. In this case you would probably want every
item passed since the objective is so important. For other complex objectives, minimal
mastery may be 3 out of 4, or 3 or 4 out of 5, a simple majority.
4. When and how will the test be given?
Various types of tests serve various purposes, for example, pretests help the teacher plan
lessons, determine what remediation is needed, and which objectives to emphasize or not
emphasize depending on what skills were already determined to be possessed by the learners.
Practice tests, or post-instruction tests, help the learners to practice the skills they are
learning, help the teacher in evaluating the quality of each lesson, and determine if additional
review or enrichment activities are needed. Post-tests, which are given after all instruction
and review, should be announced in advance to allow for adequate learner preparation.
Reliability
Reliability refers to the consistency of the test scores obtained as if the learner has undertaken
repeated measures. Similar scores over time indicate a reliable measure, which indicates that one
has more closely obtained the learners “true score.” There are many factors that can impact a
test score including memory loss, fatigue, attention, and motivation, all of which will introduce
random error. Random errors cause the learner to potentially miss items on content they truly do
know; thus, impacting the ability to obtain the “true score.” To increase reliability one must:
1. Obtain a representative sample of objectives from the goal framework, select an appropriate
number of items per objective, place the same number of items on different forms, and
balance the test in terms of difficulty.
2. Select items that adequately represent the skills that are being measured, and make sure the
items cover all possible categories. For example, when covering the learning goal nouns,
make sure person, place and thing are all tested and not just person.
© by Carolyn Pearson 2005. All rights reserved.
2
Assessment/Planning/Dr. Carolyn Pearson
3. Format the items so as to reduce guessing, which will reduce the reliability of the scores
obtained.
4. Make sure the students have ample time to answer all of the items, as timed tests create
anxiety, which will also reduce the reliability of the scores obtained.
5. Try to maintain a positive learner attitude when testing. Testing should never be used as a
punishment technique (no “pop” quizzes), lesson plan (not prepared for the day), etc.
Types of Subordinate Skills
It was mentioned earlier that various types of test serve various purposes. Skills that learners
should already possess before an instructional unit begins are called prerequisite skills and the
tests that measure them are called readiness tests. Readiness tests gather information to let the
teacher know if remediation is necessary, if the goal to be taught is too complex, or if the
learners consist of masters or non-masters. Skills that focus on the current instruction are called
enabling skills and are measured using pretests, which are used for lesson planning, lets the
teacher know what to emphasize during the instruction, and enables grouping of the students by
mastery level. Practice tests are usually given on a subset of skills immediately after a segment
of instruction and are usually linked to in-class or homework assignments. Data from practice
tests are used to determine instructional effectiveness, mastery, or new presentations; regrouping
of the learners according to mastery; for diagnosing misconceptions during the instruction; and
for the development of review materials. Practice tests also lessen test anxiety that often occurs
during post-tests. Post-tests are used at the actual end of instruction to evaluate student progress
and assign grades. Examples of post-tests are quizzes, unit tests, and comprehensive exams.
When developing tables of test specifications for individual units, it is important to group the
objectives and classify them by their level of learning (Bloom) and whether they are prerequisite
or enabling skills. Then for each objective, it should be decided what item format is most
appropriate in measuring the stated behavior. Next, it should be decided what number of items is
needed to adequately measure objective in order to establish minimal learning levels. Also, one
should review the evaluation decision that is going to be made regarding learning progress
(pretest, practice, post) and from the table, select a subset or entire set of skills and objectives for
the needed test. For example, using the nutritional analysis below, a practice test could be
developed for just carbohydrates (blue), proteins (red), and fats (green), and a comprehensive
posttest could be developed for just the highest level skills (pink). A comprehensive posttest
should balance all of the goals that were taught, should be balanced across all of the levels of
learning (with time for the learner to consider and answer each item), include an adequate
number of items in order to determine minimal competency, and represent all of the skills in the
framework (content validity).
© by Carolyn Pearson 2005. All rights reserved.
3
Assessment/Planning/Dr. Carolyn Pearson
Foods
carbos
fats
proteins
Criteria for Constructing Objective Test Items
There are five criteria for constructing and evaluating objective test items:
1. The items are congruent with the objective.
Every objective must have items that demonstrate different types of behaviors that are
congruent with that objective, they must have content that covers all of the areas taught,
and one must be sure any words used are in the proper context. The conditions must be
met, that is the learners must have the appropriate resources to do the activity (equipment
such as microscopes).
2. There is congruence of the item to the learners’ characteristics.
One must use relevant items for the learners, and one must use a suitable format and
vocabulary for the items. Try to stay away from sensitive areas such as religion and
politics and use them only if specified by the instructional goal. Two factors will affect
reliability in this case, bias which may be cultural, racial, or sexual, so try to avoid these
© by Carolyn Pearson 2005. All rights reserved.
4
Assessment/Planning/Dr. Carolyn Pearson
types; and familiarity with the response format, get the learner familiar with the item
type, optical scanning sheets, web-based forms, and procedures before the test.
3. The items are clearly written.
Write the items clearly to avoid ambiguity and unnecessary complexity. Leave out
extraneous information and items should have only one correct answer.
4. There is accuracy of the measures.
Factors related to accuracy include item novelty, do not reuse items as they introduce a
memory effect that will take about a month to overcome; susceptibility to guessing, such
as clues in the directions; give several choices, not one or two; and susceptibility to
cheating, so use alternate forms and a different item sequence on each form.
5. There is freedom from bias.
No biases due to stereotyping, cultural, racial, or sexual and be aware of culturaldependent contexts.
Formative Evaluation of Items
For posttests, one should evaluate a draft copy before giving it to the learners, look for itemobjective congruence, then student characteristics, item clarity, and last, accuracy of the
measures. Place the item and objective together and break them down into their respective
components. It is worthwhile to pull a small representative sample of students to go over test
with you so as to clarify the directions and items and to have them identify any clues to the
correct answers. Also, have colleagues judge the items independently and use the same people
who checked item-objective congruence. Accuracy of measures is based on a personal decision
and only the teacher knows their own seating chart, if the items are novel, and if guessing is
likely.
Objective Items
Items are considered objective when they can be scored based on the learners’ selection of a
response. Written response items include sentence completion (fill-in-the-blank) and shortanswer when the learner has to supply the answer from memory. Selected response items
include alternative response (true-false, fact-opinion), matching exercises, keyed items (choices
are at the top and are used repeatedly for a cluster of questions, and multiple choice. One may
ask, why are these item types used? First, objective items allow for evaluation according to both
Gagne's Types of Learning and Bloom's Taxonomy; such items are also measured quickly, with
a wide range of content being measured (larger sampling of the content domain); next, objective
tests are easier to administer, score, and analyze; the items can easily be adapted for machine
scoring; and the scores are more reliable due to fewer scoring errors.
Written-Response Test Items
Written-response test items require learners’ to recall information from memory or to apply a
skill before they write the answer. Positive features of these items are that guessing is
minimized, since no alternatives are given; original responses give insight into learner
© by Carolyn Pearson 2005. All rights reserved.
5
Assessment/Planning/Dr. Carolyn Pearson
misconceptions; and are versatile in that they use a variety of formats. Negative features are that
they must be clearly written in order to produce the proper response, and written responses
cannot be machine scored, thus reducing their efficiency.
One form of written-response item is the completion item, where learners are required to write in
a key word(s) in a blank near the end. In order for these items to work well, the possibilities of
several answers should be eliminated, as well as any clues to the correct answer. The teacher
should use paraphrasing to avoid rote memorization of the content, and reduce scoring time by
using short answer sheets. Short-Answer items require the learner to complete a statement by
having the learner insert a word, phrase, or sentence. Such items are common on mathematics
tests, and require the learner to associate a given stimulus with a response. It is important that
the teacher provide a blank per word, specify the units for the answer (i.e., grams, pounds, feet,
etc.), and insure that the directions for a cluster of items are appropriate for all of the items.
Selected Response Items
Selected response items require the learners’ to choose an answer from a set of given choices.
Incorrect alternatives are known as distracters or foils, yet should not be tricky or contain
unknown information. Positive features include that learners must choose from alternatives,
there is increased time to allow learners to answer more items, and they lend themselves to
machine scoring. One negative feature is that there is an increased risk of guessing.
One form of selected response item is Alternative response, in which the learner has a choice of
two responses, where each item is a statement that the learner must judge. The most commonly
known is true or false items, correct or incorrect, or fact and opinion. Positive features include
that they test recall, can require classification, allow for the judging of analysis or synthesis
material, can assess learners’ ability to evaluation, and allow for the learner to apply principles to
judge the accuracy of statements of causality or correlation. Negative features include that
guessing can become a 50-50 probability, and if the learner is provided with no word to respond
to, they tend to write false down as their answer. Such items require that the statement be
worded positively with no clues to possible answers. Matching items provide the learners with
two columns in which they must match information. The left side contains premises while the
right side contains responses. The learner must have clear directions on how many times the
response can be used, the content in the set must be kept homogeneous, the teacher must
construct the set so as to avoid the process of elimination, and to avoid any one-to-one
correspondence. Keyed items combine matching and alternative response, in that the same
responses (usually at the top of the question set) are used for different questions. It is important
to note here that when developing and administering tests for non-readers that the teacher read a
general statement of the skill to be performed to the learner, that the skill be demonstrated on the
response form, and that the non-reader be asked questions for clarification.
The most popular item type, multiple-choice items with corresponding Item Banks, measure
Gagnes′ verbal information and intellectual skills, and all levels of learning of Bloom’s
Taxonomy. Positive features include that they can narrow in and focus on a particular aspect of
a problem; allow for multiple responses which force the learner to choose from alternatives; have
a best answer, which measures the ability to make fine discriminations; can have several correct
© by Carolyn Pearson 2005. All rights reserved.
6
Assessment/Planning/Dr. Carolyn Pearson
answers; can measure a wide range of content; are reliably scored; are easily adapted for
machine scoring or computer administration; and are easily compiled into item banks.
Limitations include that they cannot be used to measure attitudes or motor skills, that there is an
increased chance of guessing, and it is hard to write distracters that reflect actual misconceptions.
Multiple-choice items include a stem; a complete sentence or question, an incomplete statement,
or have the stem embedded in the directions. Responses are called distracters or foils.
© by Carolyn Pearson 2005. All rights reserved.
7
Assessment/Planning/Dr. Carolyn Pearson