File: edf6432 Module 8 Overview

advertisement
Table of Contents
Module 1 Overview ............................................................................................................................ 4
Module 1 Part 1 ................................................................................................................................... 6
Measure, Test, Assess, or Evaluate? .......................................................................................... 6
Measure .............................................................................................................................................. 6
Test (or Instrument) ..................................................................................................................... 6
Assess ................................................................................................................................................. 7
Evaluate ............................................................................................................................................. 7
Module 1 Part 2: High Stakes Testing...................................................................................... 10
Additional Resources ................................................................................................................... 10
Alignment ........................................................................................................................................ 11
Ethics ................................................................................................................................................. 11
Practice ............................................................................................................................................. 12
Module 1 Part 3: Purposes and Types of Tests .................................................................... 13
Overview of classifications ........................................................................................................ 13
Common types of classifications ............................................................................................ 13
Module 2 Overview..................................................................................................................................... 15
Module 3 Overview..................................................................................................................................... 51
Module 4 Overview..................................................................................................................................... 77
Module 5 Overview .......................................................................................................................... 93
Module 5 Part 1: Central Tendency........................................................................................... 95
Summarizing Data (Please read Chapter 12 and 13 in the Textbook) ................... 95
Central Tendency ......................................................................................................................... 95
Module 5 Part 2: Variability .......................................................................................................... 98
Summarizing Data: Variability (Please read Chapter 13 in the Textbook)............ 98
Variability ...................................................................................................................................... 100
Printable View of: Module 6: Correlation and Validity ............. Error! Bookmark not
defined.
File: Module 6 Overview .......................................................... Error! Bookmark not defined.
Module 6 Overview ........................................................................................................................ 103
Module 6 Part 1 Correlation ....................................................................................................... 104
Interpreting the Correlation Coefficient (r) and Coefficient of Determination ( )
.......................................................................................................................................................... 104
Scatterplots .................................................................................................................................. 106
Module 6 Part 2 Validity ............................................................................................................... 111
Content Validity........................................................................................................................... 111
Criterion-related Validity: Concurrent and Predictive .................................................. 111
Construct Validity ....................................................................................................................... 111
Printable View of: Module 7: Reliability and Accuracy ............ Error! Bookmark not
defined.
File: Module 7 Overview .......................................................... Error! Bookmark not defined.
Module 7 Overview ........................................................................................................................ 113
Module 7 Reliability and Accuracy ........................................................................................... 114
Reliability ....................................................................................................................................... 114
Stability, Equivalence, and Stability and Equivalence Methods ............................... 115
Internal Consistency Reliability Estimates........................................................................ 117
Inter-rater reliability ................................................................................................................. 118
Standard Error of Measurement ........................................................................................... 119
Factors That Influence Reliability Interpretation ........................................................... 120
Printable View of: Module 8: Standardized Test Score Interpretation ...... 122
File: edf6432 Module 8 Overview ................................................................................. 122
Module 8 Overview ........................................................................................................................ 122
Module 8 Part 1: Standardized Testing ................................................................................. 124
Basic Characteristics of Standardized Tests .................................................................... 124
Module 8 Part 2: Standardized Test Score Interpretation ............................................. 126
Module 9 Overview ........................................................................................................................ 129
Module 9 Assessment Issues for Language Enriched Pupils & Exceptional Student Education Settings . 130
General Principles........................................................................................................................... 130
Systematically Planned and Constructed .......................................................................... 131
Designed to Fit Characteristics of Content ....................................................................... 133
Designed to Fit Characteristics of Learners ..................................................................... 134
Designed for Maximum Feasibility ....................................................................................... 134
Professional in Format and Appearance ............................................................................ 135
Part 2 .................................................................................................................................................. 136
More Resources Related to Accommodations for Students in ESE and ESOL/LEP
Programs ........................................................................................................................................... 136
Considerations When Identifying and Implementing Accommodations for
Students in ESE Programs ...................................................................................................... 136
Assistive Technology ................................................................................................................. 137
Specific Exceptionalities........................................................................................................... 137
Empirical and Research-oriented Studies Related to Assessment
Accommodations ........................................................................................................................ 137
Considerations When Identifying and Implementing Accommodations for
Students in ESOL Programs ................................................................................................... 138
EDF6432 - Measurement and Evaluation in Education
Dr. Haiyan Bai
Module 1 Overview
The concepts in this module are important whether you are using measurement skills as a
teacher in a classroom or in another professional role such as school leader, counselor,
instructional designer, or researcher. As you begin, consider all the ways that proficiency in
measurement and evaluation is vital to your effective professional performance. Consider how
these measurement skills can assist you in performing your role, consistent with your
professional philosophy, and with high quality information at your fingertips to make effective
decisions. These measurement skills are also important for interpreting and conducting research
(teacher or school leader action research, school leaders' or private, non-profit evaluation
research, scholarly research).
Module 1 corresponds to Chapters 1, 2, & 4 in our textbook. We will begin with general but
critical concepts related to measurement and evaluation continue with high-stakes testing and go
on to purposes and specific types of tests. Content in this module relates to the text but includes
content not found in the textbook as well.
One of the most important attributes of high quality assessment is the validity of results (the
extent that inferences that we make from the results are appropriate). One of the most
important steps to ensuring validity is identifying what it is you want to assess (who and what),
for what purpose (why), and under what conditions (how). In this module, we will learn skills
that will help you enhance the validity of results of tests you use, create, or evaluate for research
purposes.
The table below contains the objectives, readings, learning activities, and assignments for
Module 1.
Module 1 focuses on the following objectives:
Chapter 1

Objectives
Compare and contrast testing and
assessment.
 Explain why testing and assessment
skills are vital to today’s classroom
teacher.
 Identify the implications of current
trends in educational measurement for
today’s classroom teacher
Chapter 2

Describe the broad impact high-stakes
testing has on students, teachers,
administrators, schools, and the
community.
 Explain the relationship between
academic standards, performance
standards, and alignment in standardsbased reform.
 Identify AERA’s 12 conditions that highstakes testing programs should meet.
Chapter 4




Associate various types of decisions
with the types of tests that provide data
for these decisions.
Determine whether or not a given test is
appropriate for a given purpose.
Describe the various types of tests
available, and identify situations in
which they would be appropriate.
Discriminate among the various types of
tests and their appropriate uses.
Chapters 1, 2, & 4 in text
Content and articles specified in module
(explore) Florida Department of Education
Accountability, Research, and Measurement
(ARM) found at http://www.fldoe.org/arm/
Readings
Professional standards in your field (see list
under Materials tool)
(selected student performance standards
from) Florida Sunshine State Standards
found at
http://www.fldoe.org/bii/curriculum/sss/
Several non-posted practice tasks
Learning
Activities
posting to group (professional standard
related to measurement and evaluation;
student performance standard that has been
classified by level along with a derived
learning target)
Begin Final Project Part A (Instructions will be
Assignments available under Assignments tool: Final
Project)
Module 1 Part 1
Measure, Test, Assess, or Evaluate?
We often hear measurement related-terms substituted for each other when they actually have
distinct meanings. While the terms are often used interchangeably by the public, media, and
even colleagues, we can communicate more effectively if we use the terms correctly on a
consistent basis.
Let's first define the terms and then practice classifying real life examples. We will still hear the
terms used frequently in different ways by others. At least we will know how they are
appropriately defined by accomplished educators and measurement scholars. We may be more
likely to use the terms correctly in our communications with others and this may help others to
use the terms correctly, as well.
Measure


Measurement is the process of quantifying or describing the degree to which an attribute
is present using numbers. In our work as educators, we often need to describe the
extent that a person possesses a certain characteristic. We often must express how
much learning has taken place or the strength or presence of an attitude. Examples
include descriptions of:
o how many problems students are able to solve
o the extent that a student can explain or apply a concept
o the level of motivation to learn that a student or a class demonstrates
o our own level of performance with the Florida Teacher Accomplished Practices
In this effort, we attempt to assign a number to help us express the amount of an
attribute that is present. Among other things, this enables us to:
o communicate with others more precisely using quantified information
o monitor changes in amounts of attributes over time
o make more precise and reasonable plans
Test (or Instrument)


A test is what is used to obtain the measure. In education, tests are categorized in many
different ways according to their purpose (more on this in a later module). For example,
in educational settings, tests are often categorized as either objective or alternative.
o Examples of objective style tests include multiple choice, matching, and short
answer.
o Examples of alternative style tests include product and performance tests (with
rubrics), portfolios, and behavior rating scales.
In addition to the instrument itself, a test could be considered a set of procedures used
to get a measure. Consider these examples of tests that involve a set of procedures
designed to get a measure.
o 50 yard dash: mark out 50 yards of track, locate student at starting point, signal
start, use stop watch to record time from start to finish
o reading comprehension: student reads designated passage, responds to questions
designed to indicate extent of understanding, teacher documents level of
understanding
expressive communication: a student is placed in a specific, structured social
context; an observer determines the number of appropriate verbal interactions
exhibited in a one hour block of time
If we want high quality measures, or good information on which to base our educational
decisions, we must use high quality tests. The following characteristics are most
important for determining the quality of an instrument of any type.
o validity of results
o reliability of results
o utility of the test in the specific measurement context
o

Assess



Assessment is a process of gathering information, both measures and verbal
descriptions, about an attribute such as students' progress toward instructional goals,
the operations of an educational program, or a teacher's development as an effective
professional. The information is usually needed to make an educational decision.
Formal assessment is a systematic process and each step should be scrutinized for
quality. The process often includes the following steps:
o identify decision to be made
o identify information needed to make the decision
o collect the information (could be both formally collected and informally collected
data)
o judge the quality of the information gathered
o integrate the information
o make the decision
The quality of educational decisions depends on the quality of the assessment process.
Even when informal assessments are conducted, one should consider the quality of the
information that was used.
Evaluate



Evaluation is the process of making a judgment according to a set of criteria. Evaluation
is conducted at many levels. In the broad scope, we take the results of our assessment
and use them to make judgments. For example, we might use the results of
assessments of students' progress, assessment of students' motivation, assessment of
teachers' attitudes in order to make a judgment about the quality or worth of a
particular academic program. We might use the assessment results to judge the
program as effective, ineffective, or even harmful. On a more narrow scope, we might
use standards of reliability to judge the results of one of our teacher-made tests as
either poor or good quality.
To have confidence that our evaluation is reasonable, we again are depending on high
quality measures and assessments.
The more impact that the evaluative decsion has, the greater our responsibility to
ensure that the assessments used were of high quality.
Examine the sample scenarios and then try your hand at categorizing an activity as testing,
assessment, or evaluation.
Testing
A teacher creates a set of math problems designed to tell how well
students can recognize the relationships between fractions, decimals,
and percents; express a given quantitiy in a variety of ways; determine if
numbers expressed in different ways are equal; and convert numbers
expressed in one way to their equivalent in another form. The students
complete the math problems and the teacher determines how many each
student got right.
A teacher is making a decision about whether the student is making
progress in controling the appropriateness of their expressive
communication behavior. The teacher gathers the results of formal
behavior observations, perceptions of other teachers, and the
Assessment perceptions of the individual student from their journal as well as from a
personal interview. The results are used to determine how much control
is being demonstrated by the student. The teacher will compare the
current results to the students' previous results as well as to expectations
set forth in a developmental communication scale.
Evaluation
A team of teachers is charged with determining whether a given software
program containing a variety of maps (local, world, social, geographical,
etc.) that the district is considering for purchase will meet the needs of
both the social science and physical science curriculum. The team
decides to base the judgment about the software program on cost, utility
(easy to use, matches content in both science and social studies, etc.),
and quality of maps (whether they are current, complete, easy to read,
etc.).
Now try your hand at classifying the following scenarios. Make a note of your choice and then
scroll down to compare your classification with that of the author.
Classification
Scenario
Test,
Assessment, or
Evaluation?
A. A teacher has collected several pieces of information in order to
decide which selections the student orchestra will play for the spring
program. The teacher surveyed the participants (students, parents,
school personnel) to determine their musical preferences; reviewed
the difficulty levels of available sheet music; calculated the amount
of practice time; and made a chart of the students' performance
levels. After analyzing the available information, the teacher
selected three appropriate pieces.
Test,
Assessment, or
Evaluation?
B. A panel of three teachers has been given the task of determining
the fluency level of graduates of a Spanish language training
program. The teachers listen to students read assigned passages,
engage in a brief conversation with each student, and then listen to
a three minute oral presentation by each student. After these
activities the panel determines whether students speak at an
appropriate pace, are able to make themselves understood, and are
able to use a range of appropriate vocabulary. Considering this
information, the panel categorizes the fluency level of each student
as beginning, intermediate, or proficient.
Test,
Assessment, or
Evaluation?
C. A teacher has given students a set of instructions to write a
paragraph on a specified topic. The students will earn points for
various aspects of correct paragraph construction. They will also
earn points for using correct spelling and punctuation. The teacher
will use a scoring rubric to determine the number of points students
earn showing how well they can correctly construct a paragraph.
Test,
Assessment, or
Evaluation?
D. Students complete a set of 40 multiple choice questions. The
number correct out of 40 indicates the extent they have mastered a
list of 10 instructional objectives.
Now compare your choices with the those below.
Classification Scenario
Assessment
A. A teacher has collected several pieces of information in order to
decide which selections the student orchestra will play for the spring
program. The teacher surveyed the participants (students, parents,
school personnel) to determine their musical preferences; reviewed the
difficulty levels of available sheet music; calculated the amount of
practice time; and made a chart of the students' performance levels.
After analyzing the available information, the teacher selected three
appropriate pieces.
Evaluation
B. A panel of three teachers has been given the task of determining the
fluency level of graduates of a Spanish language training program. The
teachers listen to students read assigned passages, engage in a brief
conversation with each student, and then listen to a three minute oral
presentation by each student. After these activities the panel
determines whether students speak at an appropriate pace, are able to
make themselves understood, and are able to use a range of
appropriate vocabulary. Considering this information, the panel
categorizes the fluency level of each student as beginning,
intermediate, or proficient.
Test
C. A teacher has given students a set of instructions to write a
paragraph on a specified topic. The students will earn points for various
aspects of correct paragraph construction. They will also earn points for
using correct spelling and punctuation. The teacher will use a scoring
rubric to determine the number of points students earn showing how
well they can correctly construct a paragraph.
Test
D. Students complete a set of 40 multiple choice questions. The
number correct out of 40 indicates the extent they have mastered a list
of 10 instructional objectives.
If you were off target for most, you may wish to discuss your thoughts with classmates on the
Help from Classmates discussion board.
Module 1 Part 2: High Stakes Testing
Additional Resources
High-stakes tests: Tests for which there are significant contingencies associated with the results.
High stakes testing is a topic that is earning considerable attention from many educational
participants and stakeholders as well as media. While we read in the paper or hear/read in the
news about events and the perceptions of people without extensive training in assessment, what
do those with considerable training and research background in assessment think? In this part of
the module you will explore selected resources related to high stakes testing that are based on
research and best practice. As you continue to develop your educational philosophy and practice,
you are encouraged to include the information and skills from these resources to your
assessment and evaluation "tool kit."
The following are sites on the World Wide Web that will supplement your reading from Chapter 2
in our textbook.
AERA Research Point The contribution made by teachers in a value added assessment.
American Evaluation Association Position Statement on High-Stakes Testing in Pre-K - 12
Education.
Additional Readings (locate at least one of the articles and then check out your professional
organization website for info on high-stakes testing):
Volume 42, Issue 1 of the journal Theory Into Practice focused on high stakes testing. These
citations are for two of the articles in that issue.
Mapping the Landscape of High-Stakes Testing and Accountability Programs , By: Goertz,
Margaret, Duffy, Mark, Theory Into Practice, 00405841, Winter2003, Vol. 42, Issue 1. Retrieved
online from Academic Search Premier.
NAEP and No Child Left Behind: Technical Challenges and Practical Solutions , By: Hombo,
Catherine M., Theory Into Practice, 0040-5841, January 1, 2003, Vol. 42, Issue 1. Retrieved
online from Academic Search Premier.
High Stakes Testing Uncertainty and Student Learning Amrein, A. L., & Berliner, D. C. (2002,
March 28). High-stakes testing, uncertainty, and student learning. Education Policy Analysis
Archives, 10(18). Retrieved September 16, 2002, from http://epaa.asu.edu/epaa/v10n18/
Locate the website for your professional organization. Determine whether there is a position
statement related to high-stakes testing. Example: National Council of Teachers of Mathematics
Position Statement on High-Stakes Testing
Alignment


An important consideration when it comes to understanding and interpreting the results
of high-stakes tests in education is the extent to which the curriculum, instruction and
assessment that is related to the tests are in alignment.
The No Child Left Behind (NCLB) Act of 2002 requires that states test in reading and
math at least in grades 3 through 8 and one time in high school; and starting in the
2007-2008 year, tests in science will be administered at least once in grades 3 through
5, 6 through 9, and 10 through 12. The law states that the tests are to be aligned with
the challenging state-mandated curriculum standards. These tests are considered highstakes because of the significant contingencies associated with them. It is important to
monitor the extent that curriculum, instruction, and formative assessment is aligned
with the standards tested with these high-stakes, standards-based tests.
Ethics



Test developers and users must adhere to ethical standards for the creation and
administration of tests (high-stakes or not). These ethical standards can be found in
various sources including the Codes listed below and in various professional
organizations' resources. It is wise to review these periodically to monitor the extent
that assessment activities in your context are consistent with ethical standards.
Another source of ethical practice is the Standards for Educational and Psychological
Testing (1999) published by the American Educational Research Association and
prepared by the Joint Committee on Standards for Educational and Psychological Testing
of the American Educational Research Association, American Psychological Association,
and the National Council on Measurement in Education. You are encouraged to visit the
websites of these organizations periodicMay 22, 2007li>
Many of the principles for the use of standardized tests are true for the high-stakes
category of testing. We will study these further in later chapters and modules.
Explore the Codes listed in the links below and then try to locate others that pertain to ethics of
assessment for professionals in your specific field.
Code of Fair Testing Practices Copyright 2004 by the Joint Committee on Testing Practices.
Reprinted with Permission. Code of Fair Testing Practices in Education. (Copyright 2004).
Washington, DC: Joint Committee on Testing Practices. (Mailing Address: Joint Committee on
Testing Practices, Science Directorate, American Psychological Association (APA),750 First Street,
NE, Washington, DC 20002-4242; http://www.apa.org/science/jctpweb.html.)
Code of Professional Responsibilities in Educational Measurement copyright 1995 National Council
on Measurement in Education. Any portion of this Code may be reproduced and disseminated for
educational purposes.
Practice
Try your hand at classifying the following scenarios as that of high-stakes or not high stakes
testing.
Category:
High-stakes
or not?
Scenario
High-stakes or A student gives a speech as a performance assessment. The score
not?
obtained by using a rubric to evaluate the student's performance will
contribute 15% of the student's unit grade. The score is considered a
summative measure of the student's skill in delivering a speech.
High-stakes or Results of a performance test will be used to determine whether or not
not?
a student is allowed to participate in the school orchestra. The student
will either be considered musically talented and invited to join the
orchestra or not musically talented and prohibited from joining the
orchestra based on the obtained performance test score.
High-stakes or A student will be allowed to graduate from high school only if he or she
not?
earns a score equal to or beyond a certain cut-off point.
Compare your choices with the feedback in the following table.
Category: High-stakes or not?
Scenario
Not high-stakes (the grade for a
unit on speech will not have
consequences at the level of
significance to be considered highstakes.)
A student gives a speech as a performance
assessment. The score obtained by using a rubric
to evaluate the student's performance will
contribute 15% of the student's unit grade. The
score is considered a summative measure of the
student's skill in delivering a speech.
High-stakes (joining or not being
allowed to join the orchestra is
more significant given the potential
impact on a student's life and the
categorization: talented or not
talented)
Results of a performance test will be used to
determine whether or not a student is allowed to
participate in the school orchestra. The student will
either be considered musically talented and invited
to join the orchestra or not musically talented and
prohibited from joining the orchestra based on the
obtained performance test score.
High-stakes (graduating or failing to A student will be allowed to graduate from high
graduate from high school will have school only if he or she earns a score equal to or
a significant impact on a student's beyond a certain cut-off point.
life)
Module 1 Part 3: Purposes and Types of Tests
In this section we will review the purposes and types of tests. An important aspect of the validity
of an instrument is that it will measure what it was designed to measure. We must choose a test
that is appropriate for getting the information we need. Also, when designing our own tests,
there are design considerations specific to the various types of tests. Therefore, we must be very
clear on the purpose of the test and what it is supposed to measure before we build it or select it
from the many choices available.
Overview of classifications



One reason for considering the many classifications for tests is that it is important to
pick the right type of test for the task at hand. Consider the aspects of what you are
trying to measure and then make sure the instrument you select is a good match for
those aspects.
Another reason for being aware of the various classifications is to communicate more
effectively. You will be more likely to find the instrument you are looking for if you know
how it has been classified - you will be more likely to look in the right places. You will be
more effective in communicating about the results if you know from which type of test
the results were obtained.
A single instrument can fit into more than one of the categories. It is helpful to think of
the primary aspects that are to be emphasized and then categorize the instrument along
those lines. Use the primary aspects first, secondary aspects next, and then omit the
classifications that aren't relevant to the purpose of the test. An instrument can be
criterion-referenced, objective, power, and teacher-made all at the same time. It
couldn't be classified as both objective and subjective at the same time, though. An
instrument could be subjective and summative but could not be summative and
formative at the same time. (There are exceptions, though; subsections within an
instrument could be classified differently; Part A of an instrument might be power and
Part B might be speed, for example.)
Common types of classifications



Review the classification types and definitions in Chapter 4. Criterion-referenced and
norm-referenced are described in the next module.
A single test may be used in different ways at different times. For example, an objective
style test may be used as a formative evaluation of students' performance in one
context while that same test may be used as a summative evaluation in another
context.
o Think of the product test in which students must write a short paragraph. A
teacher might use this as a practice test while students are still in the formative
stage of developing their skills for one group; while in another class, that same
paragraph test could be used as an end-of-unit posttest to determine how well
students can create a paragraph after instruction has been completed. The scores
in the formative assessment would not count toward a student's grade while the
scores of the posttest would be considered a summative description of the
students' end performance and would count toward their term grade.
Common types of instruments in education and an example of each are listed in the
following table. For each one, try to think of an example of an instrument you have
encountered in your past experience that also fits that category.
Objectively
scored
A multiple choice test that measures students' understanding of key
concepts within a unit on geography.
Subjectively
scored
An essay test in which the student describes key elements of
democracy and then predicts what would happen if each country on a
specific continent adopted a democratic form of government.
Individually
administered
The Weschler Intelligence Scale for Children (WISC)
Group
administered
The Cognitive Abilities Test measures reasoning abilities that are
comMay 22, 2007
Verbal
Almost any test that requires that students read the test instructions
and then read questions and write their responses; our midterm exam
is an example of a verbal test. Even if the instructions and questions
were read to the students and they answered orally, it would be
considered a verbal test because taking the test is so dependent on
students' verbal abilities.
Non-verbal
Naglierri Nonverbal Abilities Test measures students' non-verbal
reasoning and problem solving skills.
Speed
A test of 'keyboarding' skills that measures how many words students
can type per minute. (This doesn't just mean that there is a limit to
how much time a student can take. It means how fast they can
perform the skill.)
Power
A standardized achievement test (e.g., Stanford Achievement Test) is
designed to measure the amount of content the student has
mastered. The breadth and depth a student's knowledge is
emphasized over how fast one can perform.
Academic
Bracken Basic Concepts Scale measures skills typically needed to
succeed in school. Also, any test that covers subjects typically taught
in schools would be considered an academic test when contrasted
with the next category: affective tests.
Affective
The Keirsey Temperament Sorter II is a personality questionnaire that
is available online.
Teacher-made
Our first product exam that will ask you to create an objective style
test is an example of a teacher-made test (not made by a commercial
test publishing company or educational materials company).
Standardized
The Florida Comprehensive Assessment Test is a standardized test.
It is to be administered and scored in the same manner everywhere
that it is used (same administration procedures, same instructions to
teachers and students, and same scoring procedures).
How did you do? Were you able to think of (or locate online) another example for each of the
categories? If not, you may want to use a search engine and see what you come up with or use
the Buros Mental Measurements Yearbook website (search Buros for test reviews) to find
examples.
Tasks for Module 2
Week 3 & 4 tasks:
1. Learn Module 2 and Chapters 5, 6, and 7.
2. Objective Exam 1 is available on Monday, Open: 1/27-2/7 and due at 11:59pm on 2/7. You can take it
anytime during this period of time, and you can also take three times, the best score will be recorded to
count to your final grade. Please don’t miss it because it is open long enough for you to select a good
day to take it. The answer key will be available after I receive all the submissions.
3. By the end of this week, you should start Part A for the Final Project, but you don’t need to submit
Part A. It should be submitted together with Part B by the end of the semester. You should follow the
schedule so that you are not left behind. I will check online to see your progress and provide my
comments when necessary.
If you have any questions, please feel free to let me now and I am ready to help you for a successful
semester.
Thank you all for a good start!
Haiyan Bai
Close this window
Module 2 Overview
The concepts in this module are important whether you are using measurement skills as a teacher in a
classroom or in another professional role such as school leader, counselor, instructional designer, or
researcher. As you begin, consider all the ways that proficiency in measurement and evaluation is vital
to your effective professional performance. Consider how these measurement skills can assist you in
performing your role, consistent with your professional philosophy, and with high quality information at
your fingertips to make effective decisions. These measurement skills are also important for interpreting
and conducting research (teacher or school leader action research, school leaders' or private, non-profit
evaluation research, scholarly research).
Module 2 corresponds to Chapters 5, 6 & 7 in our textbook. We will critical concepts related to normand criterion-referenced tests, test types and purposes, learning outcomes and constructing objectivestyle items. Content in this module relates to the text but includes content not found in the textbook as
well.
One of the most important attributes of high quality assessment is the validity of results (the extent that
inferences that we make from the results are appropriate). One of the most important steps to ensuring
validity is identifying what it is you want to assess (who and what), for what purpose (why), and under
what conditions (how). In this module, we will learn skills that will help you enhance the validity of
results of tests you use, create, or evaluate for research purposes.
The table below contains the objectives, readings, learning activities, and assignments for Module 1.
Module 2 focuses on the following objectives:
Chapter 5
Discriminate between norm- and criterion-referenced
tests.
Describe why content validity is important for
classroom achievement tests.
Objectives
Discriminate between specific instructional or
behavioral objectives, and general or expressive
objectives.
Chapter 6
Describe the components of a well-written
instructional objective.
Discriminate between observable and unobservable
learning outcomes.
Write objectives at different levels of the taxonomy.
Construct a test blueprint for a given unit of
instruction, according to the guidelines provided.
Chapter 7
Identify the type of item format appropriate for
different objectives.
Describe ways to minimize the effects of guessing on
true-false items.
Write fault-free objective test items that match their
instructional objectives.
Chapters 5, 6, & 7 in text
Content and articles specified in module
Readings
(explore) Florida Department of Education
Accountability, Research, and Measurement (ARM)
found at http://www.fldoe.org/arm/
Professional standards in your field (see list under
Materials tool)
(selected student performance standards from)
Florida Sunshine State Standards found at
http://www.floridastandards.org/index.aspx
Several non-posted practice tasks
Learning
Activities
Assignments
Postings to working group and specific discussion
topics (assessment procedure validity compromise;
student performance standard that has been
classified by level; planning to create instructional
objective and test item)
Continue Final Project Part A (to be found under
Assignments tool)
Module 2 Purposes and Types of Tests - Criterion Referenced
We continue to examine ways in which the many tests used in education can be classified according to
purpose and use. Two categories that are highly relevant to educators are Criterion-referenced and
Norm-referenced tests. This section and the next will discuss these two important categories. It is not
just the test itself (the paper with print or the set of procedures used in a performance test), it is the
perspective from which scores are interpreted that helps us categorize the instrument as criterionreferenced or norm-referenced.
Criterion referenced interpretation
The majority of the instruments that are designed by teachers for use in their classrooms are criterionreferenced. A set of test items is developed to measure a carefully analyzed framework of instructional
outcomes. The student's score on the test is interpreted to represent the extent the student can
successfully perform the set of outcomes. The student's performance is compared to the original set of
outcomes. Scores interpreted from this perspective are often reported as a list of objectives mastered,
number of objectives mastered, and/or a percentage loosely representing a percentage of the original
set of outcomes that has been mastered.
Imagine that a teacher is planning an instructional unit and would like to know whether the
students have acquired the prerequisite skills needed to be successful with the new skills included in the
upcoming unit. Let's say the unit is based on:
Social Studies, Grades 6 -8, Sunshine State Standard 3: The student understands Western and Eastern
civilization since the Renaissance. (SS.A.3.3); # 5 Understands the differences between institutions of
Eastern and Western civilizations (e.g., differences in governments, social traditions and customs,
economic systems and religious institutions).
If the teacher had criterion referenced information available he could look up exactly which
pre-requisite skills the class had acquired and which were missing. Can the students locate the Western
and Eastern hemispheres? Can the students define and give examples of government, religion,
economy, etc.?
What would not be as helpful in this context is to know the proportion of the norm group that
performed lower than a particular student, or the stanine scores earned by students in the class, or the
rank standing of each student in the class compared to the norm group. For the specific information
needed by this teacher, comparing performance against a domain of skills would be more helpful than
comparing performance of students to a norm group.
Compare the two types of information in the following tables. Both sets of information are useful but
each is useful for different purposes in different circumstances.
Notice that in the Criterion Referenced Table, the information shows that Annabelle has acquired
approximately 86% of the terms in the domain and can locate the hemispheres. The teacher might
interpret that the group is strong in relation to definition of terms but only about half of them know
where the hemispheres are located. This information could help the teacher in planning the upcoming
unit. The information in Table B, while helpful in another context, would not help the teacher in
determining the specific content that must be incorporated into the learning activities for the students
to be successful in the new unit.
You may also imagine this interpretation from the perspective of a school leader who is identifying
which goals and objectives related to the school plan are areas of strength and which areas need more
attention. Effective, data-driven decision-making involves the comparison of gathered evidence related
to school effectiveness indicators against the targeted goals (yes/no; 60% attained; etc.). This
constitutes interpretation of data from a criterion-referenced perspective. When trying to know which
actions to take in the coming years, it is not as helpful to know how you compare against other schools
but rather how you compare against your own targeted goals.
A. Criterion Referenced Perspective
B. Norm Referenced Perspective
Objective:
Student
Defines
Terms: Govt,
Econ., ...
Percentage
Scores
Locates
Hemispheres
Student
Social
Studies
Rdg. Comp.
Percentile
Rank
Percentile
Rank
Annabelle 72
88
Annabelle 86
Y
Fred
89
78
Fred
92
N
Randal
82
75
Randal
88
Y
Ysela
90
70
Ysela
94
N
Module 2 Purposes and Types of Tests - Norm Referenced
We have learned that there are many ways of classifying tests based on their format, purpose, or types
of scores. Next we'll look at the norm referenced perspective of score interpretation. This type of
interpretation is often required when trying to interpret standardized test score reports. It can be
helpful when describing a group's prior achievement (e.g., "above average, homogeneous prior
achievement within a subject area" or "heterogeneous, including below average, average, and above
average achievement across subject areas"). In contrast to the criterion referenced perspective, this
perspective is not as helpful for detailed planning of instructional units.
Norm referenced interpretation
Recall that a norm referenced interpretation of scores means we are comparing a student's
performance against that of a norm group. With norm referenced interpretation, we will be describing a
student's performance as below average, average, or above average compared to a reference group,
called the norm group. To accurately interpret the students' performances, we must have some
information about the reference group. In our classroom context, we would want to know if the
characteristics of students in the norm group are similar to the characteristics of the students in our
class. If they are, we would have more confidence in making the comparison. If they are not similar, we
would be less confident in comparing the performance of the students in our class to the performance
of the norm group. We will study this concept further in later chapters on standardized testing.
Do you remember the criterion and norm referenced tables in the previous module section? Let's
examine a circumstance where the norm referenced information could be useful.
Imagine a teacher was ready to begin planning prior to the start of a school year. The teacher
would like to select strategies, materials, and activities that will be the best match for the characteristics
of the students. He would like to gather as much information as possible about the relevant
characteristics of the students. One relevant characteristic would be the group's prior achievement in
both the targeted subject area and areas that are related to the target subject.
For example, the teacher knows that social studies skill acquisition is related to reading comprehension.
It would probably be impossible to test each student in the social studies class on their reading
comprehension. It would also be too tedious and time consuming to review the criterion-referenced
score reports on every student for each reading skill. Examining the class standardized test score report
in the areas of reading comprehension and social studies may be a more realistic possibility. The specific
skill information in Table A was useful for the unit-to-unit detailed planning. The information in the
norm referenced table would be useful for the teacher's general information need in this specific
context.
FIX THIS
A. Criterion Referenced Perspective
B. Norm Referenced Perspective
Objective:
Student
Defines
Locates Hemispheres
Terms:
Govt, Econ.,
...
Annabelle 86
Fred
Randal
Ysela
92
88
Student
Social
Studies
Rdg. Comp.
Percentile
Rank
Percentile
Rank
Annabelle 72
88
Fred
89
78
Randal
82
75
Ysela
90
70
Y
N
Y
94
N
One can see from the
information in the norm referenced
table that the students are somewhat
heterogeneous (average to above
average) in the prior achievement in
social studies while they are
homogeneous (mostly average) in their
reading comprehension when
compared to the norm group. This is a
useful piece of information when
selecting instructional strategies for the
group (e.g. the teacher may choose to
reinforce text readings with activities
that incorporate other learning styles).
As we have seen, the two score
interpretations serve different purposes
and each can be useful in a certain
context.
Module 2 - Part 3 Domain of Skills
Goal Frameworks
A domain of skills is a collection of goals and objectives related to a specific topic or group of topics
within a subject. In our field of education, we hear many different uses of the term "domain".
When the collection of goals and objectives are represented in a carefully analyzed format with the
content and learning levels clearly delineated according to a learning taxonomy, you could call it a
framework of subordinate skills. In this type of domain, learning outcomes have been identified based
on careful analysis of the goals. The subordinated skills have been derived based on a breakdown of
subskills and the relationships among the subskills within the goal. Test scores that come from a test
based on this type of domain would be a meaningful representations of the students' performance in
relation to the domain. This type of score interpretation is called criterion-referenced. We are
interpreting students' performance in relation to a criterion (the domain of skills). This type of
interpretation in sometimes also called domain-referenced (i.e. scores are linked to a domain).
Sometimes learning objectives are gathered more informally. They are collected because they generally
cover the same topic or a loose collection of related topics. Some of our colleagues refer to this as a
domain of skills, also. This type of domain is clearly different from the framework of subordinate skills
described above. Results from a test based on this type of "domain" would be interpreted differently
than results from a test based on a careful analysis of goals and subskills within a subject. Results from
this type of test are also called domain referenced. Do you see the difference between the two? When
we are using the terms and hear them used by another, it is important to understand to which type of
domain the person refers.
When instructional objectives have been written from the collection of subordinate skills within the
domain and a test is constructed from those objectives, we may hear those results referred to as
objectives-referenced scores. Again, we will want to determine whether this objectives-referenced test
was developed from a carefully analyzed goal framework of subordinate skills or from a loose collection
of skills that are related to a similar topic. Scores from the two types of objectives-referenced tests
would be interpreted differently.
When it comes to selecting and designing high quality classroom assessments, the most important
quality criterion is the validity of the test results. Validity means the extent that inferences made from
the results of a test are appropriate. Validity of results allows us to have confidence that the information
that we get from the administration of an instrument to a particular set of test takers, in a particular
context, under specified conditions is the best information possible. The results are in fact giving us
information about the attribute we are trying to measure and are not clouded by other factors we were
not trying to measure.
There are various ways of examining validity and we will study this concept further in later chapters and
modules. In this module section we are examining one type in more detail: content validity. We are
examining the extent the items on a test represent the domain of skills on which instruction was based.
Using the design principles from the previous module section, examine the following scenarios to see if
you can spot compromises to the validity of the test results. Compromises to the validity include
instances where inappropriate inferences are being made from the results of the test.. Make a note of
each possible compromise you recognize.
A science teacher would like to increase her understanding of the students' reading
comprehension level so she can select appropriate instructional materials. She is especially
interested in knowing how much she can use the Internet as a tool to convey instructional
content to the class. She has found a reading comprehension test that students can take
online. The class has very limited access to the computer lab in order to take the test but she
Example is able to schedule the class of 28 students from 10:45 to 11:15 AM on Monday morning.
1
There are 25 computers in the lab and the teacher did not notice when the three most shy
students were sitting at tables without computers. Once she did notice, she felt guilty for not
Part 1
noticing sooner but was not otherwise concerned because these students never give her any
trouble. She concludes that they must be good readers or they would ask more questions.
The teacher also noticed some students about which she was unsure of their reading
comprehension. Instead of taking the test, they were playing video games. Eventually the
students randomly bubble in answers and submit their tests.
Five of the students in that same class speak English as a second language but there are no
dictionaries in the computer lab. The teacher advises the students to quickly find a foreign
language dictionary online and use it, if needed, as they are taking the test. The students are
surprised because they did not know these types of dictionaries were available online.
Example
At 10:55, the students typical lunch hour, the class begins taking the exam which is designed
1
to be taken individually and requires 30 minutes for administration. The directions found in
Part 2
the administrator's manual of the test state that the teacher is to read instructions aloud to
the class before students are allowed to begin. The teacher is afraid the class won't be able to
finish before 11:15 so she tells the students to start reading the comprehension passages
while she reads the directions aloud aloud to them; she will then go around individually while
the class is working to ask if they have any questions.
Example Once the students have started, the teacher has a chance to read the passages herself. She
1
Part 3
gets a chuckle out of some of the vocabulary found on the test. There are some unusual
words like "lift" instead of elevator; "dustbin" for garbage can; and "lorry" for truck. She
hopes the students have watched enough movies with British settings so they will not be
bothered by the differences. The teacher also believes the students might not notice the
different vocabulary because the text on the screen is rather fuzzy because of the aging
computer monitors. The screen fades in and out from time to time if the computers are on
for longer than 20 minutes at a time.
The bell rings at 11:15 and the new class of students begins to arrive. She tells her class to hit
the "finish" button no matter where they are and then submit their results for scoring
because the class must give up the lab. She looks forward to getting the test results and
finding out just how strong the reading comprehension skills are within this class.
A statewide comprehensive assessment test covering learning outcomes from the state
curriculum standards is administered each year to every student in the state. At one school,
the faculty researched an alternative curriculum which they voted to adopt this past year and
was approved by the school board. They believe the alternative curriculum is more current
and has more comprehensive and relevant goals than the state-adopted curriculum. They
Example waited with great excitement to see the results of the comprehensive assessment test
because they believed the curriculum they had chosen was better than the state-adopted
2
curriculum. Their hopes were dashed when they found out their students earned lower
scores than the other students in their district. They had been so confident that their
students had learned a great deal after their experiences with the new curriculum. The
faculty remains perplexed and disillusioned. They are not sure how to proceed with this
frustrating information.
The state department of education is examining the performance of various schools
throughout the state. One particular school has generated a great deal of excitement because
of high levels of dedication by the faculty, attendance and motivation of the student body,
and participation by families of students. The school has received recognition from a national
Example corporation for how hard everyone has been working and their great attitudes toward
learning. Students are feeling high levels of satisfaction both because of the amount they
3
have learned since the program started (they have made almost two years worth of gain in a
single school year) and because of the recognition by popular celebrities. After the
assessment results are published, everyone is shocked because using the state's grading
system, the school received a grade of "D".
You have probably found many compromises to validity in these examples. Select one or two that
resonate with you and post a brief comment on the discussion board under the Validity discussion topic.
Which compromises resonate most with your classmates?
Module 2 - Part 4: Writing Instructional Objectives
Teachers learn about instructional objectives in several different courses or workshops. As you may have
noticed, objectives are sometimes defined and written differently in different contexts. In one context
you may have learned that the objective should always be written with the phrase Students will be able
to..., while in another context you learned that you don't need to write it like that because you already
know that the objective is what the student does and not what the teacher will do. In one context you
may have learned that objectives are written at a fairly general level while in another context you
learned they are to be written with great specificity almost like a test item. As a student of pedagogy,
you will synthesize all of this advice for your own educational context and use it to design high quality
instruction, to select or create instructional materials and techniques using the appropriate curriculum
for your students, and to select or create high quality assessments. In this section we will learn to write
instructional objectives in a format that will help you design the tests that will yield the most valid and
reliable results.
Whatever the format, most all agree that well-written instructional objectives are instrumental in
producing higher quality lesson plans and assessments. It is hoped that the instructional objective
format that you learn here will contribute to your skills in effective instructional design and assessment.
Defining and Recognizing Instructional Objectives
An instructional objective is a statement of learning outcome that contains conditions, content,
behavior, and criteria. The term learning outcome is an important part of the definition of an
instructional objective. A learning outcome is what the student will be able to do (in terms of the skill)
following instruction. It is the knowledge or skill or ability that they take with them and perform even
outside of the context of the lesson. The learning outcome is not the specific task on the test that elicits
the skill (e.g., match the definition with the term), it is not the practice exercise they do during the
lesson (e.g., locate the vocabulary words in the word puzzle); it is the skill they will be able to perform
under any context that presents itself in the future (e.g., define the terms associated with ...).
Conditions are statements of any materials, equipment, stimulus material, or context that must be
provided for the student to perform the skill specified by the behavior and content in the objective.
Conditions are very important to the instructional objective because they create the opportunity for the
student to demonstrate the skill and influence the level of difficulty at which it will be performed. The
conditions will influence the validity and reliability of the resulting test item. Validity is influenced
because the expected task to be elicited by the item is more clear when conditions have been specified.
Reliability is influenced in that it is more likely that the item will be clear and uniform across its
presentation on tests because the way in which student is to demonstrate the skill has been clearly
specified. Students will be performing under similar conditions each time the item is presented.
Content is the topic or subject matter or issue within the skill contained in the objective.
Behavior is the action part of the statement. It must be presented in observable, measurable terms (i.e.
there must be a product or performance that you can see, hear, touch, ...; you can see the students
product resulting from "...solve quadratic equations... but you can't see the student's comprehension in
"...comprehends quadratic equations.... You could measure whether or not the student solved but you
could not measure whether or not the student comprehended until the skill was operationalized using
observable, measurable terms.
Instructional objectives often must include the criteria or mastery level at which the student should
perform the skill. Some instructional designers believe that the criteria level is only necessary when
there are degrees of correctness expected or when degrees of correctness differ across the
developmental levels of the students. For example, consider the objective: From memory, state the
capital of Florida. The mastery level of the learning outcome is implied in that the student is expected to
state the capital of Florida correctly every time the opportunity presents itself, not every 2 out of 3
times or 8 out of 10 times. Now consider this learning outcome: Given a basketball with regulation size
hoop and distance from the foul line, make a free throw. The student could not reasonably be expected
to make the free throw every time. Criteria would state the mastery level appropriate for the
characteristics of the students (e.g., third graders versus professional basketball players). Instructional
designers who believe the criteria is not needed when it is implied that the student should be able to do
it right every time believe that the goal mastery level is set at the test level and not set at the
instructional objective level (e.g., 75% correct equals satisfactory performance on the test or 80%
mastery earns a "B" on the test).
Identify the Parts of Instructional Objectives: Examine the instructional objectives in the table below.
Identify the part indicated in the left hand column. The feedback follows.
1. Behavior
Supplied with a term related to musical notation and a set of definitions that include the
correct and incorrect definitions, select the definition for the specified musical notation
term.
2. Behavior
Given a piano and sheet music for an etude, play the etude with fewer than 5 errors.
3. Content
Given an example of a specified ecosystem containing producers, consumers, and
decomposers, identify the specified role as either a producer, consumer, or decomposer.
4. Content
Given a liquid and the necessary heating and cooling devices, demonstrate the change in
matter from each state to each of the other states.
5. Conditions
Given pairs of mixed numbers, explains the effect of multiplication, division, and the
inverse relationship of multiplication on the mixed numbers.
6. Conditions
Given sets of data on two variables with at least ten observations, displays the data using
histograms, bar graphs, circle graphs, and line graphs.
7. Criteria
Given names of historical leaders who have influenced western civilization since the
Renaissance, identifies at least one of the factors for which the leader was influential.
8. Criteria
Given an annual salary and lifestyle example with specified necessary expenses, create a
monthly budget correct to within $10.
Compare your choices with the feedback in the table below. The indicated part appears in red.
1. Behavior
Supplied with a term related to musical notation and a set of definitions that include the
correct and incorrect definitions, select the definition for the specified musical notation
term.
2. Behavior
Given a piano and sheet music for an etude, play the etude with fewer than 5 errors.
3. Content
Given an example of a specified ecosystem containing producers, consumers, and
decomposers, identify the specified ecosystem role as either a producer, consumer, or
decomposer.
4. Content
Given a liquid and the necessary heating and cooling devices, demonstrate the change in
matter from each state to each of the other states.
5. Conditions
Given pairs of mixed numbers, explains the effect of multiplication, division, and the
inverse relationship of multiplication on the mixed numbers.
6. Conditions
Given sets of data on two variables with at least ten observations, displays the data using
histograms, bar graphs, circle graphs, and line graphs.
7. Criteria
Given names of historical leaders who have influenced western civilization since the
Renaissance, identifies at least one of the factors for which the leader was influential.
8. Criteria
Given an annual salary and lifestyle example with specified necessary expenses, create a
monthly budget correct to within $10.
How did you do? Are you ready to create instructional objectives that contain each of these parts? You
are asked to do this in the Final Project. If you have already started Final Project, it may be a good idea
to review the objectives that you have written to make sure they contain all the necessary pieces.
Classifying Instructional Objectives According to Learning Level
Another important aspect of instructional objectives is the learning level they represent. From a
measurement perspective, this is especially important. To ensure the congruence between the learning
level expected in the instructional objective and the level contained in the instruction and tests, we
must first be aware of the learning level expected in the objective. Is it a higher level or a lower one? If it
is higher, we want to make sure we are providing students with instruction and practice opportunities at
that level. If it is a lower level of learning, we must make sure we are not demanding more in the test
item than was expected in the objective and instructional activities. This is especially important for the
validity of our test results.
After reading about the types and levels of learning described in your textbook, try to classify the
following objectives accordingly. In the first table, you are asked to identify the type of learning as either
affective, cognitive, or psychomotor. The next table asks you to classify the objective according to the
levels of learning within the cognitive type.
Type?
1. Supplied with a term related to musical notation and a set of definitions that include the
correct and incorrect definitions, select the definition for the specified musical notation term.
Type?
2. Given a regulation size soccer ball in a game situation, pass the soccer ball to an offensive
player on the player's team.
Type?
3. Given sets of data on two variables with at least ten observations, displays the data using
histograms, bar graphs, circle graphs, and line graphs.
Type?
4. Given a variety of paint brushes, demonstrates the technique and control that would be
necessary to obtain a variety of visual effects.
Type? 5. Within a game situation, demonstrates consideration for others.
Type?
6. Given the opportunity to interact with people of differing physical abilities during team sports,
chooses to show respect for people of like and different physical ability.
Compare your choices with the feedback in the table below. You may wish to discuss discrepancies or
confusion within your group's discussion area.
Cognitive
1. Supplied with a term related to musical notation and a set of definitions that include the
correct and incorrect definitions, select the definition for the specified musical term.
Psychomotor
2. Given a regulation size soccer ball in a game situation, pass the soccer ball to an offensive
player on the player's team.
Cognitive
3. Given sets of data on two variables with at least ten observations, displays the data using
histograms, bar graphs, circle graphs, and line graphs.
Psychomotor
4. Given a variety of paint brushes, demonstrates the technique and control that would be
necessary to obtain a variety of visual effects.
Affective 5. Within a game situation, demonstrates consideration for others.
Affective
6. Given the opportunity to interact with people of differing physical abilities during team
sports, chooses to show respect for people of like and different physical ability.
Now try to classify objectives according to their level within the cognitive component of Bloom's
taxonomy (knowledge, comprehension, application, analysis, synthesis, and evaluation).
1. Given a problem within a specified context that contains relevant and irrelevant information,
Level? decides what information is appropriate and then collects, displays, and interprets data to
answer relevant questions regarding the problem.
Level? 2. Given written examples of elements, molecules, and compounds, recognizes them as such.
Level?
3. Given two different strategies and problems of length in units of feet, correctly estimates
length to within one foot.
Level?
4. Given a short story, critique the author's use of the elements of plot (setting, events,
problems, conflicts, and resolutions).
Level? 5. From memory, recalls ways in which conflict can be resolved.
Level?
6. Given a specified topic, writes a speech and uses a variety of techniques to convey meaning to
an audience (movement, placement, gestures, silence, facial expression).
Compare your choices with the feedback in the table below. You may wish to discuss discrepancies or
confusion within your group's discussion area.
analysis
1. Given a problem within a specified context that contains relevant and irrelevant
information, decides what information is appropriate and then collects, displays, and
interprets data to answer relevant questions regarding the problem.
comprehension
2. Given written examples of elements, molecules, and compounds, recognizes them as
such.
application
3. Given two different strategies and problems of length in units of feet, correctly
estimates length to within one foot.
evaluation
4. Given a short story, critique the author's use of the elements of plot (setting, events,
problems, conflicts, and resolutions).
knowledge
5. From memory, recalls ways in which conflict can be resolved.
synthesis
6. Given a specified topic, writes a speech and uses a variety of techniques to convey
meaning to an audience (movement, placement, gestures, silence, facial expression).
Review a set of student performance standards in an area of interest to you (e.g., Sunshine State
Standards). While these are more broad than the instructional objectives we have been practicing with,
they will still be helpful to practice classifying skills according to learning levels. Select two (one that is
lower and one that is higher) and classify them according to type and level of learning. Post them to
your group's discussion area and review the standards and classifications your group members have
posted. Discuss any differences you may observe. While you are selecting the standards, examine those
you have selected to determine whether they are written using observable and measurable terms for
the behavior or not. Comment on this in your posting, also.
Module 2 Part 5 Test Blueprints
In the previous modules we have been learning about basic test design considerations as well as the
specific foundations of test items: instructional objectives. We will now learn about an especially
important tool when it comes to test design: the test blueprint. A test blueprint is also referred to as a
Table of Test Specification. It guides us in the development of our test and will help ensure that the test
is designed to yield the most valid and reliable results. A test blueprint is a plan for the test that
designates:
what content will be included on the test
which learning levels will be included on the test
number of items on the total test
how many items and what proportions of the test by content and learning level
the format of the test items
an estimate of time for the overall test and for each item
Before we examine how a test will do this, review the two most important criteria related to the quality
of a test: validity and reliability.
Validity
Validity is defined as the appropriateness of inferences made from test results. Previously, you may have
heard the definition of validity as "whether or not the test measures what it is supposed to measure."
This definition is true but somewhat limited as it targets content validity but does not give the full
picture when it comes to the quality of test results. In addition to measuring what it is supposed to
measure, we must ensure that the interpretations or judgments we make from the test results are
appropriate. There are test design considerations that we must consider to help us ensure that the
results we get will be valid (Carey, 2001).
Consider how well the subordinate skills selected for the test represent the overall goal framework (all
aspects of its content and learning levels; is there a good match between subskills and test items)
Consider the information expected from the test and plan accordingly (i.e., will it be formative or
summative; is it to determine whether students have mastered the pre-requisite skills?; if they have
they mastered the skills in the current unit?; will the test be used to evaluate the effectiveness of your
instruction?)
Identify the best format for the tasks or skills included on the test (written response, selected response,
essay, product, portfolio, etc.)
Determine the appropriate number of items needed to effectively measure each skill. Is one item
sufficient (such as with recall level skills) or do students need multiple opportunities to demonstrate the
skill (such as skills that require classification, problem-solving, physical movement)?
Determine how and when the test will be administered (how much time will students need to prepare,
when do you need the information for your planning or feedback purposes?)
Considering these factors as you are designing the test will usually ensure more appropriate inferences
will be made from the results, i.e. more valid results.
Reliability
Reliability is the next most important criteria when it comes to test design. Reliability is the consistency
with which scores are obtained with the same measure under the same conditions. We think of
reliability as the consistency or stability of the scores. As Nitko (2004) further explains, reliability is the
extent students' assessment results are the same when: they complete the same tasks on two or more
different occasions; two or more raters mark their performance on the same task; or they complete two
or more different but equivalent tasks on the same or different occasions. Knowing how much
consistency there is in the set of scores is useful for determining how much confidence you can have in
the test scores. Later in the semester we will study ways to estimate the reliability of the test results.
As with validity, there are important steps we can take in the design of our instruments (Carey, 2001) to
help ensure the consistency of results. These include
select a representative sample of content and skill levels from the goal framework or set of instructional
objectives
make sure there are enough items to adequately capture the skill that is to be demonstrated (e.g., if the
skill is to 'make plurals,' give students the opportunity to make more than one word or type of word
plural - s, es, ies)
select the test item formats that will best reduce the possibility of guessing
make sure you have selected only the number of items students can reasonably finish in the time
allotted
help students develop and maintain a positive attitude toward testing (announce tests in advance; tell
them the purpose of the test; avoid using tests as punishment) i.e., consider what your students will
need to be motivated to do well
Considering these factors as you are designing and constructing the test will usually ensure more
consistency in performance, i.e., more reliable scores and more confidence in the results. We will study
various ways to estimate validity in a later part of the course.
Creating the Test Blueprint or Table of Test Specifications
Now we can take all of this advice about general test design, validity, and reliability and use it to make
the best possible test. To do this we need a good plan - that's where the blueprint comes in. Good tests
take careful planning and a blueprint is a way to plan tests using all of the design considerations we have
learned. To create a table of test specifications, start with a matrix. It almost looks like the goal
framework. There is an example Table of Test Specifications format under the Materials tool. You may
want to download or print that now as you practice constructing a blueprint. You may also use it to
create your test blueprint in your Final Project if you like.
The table of test specifications helps you record the content that you have selected to be on the test you
are planning. The row headings on the left indicate what content you are selecting to appear on the test.
The content is classified according to learning level (these classifications could come from the goal
framework if you have one to work with) and the column headings are then used to indicate the
learning levels that will be present on the test. These steps are important as you are trying to make sure
the content and learning levels on the test match the content and learning levels from the instructional
objectives that were used to plan the lessons or learning activities. This is especially important when
considering the content validity of the test.
Identify an instructional goal that is relevant for your discipline (subject matter/content and age/grade
level). Create a rough draft of the major content and learning levels that would likely be needed for a
table of specifications for that goal. If you are not yet able to identify an instructional goal from your
discipline, you may wish to use the framework in the next section (Table 5.1) as the foundation for this
draft table of specifications. You may wish to discuss your draft with your group members and give them
some feedback on their draft as well. This is one of the skills found on our Final Project so you may wish
to get some practice-with-feedback within your group at this time.
Creating Items from Subordinate Skills and Instructional Objectives
Now we will look at the actual selection of objectives to place on the test and how they are used to
develop effective test items. Well-planned and well-written items will tend to contribute to valid and
reliable test results. As we practice constructing an actual blueprint, let's trace the path of test items
from their subordinate skills to their development into test questions. Remember the goal framework
from Module 2? It is copied as Table 3.1 below. It contains the results of the analysis of the instructional
goal: Write instructional objectives. Take a moment to review the framework noticing the content (lefthand row headings) and the learning levels (column headings) that make up the skill. The intersection of
the column and row creates the particular subordinate skill; and the contents of the cell at that
intersection is what the students will need to know or do.
Let's look at the intersection between Row 2 that is concerned with the "behavior" part of an
instructional objective. Now look at Column C which asks students to state the qualities of the content in
question. So the intersection of Column C and Row 2 means that students are asked to state the
qualities of the behavior aspect of an instructional objective C.2 = State the quality characteristics of
behavior in an instructional objective. If you asked the student, "What are the qualities of 'behavior' in
an instructional objective?" they should then be able to tell you the contents of cell C.2: "It is clear and
the behavior is observable and measurable, also, the behavior should be appropriate for my students."
While we are using a framework that is on the topic of instructional objectives, try to think of what a
framework in your area of study might look like (one on states of matter, history of western civilization,
problems of mass and volume, engaging in a social conversation in an alternative language, etc.). The
subordinate skills that make up the goal framework will then become instructional objectives. You turn
the subordinate skills into instructional objectives by adding the conditions and criteria. Make sure the
objective contains all the right pieces: behavior, content, conditions, and criteria.
Table 3.1 Instructional Goal Framework for the Goal: Write instructional objectives.
Learning Levels
Comprehensio
n
Knowledge
Content
A.
State/Recall
Physical
Characteristic
s
1.
A statement
Instructiona of learning
l Objective outcome that
contains 3 - 4
Applicatio
n
Evaluation
B. State/Recall
Functional
Characteristic
s
C. State/Recall D. Discriminate E. Create
Quality
Examples and
an
Characteristic Non-examples example
s
F.
Evaluate
given
examples
Serves as the
foundation of
instructional
planning and
Clear,
appropriate
scope.
Matches
(Use
criteria
from
columns A
Discriminate
between:
instructional
objectives and
parts (below). assessment.
subordinate
skill in content
and learning
level.
instructional
activities;
instructional
objectives and
goals
- C to
evaluate
given
examples.
)
2. Behavior
The action or
"verb" part of
the objective.
Specifies the
action part of
the skill the
student is to
perform.
Helps guide
construction
of test items
or tasks
Clear,
observable,
measurable.
Appropriate
for learners.
Discriminate
between
behavior and
the other parts
(content,
conditions,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.
)
3. Content
The part of
the objective
that states the
subject
matter or
topic of
learning.
Identifies
topic or
subject
matter the
student is
learning.
Serves as
basis of lesson
planning,
material
selection, and
test items or
tasks.
Clear,
relevant,
observable,
measurable.
Appropriate
for learners.
Discriminate
between
content and
the other parts
(behavior,
conditions,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.
)
4.
Conditions
The part of
the objective
that specifies
equipment
and materials.
Usually at
beginning of
objective and
starts with
"Given ....".
Specifies
equipment or
materials the
learner needs
to perform
the skill.
Assists in
setting the
level of
difficulty.
Clear,
relevant,
practical.
Appropriate
for learners
and context.
Discriminate
between
conditions and
the other parts
(behavior,
content,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.
)
Helps ensure
the test item
will match the
instruction.
5. Criteria
The part of
the objective
that indicates
the mastery
level.
Indicates the
level of
mastery at
which the skill
is to be
performed
(correct 85%
of the time;
correct to
within 3 feet
of the target,
etc.).
Clear,
observable,
measurable.
Appropriate
for learners
and content.
Discriminate
between
criteria and the
other parts
(content,
conditions,
behavior).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.
)
Now let's review steps we take in deciding on the type of test question. After considering the goals and
subordinate skills within a specific unit, a teacher will then determine the best way to measure students'
skills. As you examine the behaviors from the first three columns under the learning levels (Columns A C: "state/recall") of Table 3.1, try to imagine the best way to measure students' mastery of these lowerorder skills. What do we mean by best? In most cases, we mean "what type of instrument or what set of
tasks will provide the most valid and reliable results with maximum authenticity, feasibility, and
efficiency." (Remember design characteristics from Module 2?)
Now examine the last Column E ("create"). Would you select the same types of items to measure the
subskills in this column as you chose for Columns A - C? Probably not. Skills from Columns A - C could be
efficiently measured with selected response type items like multiple choice or matching while Column E
would be measured best with a written response item such as a short answer format or even as part of a
product exam (such as a portfolio that contains lesson plans and examples of instruments to measure
student outcomes following the lessons). Both types of formats can yield valid and reliable results but
one type is more authentic than the other and so may be more desirable in this context. It would not be
efficient to measure the definitions such as are found in Columns A - C with short answer items (because
of the length of time to grade, subjectivity of scoring, etc.). Likewise, the best way to measure the skill
"create an instructional objective" would not be with a multiple choice or true false item. In the next
module section, we will study more about creating various objective style items.
Now we can practice tracing the development of items using the subordinate skills found in the
framework above. Examine the table below to trace the path of some objective style items. You will be
asked to create an objective and item of your own as well.
Subordinate
Skill from Table
3.1
1.B State or
recall the
function of an
instructional
objective.
1.E Create an
instructional
objective.
Instructional Objective for
Specified Subordinate Skill
Example of Test Item or Task to Measure the
Subordinate Skill
1.B.1 Given the term
"instructional objective"
and a list of purposes of
related instructional terms,
recall the purpose of an
instructional objective.
Item for skill 1.B.1:
1.B.2 Given the term
"instructional objective",
state the purpose.
Item for skill 1.B.2:
1.E.1 Given a subordinate
skill, write a complete
instructional objective for
the skill.
Item for skill 1.E.1:
1.E.2 Given a goal
framework and instructions
for the task of creating an
authentic lesson plan and
test, create an instructional
objective and test item
from a skill in the
framework.
Task for skill 1.E.2
For what reasons do we use instructional objectives?
a. Foundation of lesson plans
b. Basis for test items
c. Keep accurate attendance
d. Both a and b are correct.
What is the purpose for writing an instructional
objective?
________________________________
Use the following subordinate skill to create May 23,
2007m>Identify the migratory patterns of birds in the
western hemisphere.)
________________________________
(This task would be one part of a larger portfolio
assessment that would include the instructions for
creating objectives, designing lessons, and creating
tests to measure student learning outcomes following
the lessons.) e.g....
...For each of the skills in the framework you have
selected, write an appropriate instructional objective to
serve as the foundation of your lesson plan and test.
Review the objective you created to see if it contains
the necessary parts...
4.D
(Now it's your turn. Create
an instructional objective
for this subskill: 4D.)
(Yes, now it's time to try your hand at writing a test
item for that instructional objective. Don't worry, we
will learn more about this in another module section.)
Discuss and post the instructional objective and test item (if you can create some items at this point) for
your Final Project under the your group Discussion forum. Review and offer constructive criticism to the
postings of your classmates to improve the objectives and test items (you can actually use the criteria
found in Table 5.1, Columns A - C to remind yourself of the characteristics of good instructional
objectives).
Module 2 - Part 6 Test Design General Principles
While you were reviewing the professional standards related to assessment from your professional
organization, did you notice that it was important to be able to develop or select a variety of high quality
assessment instruments? For example, review the Assessment standard in the Florida Accomplished
Practices. (Remember, links to other professional standards related to assessment are found on Page 1.)
These principles will guide you as you are creating or selecting instruments to use in your specific
teaching/learning context. In this module, we will consider principles of good test design. The two most
important characteristics of a high quality instrument are: validity and reliability. They are followed by
utility (feasibility, cost, etc.). The principles described here will help you create instruments that will
provide valid and reliable results for your decision making needs.
You can also apply many of these design principles to the state and district-wide tests that you are
required to administer and interpret. As professionals, we must evaluate these instruments using design
criteria that are based on research and best practice rather than popular opinion. Tests that are welldesigned and appropriately administered would yield more useful results for our decision-making
purposes. Results from those less well-designed and poorly administered should be interpreted with
great caution, if at all.
Most authors and researchers in the field of measurement agree on a similar set of principles that guide
effective test design and construction. When these principles are employed, the instruments will tend to
provide more valid and reliable results than if the principles are not employed. Remember, we are trying
to set the students up for success. Nothing, except lack of proficiency, should get in the way of their
successful performance. The principles that will help you ensure validity, reliability, and utility of results
include:
Systematically planned
Good match for content
Good match for learners
Feasibility
Professional in format and appearance
Systematically Planned and Constructed
Teachers must consider their resources (time, materials, cost, professional skill, etc.) along with the
entire instructional curriculum and then set up an assessment system at the beginning of the year that
will support their instructional system. The two systems go hand in hand and are widely considered two
parts of a single Instruction/Assessment system. However you wish to describe it, the assessment
system requires a variety of good testing materials and procedures. Effective teachers use good
assessment techniques to gather and analyze the information needed for the many decisions they must
make.
A good assessment system considers the types and number of decisions a teacher must make
throughout the term and includes reasonable methods for gathering information needed to make the
most informed decisions. Some methods will be formal such as objective-style or alternative tests
(essay, product, or performance tests, portfolios) and some will be informal such as anecdotal
observations. The plan should be tailored to your specific context and resources. It is a challenge to find
the balance between a plan that is both comprehensive and systematic as well as feasible. Using a
spreadsheet program or a commercially published instructional manager program may facilitate your
effort. Once you have a plan in place, you can then select or create the best instrument for the type of
decision to be made, for the content or attribute to be measured, and for the students or participants
from which you will gather information. Specific procedures for constructing an instrument (Test
Blueprint and Item Design) will be covered in another section of this module.
Table 2.1 Example of Generic Assessment Plan for a School Term
What Decisions/Information is
Needed?
When to Gather Data?*
How Will Information be
Collected?**
1. Quality of student progress on
instructional objectives: (plan would
list actual objectives)
1.1 Following individual Various methods:
lessons (include dates)
Quizzes with both selected and
1.2 Following units, etc. written response items
Objective # (keyed to SSS) or
Portfolio entries
Individualized Educational Plan short
or long term objective #
Unit exams
Other objectives specific to the context
2. Quality of data gathering
procedures. Target the instruments
that are new, most critical, or have not
been subject to evaluation in the past.
2.1 Following each
Quantitative and Logical Analysis
targeted objective style Procedures (keyed to the targeted
exam (review item
instruments):
analysis data)
Use Excel or other spreadsheet
(list the specific instruments that were 2.2 Upon completion of program or management tool to
targeted, e.g. Quiz 3, Portfolio entry
grading each targeted
calculate difficulty and discrimination
indexes, to conduct distractor
#6, etc.)
alternative-style test
analysis, to calculate reliability
estimates
Use logical analysis by reviewing
group performance on the alternative
exams and portfolio entries
Review blueprints for validity
3. Quality of instructional techniques
and materials.
3.1 Following individual Various methods:
lessons (include dates)
Quizzes with both selected and
3.2 Following units, etc. written response items
3.3 Following the
Portfolio entries
implementation of a set
of materials, techniques Unit exams
Student satisfaction questionnaires,
interviews, observations
Notes from discussion with
colleagues
4. Student attitudes:
motivation
satisfaction
4.1 Toward beginning, Variety of methods:
middle, and end of term
interviews with sample of students
(list actual dates)
anonymous questionnaires
(see attitudinal attributes listed on
report cards)
5. Professional self-evaluation:
effectiveness of instruction
ethics
interviews with a sample of parents
informal observation of behavior
5.1 Toward beginning, Variety of methods:
middle, and end of term
Peer evaluation (list dates and peers
(list actual dates)
involved)
Accomplished Practice indicators
Supervisor evaluation (list dates,
methods e.g. interview observation,
etc.)
other indicators specific to my context
(school, district, professional
organization)
Quantitative and logical analyis of
student performance data (specify
the specific data sets to be used)
satisfaction with job
Student feedback (specify data sets)
6. Others as needed for your
professional context
Others as needed...
Others as needed...
*The actual plan would include estimated dates of the data collection events; ** The actual plan could
include the specific planned instrument or data gathering technique keyed to the decision in Column 1
and date in Column 2 of the table.
Try conducting a quick search using your favorite search tool (Google, Yahoo, etc.) to find available
course management tools that might be useful in your particular teaching/learning context. (I used
"instructional management software" for search terms and came up with many hits.) As you can see,
educational materials are part of a multi-billion dollar industry with very effective marketing strategies.
As a professional, what criteria would you use to select from among the many possibilities? Compare
the tools you find in your search with those your district may already be using. Then come back and
continue studying the important test design considerations. Will you change your mind after learning
more?
Designed to Fit Characteristics of Content
As mentioned previously, it is important to pick the right kind of instrument (or items) for the objectives
you are trying to measure. This generally takes some analysis of the instructional content. You can use
one of the educational taxonomies in Taxonomy of Educational Objectives (Chapter 5). This framework
serves as the basis of your instruction and assessment for the lesson or unit. It includes at minimum, the
content and learning levels contained within a goal but could be as large in scope as a unit or even year's
worth of instruction.
Frameworks may come with your commercially published teaching materials, may be available from
your professional organization's resources, or are available online from other sources. If goal
frameworks of subordinate skills do not exist for the particular subject or goals you are teaching, it may
be worth your time to develop a rudimentary set on your own. They are essential for planning effective
instruction and assessment. Table 2.2 is an example framework for the instructional objective: Given a
subordinate skill, write an instructional objective. Notice how the content is clearly specified (row
headings) and the learning levels at which students are to perform have been identified (column
headings).
Imagine that you have been asked to teach a workshop for paraprofessionals at your school. The topic
is: How to Write Learning Objectives for Planning Lessons and Assessments. As you prepare to design
the instructional activities and then measures that will help determine whether your workshop
participants learned anything from the lessons, wouldn't it be useful to have a resource like this? It
would help to ensure your lessons were complete (covered all the subskills) and then would help you go
on to create a posttest that really matched the instruction. Review Table 2.2 and then locate
frameworks of skills in your own subject area and grade level. These may be provided by your school
district or you may need to search online. Compare the various frameworks you find for clarity,
comprehensiveness, level of content expertise, etc.
Table 2.2 Instructional Goal Framework for the Goal: Write instructional objectives.
Learning Levels
Knowledge
Comprehension Application Evaluation
A. State/Recall B. State/Recall C. State/Recall D. Discriminate E. Create
Physical
Functional
Quality
Examples and an
Characteristics Characteristics Characteristics Non-examples example
Content
1.
A statement
Instructional of learning
outcome that
Objective
contains 3 - 4
Serves as the
foundation of
instructional
planning and
Clear,
appropriate
scope.
Matches
Discriminate
between:
instructional
objectives and
F.
Evaluate
given
examples
(Use
criteria
from
columns A
parts (below). assessment.
subordinate
skill in content
and learning
level.
instructional
activities;
instructional
objectives and
goals
- C to
evaluate
given
examples.)
2. Behavior
The action or Specifies the
"verb" part of action part of
the objective. the skill the
student is to
perform.
Helps guide
construction
of test items
or tasks
Clear,
observable,
measurable.
Appropriate
for learners.
Discriminate
between
behavior and
the other parts
(content,
conditions,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.)
3. Content
The part of
the objective
that states the
subject matter
or topic of
learning.
Identifies
topic or
subject matter
the student is
learning.
Serves as
basis of lesson
planning,
material
selection, and
test items or
tasks.
Clear,
relevant,
observable,
measurable.
Appropriate
for learners.
Discriminate
between
content and
the other parts
(behavior,
conditions,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.)
4.
Conditions
The part of
the objective
that specifies
equpment
and materials.
Usually at
beginning of
objective and
starts with
"Given ....".
Specifies
equipment or
materials the
learner needs
to perform
the skill.
Assists in
setting the
level of
difficulty.
Helps ensure
the test item
Clear,
relevant,
practical.
Appropriate
for learners
and context.
Discriminate
between
conditions and
the other parts
(behavior,
content,
criteria).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.)
will match the
instruction.
5. Criteria
The part of
the objective
that indicates
the mastery
level.
Indicates the
level of
mastery at
which the skill
is to be
performed
(correct 85%
of the time;
correct to
within 3 feet
fo the target,
etc.).
Clear,
observable,
measurable.
Appropriate
for learners
and content.
Discriminate
between
criteria and the
other parts
(content,
conditions,
behavior).
(Use
criteria
from
columns A
- C to
evaluate
given
examples.)
After considering the goals and subordinate skills, a teacher will then determine the best way to
measure students' progress in relation to those skills. As you examine the behaviors from the first three
columns under the learning levels (Columns A - C: "state/recall") of Table 2.2, try to imagine the best
way to measure students' mastery of these lower-order skills. What do we mean by best? In most cases,
we mean "what type of instrument or what set of tasks will provide the most valid and reliable results
with maximum authenticity, feasibility, and efficiency."
Now examine the last Column E ("create"). Would you select the same types of items to measure the
subskills in this column as you chose for Columns A - C? Probably not. Skills from Columns A - C could be
efficiently measured with selected response type items like multiple choice or matching while Column E
would be measured best with a written response item such as a short answer format or even as part of a
product exam (such as a portfolio that contains lesson plans and examples of instruments to measure
student outcomes following the lessons). Both types of formats can yield valid and reliable results but
one type is more authentic than the other and so may be more desirable in this context. It would not be
efficient to measure the definitions such as are found in Columns A - C with short answer items (because
of the length of time to grade, subjectivity of scoring, etc.). Likewise, the best way to measure the skill
"create an instructional objective" would not be with a multiple choice or true false item. In the next
module section, we will look at an example of this process.
Designed to Fit Characteristics of Learners
It is very important that instruments are designed so that they are a good match for the characteristics
of the learners. What characteristics of learners (in addition to their proficiency with the skill) will have
an impact on students' performances? Did you think of factors such as reading and vocabulary level?
These are some of the many characteristics to keep in mind during test design and construction.
Consider developmental characteristics specific to the age group (attention span, interests, physical
dexterity, ...).
Keep in mind whether or not students have physical, cognitive, or social-emotional challenges (visual
impairment, cerebral palsey, development delay or severe mental retardation, a specific learning
disability, behavior disorders, ...). It is especially important to note whether they are receiving
Exceptional Student Education (ESE) Services and have specific testing accommodations identified on
their Individualized Educational Plan (IEP).
Consider whether the learners speak English as a second language.
Be aware of cultural backgrounds of students in the group. Their background may influence the way
they approach an exam, the way they interpret questions, and/or the way they respond to the tasks or
questions.
Consider students' experiential backgrounds. For example, if they have not been to a snowy climate
then it may not be a good idea to include a snow scenario as background in an item (unless this is a class
on weather patterns, of course).
Be aware of students' prior experience with any equipment needed to perform the skill (if students
practiced with a plastic ball and bat then it would not be good to suddenly produce a real baseball and
bat for the actual performance test; similarly, if they practiced writing paragraphs with a paper and
pencil then you would not provide a PC on which to take the exam unless you knew they have word
processing experience).
Designed for Maximum Feasibility
Besides the obvious feasibilty issues related to cost and availability of materials, access to special
equipment or contexts (electron microscopes, different sized rooms and tape measures for calculating
area, etc.) time is an important factor when it comes to feasibilty of testing procedures. Time to select
or design the instrument, time to create it, time to score students' work all relate to feasibility. Is there
time to take the entire class of students with severe developmental delay to the grocery store to
measure whether they can shop with a list, make purchases using both coins and paper money, make
sensible purchases, and stay within their budget? The most authentic determination of their skills would
be to observe them during an actual performance. The logistics of setting up these testing procedures
would likely be prohibitive. In this case, a teacher would try to determine the next most authentic
context (e.g. a classroom simulation of a grocery store). In this instance, we would lose some
authenticity but may gain in reliability. After all, how much attention could you give to the individual
student you were trying to observe when you must protect the safety of the other 19 students in the
class at the same time?
Professional in Format and Appearance
Here we must consider other important details that may contribute or inhibit students' successful
demonstration of skills during test administration. As effective educators we want our materials to
appear professional in quality. Mistakes and lack of clarity can inhibit students' performance. They are
distracting to students and may even compromise students' positive attitudes toward test taking. The
following are design considerations to keep in mind to achieve a professional caliber result. It's a good
idea to review your testing materials with fresh eyes and even to hunt down some help with this
(colleagues, other students, a willing family member - but not Fido).
absence of bias (culture, race, gender, religion, politics, etc.)
spelling
grammar
legibility
clarity
Applying these design principles will help ensure your instruments will provide the most valid and
reliable results possible when the tests you create or select are administered to students. As you gain
experience, you will probably start to apply the principles more automatically but even seasoned
professionals benefit from reviewing them occasionally. The time we invest as we create the
instruments is worth the payoff in getting the best possible information available to make the important
decisions we must make on our jobs every day.
Module 2 Part 7: Creating Objective-style Items
Objective style items are very widely used and have many advantages when it comes to classroom
assessment. They are versatile and can measure any level of learning, although there are some types of
learning (psychomotor and attitudes) that they are not able to measure. (An attitudinal survey that uses
a Likert type scale is not considered a multiple choice test). Objective-style items include multiplechoice, matching, completion, and alternative response type items like true-false questions.
Make sure you have studied the guidelines for writing test items and the advantages and disadvantages
of the various item types.
Creating Objective-style Test Items
While reading your text, you have studied the rules or guidelines for creating each of the objective-style
items. There are many sets of guidelines available in measurement textbooks and online. You have
probably discovered that it takes some practice to write high quality objective items once you have
learned that there are guidelines to be followed to ensure validity and reliability. In the table of contents
for this module, you will find another set of guidelines by Popham (2003). Now try using the guidelines
from our text and the Popham handout to evaluate existing items.
Locate a couple of existing tests that contain objective-style items. Try to pick from a couple different
sources: one that you may have created in the past, one that is included in a commercially published
text, or one that you find online for example. Evaluate the quality of the items on the tests using the
criteria from the textbook (Chapter 7) and Popham handout ("Guidelines for Item Construction" under
the Materials Tool) . You do not need to post or submit this; it is just for your own practice.
Test Instructions and Format
Now we will learn some recommendations for the format of the instructions you include on a test. To
the extent possible, you should inlcude these elements in the instructions in this order:
Begin the instructions by stating the skill the students are to perform.
Direct their attention to any stimulus material they need to respond to the item.
Tell them how they are to respond; or how they are to record their answer.
Finally, give them any additional information they may need (number of points, amount of time,
whether to show their work or not, etc.)
Students who get their instructions in this order will likely do better than students who take a test with
the instructions that are not in this sequence. Take a look at an example of this sequence applied to a
set of matching items. This set comes from the textbook (even though our text author did not use this
sequence in the instructions).
Recall events associated with United States presidents. Column A describes events associated with
United States presidents. Column B contains names of presidents. Find the name that matches the event
and write the letter that matches that president in the space beside the event. Each name may only be
used once.
Column A: Events
Column B: Names of United States presidents
_____ 1. A president not elected to office.
a. Abraham Lincoln
_____ 2. Delivered the Emancipation Proclamation.
b. Richard Nixon
_____ 3. Only president to resign from office.
c. Gerald Ford
_____ 4. Only president elected for more than two
terms.
d. George Washington
_____ 5. Our first president
e. Franklin Roosevelt
f. Theodore Roosevelt
g. Thomas Jefferson
h. Woodrow Wilson
Practice Exercise
Now try evaluating and creating a variety of objective-style items on your own. Download the objective
item writing practice . It is a practice exercise to evaluate faulty items and then create items on your
own.
Module 2 - Practice Exercises
Norm and Criterion-Referenced Interpretations and Content Validity - click on the Next arrow in the
left corner to take the self test Example Tables of Test Specifications - Download this Word document for practice exercise
Objective Item Writing Practice - Download this Word document for practice exercise
Jump to Navigation Frame Jump to Content Frame
Module 3 Overview
The concepts in this module are important whether you are using measurement skills as a teacher in a
classroom or in another professional role such as school leader, counselor, instructional designer, or
researcher. As you begin, consider all the ways that proficiency in measurement and evaluation is vital
to your effective professional performance. Consider how these measurement skills can assist you in
performing your role, consistent with your professional philosophy, and with high quality information at
your fingertips to make effective decisions. These measurement skills are also important for interpreting
and conducting research (teacher or school leader action research, school leaders' or private, non-profit
evaluation research, scholarly research).
Module 3 corresponds to Chapters 8, 9, & 10 in our textbook. We will learn the use of alternative
assessments, writing and scoring performance-based tasks, portfolio assessment, and writing and
scoring essay items. Content in this module relates to the text but includes content not found in the
textbook as well.
One of the most important attributes of high quality assessment is the validity of results (the extent that
inferences that we make from the results are appropriate). One of the most important steps to ensuring
validity is identifying what it is you want to assess (who and what), for what purpose (why), and under
what conditions (how). In this module, we will learn skills that will help you enhance the validity of
results of alternative tests you use, create, or evaluate for research purposes.
The table below contains the objectives, readings, learning activities, and assignments for Module 3.
Module 3 focuses on the following objectives:
Chapter 8
Identify the types of learning outcomes for which
essays are best suited.
Objectives
Identify situations in which use of essay items is
appropriate.
Construct a complete extended response essay item,
including a detailed scoring scheme that considers
content, organization and process criteria.
Distinguish between the assessment of knowledge
organization and concepts.
Chapter 9
Develop a scoring rubric.
Develop a primary trait scoring scheme.
Identify the primary constraints that must be decided
on when developing a performance measure.
Compare and contrast student portfolios with other
performance assessment measures.
Chapter 10
Identify the cognitive skills that will be assessed by
student portfolios.
Identify the pitfalls that can undermine the validity of
portfolio assessment.
Prepare instructions to students for how work gets
turned in and returned.
Construct the criteria to use in judging the extent to
which the purposes for portfolios are achieved.
Complete a portfolio development checklist to ensure
the quality of the portfolio.
Chapters 8, 9, & 10 in text
Content and articles specified within module
Readings
Professional standards in your field (see list under
Materials tool)
(selected student performance standards from)
Florida Sunshine State Standards found at
http://www.floridastandards.org/index.aspx
Several non-posted practice tasks
Learning
Activities
Posting to working group (example objectives,
restricted/extended response essay questions with
scoring procedures, and critique; example
performance assessment with critique)
Assignments
Continue your work on the Final Project (Please read
the instructions of the Final Project carefully)
Part 1: Essay, Product, & Performance Assessment
Basic Characteristics of Essay Tests
Use the following framework as a study guide for your review and practice with creating essays tests.
Content within the cells provides a review of the content from Chapter 8 of the Kubiszyn & Borich text.
(Are you gaining a better understanding of goal frameworks? Are you getting closer to being able to
develop one on your own?)
Learning Levels
Conten
t
A.
B.
C.
D.
State or Recall
Quality
Characteristics
Discriminate Evaluate
Examples
Examples:
from NonExamples:
Measures
complex
cognitive skills
or processes,
communicatio
n skills
Able to sample
less content
Essays from
objective
items (for
both
objectives
and items)
Requires
original
response from
student
Relatively easy
to construct;
requires longer
time to score
Reduces
No single
Essays from
correct answer,
State or Recall State or Recall
Physical
Functional
Characteristic Characteristics
s
1. Essay Test with
questions to
test
which
students
supply
responses
Test
questions,
scoring rubric,
planned set of
procedures
for
Less reliable
than objective
tests
Essays from
active
performanc
e
E.
Flaws to
detect:
mismatch of
format
F.
Create
Examples:
Restricted
and
extended
types at
various
Mismatch with levels of
objective
complexit
Inappropriatel y for
various
y measures
age
lower-level
groups,
skill
content
areas,
administering
and scoring
the test
Two types:
Restricted
response (1
page or less)
Extended
response (>1
and <20
pages)
guessing
Requires
student to
organize,
integrate, and
synthesize
knowledge
Requires
students to use
information to
solve problems
subject to
bluffing
Requires that
scorer is
knowledgeable
, possibly
expert
non-written
products
Unclear
Fails to specify
length, criteria
on which
response will
be graded
student
types
Question
provides
appropriate
structure to
students’
responses
Enables
consistency in
scoring when:
Clear, Contains
appropriate
guidance and
organizational
information
(specifies
response
length, # points
or amount of
time, other
scoring criteria
that will be
used)
Now that you have reviewed the features of essay tests, examine the following benchmarks from
various instructional standards (e.g., Sunshine State Standards). In the table that follows, you will find
pairs of standards within a subject area. Within the pairs, determine which would be better measured
with objective style items and which would be more appropriately measured with essay items.
Suggested feedback can be found in a table immediately following the examples.
Learning Outcome
Type of test?
(objective or
essay)
Example 1 (Pre K - 2; Technology Standards)
Objective: Essay:
Prior to completion of Grade 2 students will communicate about technology using
developmentally appropriate and accurate terminology.
#___
#___
Use technology resources (e.g., puzzles, logical thinking programs, writing tools,
digital cameras, drawing tools) for problem solving, communication, and illustration
of thoughts, ideas, and stories.
International Society for Technology in Education (2005). Standards for students.
Retrieved February 10, 2005 from http://cnets.iste.org/currstands/
Example 2 (Grades 9 - 12; Social Studies Standards)
Objective: Essay:
Understands how government taxes, policies, and programs affect individuals,
groups, businesses, and regions.
#___
#___
Understands basic terms and indicators associated with levels of economic
performance and the state of the economy.
Florida Department of Education (2005). Sunshine state standards: Social studies
grades 9 - 12. Economics standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 3 (Grades 3 - 5; Language Arts)
Objective: Essay:
Identifies the author's purpose in a simple text.
#___
#___
Reads and organizes information for a variety of purposes, including making a
report, conducting interviews, taking a test, and performing authentic work.
Florida Department of Education (2005). Sunshine state standards: Language arts
grades 3 - 5. Reading standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.htm
Example 4 (Grades 6 -8; Science)
Objective: Essay:
Knows that the structural basis of most organisms is the cell and most organisms
are single cells, while some, including humans, are multicellular.
#___
Knows that behavior is a response to the environment and influences growth,
#___
development, maintenance, and reproduction.
Florida Department of Education (2005). Sunshine state standards: Science grades
6 - 8. Processes of life standard 1. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 5 (Grades 9 - 12; The Arts: Music)
Objective: Essay:
Understands the musical elements and expressive techniques (e.g., tension and
release, tempo, dynamics, and harmonic and melodic movement) that generate
aesthetic responses.
#___
#___
Analyzes music events within a composition using appropriate music principles and
technical vocabulary.
Florida Department of Education (2005). Sunshine state standards: The arts: Music
grades 9 - 12. standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 6 (Grades 4 - 8; ESL)
Objective: Essay:
Identify and associate written symbols with words (e.g., written numerals with
spoken numbers, the compass rose with directional words).
#___
#___
Take a position and support it orally or in writing.
Teachers of English to Speakers of Other Languages (2005). ESL Standards for Pre-K12 Students, Online Edition. Retrieved February 10, 2005, from
http://www.tesol.org/s_tesol/seccss.asp?CID=113&DID=1583
Compare your responses with those in the feedback table below.
Type of test?
Learning Outcome
(objective or
essay)
Example 1 (Pre K - 2; Technology Standards)
Objective: Essay:
Prior to completion of Grade 2 students will communicate about technology using
developmentally appropriate and accurate terminology.
#_1_
#_2_
Use technology resources (e.g., puzzles, logical thinking programs, writing tools,
digital cameras, drawing tools) for problem solving, communication, and illustration
of thoughts, ideas, and stories.
International Society for Technology in Education (2005). Standards for students.
Retrieved February 10, 2005 from http://cnets.iste.org/currstands/
Example 2 (Grades 9 - 12; Social Studies Standards)
Objective: Essay:
Understands how government taxes, policies, and programs affect individuals,
groups, businesses, and regions.
#_2_
#_1_
Understands basic terms and indicators associated with levels of economic
performance and the state of the economy.
Florida Department of Education (2005). Sunshine state standards: Social studies
grades 9 - 12. Economics standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 3 (Grades 3 - 5; Language Arts)
Objective: Essay:
Identifies the author's purpose in a simple text.
#_1_
#_2_
Reads and organizes information for a variety of purposes, including making a
report, conducting interviews, taking a test, and performing authentic work.
Florida Department of Education (2005). Sunshine state standards: Language arts
grades 3 - 5. Reading standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.htm
Example 4 (Grades 6 -8; Science)
Objective: Essay:
Knows that the structural basis of most organisms is the cell and most organisms
are single cells, while some, including humans, are multicellular.
#_1_
#_2_
Knows that behavior is a response to the environment and influences growth,
development, maintenance, and reproduction.
Florida Department of Education (2005). Sunshine state standards: Science grades
6 - 8. Processes of life standard 1. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 5 (Grades 9 - 12; The Arts: Music)
Objective: Essay:
Understands the musical elements and expressive techniques (e.g., tension and
release, tempo, dynamics, and harmonic and melodic movement) that generate
aesthetic responses.
#_1_
#_2_
Analyzes music events within a composition using appropriate music principles and
technical vocabulary.
Florida Department of Education (2005). Sunshine state standards: The arts: Music
grades 9 - 12. standard 2. Retrieved February 10, 2005 from
http://www.firn.edu/doe/curric/prek12/index.html
Example 6 (Grades 4 - 8; ESL)
Objective: Essay:
Identify and associate written symbols with words (e.g., written numerals with
spoken numbers, the compass rose with directional words).
#_1_
#_2_
Take a position and support it orally or in writing.
Teachers of English to Speakers of Other Languages (2005). ESL Standards for Pre-K12 Students, Online Edition. Retrieved February 10, 2005, from
http://www.tesol.org/s_tesol/seccss.asp?CID=113&DID=1583
How did you do? You may want to go to the Students Helping Students Discussion topic for some peer
feedback if you disagreed with many of the choices. You might also want to check with your group
members in case they need a little help. Remember, just because these objectives contrasted lower and
higher order skills for discriminating objective and essay type items, it doesn't mean you can't write
higher-order objective style items. This is a common misconception based on poor practices from the
past (objective style items were written primarily at the knowledge and comprehension level; skilled
objective style item writers like you can write multiple written- and selected-response items at higher
learning levels). The Nitko & Brookhart (2007) textbook has better coverage than most textbooks on
how to write objective-style items at higher cognitive levels. You may want to consult that text for your
item construction in the future.
As a test designer, it is important to think about the levels of learning that are intended by the behaviors
in the Sunshine State Standards. When the behaviors are ambiguous, the learning activities and test
items are less likely to be congruent with the standards. This can become a challenge to the validity of
test results if care is not taken to clarify before planning learning activities and in turn the best
instruments to measure students' performance. In Part 2 of this module we will contrast the two types
of essay items (restricted and extended response). We are continuing to learn how to select the best
measurement procedure for the targeted skills and the needs of our learners. This is critical to the
validity of test results.
Part 2 : Extended and Restricted Response Formats
Now that you are familiar with the features of essay items in general, examine the differences between
the restricted response and extended response varieties.
Distinguish Between Restricted and Extended Response Essay Items
Use the goal framework that follows to review the characteristics of restricted and extended response
essay items. Consider educational contexts with which you are familiar. Think about when the restricted
response format is more appropriate and when the extended response is more appropriate in those
contexts. Try to imagine some examples of the different formats as you are working through the
contexts.
Learning Levels
Content
1. Restricted
Response Essay
A.
B.
C.
D.
E.
F.
State or Recall State or Recall
Physical
Functional
Characteristics Characteristics
State or Recall Discriminate Evaluate Create
Quality
Examples
Examples Examples
Characteristics from NonExamples
Test with
questions to
which
students
supply
responses
Often used in
conjunction
with objective
style items
Restricted
from
objective
variety
Due to time
constraints,
better to use
with small
classes, in
smaller
numbers, or
when fewer
objectives
need to be
Restricted
from
Extended
variety
Causes students
to recall and
organize
information,
then to draw
conclusions and
present them
Test
within imposed
questions,
constraints of
scoring rubric, time and length
planned set of
procedures
Often used to
for
assess
administering knowledge,
and scoring
comprehension,
and application
Flaws to
detect:
mismatch
of format
Restricted
variety
for
various
age
Mismatch groups,
with
content
objective areas,
Unclear student
types
Fails to
specify
length,
criteria
on which
the test
level skills
covered
Restricted
response (1
page or less)
Good to use
when test
security is an
issue
Used more
frequently
than extended
response
items
Good to use
when
information
must be
supplied rather
then recognized
or selected
response
will be
graded
Can cover
somewhat
more content
than extended
response
Can be scored
with more
reliability than
extended
response
Learning Levels
Content
1.
Extended
Response
Essay
A.
B.
C.
State or Recall
Physical
Characteristics
State or Recall
Functional
Characteristics
State or Recall Discriminate Evaluate
Quality
Examples
Examples
Characteristics from NonExamples
Test with
questions to
which students
supply
responses
Often used to
assess analysis,
synthesis, and
evaluation level
skills
Student
determines
length and
complexity of
response
Test questions,
scoring rubric,
planned set of
procedures for
administering
and scoring the
Causes students
to use higher
order cognitive
skills
Student must
D.
Extended
from
restricted
variety
E.
Flaws to
detect:
F.
Create
Examples
Extended
variety at
various
Mismatch
levels of
of format complexity
Extended
for various
Takes
from active Mismatch age groups,
relatively more performance with
objective content
time to
variety
areas,
develop and
Measures student
score
lowertypes
assemble,
critically analyze
Longer
information and
responses than use it to solve
restricted
new problems;
response items students
(often more
synthesize
than one page concepts and
usually less
principles and
than 20 pages) then predict or
evaluate
outcomes
test
Relatively
more difficult
to score
Requires more
time and
resources of
students
level skill
Unclear
Fails to
specify
length,
criteria on
which
response
will be
graded
Can be used to
evaluate
students’
communication
skills
Recall that an essay test really consists of two parts. The first part is the set of questions and instructions
for students and the second part is the scoring procedures (checklist or rubric and the steps or
instructions that will be followed by the rater). We will practice with the questions part first.
Examine the essay questions below. The objectives from which they were written are included so that
you can determine the congruence between the skills specified in the objective (behavior, content,
conditions, and criteria) and the actual item. Notice how some objectives are more appropriately
measured with the shorter restricted response items while others are more appropriately measured
with the more complex extended response items.
Objective:
Restricted
Response
Example 1
Given the terms for selected music elements and expressive techniques used by
composers (e.g., tension and release, tempo, dynamics, and harmonic and melodic
movement), explain the aesthetic responses a composer would expect they would
generate.
Restricted Response Item Set:
Explain the aesthetic responses that a composer would be able to generate with each of
the following music elements. For each of the terms, briefly explain the aesthetic
response in the space provided. Each element is worth 2 points. Keep answers brief. It is
not necessary to use complete sentences but correct spelling is required. Complete all
five elements.
1. tension and release
2. tempo
3. dynamics
4. harmonic movement
5. melodic movement
Objective (for teacher's planning and test development purposes):
Given a musical piece from a widely known composer of the Romantic Era, analyze the
piece for the five music elements and the way they were employed by the composer.
Item (to be administered as an essay test to the students):
Extended
Response
Example 1
Listen to Hungarian Rhapsody #2 Lento a Capriccio in C# by Franz Liszt. As you listen to
the piece, identify an example of how each of the five music elements were used and
explain the music principle associated with each. Then explain how Liszt used each of
the elements in this sonata to evoke an aesthetic response by the listener.
You will have one hour to listen to the piece and write your explanation. You may listen
to the entire piece or parts of the piece as many times as needed. It is about 10 minutes
in length so be mindful of the time. The test is worth 20 points (1 point for correctly
identifying the element; 1 point for including the correct music principle behind the
element ; 3 points for analyzing the composer's use of the element within the piece).
Make sure you identify clearly where each of the elements within the piece is located.
Present your response in essay format with paragraphs and complete sentences.
Spelling and grammar will not be graded on this test.
Create Restricted and Extended Response Essay Items
To ensure the validity and reliability of the essay test results, it is helpful to design the items using
guidelines developed from research and best practice. Review the suggestions for creating essay items
in the table below.You will then be asked to practice creating items on your own.
Suggestions for writing Clearly identify learning outcomes (content and learning levels) to be measured
essay questions
by the test.
Create items that demonstrate content validity (items match objectives in
content and learning level).
Create items that are good match for the characteristics of the target students.
Create questions that clearly delineate the task students are to perform.
Explain tasks, time, scope, and point values: orally, in the instructions included
with the overall test, or within the individual test items.
Indicate whether spelling, grammar, and organization will count toward overall
score.
Indicate scope of response and to what extent supporting data is required.
Create questions that elicit higher-order rather than knowledge and
comprehension level responses.
Demonstrate ethics in content and process of test administration and scoring.
Allow reasonable time to complete test; indicate to students the amount of
time allowed.
Use when objective items would be inadequate or inappropriate.
Refrain from offering optional items.
Specify criteria for poor, acceptable, and excellent responses ahead of time
Determine format needed (checklist or rating scale) based on content and
learning levels
Identify the elements that will be scored (content, organization, process of
solving the problem or drawing the conclusion)
Suggestions for
developing scoring
procedures
Strive for consistency, avoid drift from content, avoid drift of rigor (strict/lax)
Avoid biases
Keep work anonymous while grading when possible
Score all students in the group on each question before moving on to next
question
Avoid influence of prior questions’ scores within a student’s paper
Reevaluate the scores (or at least part of them) before returning the papers
Now practice creating restricted and extended response essay questions on your own. Use a wordprocessing program to list an objective and the essay question that would be derived from it. Hang onto
the question for now. In the next module section, you will be asked to create a scoring rubric to post in
your group's discussion area. It's not necessary to use this same table format, just make sure it is
apparent which items and objectives belong together. You may wish to review the list of outcomes for
which restricted response items are recommended in Chapter 8 of the text to help you get started.
Instructional objectives
Essay items
1a.(Locate an instructional objective from a set with
which you are familiar, e.g., Sunshine State Standards,
that would be more appropriately measured with a
restricted response essay item.)
1b.(Now create a restricted response essay
question following the guidelines from your
text and the table above.)
2a.(Locate an instructional objective from a set with
which you are familiar, e.g., Sunshine State Standards,
that would be more appropriately measured with an
extended response essay item.)
2b.(Now create an extended response essay
question following the guidelines from your
text and the table above.)
Part 3: Scoring Essays
Essay Item Scoring Procedures
Now that you are familiar with the features of restricted and extended response essay questions, it is
time to create scoring procedures for evaluating students' work. Keep in mind that many of the
recommendations for creating scoring procedures for essay items will apply to scoring other types of
product and performance measures as well (e.g., observation rating scales, rubrics for portfolio exhibits,
product exam rubrics).
Depending on the complexity of the response, you may use either a checklist or rating scale format to
evaluate the student's work. There are certain features in the design of checklists and rating scales that
enhance the reliability of scoring. A checklist is used when the elements of the response are more easily
observed and are either present or absent (all there or not there at all). If the level of quality and not
just the presence or absence of the element is to be rated, a rating scale format would be preferred. If
"degrees of correctness" are to be rated, i.e. the element is present and rated with quality categories
like "needs improvement," "adequate," and "very good," then a rating scale format would be more
desirable than a checklist.
Kubiszyn & Borich present another set of features that many use when scoring essays. These include
content, organization, and process. An example of this process can be found in Table 7.2 in the text.
Please review the example essay test found there. Others use checklists and rating scales such as are
found in Figures 8.5 - 8.8 on pp. 173 - 175 of Chapter 8 in the textbook.
Next we will examine suggested scoring procedures for the example items presented in Part 2 of this
module. Notice the overall format of the scoring guide, the format of the components being rated, and
the quality categories. Each choice the designer makes concerning the format of the scoring guide will
have an impact on the validity and reliability of instrument results.
Restricted Response Item Scoring Procedures
Example Scoring Procedure for the Restricted Response Essay Item
Recall the Objective that was presented earlier:
Given the terms for selected music elements and expressive techniques used by composers (e.g., tension and
release, tempo, dynamics, and harmonic and melodic movement), explain the aesthetic responses a composer
would expect they would generate.
Recall the Test Question that was presented earlier:
Explain the aesthetic responses that a composer would be able to generate with each of the following music
elements. For each of the terms, briefly explain the aesthetic response in the space provided. Each element is
worth 2 points. Keep answers brief. It is not necessary to use complete sentences but correct spelling is required.
Complete all five elements.
1. tension and release
2. tempo
3. dynamics
4. harmonic movement
5. melodic movement
Now examine the scoring rubric that might be developed for this question.
Checklist for the Aesthetic Elements of Music Short Essay Test
Name:_________________________________________________________________________________
____________________
Components:
Total Score: ______________
1. Tension and release
Not present or
Incomplete
Somewhat complete
Complete
1
2
Somewhat complete
Complete
1
2
Somewhat complete
Complete
1
2
Somewhat complete
Complete
1
2
Somewhat complete
Complete
1
2
Date
:
____
_
0
Not present or
Incomplete
2. Tempo
____
_
0
Not present or
Incomplete
3. Dynamics
____
_
0
4. Harmonic movement
Not present or
Incomplete
____
_
0
5. Melodic movement
Not present or
Incomplete
0
Comments:
Extended Response Item Scoring Procedures
Now examine a suggested scoring procedure for the extended response essay item example.
____
_
Example Scoring Procedure for the Extended Response Essay Item
Recall the Objective that was presented earlier:
Given a musical piece from a widely known composer of the Romantic Era, analyze the piece for the five music elements and
the way they were employed by the composer.
Recall the Test Question that was presented earlier:
Listen to Hungarian Rhapsody #2 Lento a Capriccio in C# by Franz Liszt. As you listen to the piece, identify an example of how
each of the five music elements were used and explain the music principle associated with each. Then explain how Liszt used
each of the elements in this sonata to evoke an aesthetic response by the listener.
You will have one hour to listen to the piece and write your explanation. You may listen to the entire piece or parts of the
piece as many times as needed. It is about 10 minutes in length so be mindful of the time. The test is worth 25 points (1 point
for correctly identifying the element; 1 point for including the correct music principle behind the element ; 3 points for
analyzing the composer's use of the element within the piece). Make sure you identify clearly where each of the elements
within the piece is located.
Now examine the scoring rubric that might be developed for this question.
Checklist for the Aesthetic Elements of Music Short Essay Test
Name:___________________________________________________________________________
Components:
Total Score: ______________
1. Tension and release
Present (1)__
Ineffective explanation
Somewhat effective
Principle (1)__
1
2
Date:
Effective
explanation
3
2. Tempo
3. Dynamics
Present (1)__
Ineffective explanation
Principle (1)__
1
Present (1)__
Ineffective explanation
Principle (1)__
1
Somewhat complete Effective
explanation
2
3
Somewhat complete Effective
explanation
2
3
4. Harmonic movement
5. Melodic movement
Subtotals:
Present (1)__
Ineffective explanation
Principle (1)__
1
Present (1)__
Ineffective explanation
Principle (1)__
1
____/10
Somewhat complete Effective
explanation
2
3
Somewhat complete Effective
explanation
2
3
____/15
Comments:
A. Read the first article in the list that follows. Note the others for your resources and future reference.
In addition to being useful resources, they may provide some good examples as you begin the Project
Part B.
1. Tierney, Robin & Marielle Simon (2004). What's still wrong with rubrics: focusing on the consistency
of performance criteria across scale levels. Practical Assessment, Research & Evaluation, 9(2). Retrieved
May 22, 2007 from http://PAREonline.net/getvn.asp?v=9&n=2.
2. Moskal, Barbara M. (2000). Scoring rubrics: what, when and how?. Practical Assessment, Research &
Evaluation, 7(3). Retrieved May 22, 2007 from http://PAREonline.net/getvn.asp?v=7&n=3 .
3. Mertler, Craig A. (2001). Designing scoring rubrics for your classroom. Practical Assessment, Research
& Evaluation, 7(25). Retrieved May 22, 2007 from http://PAREonline.net/getvn.asp?v=7&n=25 .
Part 4: Performance Assessment
Module 3 is related to Chapter 8, 9, and 10 in the Kubiszyn & Borich (2009) textbook. Our Final Project
Part B uses the information and skills in these chapters and module as the basis for constructing a
performance assessment. You may want to read the Final Project Part B instructions after reading
Chapters 8, 9, and 10 in your text. Also, there are performance assessment resources relevant to these
skills under the Materials tool.
Performance Based-Assessment Background
You may find it helpful to build a table similar to the one below to organize important concepts related
to designing performance-based assessment.
Describe the four steps involved in Step 1:
construction of a performance
Step 2:
assessment.*
Step 3:
Step 4:
Describe the three components of Component 1:
a good performance assessment.
Component 2:
**
Component 3:
*Do you recognize this as content for a cell under the column State/Recall the Physical Characteristics if
we were to create a framework of subordinate skills related to performance-based assessment?
**Do you recognize this as content for a cell under the column State/Recall the Quality Characteristics if
we were to create a framework of subordinate skills related to performance-based assessment?
Measuring the Five Learner Accomplishments
Note the five types of learner accomplishments that can/should be measured with performance-based
assessments. Locate a set of instructional standards (e.g. Sunshine State Standards) of interest to you.
For each of the types of learner accomplishments in the table below, locate a standard that would call
for that type of learner accomplishment. Select examples that you would be able to use as the basis for
developing a performance measure (instructions and scoring scheme). You will later be asked to try your
hand at actually developing a performance assessment. An example of each type of standard has been
included to help you get started.
Learner
Standard Representing the Type of Accomplishment
Accomplishments
Products
Example: Designs and performs real-world statistical experiments that involve more
than one variable, then analyzes results and reports findings.
Florida Department of Education (2008). Sunshine State Standards Mathematics
Grades 9 - 12: Data Analysis and Probability Standard 2. Retrieved September 20,
2008 online at http://www.paecfame.org/math_standards/Math_Standards_HighSchool.pdf
Your example:
Complex
cognitive
processes
Example: Uses a variety of maps, geographic technologies including geographic
information systems (GIS) and satellite-produced imagery, and other advanced
graphic representations to depict geographic problems.
Florida Department of Education (2008). Sunshine State Standards Social Studies
Grades 9 - 12: People Places and Environments Standard 1. Retrieved September 20,
2008 online at
http://www.floridaconservation.org/panther/teachers/plans/lesson17.pdf
Your example:
Observable
performance
Example: Students practice responsible use of technology systems, information, and
software.
International Society for Technology in Education (2005). Standards for students.
Retrieved September 20, 2008 from
http://www.doe.virginia.gov/VDOE/Superintendent/Sols/compteck12.doc
Your example:
Habits of mind
Example: Identifies specific personal listening preferences regarding fiction, drama,
literary nonfiction, and informational presentations.
Florida Department of Education (2008). Sunshine State Standards Language Arts
Grades 3 -5:The student uses listening strategies effectively Standard 2. Retrieved
September 20, 2008 online at
http://sage.pinellas.k12.fl.us/htm/SSS/SSS_GLE_Lang_3-5.htm
Your example:
Social skills
Example: Recognizes the benefits that accompany cooperation and sharing.
Florida Department of Education (2008). Sunshine State Standards Health Education
& Physical Education PreK-2:Advocate and Promote Physically Active Lifestyles
Standard 2. Retrieved September 20, 2008 online at http://www.ed.uiuc.edu/ylp/9495/PE-Benchmarks.html
Your example:
(You may want to consider one of your examples for your Project Part B learning standard.)
The following are some URL's for sites that contain examples of commercially published and teacher
made rubrics and checklists. (Found via Internet search tool using "teacher created rubrics").
http://www.uwstout.edu/soe/profdev/rubrics.shtml#powerpoint
http://school.discovery.com/schrockguide/assess.html#rubrics
http://www.rubrician.com/science.htm
http://k6educators.about.com/gi/dynamic/offsite.htm?site=http%3A%2F%2Fwww.4teachers.org%2Fpr
ojectbased%2Fchecklist.shtml
Select two of the standards (or benchmarks contained within them) you identified for the table above
and create a performance assessment for each. Consider trying one cognitive and one affective topic to
differentiate the advantages/challenges of each. Before you begin, specify the subject area, grade level,
and a brief description of the context (e.g., 4th grade science, self-contained classroom of 25 students,
heterogeneous group, five students participating in ESOL program, three students participating in ESE
program, two-week instructional unit). Exchange one of your assessments with a member of your group
and offer constructive criticism using the criteria we have learned in chapter 8 of the text.
Part 5: Portfolio Assessment
Portfolio Advantages and Disadvantages
As you have probably understood from your reading, portfolios have both strengths and weaknesses
when it comes to classroom assessment. Portfolio assessment is another important tool we must have it
available in order to design a comprehensive and high quality assessment system in our educational
context (classroom, program, district, etc.).
Read the following research articles and relate the findings to your personal experiences with
alternative assessment.
Title: IMPACT OF A CONTENT SELECTION FRAMEWORK ON PORTFOLIO ASSESSMENT AT THE
CLASSROOM LEVEL , By: Simon, Marielle, Forgette-Giroux, Renee, Assessment in Education: Principles,
Policy & Practice, 0969594X, Mar2000, Vol. 7, Issue 1
Database: Academic Search Premier
Title: DEVELOPING A VALID AND RELIABLE PORTFOLIO ASSESSMENT IN THE PRIMARY GRADES: BUILDING
ON PRACTICAL EXPERIENCE , By: Shapley, Kelly S., Bush, M. Joan, Applied Measurement in Education,
0895-7347, April 1, 1999, Vol. 12, Issue 2
Database: Academic Search Premier
Complete the following steps for development of a portfolio assessment in a learning context familiar to
you. Use instructional objectives that would be relevant for your professional context. If you have been
planning to conduct a portfolio assessment in your own learning context, this would be a good
opportunity to begin the developmental work to ensure the validity and reliability of the instrument
results. If you are already using portfolios, this would be a good opportunity to make any necessary
revisions. This is for your practice and does not need to be submitted to the discussion area or drop box.
However, this will help you with your Project Part B.
Identify the purposes that you would want a portfolio in your grade or content area to
achieve:
Step
1
1.___________________________________________________________________________
____________________________________________
2.___________________________________________________________________________
____________________________________________
3.___________________________________________________________________________
____________________________________________
4.___________________________________________________________________________
____________________________________________
Identify the cognitive learning outcomes (e.g. metacognitive skills), several important behaviors
(e.g. self-reflection, planning), and significant dispositions (e.g. flexibility, persistence) that will be
reflected in your learners' portfolios:
Outcomes:______________________________________________________________________
________________________________________
_______________________________________________________________________________
_______________________________________
Step
2
Behaviors:_______________________________________________________________________
______________________________________
_______________________________________________________________________________
______________________________________
Dispositions:_____________________________________________________________________
_____________________________________
_______________________________________________________________________________
_____________________________________
Now identify in what general curricular area you will plan your portfolio (science, geography,
reading, math) and describe how you will make decisions about which content areas and how
many samples within each area to include. Be sure to include several categories of content from
which learners will choose representative samples:
Step
3
Curricular
area:___________________________________________________________________________
_______________________________
_______________________________________________________________________________
________________________________________
Content
areas:__________________________________________________________________________
__________________________________
_______________________________________________________________________________
________________________________________
Number of
samples:________________________________________________________________________
_______________________________
_______________________________________________________________________________
________________________________________
Prepare a rubric for one of the content areas identified in Step 3. Also, indicate the type of scale
for rating the portfolio as a whole:
Step
4
Rubric:_________________________________________________________________________
_______________________________________
Scale:__________________________________________________________________________
_______________________________________
Now you are ready to choose a procedure to aggregate all portfolio ratings and to assign a grade
to the completed portfolio. Decide how you will weight (1) drafts when computing a content area
rating, (2) content area ratings when they are averaged, and (3) your rating of the whole portfolio
with the average rating of the content areas:
1.
Drafts:__________________________________________________________________________
_____________________________________
_______________________________________________________________________________
_______________________________________
Step
5
2. Content
areas:__________________________________________________________________________
_______________________________
_______________________________________________________________________________
_______________________________________
3. Whole
portfolio:________________________________________________________________________
_______________________________
_______________________________________________________________________________
_______________________________________
Step
Finally, describe how you will handle the following logistical issues:
6
Timelines:_______________________________________________________________________
______________________________________
_______________________________________________________________________________
______________________________________
How products are turned in and
returned:_______________________________________________________________________
______________
_______________________________________________________________________________
______________________________________
Where final products are
kept:___________________________________________________________________________
__________________
_______________________________________________________________________________
______________________________________
Who has access to the
portfolio:________________________________________________________________________
____________________
_______________________________________________________________________________
_______________________________________
Here are other resources that may be useful in your quest to design or select high quality alternative
assessments. Due to copyright constraints, they are not links but addresses only. These are just for your
reference and to offer models and perspectives from a variety of interests and contexts. Select at least
one to explore in depth and make yourself aware of the others for possible future use.
ERIC/OSEP Special Project News Brief. Making alternative portfolio assessment a success.
http://ericec.org/osep/newsbriefs/news17.html
Browse through Dr. Helen Barrett's favorite links on alternative assessments and portfolios. Beware of
information overload, there are a large number of resources here; but be sure to note some related to
use of technology for portfolio/alternative assessment development.
http://electronicportfolios.com/portfolios/bookmarks.html
Northwest Educational Resource Laboratory Assessment Scoring Guides. A variety of scoring guide
resources, note those for Spanish writing, young children, and group assessment.
http://www.nwrel.org/assessment/scoring.php
Intercultural Development Research Association Newsletter. Portfolios in Secondary ESL Classroom
Assessment: Bringing it All Together (1993). While it appears dated, this article raises important issues
that remain current. http://www.idra.org/Newslttr/1993/Nov/Adela.htm
CRESST Performance Assessment Models: Assessing Content Area Explanations. Comprehensive
resource, while dated 1992, the models (see p. 14) are current and helpful. Also,
http://cresst96.cse.ucla.edu/CRESST/Sample/Perm.pdf (skip the first cold link: test preparation samples
and check out the list by subject and grade level for useful examples).
The efficacy of portfolios for teacher evaluation and professional development: Do they make a
difference? http://eaq.sagepub.com/cgi/content/abstract/39/5/572 for abstract and full article
available: The Efficacy of Portfolios for Teacher Evaluation and Professional Development: Do they make
a difference. Tucker et al. Educational Administration Quarterly.2003; 39: 572-602.
Module 4 Overview
The concepts in this module are important whether you are using measurement skills as a teacher in a
classroom or in another professional role such as school leader, counselor, instructional designer, or
researcher. Consider how these measurement skills (evaluating the technical quality of instruments and
designing fair and appropriate marking systems) can assist you in performing your role, consistent with
your professional philosophy, and with high quality information at your fingertips in order to make
effective decisions. These measurement skills, describing and evaluating technical characteristics of tests
and creating marking systems, are also important for interpreting and conducting research (teacher or
school leader action research, school leaders' or private, non-profit evaluation research, scholarly
research).
Module 4 corresponds to Chapters 11 & 12 in our textbook. We will learn ways to administering a test,
evaluate test quality using item analysis, and create a marking system to grade student performance
fairly. Content in this module relates to the text but includes content not found in the textbook as well.
The most important attributes of high quality assessment include the validity and reliability of results
(the extent that inferences that we make from the results are appropriate and that results are
consistent and accurate). In order to accurately evaluate student or program performance and make
appropriate summative inferences, we must have high quality data. Conducting item and test analysis
will help to ensure you are making decisions with the best possible information.
The table below contains the objectives, readings, learning activities, and assignments for Module 1.
Module 4 focuses on the following objectives:
Chapter 11
Discriminate between quantitative and qualitative
item analysis.
Identify multiple-choice options in need of
modification, given quantitative item analysis data.
Objectives
Identify acceptable ranges for item difficulty levels
and discrimination indices.
Compute quantitative item analysis data for criterionreferenced tests using the modified norm-referenced
procedures described in the text.
Interpret these data to assess the appropriateness of
criterion-referenced test items.
Chapter 12
Describe the problem of mixing factors other than
achievement into marks.
Define and discriminate among the five marking
systems presented in the text.
Define and discriminate among the different symbol
systems presented in the text.
Describe the procedure suggested (i.e., equate before
you weight) in the text to minimize the likelihood that
such distortions will affect final marks.
Describe the front-end and back-end equating
procedures used to combine performance measures
and traditional measures into a single mark.
Chapters 11 & 12 in text
Content and articles specified in module
Student Evaluation Standards (2003) - see link within
module
Readings
Professional standards in your field (see list under
Materials tool) as needed
Selected student performance standards from)
Florida Sunshine State Standards found at
http://www.floridastandards.org/index.aspx (as
needed)
Learning
Activities
Assignments
Several non-posted practice tasks within module;
also, sets of practice exercises found under Table of
Contents
Posting to class discussion topic on Grading Plan and
Item Analysis as needed to compare practice
feedback.
Start Final Project Part B
Module 4 Part 1 Evaluating Test Quality
Module 4 Part 1 corresponds to Chapter 10 in the Kubiszyn and Borich (2007) textbook. Please read the
chapter and review the Power Points (click on the link) before beginning this section. You may want to
have a calculator and some scratch paper handy. These tasks are for your practice; they do not need to
be uploaded or posted. It would be a good idea to compare your responses with your group members if
you have any confusion.
In this section of the module, we are practicing item analysis skills. These are the skills test designers use
to evaluate the quality of each individual item on the test. It is especially useful when examining the
quality of newly created items to ensure they are functioning as planned. You have learned good test
design and construction guidelines including item-writing recommendations, creating good test
directions, and appropriate test administration procedures. For this practice, we will assume students
have experienced high quality instruction. It is now time to implement procedures that enable us to
evaluate the quality of the items on the test. As we are learning this material keep in mind that
ultimately, from a criterion-referenced perspective, a teacher would like every student to answer every
item on the test correctly. In other words, we want all students to learn all of the skills that are
represented by the sample of items on the test. But does this always (ever?) happen? Not usually. So
then we examine the test item data to determine whether the obtained results were reasonable under
the circumstances. In making a judgment about the functioning of the items in the specific context, you
not only are able to evaluate test quality, you will also gain information about the performance of the
students and quality of instruction.
Difficulty Analysis
The difficulty index is an indication of how difficult a specific test item was for the group. It represents
the proportion of students answering an item correctly. The difficulty index ranges from .00 (everyone
got it wrong) to 1.00 (everyone got it right). The formula for p shows the number of students answering
the item correctly (R) divided by the number of students answering the item (n) or: p = R / n .
Specific criteria are used to interpret the difficulty index, depending on the test context (e.g., normreferenced vs. criterion-referenced). The process for interpreting the difficulty index for each item
involves identifying the range of difficulty levels that would be reasonable for this group of students
with this instructional objective. You then compare the obtained index with what would be reasonable
to decide whether the item is functioning well or seems problematic.
If the item difficulty level seems either too easy or too difficult for the students under the acknowledged
conditions (achievement characteristics of the group, complexity of the material), investigate to
determine the source of the problem. Try to determine whether the item was miskeyed, ambiguous,
subject to guessing, or exhibited some other problem.
Use the criteria in Table 4.1 to interpret the p values you obtain from the data in Table 4.2. These values
are found in Carey (2003) and are appropriate for the interpretation of difficulty indices from a criterionreferenced test that was administered shortly after instruction.
Table 4.1 Standards for interpreting the Difficulty Index for items from a criterion-referenced, objective
style test following instruction.
Difficulty Index Range
Description of Item Difficulty Level*.
.90 - 1.00
Very easy for the group.
.70 - .89
Moderately easy for the group.
.50 - .69
Fairly difficult for the group.
less than .50
Very difficult for the group.
*Remember that these must be interpreted with awareness of the context of the test and characteristics of the students
objectives. Items that are close to 1.00 (very easy for the group) may be fine or it may mean there is a problem with the it
Items with p values less than .50 are most often described as problematic from a criterion-referenced perspective. When
50% of the class answers correctly in a criterion-referenced situation following adequate to excellent instruction, there is
problem with the item. These standards are not applied in the same way from a norm-referenced perspective.
Examine the item-by-student data in Table 4.2. Compute the difficulty index for the six items. Imagine
the items were developed from objectives that were classified according to the Learning Levels
indicated in the table and that there were three objectives covered by the test (items 1 and 2 came from
one objective, items 3 and 4 came from the second objective, and items 5 and 6 came from the third
objective). Imagine also that it was a homogeneous group of students who were struggling with even
the most basic unit objectives. Do the item difficulty indices seem reasonable under the circumstances?
Table 4.2 Item-by-student data for a practice test with six items.
Learning Levels K**
K
C
C
A
A
Student
Items
1
2
3
4
5
6
Total
CR*
4
2
1
3
2
4
(6)
Jasmin
4
2
1
3
2
4
(6)
Alberto
4
2
1
3
2
4
(6)
Chad
4
2
1
3
2
4
(6)
Renee
4
2
1
1
2
4
(5)
Gustaf
4
2
1
2
2
1
(4)
Devone
4
2
1
1
2
3
(4)
Maricela
4
2
1
4
2
3
(4)
Garrett
4
2
1
2
2
3
(4)
Brenda
4
2
1
3
4
3
(4)
Qing
4
2
3
3
3
4
(4)
Kadar
4
2
1
1
2
3
(4)
Anna
4
3
1
4
2
4
(4)
Joe
4
2
1
3
1
1
(4)
Nadia
4
2
4
2
2
4
(4)
Andrew
4
2
3
1
2
4
(4)
upper p
Katrina
4
2
2
2
1
4
(3)
Igor
2
3
4
4
2
4
(2)
Elena
1
3
2
2
2
4
(2)
Gordon
3
1
3
1
3
4
(1)
Burt
3
3
4
4
1
1
(0)
lower p
difficulty total
Note: *CR = Correct Response; numbers within cells indicate the answer selected by the student; i.e.
Jasmin selected response choice 4 for item #1 and response choice 2 for item #5. **K = knolwedge level;
C = Comprehension level; A = Application level of learning.
Table 4.2 Item Analysis Feedback
Index
1
2
3
4
5
6
upper p
1.00
1.00
.90
.50
.80
.50
lower p
.60
.50
.30
.10
.60
.70
difficulty
total
.80
.75
.60
.30
.70
.60
Moderately
easy
Moderately
easy
Fairly difficult
Very difficult
Moderately
easy
Fairly difficult
The difficulty indiex for items 1 and 2 seem reasonable. The questions covered lower-order skills
(knowledge) and 75 - 80% of the group answered them correctly. The difficulty index for items 3 and 4
may indicate a problem. Both items cover the same comprehension-level objective. There should not be
such a big difference between the two p values if those items cover the same objective. Also, they cover
lower order skills. While the p value for item 3 (p = .60) seems low but possibly appropriate for the
group, the difficulty index for item 4 seems very low, even for this group. The difficulty index for items 5
and 6, while showing a fair amount of difficulty, seem reasonable under these circumstances.
We would expect that the difficulty indices follow the pattern of complexity of the objectives i.e., 1 & 2
are comprehension level items with somewhat higher p values; items 3 and 4 are somewhat more
complex at the comprehension level with slightly lower p values; and items 5 & 6 are even more
complex at the application level with even lower p values. Did the difiiculty indices follow this expected
pattern? Items 1, 2, 5, and 6 seemed to follow the expected pattern but items 3 & 4 seemed both more
difficult than they should have been and too far apart in value to be appropriate. They must be
investigated for possible revision if they are to be used on future tests.
In summary, the items 1, 2, 5, and 6 seem reasonable for a practice test with this group of students and
this level of complexity. Items 3 and 4 need to be investigated for possible problems. While p = .60 for
question 3 does not seem too unreasonable for this group on a practice test, it does not seem
reasonable when compared to their performance on items 5 and 6. After all, how could they get more
difficult (application level) skills correct while getting less difficult (comprehension) skills wrong? It's a
good idea to investigate further. We will now look at how the items are able to discriminate students
who knew the material fairly well from students who did not know the material well overall.
Item Discrimination
The discrimination index (d) is another tool to help evaluate the quality of a test item. It lets us know if
the item is doing the job it was intended to do; d lets us know if the item is capable of telling us whether
or not students knew the material.
Item discrimination is based on the following assumption. Students who performed well overall on the
test are most likely the students who answer correctly on an item-to-item basis and students who
performed poorly on the overall test are likely to be the ones who miss any given item. (A logical
assumption, right? If you "buy" this assumption, you are well on your way to understanding item
discrimination.)
Item discrimination ranges from -1.00 to 1.00 and is calculated with the formula: d = [(Number of
students in the upper group who answered correctly) minus (Number of students in the lower group
who answered correctly)] divided by (Number of students in either group) or (Ru - Rl)/n of either group.
Another formula is: (p value for upper group) minus (p value for lower group) or d = pu - pl.
The discrimination index is a function of the value of the difficulty index. Items with difficulty values
closer to .50 have more potential to discriminate between people who knew the material and people
who did not know the material. Items at the more extreme ends of the difficulty scale do not have the
same potential to tell the difference between who got it right and who got it wrong. In other words, if
everyone gets the question right (p = 1.00) the item does not have a chance to discriminate whether the
people who answered correctly came from the upper rather than the lower performers overall on the
test (because everyone performed the same - they all got it right). The same with the lowest p values (p
values between .00 and .10). If everyone got it wrong the item can't discriminate whether those who
knew the material overall answered the item correctly and those who did not know the material overall
answered the item wrong (again, because everyone performed the same - they all got it wrong).
Standards for interpretation will differ somewhat from author to author. Discrimination values > .30
(i.e., greater than or equal to .30) would be considered strong, items > .20 would be considered good,
items between .00 and .20 would be considered weak but adequate, and items < .00 would be
considered poor.
Examine the item by student data in Table 4.2 (repeated here from the earlier section). Calculate and
interpret the item discrimination values for Items 1 - 6. While it is important to be able to calculate the
difficulty and discimination values and to understand the concepts, that is only half the job. The other
half is to be able to take the information you get from interpreting the indices and put it to work to
improve the items and eventually make a better test. How are these items performing?
Table 4.2 Item-by-student data for a practice test with six items.
Learning Levels K**
K
C
C
A
A
Items
1
2
3
4
5
6
Total
CR*
4
2
1
3
2
4
(6)
Jasmin
4
2
1
3
2
4
(6)
Alberto
4
2
1
3
2
4
(6)
Chad
4
2
1
3
2
4
(6)
Renee
4
2
1
1
2
4
(5)
Gustaf
4
2
1
2
2
1
(4)
Devone
4
2
1
1
2
3
(4)
Maricela
4
2
1
4
2
3
(4)
Student
Garrett
4
2
1
2
2
3
(4)
Brenda
4
2
1
3
4
3
(4)
Qing
4
2
3
3
3
4
(4)
Kadar
4
2
1
1
2
3
(4)
Anna
4
3
1
4
2
4
(4)
Joe
4
2
1
3
1
1
(4)
Nadia
4
2
4
2
2
4
(4)
Andrew
4
2
3
1
2
4
(4)
Katrina
4
2
2
2
1
4
(3)
Igor
2
3
4
4
2
4
(2)
Elena
1
3
2
2
2
4
(2)
Gordon
3
1
3
1
3
4
(1)
Burt
3
3
4
4
1
1
(0)
upper p
lower p
difficulty total
discrim. index
How did you do? Compare your results to the values and interpretations in the feedback table below.
Table 4.2 Item Analysis Feedback
Index
1
2
3
4
5
6
upper p
1.00
1.00
.90
.50
.80
.50
lower p
.60
.50
.30
.10
.60
.70
discrimination index
.40
.50
.60
.40
.20
-.20
good
good
good
good
adequate
poor
Items 5 and 6 are possibly in need of revision according to the discrimination indices. Notice for item 6
how more people in the group of students that scored lower overall on the test (lower group) answered
the item correctly than the group of students who tended to be more knowledgeable on the test (upper
group). This is not consistent with the assumption on which discrimination is based. In fact, it is illogical
and indicates the item likely needs revision.
Item Distractor Analysis
Distractor analysis is our final tool for investigating the quality of items. After creating test items and
trying them out on a test with students, it is important to investigate the quality of the items to make
sure they are doing their jobs: a) telling us who knew the material and who did not know the material
and b) giving us information on what aspects of the skills the students have not learned. Distractor
analysis is a procedure for examining patterns in students' response choices to detect faulty test items
(especially faulty response sets).
Distractor analysis allows us to examine the distribution of students' choices across the response set to
determine whether the distractors were functioning as intended. If they are not functioning well, we will
be able to detect this and revise them for future use.
If distractors are functioning appropriately, and if students were selecting wrong answers, we can
determine which wrong answers they were selecting. This tells us what misconceptions students may
have in relation to the skill. This is pretty important information from an instructional perspective. We
learn "who knows what" - and if they don't know, what part of the skill they are having trouble with
(especially if item stems and responses are based on learning objectives and are well written in terms of
their potential for diagnosing students' misconceptions). We are using qualitative analysis along with the
p and d values (quantitative analysis) to evaluate the quality of our test items.
Review the pattern of responses in Table 4.2. Are there any items in which the distractors are not
functioning? In other words, are there any incorrect choices that were not selected by any of the
students? Also, are there any patterns in the responses that show signs of ambiguity or guessing? Did
any of these items appear to be miskeyed?
Table 4.2 Item Analysis Feedback for Distractor Analysis
Items
(Qualitative
Judgments)
1
2
Distractor not
functioning: no
student selected
response choice 4.
Check to see if it is
too obviously
incorrect because of
a clue, becuase of
implausibility, or if
the teacher has
"taught to the test",
or inadvertently used
the item/response in
a previous example.
3
4
5
6
Indicates a
pattern of
guessing; too
few answered
correctly and
the rest of the
students'
choices were
"all over the
place". Notice
that this is true
even for the
students in the
upper group
who tended to
know the
material.
Distractor not
functioning: no
student selected
response choice 4.
Check to see if it is
too obviously
incorrect because of
a clue, becuase of
implausibility, or if
the teacher has
"taught to the test",
or inadvertently used
the item/response in
a previous example.
Distractor not
functioning; no student
selected response
choice 2. Check to see if
it is too obviously
incorrect because of a
clue, becuase of
implausibility, or if the
teacher has "taught to
the test", or
inadvertently used the
item/response in a
previous example.
There is a pattern of
ambiguity in that
students tended to
select answer choice 3
as frequently as 4.
Check to see whether
they are poorly written
(equally correct).
Otherwise this is good
diagnostic information
i.e., students aren't
skilled enough to
distinguish correctness
of 4 over choice 3.
Please do the practice exercise: Item Analysis Exercise. the link is also available on the late page of this
module.
Module 4 Part 2: Grading and Reporting Achievement
This part of the module corresponds to Chapter 11 in your text. In addition to information found in
textbooks, a resource related to this topic that would be extremely useful for teachers, administrators,
and other school personnel is:
The Joint Committee on Standards for Educational Evaluation. (2003). The student evaluation standards:
How to improve evaluations of students. Arlen R. Gullickson, Chair. Thousand Oaks, CA: Corwin Press.
Here is a link to the Joint Committee site: Student Evaluation Standards Follow the links until you reach
the specific standards within each of the classifications (proprietary, etc.). Make sure you carefully
review the specific Accuracy standards (A1 - A11; e.g., A1 Validity Orientation: Student evaluations
should be developed and implemented, so that interpretations made about the performance of a
student are valid and not open to misinterpretation.) Review the other Evaluation Standards (Program
Evaluation, Personnel Evaluation) as well. Reflect on ways that your professional activities are consistent
or may be inconsistent with the standards.
Marking Systems - Purpose and Features
The Student Evaluation Standards that are described in the resource cited above were developed,
reviewed, and agreed upon by the members of the Joint Committee on Standard for Educational
Evaluation (Joint Committee, 2003, p.2). Sixteen major education organizations are represented on the
committee. The standards are organized around four necessary attributes. These four attributes are
listed in the table below. They are elaborated more fully in the book and are recommended as a
practical and philosophical guide for professionals, students, parents, and others involved in educational
evaluation of students at the classroom level. The four attributes are listed here for your convenience
and review.
Proprietary
Standards
The proprietary standards help ensure that student evaluations will be conducted legally,
ethically, and with due regard for the well-being of the students being evaluated and
other people affected by the evaluation results. There are seven proprietary standards
listed in the resource.
Utility
Standards
The utility standards ensure that student evaluations are useful. Useful student
evaluations are informative, timely, and influential. There are seven utility standards.
Feasibility
Standards
The feasibility standards help ensure that student evaluations can be implemented as
planned. Feasible evaluations are practical, diplomatic, and adequately supported. There
are three feasibility standards.
Accuracy
Standards
The accuracy standards help ensure that student evaluation will produce sound
information about a student's learning and performance. Sound information leads to
valid interpretations, justifiable conclusions, and appropriate follow-up. There are 11
accuracy standards.
Recall the purpose of grading as stated by the authors of our text: to provide feedback about academic
achievement. In addition to this purpose, other authors point to other functions of grading and
reporting systems. Note that some are intended and some are not intended (please see Table 4.2.1).
When grades are used for purposes other than those intended, we must consider carefully whether
grades are really valid for those uses.
Table 4.2.1 Reported Functions of Grading and Reporting Systems
Authors
Functions of Grading Included in Authors' Discussions
Linn &
Miller
(2005)
(1) instructional uses (improvement of student learning and development)
(2) reporting to parents/guardians (help parent understand the objectives of the school
and how well their child is achieving the intended outcomes)
(3) administrative and guidance uses (determining promotion and graduation, awarding
honors, determining athletic eligibility, reporting to other schools and prospective
employers)
Oosterhoff
(2005)
(1) motivate students (generally undesirable, consequences unknown)
(2) discipline students (grades should not be used for this purpose)
Nitko (2005, (1) reaffirm what is already known about classroom achievement; (2) documentation; (3)
obtain extrinsic rewards, punishment; (4) obtain social attention or teacher attention; (5)
p. 332)
request new educational placement; (6) judge a teacher's competence or fairness; (7)
indicate school problems for a student; (8) support vocational or career guidance
explorations; (9) limit or exclude student's participation in extracurricular activities; (10)
promotion or retention; (11) granting graduation/diploma; (12) determining whether
student has necessary prerequisite for a higher level course; (13) selecting for
postsecondary education; (14) deciding whether an individual has basic skills needed for a
particular job
Locate and read one of the following articles (available via UCF online library search selecting ERIC data
base). Synthesize this information with the other material you are learning in this module.
Lambating, J. & Allen, J. D. (2002). How the multiple functions of grades influence their validity and value
as measures of academic achievement. Paper presented at the annual meeting of the American
Educational Research Association (New Orleans, LA, April 1 - 5, 2002).
Guskey, T.R. (2002). Perspectives on Grading and Reporting: Differences among teachers, students, and
parents. Paper presented at the annual meeting of the American Educational Research Association (New
Orleans, LA, April 1 - 5, 2002).
McMillan, J.H., Myran, S., & Workman, D. (2002). Elementary teachers' classroom assessment and
grading practices. [Electronic version]. Educational Researcher, 95(4), 203-214.
Validity is one of the most important characteristics to consider when designing a grading or marking
system. There are procedures to follow that will help to ensure the validity of grades. A teacher or other
person responsible for evaluating performance must select an appropriate set of indicators to represent
the expected instructional outcomes. The set must accurately and fairly represent the person's
achievement of the expected goals. A teacher will ask himself, what tasks, tests, projects, portfolio
exhibits, etc. will I include as components that will contribute to the students' composite term grades?
How will I combine these components to fairly and accurately depict students' achievement. A good
answer to these questions will help ensure the inferences resulting from the interpretation of grades
will be appropriate. A poor answer compromises the validity of the grades.
Marking Systems - Creating Composite Scores and Assigning Grades
Grades must first be defined to let students, school personnel, and families know what they mean. In
typical school systems, grades represent students' achievement related to the goals and objectives
covered during the marking term. When this is true, grades are assigned from a criterion-referenced
perspective and the various letter grades are defined to represent students' performance in relation to
the skills taught. In other words, an A represents that the student has mastered all goals at a high level
of skill, a B represents that students have mastered all or most of the goals at least at a minimal level. A
grade of C means that the student has mastered the majority of the goals but is having difficulty and a
grade of D means the student is having difficulty with a majority of the skills. A grade of F means that the
student has made little progress, if any, toward the skills during the term (Carey, 2003, p.425). When
these definitions are used uniformly within and across school systems, we have a better understanding
of interpreting the meaning of grades from one context to another.
In some programs, a norm referenced grading system is used in which grades are defined as students'
achievement of the goals and objectives in relation to the peer group. From this perspective, grades are
defined to represent the extent student's performance on goals and objectives is below average,
average, or above average when compared to the peer group.
As noted in the previous articles, classroom teachers unknowingly or deliberately combine variables
from each of the two perspectives when assigning grades. When variables other than those intended to
be measured and/or reported are included in the grade, we say that the achievement grade is
confounded (mixed with other variables that should not be included such as ability, attendance,
attitude, etc.). When grades are not well defined and their meaning not communicated effectively to
stakeholders, confusion, frustration, and resentment will often be the result.
Please do the practice exercise: Grading Practice. (The link can be found on next page). It is included to
offer practice related to calculating term composites and assigning grades. If you get stuck, you may
want to discuss the procedures within your working group. You may also want to discuss the rationale
other members used to assign their percentages, if they were very different among your group
members.
Locate and read the following article describing a study that examined procedures for adapting a grading
system for middle school students in an exceptional education program. Think about the many factors
that must be considered when designing a grading system using best pedagogical practices, with the
needs of exceptional students and students in the general population in mind, and implementing good
measurement practice. Think about the principles related to an effective grading system. To what extent
were they implemented in the experimental grading procedures? Using what you know about the
importance of implementing accommodations for exceptional students, about designing instruction and
assessment to support learning and enhance student motivation, critique the grading system described
in the article. If you were invited to help revise the Personalized Grading Plan procedures for use in your
educational context, what changes or additions would you include?
Munk, D.D. & Bursuck, W. D. (2001). Preliminary findings on personalized grading plans for middle
school students with learning disabilities. Exceptional Children, 67(2), 211-234.
This article is available using a UCF Library online search (selecting ERIC database).
Module 4 Exercises
Item Analysis Exercise (click on the link)
Grading Practice (click on the link)
Module 5 Overview
The central tendency and variability concepts in this module are important whether you are
gathering, summarizing, and interpreting data as a teacher in a classroom or in another
professional role such as school leader, counselor, instructional designer, or researcher. As you
begin, consider all the ways that proficiency in using data to help make decisions is vital to your
effective professional performance. These measurement skills are also important for interpreting
and conducting research (teacher or school leader action research, school leaders' or private,
non-profit evaluation research, scholarly research).
Module 5 corresponds to Chapters 13 & 14 in our textbook. We will learn how to summarize test
scores and group performance and convert raw scores to standard scores. Content in this
module relates to the text but includes content not found in the textbook as well.
Once assessments are used to gather information about an attribute (individual student or group
performance), it is important to summarize and make sense of the results. Concepts in this
module will assist you in this process.
The table below contains the objectives, readings, learning activities, and assignments for
Module 1.
Module 5 focuses on the following objectives:
Chapter 13

Objectives
Compare and contrast histograms,
frequency polygons, and smoothed
curves.
 Discriminate among positively skewed,
symmetrical, and negatively skewed
distributions.
 Determine the mean, median and
mode, given a set of data.
 Locate correctly the relative positions of
the measures of central tendency in
various distributions represented by
smooth curves.
 Identify the measure of central tendency
that best represents the data in various
distributions.
 Draw conclusions about data based on
the measures of central tendency
and/or smooth curves based on the
data.
Chapter 14

Compare and contrast the range, semiinterquartile range, and standard





deviation.
Determine quartiles and percentiles for
a given set of data.
Discriminate between raw and
converted scores.
Use z-score conversions to facilitate
comparisons of scores from different
distributions.
Determine equivalent raw scores, zscores, T-scores, and percentile ranks.
Use the measures of central tendency,
variability, converted scores, and the
properties of the normal curve to make
decisions about measurement data,
both for the individual students and for
groups.
Chapters 13 & 14 in text
Readings
Learning
Activities
Content specified in module
Several non-posted practice tasks
Continue Final Project Part B (to be found
Assignments under Assignments tool)
Module 5 Part 1: Central Tendency
This module part corresponds to Chapters 12 and 13 in the Kubiszyn and Borich (2007)
textbook. Please read the chapters before beginning this section. You may want to have a
calculator and some scratch paper handy. These tasks are for your practice; they do not need to
be uploaded or posted. Don't forget that you can discuss and compare your responses with your
group members if you have any confusion.
In this section of the module we will focus on summarizing data as well as calculating and
interpreting measures of central tendency. Most of you will have seen these skills before and just
need some review and practice with application of the skills to a new context.
Summarizing Data (Please read Chapter 12 and 13 in the Textbook)






There are a number of ways to summarize test performance data and the choice often
depends on the context. Factors within the context may include number of observations,
purpose of the data summary (e.g., making instructional decisions, interpreting results
obtained in a research study, program evaluation; report dissemination) and available
technology. We will concentrate mostly on educational contexts such as individual
student performance, class performance, and school or district-wide performance data.
A first step in summarizing a data set is to create a simple or grouped frequency
distribution.
Once you have done that, it may be useful to create and then interpret a graph such as
a histogram, frequency polygon, or smoothed curve.
To further interpret the data, it is usually useful to calculate and interpret measures of
central tendency: mean, median, and/or mode. Central tendency tells us how well the
group performed. The choice of central tendency measure will often depend on the
nature of the data and the questions being asked of it.
Along with the measure of central tendency, it is useful to calculate and interpret
measures of variability: range, variance, and standard deviation. Measures of
variability tell us how dispersed or clustered the scores were about the mean. Variability
is covered in Part 2 of this module.
In academic settings, it is helpful to use the following procedure to describe a group's
performance (or in the ESE or counseling context, a group of scores obtained from one
individual over time).
o Set up expectations about the group's performance while keeping in mind these
factors: characteristics of the group; complexity of the material; quality of the
test; quality of the instruction (assume the quality of the test and instruction is
good unless you have information to the contrary).
o Calculate measures of central tendency and variability.
o Compare the obtained values to the values that were expected. Evaluate the
obtained results (e.g. "the group generally performed well but were somewhat
more heterogeneous than expected").
o Seek reasonable explanations for any discrepancies between the expected central
tendency and variability and what was obtained.
o Document and use the information for making decisions about the quality of the
performance, quality of instruction and materials, and the quality of your own
performance.
Central Tendency


There are three commonly used measures of central tendency: mean, median, and
mode.
The mean represents the average of the scores in the group and is calculated using the
following formula:
∑ means “to sum”
X represents a score (thus ∑X means to sum the scores)


 n = number of scores
The median represents the midpoint in the set of scores. Half of the scores will fall below
this point and half will be above the point that is the median. Some authors in the field
(Carey, 2001) provide a formula for the median. It allows a more precise estimate of the
"midpoint of values."
The last measure of central tendency is the mode. The mode is the score in the set that
appears most frequently.
At the end of this part of the module you will be asked to calculate the mean and find the median
for Set 1 in Table 5.2A. But first you will be asked to set up reasonable expectations for the
group's performance as noted in the interpretation procedures listed toward the beginning of the
module. Continue reading about variability and then proceed with the calculations and
interpretations.
Use the sets of scores in Table 5.2A to practice summarizing a given data set. You will later use
these scores to calculate and interpret measures of central tendency and variability (Please refer
to Chapter 12 and 13 in theTextbook ).
Table 5.2A Summarize a Data Set
Scores
Data Summary Procedures
Set 1: 20, 20, 19, 19, 19, 19, 18, 18, 14, 14
Create Simple Frequency Distributions for
Set 2: 30, 26, 22, 21, 20, 20, 20, 18, 17, 16
Create a Grouped Frequency Distribution
Set 3: 50, 48, 45, 40, 40, 40, 33, 30, 29, 25
Create a histogram for Set 1.
Create Frequency Polygons for Sets 1 and
Describe the distributions of scores as rela
Examine the polygons you created from the data in Table 5.2A. Think about the nature of the
performance represented by the distributions of scores. What can you tell about the group's
performance after studying the distributions? When interpreting the distributions, pay attention
to the location of the distribution along the raw score scale. Notice whether the polygon is
situated toward the lower end of the scale or toward the higher end. This gives you an idea of
how well the group performed.
Next, notice whether the polygon is indicating the scores were clustered together about the
mean, fairly spread out around the mean, or widely dispersed about the mean. This gives you a
sense of the variability of the performance (studied further in Part B of the module). Also, notice
whether the polygon is shaped like a bell, indicates skewness, or contains more than one mode
(bimodal, multi-modal). This gives you further insight into the nature of the performance. It
shows you where there are clusters within the group - clusters at the high end are more
desirable when interpreting achievement-related scores and clusters toward the low end present
a challenge. You must then decide whether or how to provide remedial instruction for those
students who are not succeeding with the skills or extension experiences for those students who
need to reach even higher. So, pictures really are worth a thousand words when it comes to
summarizing a group's performance.
Practice calculating and interpreting group performance using the data in Set 1 from Table 5.2A.
Use the information in the following scenario to set your expectations for the group's
performance (is it reasonable to expect them to do very well? to do moderately well? or to have
difficulty with most of the skills?). Then calculate the actual (obtained) measures of central
tendency. Use that information along with the frequency polygon from the beginning of this page
to interpret the results. You may wish to use the table format after the scenario to guide your
work. We will use this same data set in Part B of this module to practice with variability concepts
(if you would like to wait and do them both together, that's ok, too).
Imagine this scenario:
A teacher is about to administer a posttest to a group of students following a unit of instruction.
This teacher has implemented some new teaching techniques and would like to know if the
students have succeeded in mastering the skills. In the past, the group has been quite
heterogeneous in their performance and struggling to achieve even moderate mastery of
skills. The sub-skills that will be measured are classified as fairly difficult. There are 20 possible
points on the exam. What mean might be expected from this group under these circumstances?
What distribution shape will likely result?
Expectation
Obtained Result
(make prediction based on
scenario)
Indices
(calculate using actual Set 1 data)
Mean
Range (continues in Part 2 of module)
Standard (continues in Part 2 of module)
Deviation
Distribution Shape
Now compare your obtained values to what you reasonably expected and write your description of the
group's performance.
Wondering how you did? You may wish to post your predictions and obtained results under your
group's discussion topic to talk it over and compare results.
Module 5 Part 2: Variability
This module part corresponds to chapters 12 and 13 in the Kubiszyn and Borich (2007) textbook
and focuses on variability in data. Please read the chapters before beginning this section. You
may want to have a calculator and some scratch paper handy. These tasks are for your practice;
they do not need to be uploaded or posted. Don't forget that you can discuss and compare your
responses with your group members if you have any confusion.
In this section of the module we will focus on summarizing data as well as calculating and
interpreting measures of variability. Again, this may be a review or a chance for you to apply the
skills in a new context.
Summarizing Data: Variability (Please read Chapter 13 in the
Textbook)


There are a number of ways to summarize test performance data and the choice often
depends on the context. Factors within the context may include number of observations,
purpose of the data summary (e.g., making instructional decisions, interpreting results
obtained in a research study, program evaluation; report dissemination) and available
technology. We will concentrate mostly on educational contexts such as individual
student performance, class performance, and school or district-wide performance data.
Along with the measure of central tendency (see Part 1 of this module), it is useful to
calculate and interpret measures of variability: range, variance, and standard
deviation. While the range reflects the distance between the high and low scores, other
measures of variability tell us how dispersed or clustered the scores were about the
mean.

In academic settings, it is helpful to use the following procedure to describe a group's
performance (or in the case of ESE and counseling contexts, possibly a group of scores
obtained from one individual over time).
o Set up expectations about the group's performance while keeping in mind these
factors: characteristics of the group; complexity of the material; quality of the
test; quality of the instruction (assume the quality of the test and instruction is
good unless you have information to the contrary).
o Calculate measures of central tendency and variability.
o Compare the obtained values to the values that were expected or reasonable
within the context. Evaluate the obtained results (e.g. "the group generally
performed well but were somewhat more heterogeneous than expected").
o Seek reasonable explanations for any discrepancies between the expected central
tendency and variability and what was obtained.
o Document and use the information for making decisions about the quality of the
performance, quality of instruction and materials, and the quality of your own
performance.
Review the sets of scores in Table 5.2A and your practice summarizing data and calculating
central tendency (from Part 1). You will now use these scores to calculate and interpret
measures of variability.
Table 5.2A Summarize a Data Set
Scores
Data Summary Procedures
Set 1: 20, 20, 19, 19, 19, 19, 18, 18, 14, 14
Create Simple Frequency Distributions for each of the se
Set 2: 30, 26, 22, 21, 20, 20, 20, 18, 17, 16
Create a Grouped Frequency Distribution for Set 2.
Set 3: 50, 48, 45, 40, 40, 40, 33, 30, 29, 25
Create a histogram for Set 1.
Create Frequency Polygons for Sets 1 and 2.
Describe the distributions of scores as related to symme
Review the polygons for amount of variability in the data set. Next, notice whether the polygon is
indicating the scores were clustered together about the mean, fairly spread out around the
mean, or widely dispersed about the mean. This gives you a sense of the variability of the
performance. Also, notice whether the polygon is shaped like a bell, indicates skewness, or
contains more than one mode (bimodal, multi-modal). This gives you further insight into the
nature of the performance. It shows you where there are clusters within the group - clusters at
the high end are more desirable when interpreting achievement-related scores and clusters
toward the low end present a challenge. You must then decide whether or how to provide
remedial instruction for those students who are not succeeding with the skills or extension
experiences for those students who need to reach even higher. So, pictures really are worth a
thousand words when it comes to summarizing a group's performance.
Variability


Measures of variability include the range, variance, and standard deviation.
The range is calculated by subtracting the lowest earned score from the highest earned
score. It tells you how far the students' scores spanned along the raw score scale. To
interpret the range, you may wish to use the following rule of thumb (Carey, 2001) : a
range that is equal to 1/4th or less of the total points possible is considered to be a
homogeneous performance; a range that is equal to about 1/3rd of the total number
of points is considered to be somewhat heterogeneous; and a range that is 1/2 or
more of the total possible points is considered to be a very heterogeneous
performance. (Example: total possible points on a test is equal to 60; highest earned
score was 58 and the lowest earned score was 28; 58 - 28 = 30; 30 divided by 60 equls
.50 or about 1/2 of the total possible points; this means the group's performance was
very heterogeneous.)
R = range

represents the highest earned score
represents the lowest earned score

The standard deviation is a number that represents, on average, how far the scores in
the set were away from the mean. As a measure of variability it is telling us how
clustered together or how dispersed the scores were about the mean. Interpretation
takes some practice but a larger number represents more variability and a smaller
number represents less variability. You can further interpret the group's performance by
comparing the standard deviation to the range. If the standard deviation is about 1/4th
or less of the range, the scores are more clustered together within the range; a standard
deviation that is around 1/3rd of the range indicates scores are somewhat dispersed
within the range; and a standard deviation that is about 1/2 or more of the range
represents scores that are quite dispersed throughout the range.

SD = standard deviation
∑ means “to sum” (thus
deviations)
X represents a score
= the mean of the scores
means the sum of the squared
n = number of scores in the group

The variance is another measure of variability that tells us how dispersed the scores
were about the mean. The variance is calculated much like the standard deviation. It is
an important measure of variability when it comes to conducting and interpreting
statistical analyses. Notice how calculation of the variance is the same as for the
standard deviation until you take the square root. In other words, variance is the
standard deviation squared.
variance
∑ means “to sum” (thus
means the sum of the squared deviations)
X represents a score
= the mean of the scores
n = number of scores in the group
Finish practicing calculating and interpreting group performance using the data in Set 1 from
Table 5.2A. Use the information in the following scenario to set your expectations for the group's
performance. Then calculate the actual (obtained) measures of variability. Use that information
along with the frequency polygon from the beginning of this page to interpret the results. You
may wish to use the table format after the scenario to guide your work.
Imagine this scenario:
A teacher is about to administer a posttest to a group of students following a unit of instruction.
This teacher has implemented some new teaching techniques and would like to know if the
students have succeeded in mastering the skills. In the past, the group has been quite
heterogeneous in their performance and struggling to achieve even moderate mastery of
skills. The sub-skills that will be measured are classified as fairly difficult. There are 20 possible
points on the exam. What mean, range, and standard deviation might be expected from this
group under these circumstances? What distribution shape will likely result?
Expectation
Obtained Result
(make prediction based on
scenario)
Indices
Mean
(calculate using actual Set 1 data)
Range
Standard Deviation
Distribution Shape
Now compare your obtained values to what you reasonably expected and write your description of the
group's performance.
Wondering how you did? You may wish to post your predictions and obtained results under your
group's discussion topic to talk it over and compare results.
There is more practice in the handout available under the Table of Contents for this module.
These exercises will give you some practice with another data set. If you would like even more
practice with illustrations of these concepts, there is a site you may find helpful (more practice
for mean and standard deviation calculation) URL is:
http://www.easycalculation.com/statistics/learn-standard-deviation.php. It is a copyrighted
product of HIOX.
Extend your learning beyond the ordinary by trying to complete the calculations and graphs for
Table 5.2 using a spreadsheet program like Microsoft EXCEL. There is a site published by Dr. Del
Siegel at the University of Connecticut illustrating this procedure at the following link: (practice
using EXCEL to calculate mean and standard deviation) URL is
http://www.gifted.uconn.edu/siegle/research/Normal/stdexcel.htm.
Module 6 Overview
The concepts in this module are important whether you are using skills related to correlation and
validity as a teacher in a classroom or in another professional role such as school leader,
counselor, instructional designer, or researcher. As you begin, consider how many times you
have talked or read about correlation without knowing some of the finer points of interpretation.
Consider how these skills can assist you in reading and understanding research and evidence of
validity of test scores.
Module 6 corresponds to Chapters 15 & 16 in our textbook. We will learn to determine
relationships with correlation and examine test validity. Content in this module relates to the text
but includes content not found in the textbook as well.
One of the most important attributes of high quality assessment is the validity of results (the
extent that inferences that we make from the results are appropriate). In this module, we will
learn skills that will help you understand and evaluate validity of results of tests you use, create,
or evaluate for research purposes.
The table below contains the objectives, readings, learning activities, and assignments for
Module 1.
Module 6 focuses on the following objectives:
Chapter 15

Objectives
Interpret correlation coefficients as to
strength and direction.
 Describe why the presence of even a
very strong correlation does not imply
causality.
 Compare and contrast the correlation
coefficient and the coefficient of
determination.
 Describe a curvilinear relationship.
 Explain why correlation coefficients
computed from a truncated range of
data will be weaker than if computed
from the entire range of data.
Chapter 16



Identify types of evidence that indicate
whether a test may be valid or invalid
for various purposes.
Compare and contrast content validity,
concurrent validity, and predictive
validity evidence.
Describe procedures used to establish



the content validity evidence of a test.
Identify the type of validity evidence
most important for achievement tests.
Explain how group heterogeneity affects
the size of a validity coefficient.
Identify the most appropriate type of
validity evidence when given different
purposes for testing.
Chapters 15 & 16 in text
Readings
Learning
Activities
Content and articles specified in module
Several non-posted practice tasks (within
module)
Practice exercises found under module Table
of Contents
Assignments Continue Final Project Part B
Module 6 Part 1 Correlation
Module 6 corresponds to Chapters 14 and 15 in the Kubiszyn and Borich textbook. There are
practice activities to help you calculate and interpret correlation. You will be asked to locate a
research article that reports a correlation coefficient. The activities found in this part of the
module are not submitted to the assignments tool for a grade.
Earlier editions of many measurement textbooks and the more advanced resources go into more
detail about correlation and validity if you are interested in extending your knowledge even
further. Our library has volumes by Sax (1989) and Hopkins (1990) as well as others that would
be very useful when you are ready to take a look.
Interpreting the Correlation Coefficient (r) and Coefficient of
Determination ( )



A correlation coefficient is a number that ranges from -1.00 to +1.00 and represents
an association between measures of variables.
There are two dimensions represented by a given correlation coefficient: degree (i.e.
strength of association) and direction (positive correlation, i.e. high scores on one
variable with high scores on the other, along with low scores on one variables with low
scores on the other; OR negative correlation, i.e. high score on one variable with low
scores on the other).
There are many different types of correlation coefficients and the one calculated and
reported usually depends on the type of data (nominal, ordinal, interval, or ratio; or
continuous versus dichotomous). The Pearson Product-Moment Correlation Coefficient is
one of the most commonly reported in published research.


Do not interpret the correlation coefficient as an indication of cause and effect. Also, do
not interpret it as if it represents a percentage of association.
The coefficient of determination is obtained by squaring the correlation coefficient
and multiplying by 100. It is a way of describing the hypothetical percentage of the
factors associated with the two variables being correlated (Sax, 1989). Coefficient of
determination = correlation coefficient squared or
.
Two computational formulas for the correlation coefficient are provided below. Also, recall the
rank difference correlation coefficient from your textbook. You will not be required to compute
with the formulas given below on our objective exam unless advised otherwise. They are here for
your information and so that you can see what the spreadsheet or statistical analysis program
(like SPSS for example) is doing for you. It is possible that you will calculate and interpret a rank
difference correlation coefficient on the exam.
r = correlation coefficient (e.g., Pearson Product-Moment Correlation Coefficient)
Another way to calculate this same type of correlation is using the z score formula:
r=
Locate criteria for interpreting a correlation coefficient and a research article in a field of interest
to you that calculated the correlation between two variables. Using the criteria, practice
interpreting the strength and direction of the coefficient reported in the article. In your own
words in about 2 -3 sentences, summarize the relationship between the variables examine in
that article. If you have any questions, please post them to discussion board under the topics
"students help students" or "questions for instructor". Here are three reputable websites that
offer three different standards for interpretation. This is not to frustrate or confuse you, it is to
illustrate how many factors contribute to interpretation of this widely used index and why you
will see so many different interpretations (relative to precision) in research articles you read.
Correlation interpretation from US gov HHS's
Another example of guidelines
Another example of r guidelines (note the diagram illustrating interpretation of r squared)
Example
The Effects of Confidence and Perception of Test-taking , By: Smith, Lisa F., North American
Journal of Psychology, 1527-7143, April 1, 2002, Vol. 4, Issue 1
Database: Academic Search Premier
These authors found a statistically significant correlation of .46 between students' confidence and
their performance. This is a small, positive correlation and indicates that about 21% of the
factors related to students' confidence are related to factors contributing to their performance.
Further, the authors found a very low, positive correlation between students' perceptions of their
test taking skills and their performance (r = .14). This low correlation indicates that only about
2% of the factors contributing to students' perceptions of their test-taking skills are related to
factors contributing to their test performance.
Scatterplots


Scatterplots are another way to represent distributions of scores. Interpreting the
scatterplot enables you to use a graph to determine the strength and direction of
correlation between scores.
Interpretation of the scatterplot will also allow you to determine what type of
relationship exists between the variables (linear or curvilinear).
Matching Exercise for Practice
Examine and interpret the scatterplot examples below. Column A contains examples of
scatterplots. Column B contains a list of relationship interpretations. Match the number of the
scatterplot from Column A with the appropriate title in Column B. Check your work with the
feedback that follows.
Once you have practiced interpreting the strength and direction of the relationship, try to
imagine two variables that could be related as represented in a given distribution. For example,
Plot #1 could represent Variable A as student enrollment while Variable B represents recruitment
effort by university administration. As recruitment effort goes up, student enrollment goes up
(strong positive correlation).
Column A: Types of relationships
Column B: Examples of
Scatterplots
A. No correlation
B. Small, positive correlation
C. Strong, positive correlation
D. Small, negative correlation
E. Strong, negative correlation
F. Curvilinear correlation
Check your choices with the correct matches below.
Column A: Types of relationships
Column B: Examples of
Scatterplots
C. Strong, positive correlation
B. Small, positive correlation
A. No correlation
E. Strong, negative correlation
D. Small, negative correlation
F. Curvilinear correlation
Module 6 Part 2 Validity
Module 6 corresponds to chapters 14 and 15 in the Kubiszyn and Borich textbook. Please read
those chapters before working on this module.
In this part of the module you will review the types of validity evidence introduced in the text.
You will be asked to use abstracts of articles reporting studies that have been conducted to
examine the validity of results of specific instrument (find them using the link in the Table of
Contents within this module). You may want to download and skim them before going through
this material.
Content Validity





Content validity is the extent that the items or tasks on a test are congruent with
whatever the test was designed to measure.
It is the most important type of validity evidence when measuring achievement. In
typical classroom assessment contexts, it is important that tests are congruent with the
instructional goals and objectives they were designed to measure as well as the
instruction that the students experienced.
o This means that test items or tasks should be a good match with the behavior
(learning levels) and content of the objectives.
o There should be representation of the domain. There should be good coverage of
the content and learning levels that should be present and the test does not
include content and learning levels that were not part of the instructional
objectives and learning activities.
Content validity is largely determined through careful logical analysis of the test items
and instructional objectives.
Other types of tests besides achievement tests must also demonstrate good content
validity.
The test blueprint or Table of Test Specifications helps to ensure strong content validity
of test results.
Criterion-related Validity: Concurrent and Predictive



Concurrent validity is demonstrated when the scores on one test show a strong
positive correlation with scores on another test that measures the same or nearly the
same thing.
Test results are said to have predictive validity when the scores on a test show a
strong positive correlation with a measure of the criterion of interest that is taken after a
specified time amount of time has passed. When this strong positive correlation is
obtained, the scores are then thought to predict the future performance.
Both criterion and predictive validity are dependent on the reliability of the measures
involved. Both the test being created and the criterion (whether conconcurrent or
future) must yield reliable results.
Construct Validity

Construct validity is demonstrated when the results of a test are shown to be
consistent with outcomes that were predicted or expected based on the theory and
research surrounding the attribute that is being defined and measured. Developing
construct validity often entails:
demonstrating that the construct is important, relevant, or necessary to the field
distinguishing the construct from other similar attributes being measured
demonstrating that the construct can be operationalized (i.e., made measurable;
that which was abstract can be made observable)
o showing that the test results converge with other, similar tests to suggest or
reinforce that there is such a thing as the phenomenon being measured
o demonstrating that the test results do not correlate with tests that do not
measure the same or similar attributes
Empirical procedures (often involving correlation) are used to establish construct
validity. Both content and criterion-related validity evidence may be employed.
Construct validity is more often associated with creating measures of psychological
attributes or more complex abstract phenomena than is typically measured in classroom
settings.
o
o
o


(No need to post on these activities)
Look up an article that describes the process of gathering validity evidence on an instrument of
interest to you. Identify and evaluate the types of validity evidence that have been obtained.
Consider how you might add to that body of evidence based on the particular uses of the
instrument in your professional context.
Module 7 Overview
The concepts in this module are important whether you are using measurement skills as a
teacher in a classroom or in another professional role such as school leader, counselor,
instructional designer, or researcher. As you begin, consider all the ways that proficiency in using
results of various instruments is vital to your effective professional performance. Skills in
interpreting and estimating reliability of test scores are important for evaluating and conducting
research (teacher or school leader action research, school leaders' or private, non-profit
evaluation research, scholarly research).
Module 7 corresponds to Chapters 17 & 18 in our textbook. Content in this module relates to the
text but includes content not found in the textbook as well.
One of the most important attributes of high quality assessment is the validity of results (the
extent that inferences that we make from the results are appropriate). To improve the validity of
the assessment, the same representation of content and learning level should follow through to
the item or task on the test they build. To describe the way the skills are originally classified
(such as by the arrangement of the item difficulty levels) and seek the consistency of the testing
items and the course content is usually called item mapping. The next most important attribute
is reliability. In this module, we will learn skills that will help you enhance the reliability of
results of tests you use, create, or evaluate for research purposes.
The table below contains the objectives, readings, learning activities, and assignments for
Module 1.
Module 7 focuses on the following objectives:
Chapter 17



Objectives



Describe procedures used to estimate
test-retest score reliability and alternateforms score reliability.
Describe procedures used to estimate
split-half and Kuder-Richardson
estimates of internal consistency.
Describe how the Spearman-Brown
Prophecy Formula is used, and its
effect on the reliability coefficient.
Select the best test for a given purpose
when provided with score reliability
information for different tests.
Select one best test for a given purpose
when given reliability and validity
information for several tests.
Identify the most relevant type of score
reliability estimate when given different
purposes for testing.

Explain the factors that affect the
obtained value of the reliability
coefficient (length of test, heterogeneity
of group, content, etc.).
Chapter 18






Explain how error can operate to both
increase and decrease test scores.
Define and discriminate among
obtained, true and error scores.
Discriminate between the standard
deviation and the standard error of
measurement.
Construction 68, 95, and 99%
confidence bands around obtained
scores when given the standard error.
Identify the four sources of error in
testing, and give examples of each.
Describe the extent to which the various
estimates of reliability are differentially
affected by the sources of error.
Chapters 17 & 18 in text
Readings
Learning
Activities
Content and articles specified in module
Several non-posted practice tasks
Continue Project Part B (found under
Assignments Assignments tool)
Module 7 Reliability and Accuracy
This module corresponds to chapters 16 and 17 of the Kubiszyn and Borich textbook. Please read
the chapters before completing this module. It may be a good idea to skim the material in the
paragraphs below, then read the materials recommended under the "DO THIS" icon at the end of
this page. Then go back and re-read the sections on this page to synthesize across the
resources.
Reliability




Reliability is the consistency with which an instrument is able to obtain results. After
validity, it is the most important characteristic of instrument results. If instrument
results are not reliable, they cannot be valid.
There are a variety of ways to estimate reliability of test scores. The choice of reliability
estimate will depend on a number of factors and sometimes it may be necessary to use
multiple procedures to fully estimate reliability of scores. Sometimes it is necessary to
use a method that would be less than ideal but is still the best available.
The various method of estimating reliability of test scores include: stability, equivalence,
stability and equivalence, internal consistency, and inter-rater consistency. Different
methods will result in different values for the reliability coefficient.
Different resources may suggest different rules of thumb for interpreting reliability but
most would agree that the higher the stakes, the higher the reliability that should be
expected of the scores. Standardized multiple-choice tests typically have reliability
coefficients in the range of .85 - .95. Paper and pencil tests may range between .65 and
.80. Portfolio assessments may range between .40 - .60. Consult reputable resources
(e.g. Standards for Educational Psychological Testing) when interpreting reliability
coefficients. There are a number of factors that affect their interpretation.
Stability, Equivalence, and Stability and Equivalence Methods


Stability is measured by estimating the correlation of test scores obtained from the
same individuals over a period of time (test-retest reliability).
o Test-retest estimation involves administering the same instrument two times to
the same group of students after a pre-determined period of time. Evidence of the
stability of the test scores is provided if the students who score high the first time
around are the same students who score high the second time around (and the
remaining students keep the same relative ranks from the first to the second
administration as well).
o The amount of time that elapses is an important consideration. The longer the
interval between administrations the lower the correlation between the sets of
scores. When there is a long interval between administrations, the correlation
between sets of scores is not only diminished by the lack of stability of the scores
but also by other factors that could interfere. In addition to the lack of stability of
scores, changes in the student could occur which make it less likely that the same
rank will be held from one test administration to the next.
o Scores that might be obtained at one point in time but may be relevant or used at
a point later in time must demonstrate stability (standardized tests, for example,
that are used for admissions to academic programs). Stability is less important for
scores that will be used relatively shortly after they are obtained and are not likely
be relevant at a point in the future (classroom unit tests, for example).
o As an example of the criteria for evaluating a test-retest reliability coefficient,
measures of stability commonly reported for standardized tests of aptitude and
achievement over results of tests administered within the same year are about
.80. This is important when using standardized assessment scores from students'
permanent records. Consider the date of the assessment and whether there is
stability evidence available to indicate that the scores are still relevant after the
amount of time that has elapsed (Linn & Miller, 2005).
Equivalence is evaluated by creating two or more forms (parallel forms) of the same
instrument and administering them to the same individuals at about the same time.
o Parallel forms is an indication of the short range constancy of performance as well
as the adequacy of sampling of the domain. Recall that to ensure reliability, it is
important to get a representative sample of items from the many possible items of
the domain. If scores on equivalent forms of the test are highly correlated, it is an
indication that the test is an appropriate sample of the domain; if multiple
equivalent samples of the domain are correlated reflecting similarity in the results,
o
then this indicates that the (equivalent) samples are a good representation of the
domain.
This method of estimation is often used with standardized tests as there are often
multiple (parallel) forms needed. Along with the content validity, evidence of the
equivalence of the forms must be provided. This is true for any type of test that
would offer parallel forms reliability estimates.
The following research article abstract is an example of a study that examined the
equivalence of test scores across testing methods.
Abstract
This study explores the equivalence of web-based administration with no local
supervision and traditional paper-and-pencil supervised versions of OPQ32i (the ipsative
format version of the Occupational Personality Questionnaire). Samples of data were
collected from a range of client projects and matched in terms of industry sector,
assessment purpose (selection or development) and candidate category (graduate or
managerial/professional). The analysis indicates that lack of local supervision in highstakes situations has little if any impact on scale scores. At worst, some scales appear to
show shifts of less than quarter of an SD, with most scales showing little change, if any.
Analyses in terms of the Big Five show differences of less than .2 of an SD. Scale
reliabilities and scale covariances appear to be unaffected by the differences between
the supervised and unsupervised administration conditions.
Bartram, D. & Brown, A. (2004). Online testing: mode of administration and the stability
of OPQ 32i scores. International Journal of Selection and Assessment, 12(3), 278-284.
The following research article abstract is an example of a study that examined the stability and
equivalence of test scores from a modified version of a widely used test in early childhood.
Abstract
Examined the psychometric properties of a set of preliteracy measures modified from the
Dynamic Indicators of Basic Early Literacy Skills (DIBELS) with a sample of 75 kindergarten
students. The modified battery (called DIBELS--M) includes measures of Letter Naming Fluency,
Sound Naming Fluency, Initial Phoneme Ability, and Phonemic Segmentation Ability. These
measures were assessed through repeated administrations in 2-wk intervals at the end of the
kindergarten year. Results indicate interrater reliability estimates and coefficients of stability
and equivalence for 3 of the measures ranged from .80 to the mid .90s with about one-half
of the coefficients above .90. Correlations between DIBELS--M scores and criterion measures of
phonological awareness, standardized achievement measures, and teacher ratings of
achievement yielded concurrent validity coefficients ranging from .60 to .70. Hierarchical
regression analysis showed that the 4 DIBELS--M measures accounted for 73% of the variance in
scores on the Skills Cluster of the Woodcock-Johnson Psychoeducational Battery--Revised. The
contributions of the study, including psychometric analysis of the DIBELS--M with a new sample
and formation of composite scores, are discussed in relation to the extant literature.
Elliott, J., Lee, S., & Tollefson, N. (2001). A reliability and validity study of the Dynamic
Indicators of Basic Early Literacy Skills-Modified. School Psychology Review, 30(1), 33-49.

Stability and equivalence is obtained by administering two forms with a relatively
long delay in between.
o The combined procedure is even more rigorous than the stability or equivalence
procedures alone and is often estimated for standardized test results. While it is a
o
rigorous test of reliability of scores, it is not as commonly reported in the
literature.
The procedure is in effect a measure of both constancy of the scores and
representativeness of the domain.
Internal Consistency Reliability Estimates




Internal consistency methods involve only one test administration and involve
procedures for estimating correlations of items within the test. Common internal
consistency estimates include split-half reliability, Kuder-Richardson formulas (KR20 and
KR21), and Cronbach's Alpha.
Split-half reliability requires the administration of the test to only one group. The test
is then split into halves that are equivalent. The correlation between the scores on the
two halves is then estimated.
o Split-half is similar to the equivalent forms method of estimation in that it
provides an indication of whether or not the sample of items on the test is a
dependable sampling of the domain.
o A problem with using split-half reliability estimation for classroom achievement
tests is how to split items such that content and difficulty is equivalent across the
halves.
Kuder-Richardson methods estimate an average reliability found by taking all
possible splits of the test. The methods assume that items on a single test measure the
same attribute and that the test is a power test and not a speed test.
o The estimation involves comparison of the sum of the item variances to the overall
test variance. (Variance, like the range and standard deviation, is a measure of
variability. It is equal to the standard deviation squared.)
o Because the KR methods rely on item variance (as can be observed in the KR20
and KR21 formulas), it is important to remember that they may not be the best
reflection of reliability for criterion-referenced tests in which it is possible and
appropriate that every student answers an item or many items correctly. When
this happens, item variance is reduced and the resulting reliability estimate will be
lower. Therefore, while it may not be the best method for estimating reliability of
results for a criterion-referenced teacher-made test, it is still the most practical
method available and must be interpreted in light of these factors.
o Kuder-Richardson internal consistency reliability methods are not appropriate for
speeded tests - they tend to be inflated with speeded tests. Teacher-made
criterion referenced tests are not affected as much by this caution as a
standardized test in which time allowance is more of a factor. The KuderRichardson reliability estimates must be interpreted with caution unless it has
been determined that respondents generally have adequate time.
o Another limitation is that the KR reliability estimation methods do not reflect the
stability of scores over time.
Cronbach's alpha (coefficient alpha) is a variation of the Kuder-Richardson methods
that is used when the responses are not scored on a dichotomous scale (i.e., two
possible answer judgments "correct" or "incorrect") but rather from responses that come
from a scale with multiple response choices where the answer can receive more than
one point.
o This method is used on types of scales, such as Likert type scales (e.g., 5 strongly agree, 4 - agree, 3 - neutral, 2 - disagree, 1 - strongly disagree) which
are often used on survey questionnaires or instruments measuring psychological
traits or attitudes.
o The same cautions that apply to the KR methods apply to coefficient alpha (should
not be used in speeded tests and do not indicate stability over time). The same
advantage (single test administration) corresponds to coefficient alpha as well.
The following research article abstract is an example of a study that examined the Cronbach's
Alpha internal consistency reliability of scores from a teacher report measure of reading
engagement.
Abstract
This study examined psychometric properties of the Kindergarten Reading Engagement Scale
(KRES), a brief teacher-report measure of classroom reading engagement. Participants were 27
students with identified reading deficits from a predominantly low-income, African-American
community. Data were collected in kindergarten (Time 1) and first grade (Time 2). The KRES
demonstrated strong internal consistency (Cronbach's alpha=.96) and modest test-retest
reliability (r=.66). KRES ratings were significantly correlated with scores from the Word Reading
subtest of the Wechsler Individual Achievement Test-Second Edition and the Sound Matching
subtest of the Comprehensive Test of Phonological Processing, measured at Time 1 and Time 2.
Strategies for refining the scale and implications for applying the KRES in school-based program
evaluations are discussed.
Clarke, A.T., Power, T.J., Blon-Huffman, J., Dwyer,J.F., Kelleher, C.R., Novak, M. (2003).
Kindergarten reading engagement: An investigation of teacher ratings. Journal of Applied School
Psychology, 20(1), 131-144.
Inter-rater reliability


Interrater reliability indicates the consistency of scores that require judgments. On a
performance rating for example, it would be important to know if the ratings that were
obtained would be consistent if more than one rater evaluated the work or if the work
was evaluated on more than one occasion.
Consistency is defined as the similarity of the rank order of ratings by two different
judges.
o The reliability estimate may be a correlation estimated using the two sets of
scores (i.e. the scores from judge one and the scores from judge two).
o Another interrater reliability estimation method involves computing the percentage
of agreement between the two scorers.
o The method selected depends on the purpose of the scores or ratings. If the
scores needed are rankings, the correlation coefficient would be selected; if the
actual score is needed (such as in a pass/fail decision) the percentage agreement
method would be selected.
The following research article abstract is an example of a study that examined the rater
agreement method of estimating reliability of scores from a structured interview process to
determine special health care needs of children.
Abstract
The purpose of this study was to determine if two teams of raters could reliably assign codes and
performance qualifiers from the Activities and Participation component of the International
Classification of Functioning, Disability, and Health (ICF) to children with special health care
needs based on the results of a developmentally structured interview. Method . Children ( N =
40), ages 11 months to 12 years 10 months, with a range of health conditions, were evaluated
using a structured interview consisting of open-ended questions and scored using developmental
guidelines. For each child, two raters made a binary decision indicating whether codes
represented an area of need or no need for that child. Raters assigned a performance qualifier,
based on the ICF guidelines, to each code designated as an area of need. Cohen's Kappa statistic
was used as the measure of inter-rater reliability. Results . Team I reached good to excellent
agreement on 39/39 codes and Team II on 38/39 codes. Team I reached good to excellent
agreement on 5/5 qualifiers and Team II on 10/14 qualifiers. Conclusions . A developmentally
structured interview was an effective clinical tool for assigning ICF codes to children with special
health care needs. The interview resulted in higher rates of agreement than did results from
standardized functional assessments. Guidelines for assigning performance qualifiers must be
modified for use with children.
Kronk, R., Ogonowski, J., Rice, C., & Fledman, H. (2005). Reliability in assigning ICF codes to
children with special health care needs using a developmentally structured interview. Disability &
Rehabilitation, 27(17), 977-983.
Standard Error of Measurement








Our common sense tells us that no measurement procedure is perfect. We must
acknowledge that every score contains a certain amount of error. If we use scores to
make decisions we have the responsibility of making ourselves familiar with an estimate
of the amount of error it contains. The higher the stakes, the more responsibility we
have to learn about the error and interpret the score accordingly.
Tests scores with low reliability contain large amounts of error and test scores with high
reliability contain lower amounts of error. Low reliability would mean that there would be
large variations in assessment results if students were to be retested over and over
again. If scores reflected high reliability we would have more confidence that if students
were tested over and over again, there would be little variation in the ranking of their
scores.
The standard error of measurement (SEM) provides an estimate of the amount of
error in a score. It represents the amount that a score would be expected to vary if the
test were administered over and over again.
Scores should be interpreted within the context of the error they contain. This involves
reporting a band (range) within which the observed score falls. Upon retesting, the score
might fall anywhere within the band.
Standardized tests report the standard error estimates in the technical manual and often
provide the bands on the score reports. The band indicates how much a student's score
may vary upon retesting.
Bands can be reported at different levels of confidence or probability (based on the
standard normal curve). Bands may be reported at the 68% confidence level, 95%
confidence level, or 99% confidence level.
The standard error of measurement is estimated using the following formula:
Depending on the degree of confidence needed, a band is obtained by adding and
subtracting the SEM until the desired level of confidence is reached. A band at the 68%
confidence is obtained by adding and subtracting one standard error of measurement to
the score. Consider this example: X = 54 and SEM = 2 (recall that X is the symbol for a
score).
o the 68% band is represented by the range of scores from 52 - 56 or (54 + 1SEM;
subtract the SEM of 2 from 54 and then add the SEM of 2 to 54); upon retesting,
68% of the time, the student's score is likely to fall between 52 and 56.
o the 95% band is represented by the range of scores from 50 - 58 or (54+ 2SEM
or 54 + 4); upon retesting, 95% of the time, the students score is likely to fall
between 50 and 58.
the 99% band is represented by the range of scores from 48 - 60 or (54 + 3SEM
or 54 + 6); upon retesting, 99% of the time, the students score is likely to fall
between 48 and 60.
More advanced measurement textbooks will further explain the meaning of standard
error of measurement, how it is derived, and how it is best interpreted. You are
encouraged to study further than what is allowed by the time constraints of this course,
especially if you use test scores to make the types of decisions that have lasting effects
on the lives of other people.
o

The following research article abstract is an example of a study that reported test-retest
reliability and standard error of measurement values for the correlation indices.
Abstract
Test-retest reliability of the Test of Variables of Attention (T.O.V.A.) was investigated in two
studies using two different time intervals: 90 min and 1 week (±2 days). To investigate the 90min reliability, 31 school-age children (M = 10 years, SD = 2.66) were administered the T.O.V.A.
then readministered the test 90 min afterward. Significant reliability coefficients were obtained
across omission (.70), commission (.78), response time (.84), and response time variability
(.87). For the second study, a different sample of 33 school-age children (M = 10.01 years, SD =
2.59) were administered the test then readministered the test 1 week later. Significant reliability
coefficients were obtained for omission (.86), commission (.74), response time (.79), and
response time variability (.87). Standard error of measurement statistics were calculated
using the obtained coefficients. Commission scores were significantly higher on second trials
for each retest interval.
TABLE 2
Scores for the 1-Week Interval ( N= 33)
First Time
Second Time
T.O.V.A. Score
M
M
Omission
90.39 21.85 91.42
Commission
92.39 19.95 105.88* 15.37 .74 7.65
Response time
94.63 15.55 90.85
21.05 .79 6.87
Response time
variability
97.70 18.32 98.64
20.94 .87 5.41
SD
SD
r
SEM
21.86 .86 5.61
*p < .01.
Leark, R.A., Wallace, D.R., & Fitzgerald, R. (2004). Test-Retest Reliability and Standard Error of
Measurement for the Test of Variables of Attention (T.O.V.A.) With Healthy School-Age Children.
Assessment, 11( 4), 285-289.
Factors That Influence Reliability Interpretation

Several factors affect the calculation and interpretation of reliability estimates and must
be kept in mind when interpreting test results.
The number of items or tasks on the test affect the reliability estimate. Tests with
more items tend to have higher reliability estimates and shorter tests have lower
reliability coefficients.
o The variability of the scores also influences the obtained reliability estimate. Less
variability results in lower reliability while higher variability tends to result in
higher estimates of reliability.
o The level of objectivity influences the reliability. More objectivity results in higher
reliability while lower objectivity is related to lower reliability.
o Difficulty of test items affects the reliability. Extreme levels in which all students
answer incorrectly or all students answer correctly result in lower reliability.
It is important to consider the possible sources of error in the scores when selecting and
interpreting correlation coefficients.
o

Read the National Council on Measurement in Education Instructional Modules on Reliability
found at
NCME Instructional Module on Reliability of Scores from Teacher-Made Tests found at
http://www.ncme.org/pubs/items/ITEMS_Mod_3.pdf
NCME Instructional Module on Understanding Reliability found at
http://www.ncme.org/pubs/items/15.pdf
Now do the practice exercises on the worksheet found in the table of contents of this module.
You do not need to submit these practices but they are very helpful for demonstrating your
understanding of the reliability of test results. You may want to discuss with your group
members any difficulties you are having with the content of the modules or the calculation and
interpretation practices on the worksheets.
Examine an article that reports empirical research on a topic of interest to you. It may be one
that you have located for one of your other courses (or a new one for this purpose, if you wish).
Examine the "instrument" or "measurement" section of the article to locate the reliability
estimates reported for the instruments used to collect data for that study. Evaluate the estimates
of reliability using the criteria you have just learned.
Practice Exercises (please click the link to download the PDF file for the exercises).
Jump to Navigation Frame Jump to Content Frame
Printable View of: Module 8: Standardized Test Score Interpretation
Print Save to File
File: edf6432 Module 8 Overview
EDF 6432 - Measurement and Evaluation in Education
Dr. Haiyan Bai
Module 8 Overview
The concepts in this module are important whether you are interpreting standardized test scores
as a teacher in a classroom or in another professional role such as school leader, counselor, or
researcher. Consider all the attention that standardized tests scores receive in the media and
research. These measurement skills can assist you in performing your role, consistent with your
professional philosophy, and with high quality information at your fingertips to make effective
decisions. The skills are also important for interpreting and conducting research (teacher or
school leader action research, school leaders' or private, non-profit evaluation research, scholarly
research).
Module 8 corresponds to Chapters 19 & 20 in our textbook. We will study standardized test and
their derived scores. Content in this module relates to the text but includes content not found in
the textbook as well.
We have studied the important attributes of high quality assessment: validity and reliability of
results (the extent that inferences we make from the results are appropriate and the consistency
with which we obtain results). In this module, you will find information on validity and reliability
of standardized test results and interpret specific types of derived scores.
The table below contains the objectives, readings, learning activities, and assignments for
Module 8.
Module 8 focuses on the following objectives:
Chapter 19

Objectives



Compare and contrast standardized
and teacher-made tests.
Indicate the sources of error controlled
or minimized by standardized tests.
Develop a local norms table given a set
of test scores.
Describe how student-related factors
can affect standardized test scores.
 Convert grade equivalents and
percentile ranks to standard scores
(using a supplied conversion table) to
facilitate determining aptitudeachievement discrepancies.
 Compare and contrast grade equivalent
scores, age equivalent scores,
percentile ranks, and standard scores.
Chapter 20





Discriminate among standardized
achievement test batteries, single
subject achievement tests, and
diagnostic achievement tests.
Discriminate between aptitude tests and
achievement tests.
Compare and contrast diagnostic tests
and survey batteries.
Explain why there is no universally
accepted definition of personality.
Compare and contrast objective and
projective personality assessment
techniques and identify the major
advantages and disadvantages of each
approach.
Chapters 19 & 20 in text
Readings
Content and articles specified in module
Florida Department of Education website
related to FCAT reports
Several non-posted practice tasks
Learning
Activities
Posting to class Standardized Test Score
discussion topic
Assignments Revise the Final Project
Module 8 Part 1: Standardized Testing
This module accompanies Chapters 18-19 in the Kubiszyn & Borich (2007) textbook. Please read
the chapters 18 and 19. You must also use the principles of validity and reliability as well as
central tendency and variability to completely understand the content related to this module.
Basic Characteristics of Standardized Tests
Standardized tests are usually commercially published after a long and expensive development
process. They are called standardized because they are to be administered and scored according
to specified procedures in the same way every time they are used. As we have learned from "The
Standards," the author, publisher, and users of test results are responsible for identifying the
evidence for the validity of the test results.
After reading about the different types of tests from chapters 18 - 19 in the textbook, locate an
example of each of the following types of standardized tests. You may want to visit the Buros
Mental Measurements Yearbook site again to look for your examples. Recall/review the definition
of each of the test types and then look specifically for evidence of the validity of the test results.
(Many of the large commercial test publishers have websites where you can find some of this
information; other information can be found by conducting an online library or ERIC search.) This
activity is to extend your application of the concepts related to the qualities of instruments that
we have learned throughout the semester. (It is not necessary to post these examples as an
assignment.)
Norm-referenced academic
achievement test
(locate an example of this type of standardized
test)
Criterion-referenced academic
achievement test
(locate an example of this type of standardized
test)
Scholastic aptitude test
(locate an example of this type of standardized
test)
Identify an example of a standardized test that has been used in a research study of interest
to you. (You may use one of the instruments from the examples above if you wish.) Now locate
the author's or publisher's description of the test's purpose and any information about the
validity of the test that is available from the author/publisher. Compare the way the test was
administered and results interpreted in the research study to the author's stated purpose and
use for the test. Post a brief summary (one short paragraph) to the Standardized Tests
Discussion Topic of what you found when you compared the intended use to the actual use in
the research study . Briefly comment on your reaction to what you have found. Read a couple of
your classmates postings to compare their findings with yours.
You may wish to use the following outline to guide your work or for evaluating standardized tests
in general. It contains some of the most critical elements that should be considered when
evaluating the quality of a standardized test.
I. Reference Data
a. Title
b. Author(s)
c. Publisher
d. Type of test
e. Description of test and subtests
II. Practical Consideration:
a. Cost
b. Time limits
c. Alternate forms
d. Appropriate grade levels
e. Availability of manual
f. Copyright data of manual and test booklets
g. Purpose of test
h. Required administrator qualifications
III. Reliability
a. Reliability for each recommended use
b. Type(s) of reliability reported
IV. Validity
a. Validity evidence for each recommended use
b. Types of validity evidence
V. Scales and Norms:
a. Types of norms provided
b. Difficulty levels of items
c. Population used for norm group
d. Methods used to select norm group
e. Year and time of year standardization data was collected
Module 8 Part 2: Standardized Test Score Interpretation
This part of the module is associated with Chapters 18 and 19 in the Kubiszyn and Borich
textbook. Please read those chapters before working on these activities.
Practice interpreting a variety of standardized test scores. Create a diagram of the Standard
Normal Curve. Using the means and standard deviations of the scales listed below, plot the
scores that correspond to the familiar marker points. You may want to use Figure 18.3 in the
textbook to help you get started. Then practice interpreting as required in the questions that
follow. Remember the Students Helping Students discussion topic if you would like to compare
answers with classmates or get some pointers.
Scales:
z (Mean 0; SD 1)
T (Mean 50; SD 10)
IQ (Mean 100; SD 15 for the Weschler or 16 for the Stanford Binet)
Normal Curve Equivalence (Mean 50; SD 21.06)
Questions for interpretation practice:
How would you describe the performance of a student who earned a z score of 2.5 on a norm
referenced standardized test?
How would you describe the performance of a student who earned a T score of 28 on a norm
referenced standardized test?
How would you describe the performance of a student who earned a percentile rank of 28 on a
norm referenced standardized test?
What Normal Curve Equivalent score is equal to a percentile rank of 16?
Recall the process of interpreting percentile bands presented in Chapter 17. Practice building
percentile bands (68%) around the following Obtained Percentile Rank scores. Assume there was
a SEM of 4 on this particular subtest and interpret the performances as indicated. Bands and
interpretations for Chris and Angelina have been done to help you get started.
Student
Per
cent
ile
Ran
k
Percenti Plot the 95% Percentile Band
le Band
Interpretation
Bounda 1_____10_____20_____30_____40_____50_____60__
ries
___70_____80_____90____99
Angelina 82
78 - 86
1_____10_____20_____30_____40_____50_____60__ Average compared
___70____X80XXX__90____99
to the norm group
Chris
92
88 - 96
1_____10_____20_____30_____40_____50_____60__ Above average
___70_____80____X90XXX__99
compared to the
norm group; Chris
performed better
than Angelina
Dewan
88
1_____10_____20_____30_____40_____50_____60__
___70_____80_____90____99
Cheerie
24
1_____10_____20_____30_____40_____50_____60__
___70_____80_____90____99
Lan
55
1_____10_____20_____30_____40_____50_____60__
___70_____80_____90____99
Visit the Florida Department of Education website and locate information on the Florida
Comprehensive Assessment Test. Locate the Assessment and Accountability Briefing Book
(especially pp. 21 - 24) and the FCAT for Reading and Math Technical Report (use FCAT reliability
to search within the FL DOE site under the Shortcuts keyword search). Read about the meaning
of the various scores that are reported. Choose an area of interest to you (i.e., Math, Reading,
etc.) to examine more closely. Locate the validity, reliability, and standard errors reported in the
technical report. Next use the tables found at this link Scale Scores at Achievement Levels (URL
is http://fcat.fldoe.org/pdf/fcAchievementLevels.pdf) to find the range of scale scores associated
with each of the FCAT Levels? For example, what range of Reading Scale Scores is associated
with Grade 3, Level 3? (answer: 284 - 381)
Also, visit the Publications (Educator) for Florida Comprehensive Assessment Test at
http://fcat.fldoe.org/fcatpub2.asp for important information when interpreting the FCAT results.
You may want to save the URL for future reference. For practice, locate the following information
within the documents found on that website.
Examine the FCAT Mathematics 2007 Grade 9 content focus. How many points are possible from
the Data Analysis and Probability content area?________________ (example answer: 8)
Examine the FCAT Reading 2007 Grade 10 content focus. How many points are possible from the
Conclusions and Inferences (Cluster 1) content area?_____________
Examine the FCAT Summary of Tests and Design. What percentage of points are of moderate
complexity on the FCAT Mathematics 6th-7th grade test (Table 9)?_____
Examine the FCAT Technical Report for 2003. Locate Table 69. What is the Cronbach's Alpha
reliability coefficient for the Grade 3 Reading total battery?_________
Consider how much more of this technical information you are able to understand and use
because of the hard work you have done acquiring the skills in this course. Consider what you
know now compared to what you knew prior to studying the content of this course. Good job!
Module 9 Overview
The concepts and resources in this module are important for assessing students receiving
exceptional student education services and students with limited English proficiency whether you
are a teacher in a classroom or someone in another professional role such as school leader,
counselor, instructional designer, or researcher. Consider the importance of validity and
reliability principles to the assessment process and how important they are when selecting,
designing, and administering instruments for students with special needs. Principles for ensuring
validity (appropriateness of inferences made from test results) and reliability (consistency) are
no less critical when making accommodations with existing instruments. These concepts and
skills are also important for interpreting and conducting research (teacher or school leader action
research, school leaders' or private, non-profit evaluation research, scholarly research) that
require measures of students with special needs.
Module 9 corresponds to Chapters 3 & 21 in our textbook. We will learn the concept of individual
educational plan (IEP), the assessment of children with special needs, and understand the
related policy and practice. Content in this module relates to the text but includes content not
found in the textbook as well.
One of the most important attributes of high quality assessment is the validity of results (the
extent that inferences that we make from the results are appropriate). One of the most
important steps to ensuring validity is identifying what it is you want to assess (who and what),
for what purpose (why), and under what conditions (how). In this module, we will learn skills
that will help you enhance the validity of results of tests you use, create, or evaluate for research
purposes.
The table below contains the objectives, readings, learning activities, and assignments for
Module 1.
Module 9 focuses on the following objectives:
Chapter 3


Objectives




Identify the types of assessment data
the classroom teacher may be called
upon to provide as part of the child
identification process.
Identify the types of assessment
instruments the classroom teacher may
employ during the individual
assessment process.
Describe what is response-tointervention (RTI).
Explain the classroom teacher’s role in
RTI development and its
implementation.
State the purpose of the RTI.
Understand how the requirements of
the NCLB, IDEIA, and the shift to
formative assessment (CBM and RTI)
have altered regular classroom testing.
Chapter 20

Summary of the course.
Chapters 3 & 21 in text
Content and articles specified in module; note
there are many sites to explore and keep
handy for future reference.
Readings
Chapter on FLDOE website:
Accommodations: Assisting Students with
Disabilities (link is within module)
Brochure from FLDOE Planning FCAT
Accommodations for Students with
Disabilities (link is within module)
Learning
Activities
Review resources within module (take
advantage of guidelines, research, and
tutorials, etc. available on Internet that are
listed within the module)
Assignments
Revise the final version of the Final Project
for the final submission due.
Module 9 Assessment Issues for Language Enriched Pupils & Exceptional
Student Education Settings
General Principles
The general principles we have learned that guide selection, design, and construction of
assessment for the general population apply just as much or more so for Language Enriched
Pupils (LEP) and in Exceptional Student Education (ESE) settings. Recall these principles from
Module 2 Part 6:





Systematically planned
Good match for content
Good match for learners
Feasibility
Professional in format and appearance
Imagine students who speak languages other than English in their home. As an educator, you
will no doubt at some time be responsible for the learning of students who are in the process of
learning to speak, read, write, and understand the English language along with the subjectmatter content (math, science, social studies, music, etc.).
Now think about students who must learn differently than students in the general population. For
example, think of a student with mild mental retardation, or with cerebral palsy, or with a
reading disability. This module will help you more effectively plan assessments for students with
these or similar challenges.
Systematically Planned and Constructed
Teachers must consider their resources (time, materials, cost, professional skill, etc.) along with
the entire instructional curriculum and then set up an assessment system at the beginning of the
year that will support their planned instructional system. At times it is necessary to make
adaptations to the assessments within the system to accommodate the needs of special learners.
While an effective instructional/assessment system is tailored to your specific context and
resources, it must also be adapted to meet the needs of a variety of learners. It can be a
challenge to find the balance between a plan that is both individualized for specific learners as
well as feasible. Knowledge and creativity will help in the effort required to meet that challenge;
administrative support, assistance from trained ESE and LEP professionals, as well as patience
are also needed.
Educators must make themselves aware of available resources to help meet the challenge of
appropriately adapting assessments. Look for resources that will help you become familiar with
the specific needs of the learners. These resources may include school-based personnel with
special training in ESOL and ESE; Internet web-sites with information and tips; volunteers from
the community with knowledge of various languages, cultures, or challenges to learning; or
textbooks about these topics. Of course, the learners and their families would be good sources of
information about the learners' strengths and needs as well.
Once you have made yourself familiar with the learning modalities of the special learners, it is
time to apply your creativity and measurement "best practices" to make needed adaptations to
the instruments or administration procedures. Explore the sites listed in Table 9.1 and find
others that will help you become familiar with available resources related to instruction and
assessment of language enriched students or students with exceptional learning needs. Note that
you must evaluate the suggestions you find here through a "Best Practices in Measurement" filter
and then accept, revise, or reject the suggestion based on consistency with measurement best
practices. You may want to save the addresses of the sites that will be helpful in your
professional context. Several of these sites will also be helpful as you work on our Product Exam
Part B.
Table 9.1 Example Resources: Leaning and Assessment for Special Learners
Learners Resources
1.
English
Speakers
of Other
Languag
es
1.1 NWREL publication Spring 2006, Volume 11, #3, Everyone's Child (URL
is http://www.nwrel.org/nwedu/11-03/child/)
This site is from the Northwest Regional Educational Laboratory publication
of Spring 2006, Volume 11, #3. Note tips for general education teachers at
the end of the article Everyone's Child.
1.2 NCREL Critical Issue: Mastering the Mosaic (URL is
http://www.ncrel.org/sdrs/areas/issues/content/cntareas/math/ma700.htm)
Mastering the Mosaic - Framing Critical Impact Factors to Aid Limited
English Proficient Students in Math and Science. Resource from the North
Central Regional Educational Laboratory contains extensive research-based
information. Note the overviews of ELL programs and the sections on
Instruction and Assessment. This resource is useful to many education
professionals such as teachers, administrators, and ELL personnel.
1.3 Office of Academic Achievement Through Language Acquisition website
is Florida Department of Education Office of AALA website with important
resources for teachers and administrators. (URL is
http://www.fldoe.org/aala/). Among the many resources found there, be sure
to note the Documents and Publications link. In the long list of resources,
especially take note of Accommodations for Limited English Proficient
Students in the Administration of Statewide Assessments; Technical
Assistance Paper: Modifications to the Consent Decree...; Inclusion as an
Instructional Model for LEP students; Clustering II: Technical Assistance
Note....
1.4 University of South Florida St. Petersburg ESOL Infusion Site contains
extensive ESOL resources for a wide variety of education professionals.
(URL is http://fcit.usf.edu/esol/resources/resources_articles.html) This may
be a useful site to bookmark as there are so many resources.
1.5 CCSSO Ensuring Accuracy in Testing for English Language Learners
(URL is
http://www.ccsso.org/Resources/Programs/English_Language_Learners_%2
8ELL%29.html) This is a publication found on the website of the Council of
Chief State School Officers. It is useful to all but essential to administrators.
You will have to follow links to Publications and then search using keyword:
Assessment under category: Limited English Proficient Students. This is a
valuable free, downloadable pdf file with excellent guidelines on
accommodations in test design. Note especially chapters 4, 5, and 9.
1.6 Organizing and Assessing in the Content Area Class (URL is
http://www.everythingesl.net/inservices/judith2.php) provides some useful
suggestions for instruction and assessment. Helpful suggestions (but use
measurement filter here).
1.7 CRESST Reports from the National Center for Research on Evaluation,
Standards, and Student Testing (URL is:
http://www.cse.ucla.edu/products/reports.asp). Contains a variety of
research-based reports. Explore those related to LEP concerns in your
discipline.
2.
Students
with
Exceptio
nal
Needs
2.1 At the Florida Department of Education Bureau of Exceptional Education
and Student Services site, locate the Publications Index (URL is
http://www.fldoe.org/ese/). The site contains many useful resources. Locate
Technical Paper 312783 TAP FY2007-4 Accommodations for students with
Disabilities Taking the Florida Comprehensive Assessment Test (FCAT).
You may want to download or bookmark this for future reference.
2.2 Adaptations and Accommodations for Students with Disabilities (URL is
http://www.nichcy.org/pubs/bibliog/bib15txt.htm) contains a bibliography of
many useful resources. You may want to find the article: Suchs & Fuchs
(1998, Winter). General educators' instructional adaptations for students with
learning disabilities. Learning Disability Quarterly, 21(1), 23 - 33.
2.3 The Access Disabled Assistance Program for Tech Students site there is
a page on Teaching Students with Disabilities
http://www.catalog.gatech.edu/general/services/assist.php). There are
sections on characteristics of students with various types of disabilities and
lists of academic accommodations grouped by category of disability.
2.4 Assistive Technology Educational Network (ATEN) (URL
ishttp://www.aten.scps.k12.fl.us/resources.html) provides many resources to
explore. Note the Assistive Technology Links section and keep this site (and
Network) in mind for future reference.
2.5 U.S. Office of Special Education Programs Ideas That Work site contains
many useful resources under Info and Reports (URL is
http://www.seels.net/info_reports/children_we_serve.htm). One of them is
The Children We Serve: Demographic Characteristics of Elementary and
Middle School Students with Disabilities and Their Households. Explore the
reports that may be valuable in your professional role.
Designed to Fit Characteristics of Content
As mentioned previously, it is important to pick the right kind of instrument (or items) for the
objectives you are trying to measure. This generally takes some analysis of the instructional
content - recall the work you have done classifying types and levels of learning (knowledge,
comprehension, application, etc.; psychomotor skills; affective skills). At the same time, keep in
mind the characteristics of the special learners as you are designing your instruction and then
selecting or creating the assessment tools that will measure students' learning. You will want to
select the best format for the content as well as for the variety of learners in your educational
context. While we are making adaptations, it is important to maintain the integrity of the
content while adapting the format to meet the learners' needs. If the content changes, the
meaning of achievement scores will change and this has an impact on validity of the scores. On
the other hand, failing to make appropriate adaptations when they have been specified in an
Individualized Educational Plan or by a LEP classification, so the learner doesn't have a chance to
demonstrate mastery of the content, also changes the meaning of achievement scores and has
an impact on validity.
Designed to Fit Characteristics of Learners
It is very important that instruments are designed so that they are a good match for the
characteristics of the learners. In an earlier module, you were asked to think about the specific
characteristics of the learners in your educational context. As educators, we would think of some
of the following factors as we are designing instructional activities and assessments.
 Consider developmental characteristics specific to the age group (attention span,
interests, physical dexterity, ...).
 Keep in mind whether or not students have physical, cognitive, or social-emotional
challenges (visual impairment, cerebral palsy, development delay or severe mental
retardation, a specific learning disability, behavior disorders, ...). It is especially
important to note whether they are receiving Exceptional Student Education (ESE)
Services and have specific testing accommodations identified on their Individualized
Educational Plan (IEP).
 Consider whether the learners speak English as a second language.
 Be aware of cultural backgrounds of students in the group. Their background may
influence the way they approach an exam, the way they interpret questions, and/or the
way they respond to the tasks or questions.
 Consider students' experiential backgrounds. For example, if they have not been to a
snowy climate then it may not be a good idea to include a snow scenario as background
in an item (unless this is a class on weather patterns, of course).
 Be aware of students' prior experience with any equipment needed to perform the skill
(if students practiced with a plastic ball and bat then it would not be good to suddenly
produce a real baseball and bat for the actual performance test; similarly, if they
practiced writing paragraphs with a paper and pencil then you would not provide a PC on
which to take the exam unless you knew they have had experience taking exams on
computers).
When students demonstrate developmental disabilities, we must think beyond these general
factors and come up with very concrete ideas on how students can acquire skills and then
demonstrate what they know and are able to do. The same with students whose primary
language is other than English. We must give some thought to the best way for them to
demonstrate what they have learned related to the targeted instructional objectives. We must
come up with ways these students can demonstrate what they can do without letting the lack of
English or their disability stand in the way. This requires both general knowledge of
characteristics of student's with that particular condition and specific knowledge of the student's
capabilities.
Designed for Maximum Feasibility
We have realized that time is an important factor when it comes to feasibility of testing
procedures. Time to select or design the instrument, time to create it, time to score students'
work all relate to feasibility. In addition to time, feasibility issues include factors such as support
from classroom or program assistants, specialized equipment and materials that will facilitate the
student's performance, alternative space, scheduling or possibly permission factors. It takes
creativity and support for educators to make adaptations to assessments for students with
special learning challenges. The law, ethics, and our own professional motivation to support the
learning of all students require us to find the time and seek the support we need to make this
happen.
Professional in Format and Appearance
Here we must consider other important details that may contribute or inhibit students' successful
demonstration of skills during test administration. As effective educators we want our materials
to appear professional in quality. We are reminded that when we make adaptations for special
learners, we must continue to consider the following format and administration procedures.





absence of bias (culture, race, gender, religion, politics, etc.)
spelling
grammar
legibility
clarity
When you select or design and adapt an instrument for special learners, it is especially important
to seek feedback on the effectiveness of the adaptations. We know from experience that the
resulting product may not achieve the desired outcome. Critique from colleagues and from
members of groups of special learners helps to refine the instructions and test items or tasks to
create the best assessment possible.
Applying these design principles will help ensure your instruments will provide the most valid and
reliable results when the tests you create or select are administered to special needs students.
As you gain experience, you will probably start to apply the principles more automatically but
even seasoned professionals benefit from reviewing them occasionally. The time we invest as we
create the instruments is worth the payoff in getting the best possible information available to
make the important decisions we must make on our jobs every day.
Read the chapter 3 (The chapter has very important information you need to know.):
Assignments and Assessments starting on page 28 of the following document (Accommodations:
Assisting Students with Disabilities) available on Florida's Department of Education website (the
url is http://www.cpt.fsu.edu/ese/pdf/acom_edu.pdf). Just for practice (i.e. no posting required
), review the recommendations found in the chapter and compare them against the guidelines
you have learned for ensuring validity and reliability of test results. Identify a specific conflict
and then come up with an idea for revising the recommendation to make it more consistent with
validity and reliability guidelines.
Read (and download or make a copy of) the brochure found on the Department of Education
Exceptional Student Education site Planning FCAT Accommodations for Students with Disabilities
. The URL is http://www.fldoe.org/ese/fcat/fcat-tea.pdf. Consider how these suggested
accommodations might be useful for teacher-made assessments of students in your professional
context.
Think about a student, friend, or family member you have known with a disability. Briefly
summarize that person's challenge (protecting confidentiality, of course) and then identify a
recommendation from the Accommodations chapter that would be helpful for them as they are
taking a test. Really consider the characteristics of the person (strengths and challenges) and
think about the suggested accommodations. Are they "generic - one size fits all" or would they
really be useful to the person without compromising the validity of the test results?
Part 2
More Resources Related to Accommodations for Students in
ESE and ESOL/LEP Programs
Take some time to read or review any of the resources you found within the course site that
were relevant to your interests or professional role (especially the chapter on validity). There are
two required articles related to accommodations for students in ESE and many other optional
resources related to assessment accommodations for students in ESE and/or ESOL programs in
this part of the module.
Considerations When Identifying and Implementing Accommodations
for Students in ESE Programs
Much of the research in this area has been related to accommodations in formal or large-scale
standardized assessment contexts. Less research has been conducted on accommodations in less
formal classroom or other educational contexts. We as (teacher, administrator, researcher, etc.)
test developers and users are responsible for adapting the large-scale assessment
recommendations as appropriate for our target students and for evaluating them using the most
important criteria for evaluating an assessment practice - validity.
Read the following two articles related to ESE accommodations from the Council for
Exceptional Children journal Teaching Exceptional Children. They are important resources that
you will likely find useful in your future work. Then select and read at least one of the other
articles that is of most interest to your professional role.
Council for Exceptional Children (2005). Supplemental section: Guiding principles for appropriate
adaptations and accommodations. TEACHING Exceptional Children (Sept/Oct), 53-54.
Zirkel, P.A. (2006). What does the law say? TEACHING Exceptional Children, (Jan/Feb), 62-63.
Make sure you are aware of the resources on the Florida Department of Education Bureau of
Exceptional Education and Student Services Publications website found at
http://www.fldoe.org/ese/. There are many important resources related to assessment and
accommodations for exceptional students. Click on Publications and browse through the list. A
publication that most educators should be aware of is Technical Paper 312783 Accommodations
for Students with Disabilities Taking the Florida Comprehensive Assessment Test (FCAT).
(Select and read at least one from the many options listed below.)
Assistive Technology
Consider the following resources when identifying possible accommodations as a member of an
IEP development team, as a classroom teacher, a school administrator, or any other professional
responsible for accommodations for students with exceptionalities. These are ideas and
suggestions for assistive technology useful for assessment that you may not be aware of yet.
Reed, P. (2004). Critical Issue: Enhancing system change and academic success through
assistive technologies for k - 12 students with special needs. North Central Regional Educational
Laboratory. Available online at
http://www.ncrel.org/sdrs/areas/issues/methods/technlgy/te700.htm (Note: click on "technology
devices in use" to view examples of assistive technology.)
Reed, P.P. & Walser, P. (2001). Utilizing assistive technology in making testing accommodations.
Wisconsin Assistive Technology Initiative. Available online at
http://www.wati.org/AT_Services/pdf/Utilizing_AT_for_Accom.pdf
Specific Exceptionalities
Cawthon, S.W. (2006). National survey of accommodations and alternate assessments for
students who are deaf or hard of hearing in the United States. Journal of Deaf Studies and Deaf
Education, 11(3), 337-359.
Empirical and Research-oriented Studies Related to Assessment
Accommodations
Koretz, D.W. & Barton, K. (2003). Assessing students with disabilities: Issues and evidence. CSE
Technical Report. CSE-TR-587. Office of Educational Research and Improvement, Washington,
DC.
Ysseldyke, J. & Nelson, R.R. (2004). What we know and need to know about the consequences of
high-stakes testing for students with disabilities. Exceptional Children, 71(1), 75-94.
Sireci, S.G., Scarpati, S.E., & Shuhong, L. (2005). Test accommodations for students with
disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75(4),
457-490.
Elliott, S.N., Kratochwill, T.R., Y McKevitt, B.C. (2001). Experimental analysis of the effects of
testing accommodations on the scores of students with and without disabilities. Journal of School
Psychology, 39(1), 3-24.
Wagner, M., Friend, M., Bursuck, W.D., Kutash, K., Duchnowski, A.J., Sumi, W.C., & Epstein,
M.H. (2006). Journal of Emotional and Behavioral Disorders, 14(1), 12-30.
Weston, T.J. (2003). The validity of oral accommodation in testing: NAEP validity studies.
Working Paper Series. National Center for Educational Statistics (ED), Washington, DC. NCESWP-2003-06. National Assessment of Educational Progress, Princeton, NJ.
Considerations When Identifying and Implementing Accommodations
for Students in ESOL Programs
As with research on assessment accommodations for students in ESE programs, research on
accommodations for students in ESOL programs is not plentiful either. We again must be guided
by principles that ensure validity of results as we identify and implement accommodations. There
are state and federal guidelines, measurement principles, and feasibility issues to consider. Make
yourself familiar with the resources on Florida's Department of Education Office of Academic
Achievement through Language Acquisition website at http://www.fldoe.org/aala/Default.asp.
(Note: be sure to click on documents and publications for important information on assessment.)
You will likely have use for these important resources in the future. Also, you may find the
following articles reporting empirical research in the area useful in your professional work.
Consider reading one of these (optional) articles related to accommodations for students with
limited English proficiency.
Abedi, J. & Hejri, F. (2004). Accommodations for students with limited English proficiency in the
national assessment of educational progress. Applied Measurement in Education, 17(4), 371392.
Duncan, T.G., Parent, L.R., Chen, L., Ferrara, S., Y Johnson, E. (2002). Study of a dual language
test booklet in 8th grade mathematics. Paper presented at the Annual Meeting of the American
Educational Research Association (New Orleans, LA).
Albus, D., Bielinski, J., Thurlow, M., Liu, K. (2001). The effect of a simplified English language
dictionary on a reading test. LEP Projects Report 1. National Center on Educational Outcomes.
Special Education Programs, Washington, DC. Available from
http://education.umn.edu/nceo/OnlinePubs/LEP1.html.
Hafner, A.L. (2001). Evaluating the impact of test accommodations on test scores of LEP
students & non-LEP students. Paper presented at the Annual Meeting of the American
Educational Research Association (Seatle, WA).
Here are a couple more websites with valuable resources. Visit the Center for Applied Linguistics
website at URL http://www.cal.org/index.html to find available resources relevant to your
professional role.
Another resource you may find useful is found under Professional Development "PD Resources"
at the World-Class Instructional Design and Assessment website found at http://www.wida.us.
Select the first presentation Comprehensive School Reform for English Language Learners
(ELL's). After listening to the first presentation you may want to select another one that would be
of most use in your professional role.
Download