Reliability and Validity Short Overview

advertisement
Office of Research & Evaluation
UC Irvine
Reliability and Validity
Judy Shoemaker, Ph.D.
University of California, Irvine
September 2006
Every type of measurement instrument, including educational assessment methods, contain some
degree of error. Sources of error may be the individuals being assessed (a student who doesn’t
perform well due to illness), the administration and scoring procedures used (there was a noisy
event taking place outside the classroom preventing test takers from concentrating), or the
instrument itself. To ensure that the instrument itself is sound, it is important to review evidence
of its reliability (consistency) and validity (accuracy).
Since most of the work on reliability and validity is done within the context of tests and test
scores, that terminology will be used here. However, the concepts can be applied to any other
form of assessment.
Reliability
Reliability refers to the consistency of the scores. That is, can we count on getting the same or
similar scores if the test was administered at a different time of day, or if different raters scored
the test? Reliability also refers to how internally consistent the test is.
Reliability is estimated using correlation coefficients (Pearson r) derived from various sets of test
scores. Correlation coefficients range from 0.00 (no correlation) to 1.00 (perfect correlation).
Professionally developed standardized tests often have reliability coefficients of .90 or higher,
while teacher-made tests often have coefficients of .50 or lower (Ebel & Frisbie, 1986).
There are several different methods for estimating the reliability of a test:
Test-Retest Method: Test-retest reliability is a measure of stability (Gronlund & Linn, 1990).
For this method the test is administered twice to the same set of students. The correlation
between students’ scores on the first test and scores on the second test is an estimate of
reliability. This method makes several assumptions, some of which might not be realistic, such
as there has been no change in what students know or can do between tests, and that details of
the first test are not remembered.
Alternate Forms Method: With this method two parallel tests are developed at the same time,
following the same test blueprint.1 Each student takes both tests and the correlation between the
two test scores are indicative of the reliability of the test. Alternate forms reliability is a measure
of equivalence (Gronlund & Linn, 1990).
1
A test blueprint is a matrix showing areas to be assessed (rows) and the cognitive level to be assessed (columns).
Numbers in each cell identify how many test items will be used to measure that area at that specific cognitive level.
A test blueprint is also called a Table of Specifications.
1
Office of Research & Evaluation
UC Irvine
Split-Halves Method: Since it is often difficult and inefficient to develop two parallel tests, a
more common approach is to split the current test into two equivalent halves and correlate those
test scores together. One method is to use the odd-numbered items as one half of the test, and
the even-numbered items as the second half of the test. The correlation between the two halves
is an estimate of reliability of the test. In this case, the correlation coefficient is often adjusted to
reflect the length of the original test. Split-halves reliability is a measure of internal consistency.
Kuder-Richarson (K-R) Method. Taking the split-halves method one step further, the method
developed by Kuder and Richardson yields an estimate of reliability that is equivalent to the
“average correlation achieved by computing all possible split-halves correlations for a test” (Ebel
& Frisbie, 1986). Use of the K-R formulas assumes that each test item is scored dichotomously
(one point for a correct answer and no points for an incorrect answer). Split-halves reliability is
a measure of internal consistency.
Cronbach’s Alpha. This method is appropriate for items scored with values other than 1 or 0,
such as an essay item that might be scored using a 5-point scale. Like the K-R formulas,
Cronbach’s alpha represents an average correlation that would be obtained over all split-halves
of the test. Cronbach’s alpha is a measure of internal consistency and is the most widely used
and reported method for estimating the reliability of test scores.
Inter-Rater Reliability. For essay or other performance-based items that are scored by more
than one rater, it is important that all raters are scoring the items in the same way. To estimate
rater reliability, we calculate the percentage of scores that are in absolute agreement when
multiple raters rate the same set of papers. Another measure of reliability that is commonly used
to measure inter-rater reliability is the average correlation of scores between pairs of raters. In
both cases, we would look for percentages above .70. Inter-rater reliability is especially
important when multiple raters are using a scoring rubric to assess student learning outcomes.
Improving reliability: Reliability can be improved by ensuring that the test items are written
clearly and without ambiguity. Response options should be appropriate and meaningful. If
possible, making the test longer (adding more items, for example) will improve test reliability. If
scoring rubrics are being used, reliability of ratings can be improved through training and
practice.
2
Office of Research & Evaluation
UC Irvine
Types of Reliability
Type of
reliability
What’s
measured
Procedure
Comments
Test-retest
Stability
Give the same test twice to the
Assumes no learning over
same set of students over a very
time; fairly unrealistic
short period of time (a day or two) assumptions.
and correlate the test scores.
Alternate
forms
Equivalence
Develop two parallel versions of
the same test, give both tests at
the same time, and correlate the
test scores.
Difficult and time
consuming to develop two
different versions of the
same test.
SplitHalves
Internal
consistency
Before scoring, split the test into
two halves (odd/even items) and
correlate the scores on the two
halves.
Works best when test has
many items, correlations
should be adjusted upward
to reflect total number of
items on the test.
KuderInternal
Richardson consistency
(K-R)
Hypothetical average of all
possible split-halves correlations.
Assumes each test item is
scored right/wrong (1/0).
Cronbach’s Internal
Alpha
consistency
Hypothetical average of all
possible split-halves correlations
Allows test items to be
scored with values other
than 1/0. Most widely used
measure of internal
consistency.
Inter-rater
Degree of consistency among
raters when more than one rater is
used to score the same test items.
Important for performance
exams or where scoring
rubrics are used by multiple
raters.
Between
raters
3
Office of Research & Evaluation
UC Irvine
Validity
Validity refers to how accurately a test is measuring what it is supposed to measure. A foreign
language placement exam is said to be valid if it accurately predicts grades in introductory
foreign language courses. To be valid, the test must first be reliable, but not all reliable tests
demonstrate validity. Stated another way, “Reliability is a necessary but not sufficient condition
for validity” (Gronlund & Linn, 1990, p. 79).
There are different types of validity depending on the purpose of the test. Commonly used types
of validity are face validity, construct validity, predictive validity, and content validity. Unlike
reliability, there is no single statistical method that is used to demonstrate validity.
Face validity. Face validity is the weakest type of validity. Face validity refers to how well the
test “on the face of it” looks like it measures what it is supposed to measure. A mathematics test
made up of mathematics problems is said to have face validity. This type of validity is especially
important to test takers.
Content validity. Content validity is the most important type of validity for assessment of
student learning. Content validity refers to how well the test items represent what is learned in a
course or in a similar knowledge domain. That is, to what extent are the test items representative
of the types of content or skills that were taught? Content validity can be enhanced by carefully
designing the test to reflect what was taught. To ensure this alignment, many test makers use a
test blueprint or matrix where the rows or columns are the important elements of the content and
the cells represent the associated test items.
Not all achievement tests demonstrate content validity for use in classroom assessment. For
example, although nationally standardized achievement tests are developed by content experts,
the content selected may not actually be a good fit for what was taught in a specific course. Thus
it might be inappropriate to assess course learning outcomes with a standardized test unless it can
be shown that there is a good fit between the test and what is actually taught in the course.
Predictive validity. Predictive validity is important when the purpose of the test is to predict
future behavior, such as the foreign language placement exam described earlier. Predictive
validity is demonstrated when scores on the test are positively correlated with future behavior,
such as a grade in a course. This type of validity is also known as criterion-related validity since
test scores are compared with an external criterion.
Construct validity. Construct validity indicates the extent to which a test measures an
underlying construct, such as intelligence or anxiety. Construct validity is demonstrated if it
correlates with similar tests measuring the same construct, or if test scores are consistent with
what the construct would predict. For example, when individuals are placed in a stress
environment, it would be expected that their scores on an anxiety test would go up.
4
Office of Research & Evaluation
UC Irvine
Types of Validity
Type of validity
Definition
Example
Face
The extent to which a test “looks like”
what it is supposed to measure.
An in-class essay in a writing
course.
Content
The extent to which a test is representative
of what was taught.
Essential for assessing student
learning outcomes.
Predictive
(criterion)
The extent to which a test accurately
predicts future behavior.
A calculus placement exam
used to place students into
pre-calculus or regular
calculus courses (criterion =
course grades)
Construct
The extent to which a test corresponds to
other variables, as predicted by the
construct or theory.
Scores on a depression scale
correlate with a physician’s
diagnosis of depression.
References
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Englewood Cliffs,
NJ: Prentice Hall.
Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th ed.) New
York: Macmillan.
5
Download