Chapter 4 - Reliability

advertisement
1
Chapter 4 – Reliability
1. Observed Scores and True Scores
2. Error
3. How We Deal with Sources of Error:
A. Domain sampling – test items
B. Time sampling – test occasions
C. Internal consistency – traits
4. Reliability in Observational Studies
5. Using Reliability Information
6. What To Do about Low Reliability
2
Chapter 4 - Reliability
• Measurement of human ability and
knowledge is challenging because:
 ability is not directly observable – we infer
ability from behavior
 all behaviors are influenced by many
variables, only a few of which matter to us
3
Observed Scores
O=T+e
O = Observed score
T = True score
e = error
4
Reliability – the basics
1. A true score on a
test does not
change with
repeated testing
2. A true score would
be obtained if there
were no error of
measurement.
3. We assume that
errors are random
(equally likely to
increase or
decrease any test
result).
5
Reliability – the basics
• Because errors are
• Mean of many
random, if we test one
observed scores for
person many times,
one person will be the
the errors will cancel
person’s true score
each other out
• (Positive errors
cancel negative
errors)
6
Reliability – the basics
• Example: to measure
Sarah’s spelling
ability for English
words.
• We can’t ask her to
spell every word in
the OED, so…
• Ask Sarah to spell a
subset of English
words
• % correct estimates
her true English
spelling skill
• But which words
should be in our
subset?
7
Estimating Sarah’s spelling ability…
• Suppose we choose
20 words randomly…
• What if, by chance,
we get a lot of very
easy words – cat,
tree, chair, stand…
• Or, by chance, we get
a lot of very difficult
words – desiccate,
arteriosclerosis,
numismatics
8
Estimating Sarah’s spelling ability…
• Sarah’s observed
score varies as the
difficulty of the
random sets of words
varies
• But presumably her
true score (her actual
spelling ability)
remains constant.
9
Reliability – the basics
• Other things can
produce error in our
measurement
• E.g. on the first day
that we test Sarah
she’s tired
• But on the second
day, she’s rested…
• This would lead to
different scores on
the two days
10
Estimating Sarah’s spelling ability…
• Conclusion:
O=T+e
But e1 ≠ e2 ≠ e3 …
• The variation in
Sarah’s scores is
produced by
measurement error.
• How can we measure
such effects – how
can we measure
reliability?
11
Reliability – the basics
• In what follows, we
consider various
sources of error in
measurement.
• Different ways of
measuring reliability
are sensitive to
different sources of
error.
12
How do we deal with sources of error?
• Error due to test items • Domain sampling
error
13
How do we deal with sources of error?
• Error due to test items • Time sampling error
• Error due to testing
occasions
14
How do we deal with sources of error?
• Error due to test items • Internal consistency
error
• Error due to testing
occasions
• Error due to testing
multiple traits
15
Domain Sampling error
• A knowledge base or
skill set containing
many items is to be
tested.
 E.g., the chemical
properties of foods.
• We can’t test the
entire set of items.
 So we select a sample
of items.
 That produces domain
sampling error, as in
Sarah’s spelling test.
16
Domain Sampling error
• There is a “domain” of
knowledge to be
tested
• A person’s score may
vary depending upon
what is included or
excluded from the
test.
17
Domain Sampling error
• Smaller sets of items
may not test entire
knowledge base.
• Larger sets of items
should do a better job
of covering the whole
knowledge base.
• As a result, reliability
of a test increases
with the number of
items on that test
18
Domain Sampling error
• Parallel Forms
Reliability:
• choose 2 different
sets of test items.
• these 2 sets give you
“parallel forms” of the
test
• Across all people
tested, if correlation
between scores on 2
parallel forms is low,
then we probably
have domain
sampling error.
19
Time Sampling error
• Test-retest Reliability
 person taking test
might be having a
very good or very bad
day – due to fatigue,
emotional state,
preparedness, etc.
• Give same test
repeatedly & check
correlations among
scores
• High correlations
indicate stability –
less influence of bad
or good days.
20
Time Sampling error
• Test-retest approach
is only useful for traits
– characteristics that
don’t change over
time
• Not all low test-retest
correlations imply a
weak test
• Sometimes, the
characteristic being
measured varies with
time (as in learning)
21
Time Sampling error
• Interval over which
correlation is
measured matters
• E.g., for young
children, use a very
short period (< 1
month, in general)
• In general, interval
should not be > 6
months
• Not all low test-retest
correlations imply a
weak test
• Sometimes, the
characteristic being
measured varies with
time (as in learning)
22
Time sampling error
• Test-retest approach
advantage: easy to
evaluate, using
correlation
• Disadvantage:
carryover & practice
effects
• Carryover: first testing
session influences scores
on next session
• Practice: when carryover
effect involves learning
23
Internal Consistency error
• Suppose a test
includes both items
on social psychology
and items requiring
mental rotation of
abstract visual
shapes.
• Would you expect
much correlation
between scores on
the two parts?
 No – because the two
‘skills’ are unrelated.
24
Internal Consistency Approach
• A low correlation
between scores on 2
halves of a test,
suggests that the test
is tapping two
different abilities or
traits.
• A good test has high
correlations between
scores on its two
halves.
 But how should we
divide the test in two to
check that correlation?
25
Internal Consistency error
• Split-half method
• Kuder-Richardson
formula
• Cronbach’s alpha
• All of these assess
the extent to which
items on a given test
measure the same
ability or trait.
26
Split-half Reliability
• After testing, divide
test items into halves
A & B that are scored
separately.
• Check for correlation
of results for A with
results for B.
• Various ways of
dividing test into two –
randomly, first half vs.
second half, oddeven…
27
Split-half Reliability – a problem
• Each half-test is
smaller than the
whole
• Smaller tests have
lower reliability
(domain sampling
error)
• So, we shouldn’t use
the raw split-half
reliability to assess
reliability for the
whole test
28
Split-half reliability – a problem
• We correct reliability
estimate using the
Spearman-Brown
formula:
re = 2rc
1+ rc
re = estimated reliability for
the test
rc = computed reliability
(correlation between
scores on the two halves
A and B)
29
Kuder-Richardson 20
• Kuder & Richardson
(1937): an internalconsistency measure
that doesn’t require
arbitrary splitting of
test into 2 halves.
• KR-20 avoids
problems associated
with splitting by
simultaneously
considering all
possible ways of
splitting a test into 2
halves.
30
Kuder-Richardson 20
•
The formula
contains two basic
terms:
1. a measure of all the
variance in the
whole set of test
results.
31
Kuder-Richardson 20
•
The formula
contains two basic
terms:
2. “item variance” –
when items measure
the same trait, they
co-vary (same
people get them
right or wrong). More
co-variance = less
“item variance”
32
Internal Consistency – Cronbach’s α
• KR-20 can only be
used with test items
scored as 1 or 0 (e.g.,
right or wrong, true or
false).
• Cronbach’s α (alpha)
generalizes KR-20 to
tests with multiple
response categories.
• α is a more generallyuseful measure of
internal consistency
than KR-20
Review: How do we deal with sources of
error?
Approach
Measures
Issues
Test-Retest
Stability of scores
Carryover
Parallel Forms
Equivalence & Stability
Effort
Split-half
Equivalence & Internal
consistency
Equivalence & Internal
consistency
Shortened
test
Difficult to
calculate
KR-20 & α
33
34
Reliability in Observational Studies
• Some psychologists
collect data by
observing behavior
rather than by testing.
• This approach
requires time
sampling, leading to
sampling error
• Further error due to:
 observer failures
 inter-observer
differences
35
Reliability in Observational Studies
• Deal with possibility of • Deal with interfailure in the singleobserver differences
observer situation by
using:
having more than 1
 Inter-rater reliability
observer.
 Kappa statistic
36
Reliability in Observational Studies
• Inter-rater reliability
• % agreement between 2
or more observers
 problem: in a 2-choice
case, 2 judges have a 50%
chance of agreeing even if
they guess!
 this means that %
agreement may overestimate inter-rater
reliability.
37
Reliability in Observational Studies
• Kappa Statistic
(Cohen,1960)
• estimates actual interrater agreement as a
proportion of potential
inter-rater agreement
after correction for
chance.
38
Using Reliability Information
• Standard error of
measurement (SEM)
• estimates extent to
which test score
misrepresents a true
score.
• SEM = (S)(1 – r)
39
Standard Error of Measurement
• We use SEM to
compute a confidence
interval for a
particular test score.
• The interval is centered
on the test score
• We have confidence that
the true score falls in this
interval
• E.g., 95% of the time the
true score will fall within
1.96 SEM either way of
the test (observed) score.
40
Standard Error of Measurement
• A simple way to think
of the SEM:
• Suppose we gave
one student the same
test over and over
• Suppose, too, that no
learning took place
between tests and the
student did not
memorize questions
• The standard
deviation of the
resulting set of test
scores (for this one
student) would be the
standard error of
measurement.
41
What to do about low reliability
• Increase the number
of items
• To find how many you
need, use SpearmanBrown formula
• Using more items
may introduce new
sources of error such
as fatigue, boredom
42
What to do about low reliability
• Discriminability
analysis
• Find correlations
between each item
and whole test
• Delete items with low
correlations
Download