CHAPTER 7 - Ravanshenas

advertisement
CHAPTER
7
Procedures for
Estimating Reliability
1
TYPE OF RELIABILITY
WHT IT IS
HOW DO YOU DO IT
WHAT THE
RELIABILITY
COEFFICIENT LOOKS
LIKE
*TYPES OF RELIABILITY
Test-Retest
A measure of
stability
Administer the same
test/measure at two different
times to the same group of
participants
r test1.test2
Ex. IQ test
A measure of
equivalence
Administer two different
forms of the same test to the
same group of participants
r testA.testB
Ex. Stats Test
2 Admin
Parallel/alternate
Interitem/Equivalent Forms
2 Admin
Test-Retest with Alternate
Forms
2 Admin
Inter-Rater
A measure of
stability and
equivalence
1 Admin
2
r testA.testB
A measure of
agreement
Have two raters rate
behaviors and then
determine the amount of
agreement between them
Percentage of
agreement
A measure of how
consistently each
item measures the
same underlying
construct
Correlate performance on
each item with overall
performance across
participants
Cronbach’s Alpha
Method
Kuder-Richardson
Method
Split Half Method
Hoyts Method
1 Admin
Internal Consistency
On Monday, you administer form A to
1st half of the group and form B to the
second half.
On Friday, you administer form B to 1st
half of the group and form A to the 2nd
half
Procedures for
Estimating/Calculating Reliability
Procedures Requiring 2 Test
Administration
Procedures Requiring 1 Test
Administration
3
Procedures for Estimating Reliability
*Procedures Requiring two (2) Test
Administration
1. Test-Retest Reliability Method
measures the Stability.
2. Parallel (Alternate) Equivalent
Forms Interitem Reliability Method
measures the Equivalence.
3. Test-Retest with Alternate Reliability
Forms measures the Stability and
Equivalent
4
Procedures Requiring 2 Test
Administration
 1. Test-Retest Reliability Method
Administering the same test to the same group of
participants then, the two sets of scores are
correlated with each other.
The correlation coefficient ( r ) between the two
sets of scores is called the coefficient of stability.
The problem with this method is time Sampling, it
means that factors related to time are the sources
of measurement error e.g., change in exam
condition such as noises, the weather, illness,
fatigue, worry, mood change etc.
5
How to Measure the
Test Retest Reliability
 Class IQ Scores
 Students
 John
 Jo
 Mary
 Kathy
 David
 rfirst time.second time
X –first timeY- second time
125
120
110
112
130
128
122
120
115
120
stability
6
Procedures Requiring 2 Test
Administration
 2. Parallel (Alternate) Forms Reliability Method
Different Forms of the same test are
given to the same group of participants
then, the two sets of scores are
correlated. The correlation coefficient
(r) between the two sets of scores is
called the coefficient of equivalence.
7
How to measure the
Parallel Forms Reliability
 Class Test Scores
 Students
 John
 Jo
 Mary
 Kathy
 David
 rformA•formB
X-Form A
Y-Form B
95
90
80
85
78
82
82
88
75
72
equivalence
8
Procedures Requiring 2 Test
Administration
 3. Test-Retest with Alternate Reliability Forms
 It is a combination of the test-retest and alternate
form reliability method.
 On Monday, you administer form A to 1st half of
the group and form B to the second half.
 On Friday, you administer form B to 1st half of the
group and form A to the second half.
 The correlation coefficient ( r) between the two
sets of scores is called the coefficient of stability
and equivalence.
9
Procedures Requiring 1 Test
Administration
 A. Internal Consistency Reliability (ICR)
 Examines the unidimensional nature of a set of
items in a test. It tells us how unified the items are
in a test or in an assessment.
 Ex. If we administer a 100-item personality test we
want the items to relate with one another and to
reflect the same construct (personality). We want
them to have item homogeneity.
 ICR deals with how unified the items are in a test
or an assessment. This is called “item
homogeneity.”
10
Procedures for Estimating Reliability
*Procedures Requiring one
(1) Test Administration
A. Internal Consistency
Reliability
B. Inter-Rater Reliability
11
A. Internal Consistency Reliability (ICR)
*4 Different ways to measure ICR
1. Guttman Split Half Reliability
Method same as (Spearman Brown Prophesy Formula)
2. Cronbach’s Alpha Method
3. Kuder Richardson Method
4. Hoyt’s Method
They are different statistical
procedures to calculate the reliability
of a test.
12
Procedures Requiring 1 Test Administration
 A. Internal Consistency Reliability (ICR)
1. Guttman Split-Half Reliability
Method (most popular) usually use
for dichotomously scored exams.
First, administer a test, then divide
the test items into 2 subtests
(There are four popular methods),
then, find the correlation between
the 2 subtests and place it in the
formula.
13
1. Split Half Reliability Method

1
4
1. Split Half Reliability Method
*The 4 popular methods are:
1.Assign all odd-numbered items
to form 1 and all even-numbered
items to form 2
2. Rank order the items in terms
of their difficulty levels (p-values)
based on the responses of the
examiners; then assign items with
odd-numbered ranks to form 1
and those with even-numbered
ranks to form 2
15
1. Split Half Reliability Method
The four popular methods
are:
3. Randomly assign items to
the two half-test forms
4. Assign items to half-test
forms so that the forms are
“matched” in content e.g. if there
are 6 items on reliability, each half will get 3.
16
1. Split Half Reliability Method
A high Slit Half reliability coefficient (e.g., >0.90) indicates a homogeneous test.

17
1. Split Half Reliability Method
*Use the split half
reliability method to
calculate the reliability
estimate of a test with
reliability coefficient
(correlation) of 0.25 for
the 2 halves of this test ?
18
1. Split Half Reliability Method

19
1. Split Half Reliability Method
A=X and B=Y
20
Procedures Requiring 1 Test
Administration
A. Internal Consistency Reliability (ICR)
2. Cronbach’s Alpha Method (used for
wide range of scoring such as NonDichotomously and
Dichotomously scored exams.
Cronbach’s(α) is a preferred statistic.
 Lee Cronbach-
21

22
Procedures Requiring 1 Test
Administration
Cronbach α
for composite tests K is number of tests/subtest
23
A. Internal Consistency Reliability (ICR)
2. *Cronbach’s Alpha Method
or ( Coefficient (α) is a preferred statistic)
 Ex. Suppose that the examinees are tested
on 4 essay items and the maximum score
for each is 10 points. The variance for the
items are as follow; σ²1=9, σ²2=4.8,
σ²3=10.2, and σ²4=16. If the total score
variance σ²x=100, used the Cronbach’s
Alpha Method to calculate the internal
consistency of this test? A high α
coefficient (e.g., >0.90) indicates a
homogeneous test.
24

25
Cronbach’s Alpha Method
26
27
Procedures Requiring 1 Test
Administration
3. Kuder Richardson Method
 A. Internal Consistency Reliability (ICR)
 *The Kuder-Richardson Formula 20 (KR-20) first
published in 1937. It is a measure of internal
consistency reliability for measures with
dichotomous choices. It is analogous \ə-ˈna-ləgəs\ to Cronbach's α, except Cronbach's α is also
used for non-dichotomous tests. pq=σ²i. A high KR20 coefficient (e.g., >0.90) indicates a homogeneous test.
28
Procedures Requiring 1 Test
*Administration

29
Procedures Requiring 1 Test Administration

30
3. *Kuder Richardson Method (KR 20and KR 21) See
table 7.1 or data on p.136 next
31
Variance=square of standard deviation=4.08
32
Procedures Requiring 1 Test
Administration
A. Internal Consistency Reliability
(ICR)
3. Kuder Richardson Method (KR 21) It
is used only with dichotomously
scored items. It does not require the
computing of each item variance (you
do it once for all items or test variance
σ²X=Total test score variance) see table 7.1
for standard deviation and variance for all items.
 It assumes all items are equal in difficulties.
33
Procedures Requiring 1 Test
Administration

34
Procedures Requiring 1 Test
Administration
A. Internal Consistency Reliability
(ICR)
4. *Hoyt’s (1941) Method
Hoyt used ANOVA to obtained
variance or MS to calculate the Hoyt’s
Coefficient.
MS=σ²=S²=Variance
35
Procedures Requiring 1 Test
Administration

36
4. *Hoyt’s (1941) Method
MS person MS within
MS items MS between
MS residual has it’s own calculations, it is not =MS total
37
Procedures Requiring 1 Test
Administration
B. Inter-Rater Reliability
It is measure of consistency
from rater to rater.
It is a measure of agreement
between the raters.
38
Procedures Requiring 1
Test Administration
 B. Inter-Rater Reliability
 Items
Rater 1
Rater 2
1
4
3
2
3
5
3
5
5
4
4
2
5
1
2
 First do the r rater1.rater2 then, X 100.
39
Procedures Requiring 1
Test Administration
 B. Inter-Rater Reliability
 More than 2 raters:
 Raters 1, 2, and 3
 Calculate r for 1 & 2=.6
 Calculate r for 1 & 3=.7
 Calculate r for 2 & 3=.8
 µ=.7 x100=70%
40
*Factors that Affect Reliability
Coefficients
1. Group Homogeneity
2. Test length
3. Time limit
41
*Factors that Affect Reliability
Coefficients
 1. Group Homogeneity
If a sample of examinees is highly homogeneous on the
construct being measured, the reliability estimate will be
lower than if the sample were more heterogeneous.
 2. Test length
Longer tests are more reliable than shorter tests.
The effect of changing test length can be estimated by
using Spearman Brown Prophecy Formula.
 3. Time limit
Time Limit refers to when a test has a rigid time limit.
Meaning, some examinees finish but others don’t, this
will artificially inflate the test reliability coefficient.
42
Reporting Reliability Data
According to Standards for Educational and
Psychological Testing
 1. Result of different reliability studies
should be reported to take into account
different sources of measurement error that
are most relevant to score use.
 2.Standard error of measurement and score
bands for different confidence intervals
should accompany each reliability estimate
 3.Reliability and standard error estimates
should be reported for subtest scores as
well as total test score.
43
Reporting Reliability Data
 4.Procedures and sample used in reliability
studies should be sufficiently describe to permit
users to determine similarity between conditions
of the reliability study and their local situations.
 5.When a test is normally used for a particular
population of examinees (e.g., those within a
grade level or those who have a particular
handicap) reliability estimate and standard error
of measurement should be reported separately
for such specialized population.
44
Reporting Reliability Data
 6.when test scores are used primarily for
describing or comparing group performance,
reliability and standard error of measurement for
aggregated observations should be reported.
 7.If standard errors of measurement are
estimated by using a model such as the binomial
model, this should be clearly indicated; users will
probably assume that the classical standard error
of measurement is being reported. A binomial
model is characterized by trials which either end
in success (heads) or
 failure (tails).
45
CHAPTER 8
Introduction to Generalizability Theory
Cronbach (1963)
46
CHAPTER 8
Introduction to Generalizability
Theory
Cronbach (1963)
Generalizability is another way to
calculate the reliability of a test by
using ANOVA.
Generalizability refers to the degree to
which a particular set of measurements
of an examinee generalizes to a more
extensive set of measurements of that
examinee. (just like conducting inferential research)
47
Introduction to Generalizability
Generalizability Coefficient
 FYI, In Classical True Score Theory, the
Reliability was defined as the ratio of the
True score to Observed score.
Reliability= T/T+E
 Also, an examinee’s True score is defined
as the average (mean) of large number of
strictly parallel measurements, and the True
score variance σ²T defined as variance of
these averages.
 Reliability CoefficientpX1X2= σ²T/ σ²X
48
Introduction to Generalizability
*Generalizabilty Coefficient
In Generalizability theory an
examinee’s Universe Score is defined
as the average of the measurements in
the universe of generalization (The
Universe Score is the same as True
score in classical test theory), it is the
average or mean of the measurements
in the Universe of Generalization.
49
Introduction to Generalizability
*Generalizabilty Coefficient
 The Generalizability Coefficient or p is
defined as the ratio of Universe Score
Variance (σ²μ) to expected Observed Score
Variance (eσ²X).
 Generalizability Coefficient=p=
σ²μ/eσ²X
Ex. if Expected Observed Score Variance=eσ²X
=10
and Universe Score Variance =σ²μ =5
Then the Generalizability Coefficient is: 5/10=0.5
50
Introduction to Generalizability
Key Terms
 Universe:
Universe are a set of measurement
conditions which are more extensive than
the condition under which the sample
measurements were obtained.
Ex: If you took the Test Construction exam here at CAU then, the Universe or
(generalization) is when you take the test construction exams at several other
universities,
University Score
CAU
85
FIU
90
FAU
84
NSU
80
UM
88
μ=85.40 is called the Universe Score
51
Introduction to Generalizability
Key Terms
Universe Score:
It is the same as True score in
Classical Test Theory. It is the average
(mean) of the measurements in the
universe of generalization.
Ex: The mean of your scores on the Test
Construction exams you took at other
universities is your Universe Score (see
previous slide).
52
Introduction to Generalizability
Key Term
Facets:
Facets are a part or aspect of
something also A Set of
Measurement Conditions.
 Ex. Next slide
53
Introduction to Generalizability
*Facets: Example
If two supervisors want to rate the
performance of factory workers under
three workloads (heavy, medium, and
light), how many sets of
measurements (facets) we’ll have?
See next slide
54
Introduction to Generalizability
*Facets: Example
If two supervisors want to rate the
performance of factory workers under
three workloads (heavy, medium, and
light), how many sets of
measurements (facets) we’ll have?
See next slide
55
Introduction to Generalizability
Facets:
The two sets of measurement
conditions or the two facets are;
1- the supervisors (one and two),
2- The workloads (heavy,
medium, and light).
 Ex. 2 next slide
56
Introduction to Generalizability
Facets:
 *A researcher measuring students
compositional writing on four occasions.
On each occasion, each student writes
compositions on two different topics.
All compositions are graded by three
different raters.
This design involves how many facets??
See next slide
57
Introduction to Generalizability
Facets:
 *A researcher measuring students
compositional writing on four occasions.
On each occasion, each student writes
compositions on two different topics.
All compositions are graded by three
different raters.
58
Introduction to Generalizability
Key Term
*Universe of Generalization:
Universe of Generalization are all
of the measurement conditions for
the second set of measurement or
“universe.” Such as; fatigue, room
temperature, specification, etc,.
 Ex. All of the conditions under which you took your testconstruction exams at other universities.
59
Introduction to Generalizability
Generalizability Distinguishes between
Generalizability Studies (G- Studies)
and Decision studies (D-Studies).
*G-Studies:
G-Studies are concern with extent to
which a sample of measurement
generalizes to universe of
measurements. It is the study of
generalizability procedures.
60
Generalizability Studies (G- Studies)
and Decision studies (D-Studies)
*D-Studies:
D-Studies refer to providing data
for making decisions about
examinees. It is about the
adequacy of measurement.
Ex. Next slide
61
Generalizability Studies (G- Studies)
and Decision studies (D-Studies)
Ex. Suppose we use an achievement test
to test 2000 children from public and
2000 children from private schools.
If we want to know whether this test is
equally reliable for both types of schools
then we are dealing with G-Study (quality
of measurement).
 Ex. We can generalize a test to these two different
school population i.e CAU and FIU doc. stu. taking
the EPPP exam
6
Generalizability Studies (G- Studies)
and Decision studies (D-Studies)
However, if we want to compare the
means of these different types of
schools (using data) and draw a
conclusion about differences in the
adequacy of the two educational
systems then, we dealing with DStudy. Ex. Compare the means of CAU and FIU doc.
stu. Who took the EPPP exam.
63
Introduction to Generalizability
 *Generalizability Designs:
 There are 4 different Generalizability designs
with different Generalizability theory
_
+
 Stands for examinee
 Stands for rater or examiner
64
Generalizibility Designs:
 1._
_
_
_
_ _ _ _ _ _
+
1. One rater rates each one of the examinees
 2._ _ _ _ _ _ _ _ _ _
+++
2. A group of raters rate each one of the examinees
 3._ _ _ _ _ _ _ _ _ _
+ + + + + + + + + +
3. One rater rates only one examinee
 4._
_
_
_
_
_
_
_
_
_
+++ +++ +++ +++ +++ +++ +++ +++ +++ +++
4. Each examinee is rated by different group of raters (most
expensive).
65
Download