Uploaded by Nhân Đặng

Week 3 self-study

advertisement
Reliability Terms
• Reliability: A test yield the same scores one day and the next if there has been no
instruction intervening. The test yield dependable scores in the sense that they will not
fluctuate very much so that we may know that the score obtained by a student is pretty
close to the score he would obtain if we gave the test again.
With no changes in language ability, scores should be reliable (consistent) no
matter
 when students take the test (e.g. today or next week),
 which version (form) of the test students take, which part of the test
students answer (when the questions in each part of the test are designed
to measure the same ability),
 who rates the test,
 how the test is rated
Definition: vary (to change or be different), variation (a way to show how data is
dispered or spread out), variance (tells you how far a data set is spread out, but it is an
abstract number)
Factors that can affect test performance = sources of variance in language test scrores
1. Communicative language ability (reliable variance):
 Organizational knowledge
o Grammatical: vocabulary, syntax, phonology, graphology
o Textual: cohesion, rhetorical/ conversational organization
 Pragmatic
o Functional, e.g. ideational, manipulative functions
o Sociolinguistic, e.g. registers, figures of speech
 We expect that differences in test performance will be related to differences
in test takers’ levels of ability
2. Personal characteristics of test takers (source of test bias): relatively stable
attributes of test takers such as age, gender, language/ cultural background, background
knowledge, cognitive abilities and affective schemata
 Not related to the language ability
 Individual characteristics, e.g. background knowledge
 Group characteristics, e.g. ethnic group, gender
Example: English reading test for a group of international students (including
Thais) with the topic of “How Thai people celebrate Songkran today”. This is the
test of English reading skills, not cultural knowledge- therefore unreliable
3. Characteristics of test method (Source of score variance) (Source of
measurement error or error variance= variance in scores that is not directly
related to the purpose of the test)
 Format, e.g reading test: multiple choice , T/F- some students may
perform better in the multiple choice format than in T/F or than other
students. This does not mean their language ability is better.
 Test instructions- clear or unclear instructions may confuse students
 Raters, e.g. for a speaking test- trained or non-trained raters, linient or
strict raters
 the effects of test method facets on the performance may be the same for all
test takers or may vary from one test taker to another
4. Random factors (Source of score variance) (Source of measurement error or
error variance= variance in scores that is not directly related to the purpose of the test)
 Unexpected irregularities in test administration (i.e. power failure)
 Unpredictable and temporary conditions of test takers (i.e headache)
How to estimate reliability:
Step 1: Logical analysis (identify potential sources of measurement error for a particular
testing situation, e.g. raters, test forms)
 List the factors- sources of variance in language test scores (test method and
random factors) that are relavant
Step 2: Statistical analyses (design a study to collect two sets of scores, then estimate
the reliability using appropriate statistics)
 Different methods for estimating reliability for Norm-referenced Tests (NRT) and
Criterion-referenced (CRT) Tests
o Calculating reliability for NRT (Classical Test Theory)
3 methods, each focuses on different kinds of error that make scores
unreliable
 Test administrations- source of measurement error (the way the test is
administered, i.e. different proctors, rooms, test forms/ versions)
To get reliability estimate, a & b both require two test administrations
and make use of Pearson Product Moment Correllation
a. Test-retest reliability
Source of measurement error: inconsistencies across different
times of admisnistration
 If no learning had taken place between 2 administrations
(although the test is held in 2 different places or with
different proctors), the scores should not differ greatly)
 If the scores changed as a direct result of different times
of administration, this is a source of unreliability
To estimate the reliability:
 Have the same groups of test takers take the same test
twice
 The two test administrations should not be too close
(effect of test familiarity) or too spread out (changes in
students’ ability) ~ 1 month
 Calculate the Pearson correlation between two sets of
scores (time 1 and 2; Fulcher, p.48) or Pearson in SPSS
b. Parallel form
Source of measurement error: inconsistencies across different
forms (versions) of the test
To estimate the reliability:
 Construct two forms of the test: equivalent in content
and the ability measured
 The same groups of test takers take both test forms (A &
B)
 Calculate the Pearson correlation between the two sets
of scores (Test form A & B)
 To minimize the possibility of an ordering effect, use
counterbalanced design with two equivalent groups of
test takers
 Calculate Pearson correlation coefficient
 Test
Internal consistency reliability
Source of measurement error: test items, e.g. a reading test with many
passages and questions but some items measure something else => not
reading comprehension, e.g. calculation
2 ways to estimate the internal consistency reliability
NOTE: When a test consists of many parts which measure different
skills (i.e. reading, listening), we need separate analyses of interal
consistency reliability for each skill should be run.
a. Split-half reliability estimates
To estimate the reliability:
 Administer the whole test (one skill, e.g. grammar
knowledge) to the same group of test takers
 Split the test into 2 equal halves (random approach- odd
& even-numbered- measure the same ability and rational
approach- split the test based on the content of test
items)
 Calculate the correlation between the two halves
 Use the Spearman Brown correction formula to get the
reliability coefficient (p.51)
b. Cronbach’s alpha (find the test’s items)
Solve the problem of not being able to split the test into two
halves (e.g. a gap-filling test)
To estimate the reliability
 Administer the test
 Use Cronbach’s alpha to get the reliability coefficient
(Fulcher p. 51-52)
 Rating- sources of measurement error: scoring procedures
Rater consistency reliability estimates, e.g. writing, speaking test that
involve judgement, not multiple-choice test
Raters: source of unreliability
a. Intra-rater reliability (one rater)
Inconsistency within raters, e.g. fatigue, bias
To estimate the reliability
 Administer the test
 Two ratings of one rater for the same students (different,
random order)
 Calculate the correlation between the two sets of ratings or
calculate coefficient alpha
b. Inter-rater reliability (more than one raters)
Variation between the raters, e.g. some raters are more linient
To estimate the reliability
 Administer the test
 Two independent ratings from different raters for the same
students
 Calculate Cronbach’s alpha (formala to get the realiability
coefficient, p.53)
• dependability
• Reliability coefficients or estimates: provide information about how reliable a
particular set of scores is, on average, for a particular group of test takers, not on
individual test scores
 Range of reliability coefficients: 0 – 1 (0 is randomness- all error; 1 is
perfectly reliable- perfect measurement without any error)
 Acceptable: above 0.7
 Preferable: above 0.8; reliability coefficient = .8 => 80% of each person’s
scores come from his/her true scores, while 20% is based on measurement
error
• Standard error of measurement (Se): calculate Se to estimate the reliability of
individual test scores
To estimate the reliability of individual test scores:
 Use reliability coefficient to computate standard error of measurement
(Se)
 Use Se to calculate confidence interval (score range, equal to the test
takers’ score +/- some value, within which the test taker’s “true score”
is likely to fall- khoảng giá tri ̣ mà test taker là m bao nhiêu lần thì kết
quả cũ ng rơi và o)
 Range of test taker’s score and confidence level
o
(Fulcher p.55)
e.g. test reliability coefficient R= .84, Standard deviation sd = 4.97 => Se
= 1.988 ~ 2; student A’s test score = 30
o 68% confident that her true score is between one Se of the
obtained score
= 30 +/- (2 * 1)= 28 -32
o 95% confident that her true score is between two Se of the
obtained score (exact figure= 1.96)
= 30 +/- (2 * 2)= 26 - 34
o 99% confident that her true score is between three Se of the
obtained score (exact figure= 2.58)
o = 30 +/- (2 * 3)= 24 – 36
 The narrower the Se, the narrower the confidence interval or the
more consistently the raw scores represent the students’ actual
abilities
e.g., test reliability coefficient R= .84; Se ~ 2 and R= .5; Se~ 4
• Standard deviation (S or SD) (Brown, 2005 p.103): a measure of the spread or
dispersion of a set of data
 Variance (S2) the square of the standard deviation (S= 4; S2 = 16)
Calculating reliability for item-based CRT
- NRT: reliability vs CRT: dependability
- CRT is frequently concerned with classifying students as passing or failing, or receiving
a particular grade rather than a higher or lower one. A cut score is set for each decision.
For example, Grades: S/U (S= 70 – 100%; U= below 70%) or A= 80 – 100%, B= 70 -79%,
C= 60 – 69%, D= 50 – 59%, F= below 50%.
- Calculating dependability for CRT:
2 types of classification dependability estimates
 Threshold loss agreement indices
 Squared error loss agreement indices
o Dependability classification
o
o
o
o
Used for class, school and program level
Items must be scored right or wrong
One administration
Phi lambda (Fulcher p. 84-85) tells us how accurately our test
reparates students into various groups: letter grades, course levels,
passing/ failing
 Both estimate how much agreement/ consistency there would be in the
results if students were tested and classified repeatedly
How to improve the reliability and dependability (Brown, 1996 & Carr, 2011):
 Before test administration
o Use clear test specifications (e.g. https://takeielts.britishcouncil.org/takeielts/prepare/test-format)
o Write well-designed, carefully written test items
o Use appropriate scoring rubrics and properly trained raters
 During test administration
o Administer tests properly (e.g. standardized testing procedures)
 After test administration
o Estimate reliability or dependability and revise the test by (1) making the
test longer *, (2) Using item analysis to see if any items are problematic
*
Download