Reliability Terms • Reliability: A test yield the same scores one day and the next if there has been no instruction intervening. The test yield dependable scores in the sense that they will not fluctuate very much so that we may know that the score obtained by a student is pretty close to the score he would obtain if we gave the test again. With no changes in language ability, scores should be reliable (consistent) no matter when students take the test (e.g. today or next week), which version (form) of the test students take, which part of the test students answer (when the questions in each part of the test are designed to measure the same ability), who rates the test, how the test is rated Definition: vary (to change or be different), variation (a way to show how data is dispered or spread out), variance (tells you how far a data set is spread out, but it is an abstract number) Factors that can affect test performance = sources of variance in language test scrores 1. Communicative language ability (reliable variance): Organizational knowledge o Grammatical: vocabulary, syntax, phonology, graphology o Textual: cohesion, rhetorical/ conversational organization Pragmatic o Functional, e.g. ideational, manipulative functions o Sociolinguistic, e.g. registers, figures of speech We expect that differences in test performance will be related to differences in test takers’ levels of ability 2. Personal characteristics of test takers (source of test bias): relatively stable attributes of test takers such as age, gender, language/ cultural background, background knowledge, cognitive abilities and affective schemata Not related to the language ability Individual characteristics, e.g. background knowledge Group characteristics, e.g. ethnic group, gender Example: English reading test for a group of international students (including Thais) with the topic of “How Thai people celebrate Songkran today”. This is the test of English reading skills, not cultural knowledge- therefore unreliable 3. Characteristics of test method (Source of score variance) (Source of measurement error or error variance= variance in scores that is not directly related to the purpose of the test) Format, e.g reading test: multiple choice , T/F- some students may perform better in the multiple choice format than in T/F or than other students. This does not mean their language ability is better. Test instructions- clear or unclear instructions may confuse students Raters, e.g. for a speaking test- trained or non-trained raters, linient or strict raters the effects of test method facets on the performance may be the same for all test takers or may vary from one test taker to another 4. Random factors (Source of score variance) (Source of measurement error or error variance= variance in scores that is not directly related to the purpose of the test) Unexpected irregularities in test administration (i.e. power failure) Unpredictable and temporary conditions of test takers (i.e headache) How to estimate reliability: Step 1: Logical analysis (identify potential sources of measurement error for a particular testing situation, e.g. raters, test forms) List the factors- sources of variance in language test scores (test method and random factors) that are relavant Step 2: Statistical analyses (design a study to collect two sets of scores, then estimate the reliability using appropriate statistics) Different methods for estimating reliability for Norm-referenced Tests (NRT) and Criterion-referenced (CRT) Tests o Calculating reliability for NRT (Classical Test Theory) 3 methods, each focuses on different kinds of error that make scores unreliable Test administrations- source of measurement error (the way the test is administered, i.e. different proctors, rooms, test forms/ versions) To get reliability estimate, a & b both require two test administrations and make use of Pearson Product Moment Correllation a. Test-retest reliability Source of measurement error: inconsistencies across different times of admisnistration If no learning had taken place between 2 administrations (although the test is held in 2 different places or with different proctors), the scores should not differ greatly) If the scores changed as a direct result of different times of administration, this is a source of unreliability To estimate the reliability: Have the same groups of test takers take the same test twice The two test administrations should not be too close (effect of test familiarity) or too spread out (changes in students’ ability) ~ 1 month Calculate the Pearson correlation between two sets of scores (time 1 and 2; Fulcher, p.48) or Pearson in SPSS b. Parallel form Source of measurement error: inconsistencies across different forms (versions) of the test To estimate the reliability: Construct two forms of the test: equivalent in content and the ability measured The same groups of test takers take both test forms (A & B) Calculate the Pearson correlation between the two sets of scores (Test form A & B) To minimize the possibility of an ordering effect, use counterbalanced design with two equivalent groups of test takers Calculate Pearson correlation coefficient Test Internal consistency reliability Source of measurement error: test items, e.g. a reading test with many passages and questions but some items measure something else => not reading comprehension, e.g. calculation 2 ways to estimate the internal consistency reliability NOTE: When a test consists of many parts which measure different skills (i.e. reading, listening), we need separate analyses of interal consistency reliability for each skill should be run. a. Split-half reliability estimates To estimate the reliability: Administer the whole test (one skill, e.g. grammar knowledge) to the same group of test takers Split the test into 2 equal halves (random approach- odd & even-numbered- measure the same ability and rational approach- split the test based on the content of test items) Calculate the correlation between the two halves Use the Spearman Brown correction formula to get the reliability coefficient (p.51) b. Cronbach’s alpha (find the test’s items) Solve the problem of not being able to split the test into two halves (e.g. a gap-filling test) To estimate the reliability Administer the test Use Cronbach’s alpha to get the reliability coefficient (Fulcher p. 51-52) Rating- sources of measurement error: scoring procedures Rater consistency reliability estimates, e.g. writing, speaking test that involve judgement, not multiple-choice test Raters: source of unreliability a. Intra-rater reliability (one rater) Inconsistency within raters, e.g. fatigue, bias To estimate the reliability Administer the test Two ratings of one rater for the same students (different, random order) Calculate the correlation between the two sets of ratings or calculate coefficient alpha b. Inter-rater reliability (more than one raters) Variation between the raters, e.g. some raters are more linient To estimate the reliability Administer the test Two independent ratings from different raters for the same students Calculate Cronbach’s alpha (formala to get the realiability coefficient, p.53) • dependability • Reliability coefficients or estimates: provide information about how reliable a particular set of scores is, on average, for a particular group of test takers, not on individual test scores Range of reliability coefficients: 0 – 1 (0 is randomness- all error; 1 is perfectly reliable- perfect measurement without any error) Acceptable: above 0.7 Preferable: above 0.8; reliability coefficient = .8 => 80% of each person’s scores come from his/her true scores, while 20% is based on measurement error • Standard error of measurement (Se): calculate Se to estimate the reliability of individual test scores To estimate the reliability of individual test scores: Use reliability coefficient to computate standard error of measurement (Se) Use Se to calculate confidence interval (score range, equal to the test takers’ score +/- some value, within which the test taker’s “true score” is likely to fall- khoảng giá tri ̣ mà test taker là m bao nhiêu lần thì kết quả cũ ng rơi và o) Range of test taker’s score and confidence level o (Fulcher p.55) e.g. test reliability coefficient R= .84, Standard deviation sd = 4.97 => Se = 1.988 ~ 2; student A’s test score = 30 o 68% confident that her true score is between one Se of the obtained score = 30 +/- (2 * 1)= 28 -32 o 95% confident that her true score is between two Se of the obtained score (exact figure= 1.96) = 30 +/- (2 * 2)= 26 - 34 o 99% confident that her true score is between three Se of the obtained score (exact figure= 2.58) o = 30 +/- (2 * 3)= 24 – 36 The narrower the Se, the narrower the confidence interval or the more consistently the raw scores represent the students’ actual abilities e.g., test reliability coefficient R= .84; Se ~ 2 and R= .5; Se~ 4 • Standard deviation (S or SD) (Brown, 2005 p.103): a measure of the spread or dispersion of a set of data Variance (S2) the square of the standard deviation (S= 4; S2 = 16) Calculating reliability for item-based CRT - NRT: reliability vs CRT: dependability - CRT is frequently concerned with classifying students as passing or failing, or receiving a particular grade rather than a higher or lower one. A cut score is set for each decision. For example, Grades: S/U (S= 70 – 100%; U= below 70%) or A= 80 – 100%, B= 70 -79%, C= 60 – 69%, D= 50 – 59%, F= below 50%. - Calculating dependability for CRT: 2 types of classification dependability estimates Threshold loss agreement indices Squared error loss agreement indices o Dependability classification o o o o Used for class, school and program level Items must be scored right or wrong One administration Phi lambda (Fulcher p. 84-85) tells us how accurately our test reparates students into various groups: letter grades, course levels, passing/ failing Both estimate how much agreement/ consistency there would be in the results if students were tested and classified repeatedly How to improve the reliability and dependability (Brown, 1996 & Carr, 2011): Before test administration o Use clear test specifications (e.g. https://takeielts.britishcouncil.org/takeielts/prepare/test-format) o Write well-designed, carefully written test items o Use appropriate scoring rubrics and properly trained raters During test administration o Administer tests properly (e.g. standardized testing procedures) After test administration o Estimate reliability or dependability and revise the test by (1) making the test longer *, (2) Using item analysis to see if any items are problematic *