1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling – test occasions C. Internal consistency – traits 4. Reliability in Observational Studies 5. Using Reliability Information 6. What To Do about Low Reliability 2 Chapter 4 - Reliability • Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer ability from behavior all behaviors are influenced by many variables, only a few of which matter to us 3 Observed Scores O=T+e O = Observed score T = True score e = error 4 Reliability – the basics 1. A true score on a test does not change with repeated testing 2. A true score would be obtained if there were no error of measurement. 3. We assume that errors are random (equally likely to increase or decrease any test result). 5 Reliability – the basics • Because errors are • Mean of many random, if we test one observed scores for person many times, one person will be the the errors will cancel person’s true score each other out • (Positive errors cancel negative errors) 6 Reliability – the basics • Example: to measure Sarah’s spelling ability for English words. • We can’t ask her to spell every word in the OED, so… • Ask Sarah to spell a subset of English words • % correct estimates her true English spelling skill • But which words should be in our subset? 7 Estimating Sarah’s spelling ability… • Suppose we choose 20 words randomly… • What if, by chance, we get a lot of very easy words – cat, tree, chair, stand… • Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics 8 Estimating Sarah’s spelling ability… • Sarah’s observed score varies as the difficulty of the random sets of words varies • But presumably her true score (her actual spelling ability) remains constant. 9 Reliability – the basics • Other things can produce error in our measurement • E.g. on the first day that we test Sarah she’s tired • But on the second day, she’s rested… • This would lead to different scores on the two days 10 Estimating Sarah’s spelling ability… • Conclusion: O=T+e But e1 ≠ e2 ≠ e3 … • The variation in Sarah’s scores is produced by measurement error. • How can we measure such effects – how can we measure reliability? 11 Reliability – the basics • In what follows, we consider various sources of error in measurement. • Different ways of measuring reliability are sensitive to different sources of error. 12 How do we deal with sources of error? • Error due to test items • Domain sampling error 13 How do we deal with sources of error? • Error due to test items • Time sampling error • Error due to testing occasions 14 How do we deal with sources of error? • Error due to test items • Internal consistency error • Error due to testing occasions • Error due to testing multiple traits 15 Domain Sampling error • A knowledge base or skill set containing many items is to be tested. E.g., the chemical properties of foods. • We can’t test the entire set of items. So we select a sample of items. That produces domain sampling error, as in Sarah’s spelling test. 16 Domain Sampling error • There is a “domain” of knowledge to be tested • A person’s score may vary depending upon what is included or excluded from the test. 17 Domain Sampling error • Smaller sets of items may not test entire knowledge base. • Larger sets of items should do a better job of covering the whole knowledge base. • As a result, reliability of a test increases with the number of items on that test 18 Domain Sampling error • Parallel Forms Reliability: • choose 2 different sets of test items. • these 2 sets give you “parallel forms” of the test • Across all people tested, if correlation between scores on 2 parallel forms is low, then we probably have domain sampling error. 19 Time Sampling error • Test-retest Reliability person taking test might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc. • Give same test repeatedly & check correlations among scores • High correlations indicate stability – less influence of bad or good days. 20 Time Sampling error • Test-retest approach is only useful for traits – characteristics that don’t change over time • Not all low test-retest correlations imply a weak test • Sometimes, the characteristic being measured varies with time (as in learning) 21 Time Sampling error • Interval over which correlation is measured matters • E.g., for young children, use a very short period (< 1 month, in general) • In general, interval should not be > 6 months • Not all low test-retest correlations imply a weak test • Sometimes, the characteristic being measured varies with time (as in learning) 22 Time sampling error • Test-retest approach advantage: easy to evaluate, using correlation • Disadvantage: carryover & practice effects • Carryover: first testing session influences scores on next session • Practice: when carryover effect involves learning 23 Internal Consistency error • Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes. • Would you expect much correlation between scores on the two parts? No – because the two ‘skills’ are unrelated. 24 Internal Consistency Approach • A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits. • A good test has high correlations between scores on its two halves. But how should we divide the test in two to check that correlation? 25 Internal Consistency error • Split-half method • Kuder-Richardson formula • Cronbach’s alpha • All of these assess the extent to which items on a given test measure the same ability or trait. 26 Split-half Reliability • After testing, divide test items into halves A & B that are scored separately. • Check for correlation of results for A with results for B. • Various ways of dividing test into two – randomly, first half vs. second half, oddeven… 27 Split-half Reliability – a problem • Each half-test is smaller than the whole • Smaller tests have lower reliability (domain sampling error) • So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test 28 Split-half reliability – a problem • We correct reliability estimate using the Spearman-Brown formula: re = 2rc 1+ rc re = estimated reliability for the test rc = computed reliability (correlation between scores on the two halves A and B) 29 Kuder-Richardson 20 • Kuder & Richardson (1937): an internalconsistency measure that doesn’t require arbitrary splitting of test into 2 halves. • KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves. 30 Kuder-Richardson 20 • The formula contains two basic terms: 1. a measure of all the variance in the whole set of test results. 31 Kuder-Richardson 20 • The formula contains two basic terms: 2. “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance” 32 Internal Consistency – Cronbach’s α • KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false). • Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories. • α is a more generallyuseful measure of internal consistency than KR-20 Review: How do we deal with sources of error? Approach Measures Issues Test-Retest Stability of scores Carryover Parallel Forms Equivalence & Stability Effort Split-half Equivalence & Internal consistency Equivalence & Internal consistency Shortened test Difficult to calculate KR-20 & α 33 34 Reliability in Observational Studies • Some psychologists collect data by observing behavior rather than by testing. • This approach requires time sampling, leading to sampling error • Further error due to: observer failures inter-observer differences 35 Reliability in Observational Studies • Deal with possibility of • Deal with interfailure in the singleobserver differences observer situation by using: having more than 1 Inter-rater reliability observer. Kappa statistic 36 Reliability in Observational Studies • Inter-rater reliability • % agreement between 2 or more observers problem: in a 2-choice case, 2 judges have a 50% chance of agreeing even if they guess! this means that % agreement may overestimate inter-rater reliability. 37 Reliability in Observational Studies • Kappa Statistic (Cohen,1960) • estimates actual interrater agreement as a proportion of potential inter-rater agreement after correction for chance. 38 Using Reliability Information • Standard error of measurement (SEM) • estimates extent to which test score misrepresents a true score. • SEM = (S)(1 – r) 39 Standard Error of Measurement • We use SEM to compute a confidence interval for a particular test score. • The interval is centered on the test score • We have confidence that the true score falls in this interval • E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score. 40 Standard Error of Measurement • A simple way to think of the SEM: • Suppose we gave one student the same test over and over • Suppose, too, that no learning took place between tests and the student did not memorize questions • The standard deviation of the resulting set of test scores (for this one student) would be the standard error of measurement. 41 What to do about low reliability • Increase the number of items • To find how many you need, use SpearmanBrown formula • Using more items may introduce new sources of error such as fatigue, boredom 42 What to do about low reliability • Discriminability analysis • Find correlations between each item and whole test • Delete items with low correlations