Reliability • Consistency • Test Scores & Error – X=T+E • As T goes up & E goes down, reliability increases • Variance & Error Variance 2 2 2 tr e Sources of Error • Test Construction/Content – Sampling; finite number of questions – Poorly written questions • Test Administration – Error related to the test taker – Error related to the test environment – Error related to the examiner Sources of Error (cont.) • Test scoring & interpretation – Objective v. subjective – scoring rubrics Parallel Tests • Theoretical underpinning of reliability – Similar content • Same true score & same error variance – Theoretical, not produced in reality – Not to be confused with “alternate forms” • Reliability can be defined as the correlation between 2 parallel tests rxx Types of Reliability • Reliability over time • Internal consistency/reliability • Inter-rater reliability Reliability over time • Test-retest reliability – Obtained by correlating pairs of scores from the same sample on two different administrations of the same test • Error related to passage of time & intervening factors • Alternate-Form (Immediate & Delayed) – Error related to time & content Internal Consistency • Split-half 1. Divide the test into two equivalent halves • • • Odd-even Randomly assign items Divide by equivalency of items 2. Calculate r between 2 halves 3. Correct with Spearman-Brown • Allows estimation of reliability of test that has been shortened or lengthened Internal Consistency (cont.) • Inter-item consistency – Index of homogeneity of test; degree to which all items measure same construct – Desirable: aids in interpretation of test (as opposed to homogeneity of groups) Internal Consistency (cont.) • Kuder-Richardson formulas – KR-20: statistic of choice for determining reliability of tests with dichotomous items (right-wrong) – KR-21: can be used if assumption that all items are of similar difficulty Internal Consistency (cont.) • Cronbach’s coefficient alpha – Function of all items on test & the total test score – Each item conceptualized as a test – 36-item test, 36-parallel tests – In addition to use with dichotomous tests can be used with tests containing nondichotomous items, e.g., opinion, tests which allow partial credit Inter-rater reliability • How well do 2 raters/judges agree? – Correlation between scores from 2 raters – Percentage of agreement; percentage of intervals where both raters agreed behavior occurred – Kappa Factors influencing reliability • Length of test – Longer tests increase percentage of domain that can be sampled – Point of diminishing returns • Homogeneity of items – Measure same construct; easier to interpret • Dynamic or static characteristics Factors influencing reliability (cont.) • Homogeneity of sample – Restriction of range – If sample is homogenous then any observed variance must be error • Power v. Speed tests – Speed use test-retest; alternate forms; split half from 2 separately timed half tests – Internal consistency not applicable • Speed tests easy; internal consistency inflates reliability Reliability of Individual Scores • How much error is in an individual score? – How much confidence do we have in a particular score? • Standard Error of Measurement – Extent to which one individual’s scores vary over tests that are presumed to be parallel • Assume error is distributed “normally” – Where is the individual’s “true” score? Standard Error of Measurement Smeas S.D. 1 rxx Smeas 15 1 .96 3 SEM (cont.) • Odds are 68% that “true” score falls within plus or minus 1 SEM. • Odds are __% that “true” score falls within plus or minus 2 (1.96) SEM. • Odds are __% that “true” score falls within plus or minus 3 SEM. • WHAT IS THE RELATIONSHIP BETWEEN RELIABILITY & SEM? Standard Error of the Difference of Two Scores • Compare test takers performance on two different tests • Compare two test takers on the same test • Compare two test takers on two different tests Standard Error of the Difference diff 2 meas1 2 meas2 diff 2 r1 r2 Standard Error of the Difference • Set confidence intervals for difference scores • Difference scores contain error from both of the comparison measures. – Difference scores are less reliable than scores from individual tests. Test-retest reliability: Social Interaction Self-Statement • r+1+2 = .99 • r-1-2 = .99 • • • • r+1-1 = -.45 r+1-2 = -.55 r+2-1 = -.47 r+2-2 = -.56