CHAPTER 3 RELIABILITY AND OBJECTIVITY OBJECTIVES - this chapter discusses methods to estimate reliability and objectivity, and factors that influence the value of these for a test score - on most occasions, a test can be administered several times a day the test administrator must decide how many to administer and which trials to use as the criterion score - after reading this chapter, you should be able to: 1. define and differentiate between reliability and objectivity for norm-referenced tests scores and outline the methods used to estimate these values 2. identify those factors that influence reliability and objectivity for norm-referenced test scores 3. identify those factors that influence reliability for criterion-referenced test scores 4. select a reliable criterion score based on measurement theory INTRODUCTION - certain characteristics essential to measurement, w/o them we cannot believe in the measurement and we can make little use of it in this chapter, we will go over these characteristics - the most important characteristic of a measurement is validity an instrument is valid only if it measures what it is supposed to measure this is discussed in greater detail in chapter 4 - the second most important characteristic is reliability a reliable instrument measures whatever it measures consistently before an instrument can be valid, it must first be reliable - objectivity is another important characteristic sometimes called rater reliability because it is defined in terms of agreement of judges about the value of the measurement if two judges/raters cannot agree on a score, the measurement lacks objectivity a lack of objectivity reduces both reliability and objectivity - majority of this chapter deals with reliability and objectivity of norm-referenced tests reliability of criterion-referenced tests in presented at the end of the chapter SELECTING A CRITERION SCORE Mean Score Versus Best Score - a criterion score is the measure used to indicate a person’s ability - - unless a measure has perfect reliability, it is a better indicator of performance when developed from more than one trial multiple trials are common: skinfolds, strength testing, IQ, jumping, etc… for multiple trials, the criterion score can be either the best score or the mean of the trials best can represent optimal mean can represent typical - which to use – it depends more difficult to obtain a mean (calculations and administrations) the more trials you have, the better the reliability - More Considerations - if maximum ability is what you are seeking, then maybe you want best score - maybe the mean of the two highest scores Summary - based on the above information, we can determine a criterion score in any of the following ways 1. mean of all the trial scores 2. best score of all the trial scores 3. mean of selected trial scores based upon which trials the group scored best 4. mean of selected trial scores based upon which trials the individual scored best - in chapter 4 we will learn that for a score to be valid, it must be reliable but reliability does not guarantee validity in selecting a criterion score, we must consider what is the most valid and reliable score, and not simply the most reliable score TYPES OF RELIABILITY - - traditionally estimated by one of two methods: 1. test-retest (stability) or 2. internal consistency each yields a different coefficient, so it is imperative to use the most appropriate one it is also important to note the types of reliability others have used to estimate their coefficients also, a test may be reliable in one group but not in another Stability Reliability - when individual scores change little from one day to the next, they are stable - when scores remain stable, they are considered reliable - test-retest is used to obtain the stability reliability coefficient - for this, the same individual is measured with the same instrument on several (usually, and at least, 2) the correlation between the sets of scores is the stability reliability coefficient - the close this coefficient is to (+1), the more reliable the scores - 3 factors that can contribute to low reliability are: 1. testers may perform differently 2. measuring instrument may be operated or applied differently 3. the person administering the measurement may change - as a rule of thumb, test administrations are 1 to 3 days apart for maximum fitness testing (i.e., VO2max) there should be seven days to allow for complete physiological recovery if interval between administrations is too long, scores can change due to practice, memory, etc… - things that are not considered sources of measurement error - - because of the time constraints, some people do not advocate test-retest reliability probably, however, most appropriate method for determining reliability of physical performance measures and, not all subjects have to be retested – 25% to 50% of sample size should suffice there is not set standard for an acceptable range of stability reliability coefficients all situations must be evaluated on an individual basis most physical performance measures, however, exhibit coefficients in the .80 - .95 range Internal-Consistency Reliability - used by many - the advantage is that all measures are collected on the same day - refers to a consistent rate of scoring by the individuals being tested throughout a test, or when multiple trials are administered from trial to trial - two trials must be administered on the same day significant changes in test scores indicate a lack of reliability - the correlation among the trial scores is the internal consistency reliability coefficient Stability versus Internal Consistency - not comparable - internal consistency is not affected by day-to-day changes in performance - the day-to-day changes in performance are major sources of measurement error in stability reliability - internal consistency is typically higher than stability reliability - not unusual to have coefficients ranging from .85-.99 on performance tests - disciplines that rely heavily on paper and pencil tests primarily use internal consistency reliability; whereas performance based disciplines rely heavily on stability reliability - why? remember – the stability coefficient assumes that true ability has not changed from one day to the next – paper and pencil tests typically cannot meet this assumption – performance tests, on the other hand, can RELIABILITY THEORY - reliability for norm-referenced tests may be better understood through its mathematical foundation reliability can be explained in terms of “observed scores”, “true scores”, and “error scores” additionally, reliability assumes that any measurement on a continuous scale has an inherent component of error, termed “measurement error” any number of things can affect measurement error, the four most frequently occurring are: 1. lack of agreement among scorers (objectivity) 2. lack of consistency by individuals being tested 3. lack of consistency in measuring instrument; and 4. inconsistency in following standardized testing procedures - from text: o assume we measure height of 5 people all 68 inches tall o if we say anyone is other than 68 inches tall, we have measurement error o the variability in measured height versus actual height is “measurement error” o variance, from chapter 2, is s2 o if all measured scores are 68, s is zero and s2 is 0 o there is no measurement error o if all people are not the same height, than s2 is due to true differences in height or differences due to “measurement error” o in reliability theory, we are trying to determine what is “true error” and what is “measurement error” - in theory, observed score is the sum of the true score and the measurement error equation is: X = t + e; where o X is the observed score, o t is the true score o e is the error for example, an individual is 70.25 inches and is measured at 70.5 inches, the measurement error if 0.25 inches o 70.5 (X) = 70.25(t) + 0.25(e) - - the variance for a set of observed scores is equal to the variance of the true scores plus the variance of the error scores: x2 t2 e2 - reliability then is the ratio of the true-score variance to the observed-score variance: t2 x2 e2 e2 1 x2 x2 x2 - from the formula, we can observe that when measurement error is 0, reliability equals 1 as measurement error increases, reliability decreases reliability is an indicator of the amount of measurement error in a set of scores - reliability, then is dependent on two factors 1. reducing the variation attributable to measurement error, and 2. detecting individual differences (true score variation) within the group measured - reliability, then must be viewed in terms of its measurement error (error variance) and its power to discriminate among different levels of ability within the group measured (true-score variance) ESTIMATING RELIABILITY – INTRACLASS CORRELATION - remember: X (observed score) = t (true score) + e (error) furthermore, the variance of observed scores equals the variance of true scores plus the variance of the error scores: t2 x2 e2 e2 1 2 2 2 x x x - reliability equals the true-score variance divided by the observed-score variance - just as observed score can be divided into true and error scores, the total variability (s2) for a set of scores can be divided into several parts to divide, or petition, this variance – we use ANOVA we use the output from the ANOVA to obtain the variance scores that we need then, we can calculate an intraclass reliability coefficient - before ANOVA, two things we must discuss 1. you must estimate reliability prior to collecting a large amount of data a. administer to small representative sample b. estimate reliability c. known as a pilot study d. should use more administrations and trials with pilot study compared to larger study 2. calculations are easier when done with the computer; however, you should still practice by hand to understand how they are calculated Intraclass R from One-Way ANOVA - using ANOVA, we replace the Reliability formula : - with, R = - where, t2 x2 MS A MS w MS A o R is the intraclass correlation coefficient o MSA = mean square among o MSW = mean square within the mean square values are provided by ANOVA they are also variance scores/estimates o (MSA – MSW) is an estimate of t2 , and o MSA an estimate of x2 - to calculate these values by hand, we need to define 6 values: 1. sum of squares total – SST 2. degrees of freedom total dfT 3. sum of squares among people SSA 4. sum of squares within people SSW 5. degrees of freedom among people dfA 6. degrees of freedom within people dfW SST X 2 X2 SS W X 2 Ti2 X 2 SS A nk Ti2 k nk df T nk - 1 k df w nk - 1 df A n - 1 Where: X 2 is the sum of the squared scores X is the sum of the scores of all people n is the number of people k is the number of scores for each person, and Ti is the sum of the scores for person I Now, we can easily calculate the Mean Square values (among and within) MS A SS A SS A df A n - 1 MS W - SS w SS W df W nk - 1 - in calculating MSA, SSA will be zero if all people have the same score it will be greater than zero if people have different scores should not anticipate it being zero, there will be different scores MSA, then, should be interpreted as an estimate of true score variance x2 also, in calculating MSW, SSW will be zero if each person has all the same scores on the different trials it will be greater than zero if people have different scores on different trials - MSW, then, should be interpreted as an estimate of error score variance e2 Step 8 from text, page 86 – ANOVA source table: Source df SS MS Among people Within people Total dfA dfW dfT SSA SSW MSA MSW - go to handout for problem 3.1 - we interpret this R as the reliability of a criterion score which is the sum or mean test score for each person when R = 0, there is no reliability when R = 1, there is maximum reliability - SPSS handout - reliability we have from problem 3.1 is the reliability of a criterion score from the mean of two scores (trials 1 and 2) what if you want R for a single trial for a single test administration we calculate for a criterion score from a single score on one day using the following formula: - R MS A MS W k MS A 1MS W k' (formula 3.2) where, R = reliability for a criterion score composed of k’ scores k is the number of scores per person in the pilot study k’ is the number of scores per person in the actual measurement group if we want to estimate reliability for a score collected on a single day, using the values from problem 3.1, where k = 2 and k’ = 1 R - 31.5 0.33 31.17 0.98 31.83 2 31.5 10.33 1 - we use Step 9 when we want reliability of a criterion score from the mean or sum of trial scores we can use equation 3.2 if we want reliability of a best score - this reliability coefficient is only one of many but is the simplest to illustrate more advanced procedures may illicit a more precise coefficient Intraclass R from Two-Way ANOVA - suppose that k scores were collected for each of n people - which could have been collected over k trials or k days - for discussion and illustration, think of k scores as trials - an ANOVA source summary table would look like Table 3.2, page 87 - think of it as an extension of the previous calculations to calculate reliability – we have introduced more trials, so we have increased the confusion, just a little bit - to complete these calculations, we need the following calculations: sum of squares total – SST – same as above sum of squares among people – SSP – same as SSA from above - sum of squares among trials – SSt = T j 2 X2 sum of squares interaction – SSi - = X - n 2 nk 2 Ti 2 Tj 2 X nk k n df T nk - 1; df P n - 1; df t k - 1; df I n - 1k - 1 where, Tj is the sum of the scores for Trial j, and the rest are defined above Selecting Criterion Score - dealing with selecting the criterion score and ANOVA model for R - review the earlier information (selecting a criterion score) Criterion Score Is Mean - if it will be the mean or sum of the trial scores and NOT THE BEST SCORE - we need to examine if the trial means are different - one method is to visually compare them - in Problem 3.2, the means are 5.0, 5.2 and 5.4, not a big difference - another method of determining if the means are different is to use an F-test - MS t 0.20 ; from 3.2 = F 0.29 0.70 MS I evaluate the F-test from additional statistical techniques at the end of chapter 2 if it is significant, there are real differences between the means if not, than there is no difference in the means F once you determine the significance of the F-test, you can proceed in one of three ways (see Figure 3-1, page 89). 1. If there is no significant difference among the trial means, the reliability of the criterion score is calculated using the following: MS P MS W (formula 3.3); where MSS SS SS I MS w t df t df I - Note that formula 3.3 is the same as formula 3.1 2. If there are significant differences among the trial means, we need to make some decisions a. We can discard the scores that are presenting the problem b. See figure 3-2, page 90. c. If we discard, then we do another set of ANOVA calculations d. Then see if the means are different, if they are not, then we proceed with step 1 above e. The purpose of doing this is to remove differences in trials, and is completely acceptable 3. Or, we can consider the changes in scores as attributable to learning or practice and therefore not due to measurement error a. Use this especially if you see increases in performance from 1 trial to another b. This technique can also be used when attempting to estimate the objectivity of judges c. The formula for estimating reliability in this case is: MS P MS I (formula 3.4) MSS - measurement error is supposed to be random – significant differences indicate a lack of randomness – there is a systematic change Criterion Score Is Single Score - earlier, provided example of selecting criterion score using One-Way here provide information on estimating reliability from a single score we use the following formula: R MS P MS I (formula 3.5) k MS P 1MS I k' COMPUTER USE - - you should use a computer at all times, unless the dataset is small the advantage of a two-way ANOVA over a one-way ANOVA is that trial means will be provided, as well as a significance test for the trial means you can calculate R using either equation 3.3 or 3.4 from this output in SPSS, Reliability Analysis provides an option for the two-way ANOVA for repeated measures (this option is not available in the Student Version) o see Section 12 of the Windows Statistical Procedures in Appendix A, page 513 see Tables 3.3 and 3.4 in Table 3.4, “between people” and “within people” if a one-way ANOVA is used “between people”, “between measures” and “residual” is a two-way ANOVA is used within people from one-way ANOVA is composed of between measures and residual from the two-way ANOVA between measures and residual in Table 3.4 are among trials and interaction, respectively COMMON SITUATIONS FOR CALCULATING R - - formulae 3.2 and 3.5 are for estimating R for a criterion score that is a single trial or single day score of a person where two or more trial or day scores were collected remember that R is often estimated in a pilot study with more scores per person than will actually be used in a regular study presented next are common situations and formulae for calculating R using formulae 3.2 and 3.5 see handouts for situations 1 and 2 Sample Size for R - the size of the estimated R is somewhat dependent on the sample - to combat this, confidence intervals are used to provide a range of what the R may be, for example 90% confidence intervals of R - when using confidence intervals, you are stating that you are X% confident that the actual R will fall within this range you want to have a high level of confidence with a small confidence interval research has shown that confidence limits are the same when keeping sample size criterion score equal across either one-way or two-way ANOVA’s. additionally, as the sample size increases or the R increases, the width of the confidence interval decreases review Table 3.5 to see this relationship, note how the confidence interval changes as R increases or the sample size increases you can obtain confidence intervals through the Reliability Analysis program within SPSS Acceptable Reliability - what is an acceptable reliability coefficient - it depends, and it depends on several factors, including - the characteristics of the sample, R’s from previous and similar studies, the type of R, study design, etc… - in performance related tests, minimum estimates of R might be 0.70 or 0.80 - in behavioral research, estimates of R = 0.60 are sometimes acceptable - you can also use confidence intervals to set a minimum acceptable estimate of R for example, with a 95% confidence interval, you might stipulate that the lower end of the confidence interval must be greater than 0.70 - some guidance in exercise science field might be the following o 0.70 – 0.79 is below-average but acceptable o 0.80 – 0.89 is average and acceptable o 0.90 and greater is above average Factors Affecting Reliability - many factors can affect the reliability of a measurement - some include: scoring accuracy, number of trials, test difficulty, test instructions, testing environment and experience - the length of test, since longer tests provide higher estimates of R - Table 3.6 is a categorization of factors that influence test score reliability - beyond that, we can expect an acceptable degree of reliability when 1. the sample is heterogeneous in ability, motivated to do well, ready to be tested, and informed about the nature of the test 2. the test discriminates among ability groups, and is long enough or repeated sufficiently for each person to show his or her best performance 3. the testing environment and organization are favorable to good performance, and 4. the person administering the test is competent Coefficient Alpha - often used to determine the reliability of dichotomous data (chapter 14) - with ordinal data we may use coefficient alpha, in fact it provides the same answer if the data are ratio using formula 3.4 - coefficient alpha is an estimate of the reliability of a criterion score that is the sum of the trial scores in one day 2 2 k s x s j - it is determined using the following: r ; where, k - 1 s 2x - rα is coefficient alpha - k is the number of trials - s2x is the variance for the criterion scores, and - s2j is the sum of the variances for the trials - if using statistical software or Excel, do not worry about n or n-1 in the denominator - they cancel out, so the coefficient alpha will be the same - go to handout for problem 3.3 besides the needed variances, statistical software can also provide correlations which can help in deciding which trial scores to use, if you don’t use all Intraclass R in Summary - sometimes the intraclass correlation coefficient will be lower than expected or wanted - even though the test scores seem reliable - this can happen when the sum of squares among people is small - this would indicate a more homogeneous group as opposed to heterogeneous, which increases the sum of squares among people - one way to correct this is to increase the sensitivity of the test so that it discriminates more among the homogeneous group - another way is to increase the heterogeneity of the group SPEARMAN-BROWN PROPHECY FORMULA - - used to estimate reliability when the length of the test is increased assumes that additional length (or new trial) is just as difficult as the original and is neither mentally or physically more tiring an estimate of reliability of a criterion score from a mean or sum of trial scores k r1,1 - rk,k 1 k 1 r 1,1 where rk,k is the estimated reliability of a test increased in length k times k is the number of times the test in increased in length, and r1,1 is the reliability of the present test - problem 3.4, six trial test R = 0.94, what if 18 trials were administered - - - 30.94 2.82 2.82 rk, k 0.98 1 3 1 0.94 1 1 . 88 2 . 88 what about the accuracy of this formula – the accuracy of the formula increases as the value of k in the formula decreases – meaning that the maximum reliability of the test can be determined through multiple iterations of this equation STANDARD ERROR OF MEASUREMENT - sometimes useful to estimate the measurement error in each test score if we administer several trials to each person, we can calculate a standard deviation for each person this standard deviation is the measurement error for the individual - if there is only one score the measurement error is estimated using the group scores by calculating the standard error of measurement with the following: - s e s x 1 rx, x - where, se is the standard error of measurement, sx is the standard deviation, and rx,x is the reliability coefficient for the test scores - the standard error of measurement reflects the degree one may expect a test score to vary due to measurement error it acts like and can be interpreted in the same way using the normal curve as a standard deviation for a score it specifies the limits within which we can expect scores to vary due to measurement error - - Problem 3.5, standard deviation of a test is 5, R = 0.91 - s e 5 1 0.91 5 0.09 5(0.3) 1.5 OBJECTIVITY - or rater reliability, the close agreement between the scores assigned to each person by two or more raters for a test to be valid it must be both reliable and objective Factors Affecting Objectivity - dependent on two related factors 1. clarity of the scoring system o certain tests have clear scoring methods: mile run, strength scores, sit-ups, pushups, etc… o open-ended tests don’t have a clear scoring method and, therefore, interject more subjectivity 2. degree to which raters can assign scores accurately o affected by familiarity with scoring mechanism – stopwatch-if rater doesn’t know how to use it, less accurate scores can be assumed - a high degree of objectivity is essential when two or more people are administering a test the lower the objectivity, the more dependent on a particular rater an individual’s scores will be high objectivity is also needed when the same rater/tester administers the same test over several days Estimation - we can calculate the degree of objectivity between two or more raters using an intraclass correlation coefficient - to do this, we consider the raters or judges scores as trials – we have two or more scores for each person (the number of scores is determined by the number of raters) - if all raters are supposed to be using the same standards, we can consider the difference between judges as measurement error and calculate objectivity using the following equation: MS A MS w - Reliability (formula 3.1, not 3.2 as stated in book) MS A - if all judges are not expected to use the same standards, we would calculate objectivity using either the alpha coefficient or the appropriate intraclass R formula – formula 3.5 RELIABILITY OF CRITERION-REFERENCED TEST SCORES - from chapter 1 based on a criterion-referenced standard a person is classified as either yes or no (proficient or non-proficient) criterion-referenced reliability is defined differently than norm-referenced reliability reliability is defined as consistency of classification - we use the 2 x 2 classification table to estimate reliability Day 2 Day 1 - - Pass Fail Pass A C Fail B D A is the number of people who passed both times (true-positives) and D is the number who failed both times (false-negatives) B and C are those that did not meet or fail the criterion on both occasions the objective is to maximize the numbers in the A and D boxes while minimizing the numbers in the B and C boxes proportion of agreement is used to estimate reliability - P AD ABCD Problem 3.6, determine proportion of agreement for data in Table 3.8 P - 84 40 124 0.83 84 21 5 40 150 - the kappa coefficient allows for chance occurrences within the data (some got lucky both times or unlucky both times) Pa - Pc - k ; where Pa is the proportion of agreement 1 - Pc - Pc is the proportion of agreement by chance: A BA C C D B D - Pc A B C D 2 - Problem 3.7, determine kappa coefficient for data in Table 3.8 Step 1, calculate Proportion of Agreement, already done from above Step 2, Calculate Proportion of agreement by chance, Pc - - Pc 10589 4561 9345 2745 12090 0.54 22500 84 21 5 402 1502 Step 3, Calculate kappa - - k 0.83 - 0.54 0.29 0.63 1 - 0.54 0.46 what are acceptable levels of P - .60 and lower should not even be considered depending on the situation - .90 might be set as a minimum RELIABILITY OF DIFFERENCE SCORES - sometimes called change scores or improvement scores - when people start a program with markedly different scores it proves difficult in evaluating change over time – particularly when some people start the program with higher scores a difference score (change from beginning to end) can be used to determine the degree to which a person has changed - - there are two problems using difference scores for comparison among individuals, among groups and/or development of performance standards 1. people who score high at the beginning have little chance for improvement 2. difference scores are unreliable - the formula for estimating R of difference scores (X – Y) is as follows: - - s x2 s 2y 2 Rxy s x s y where o o o o o o - Rdd Rxx s x2 Ryy s 2y 2 Rxy s x s y X is the initial score Y is the final score Rxx and Ryy are the reliability coefficients for tests X and Y Rxy is the correlation between tests X and Y Rdd is the reliability of difference scores sx and sy are the standard deviations for the initial and final scores simple linear regression can also be used to predict final scores from initial scores to do this you need to know what the correlation between the two tests is, and then predict what the final score would be SUMMARY - three characteristics to a sound measuring instrument: reliability, objectivity and validity - a test is reliable when it consistently measures whatever it is supposed to measure o there are two types of reliability – stability reliability and internal consistency reliability - objectivity is the degree to which different raters agree in scoring of the same subjects