Reliability 1 RELIABILITY Course: Writing Professor: Dr. Ahmadi Student: Date: 24.10.2012 Reliability 2 In different books on testing there are definitions of reliability. As Grant Hening (Guide to Language testing. P.73) expressed, reliability is a measure of accuracy, consistency, depended ability or fairness of scores resulting from administration of a particular examination. Also, Caroline V. Gipps ( Beyond testing, P. 67) expressed that reliability is concerned with the accuracy with which the test measure the skill or attainment it is designed to measure. According to these definitions reliability can be defined as a quality of test scores which refers to the consistency of measures. The reliable test may not be valid, but a valid test is reliable. Bachman (1990, P.161) expressed that the concerns of reliability and validity leading to two complementary objectives in designing and developing tests: 1. to minimize the effects of measurement error, and 2) to maximize the effects of the language abilities we want to measure. Investigation of reliability involves both logical analysis and empirical research; we identify sources of error and estimate the magnitude of their effects on test scores. Describing these theories such as a) Classical true score measurement theory, b) Generalizability theory, c) Item response theory, can provide more powerful tools for test development and interpretation, estimate of reliability. a) Classical true score measurement Theory Bachman (1990, P.166), in investigation of reliability distinction between unobservable abilities and observed test score is essential. As Bachman stated CTS theory consist of set of assumptions about the relationship between actual or observed test score and the factors that affect test scores. Observed scores comprise two factors: 1. true score that is Reliability 3 due to an individual’s level of ability, 2. error score that is due to factors other than the ability being tested. This assumption represent as follow: X= xt + xe Another assumption is the relationship between true and error scores. Error scores are unsystematic or random and are unrelated with true scores. In CTS model two scores of variance in a set of test scores: the true score variance, which is due to difference in the ability of the individuals tested and measurement error, which is unsystematic, or random. Parallel tests Bachman (1990, P.168) Parallel tests is the concept of CTS theory. In classical measurement theory parallel test defined as two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability. Reliability in CTS theory defined as a true score variance. Reliability refers to the score not the test itself. Bachman (1990, P.172) mentioned that there are three approaches in CTS model to estimate reliability a) internal consistency, which is concerned primarily with sources of error from within the test and scoring procedures, b) stability estimates indicate how consistent test scores are over time, c) equivalence estimates provide an indication of the extent to which scores on alternate forms of a test are equivalent. The estimates of reliability that these approaches yield are called reliability of coefficients. a) Internal consistency Reliability Moreover, there are approaches to examine the internal consistency of test; one is Splithalf method and the other is the spearman- Brown split-half estimate and The Guttman Split-half estimate. Split- half method: in this method test divided into two halves and then determine the extent to which scores on these two halves are consistent with each other. Treat the halves as parallel tests. These two halves are independent of each other. In other hand there might be two interpretations for a correlation between the haves: 1. they both measure the same trait. 2. Individual’s performance on one half depends on how they perform on the other. However, the problem in method is that most language tests are designed as ‘power test’ with easiest questions at the beginning and the more difficult in the rest so the assumption of equivalence would not be satisfied. The spearman- Brown split-half estimate: in this approach compute the correlation between two set of scores. In order to use this method two assumptions are met: 1 assumes that two halves as parallel tests have equal means and variances.2. Assume that two halves are independent from each other. Bachman (1990, P.175, No.10) The Guttman split-half estimate: this method developed by Guttman (1945), which does not assume equivalence of the halves and which does not require to computing a correlation between them. This reliability is based on the ratio of the sum of variances of the two halves to the variance of the whole test. Bachman (1990, P.175, No.11) Reliability estimates based on item Variances There are many different ways to estimate reliability with split- half approach. The approach on the basis of the statistical characteristic of the test item developed by Kuder 4 Reliability 5 and Richardson (1937), involves computing the means and variances of the item that constitute the test. Cronbach (1951) developed general formula for estimating internal consistency which called ‘Coefficient alpha’ or ‘Cronbach’s alpha’ Rater consistency: Bachman (1990, P.178) in test scores that are obtained subjectively, such as rating of compositions or oral interviews, a sources of error is inconsistency in there rating. If there is a single rater, we need to be concerned about consistency within that individual’s ratings, or with intra- rater reliability. Intra-rater reliability Inter-rater reliability: Inter-rater (or inter-observer) reliability is an important consideration in the social sciences because there are many conditions for which the best means of measurement is the report of trained observers. Some classes such as gymnastics can only be accessed through the ratings of expert judges. . Particularly, the acceptability of the reports of two or more observers increases when their observations are similar. The measure of the similarities of the observations coming from two or more sources is the inter-rater reliability. (www. education.com) b) Stability estimate Second approaches in CTS is Stability, in this approach reliability can be estimated by given the test more than once to same group of individuals. Also this approach called ‘test-retest’ and provides an estimate of the stability of the test score over time. Differential practice effect and differential changes in ability are two sources of inconsistency. Differential practice effect is when given the test twice with little time Reliability 6 between. A differential change in ability is when given the test twice with long time between. c) Equivalence estimate According to Heaten (1988. P.163) another way to estimate reliability is equivalence estimate that administering parallel forms of the test to the same group. This assume that two similar versions of the particular test can be constructed: such tests must be identical in the nature of their sampling, difficulty, length, rubrics, etc. if the results derived from the two tests correspond closely to each other, then the test can be reliable. Problems with the CTS model As Bachmn defined (1990,P.186) one problem in CTS model is that it treats error variance as homogeneous in origin. And the other problem is that CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error. b) Generalizability Theory Bachman (1990, P.187) another framework to estimate reliability is Generalizability theory. This theory has been developed by Cronbach and his colleagues. Reliability is a matter of generalizability, and the extent to which we can generalize from a given score is a function of how we define the universe of measures. Application of G-theory to tests development and use takes place in two stages. First, considering the uses that will be made of the test scores, the test developer designs and constructs a study to investigate the sources of variance that are of concern or interest. It is generalizability study (G-study) the test developer obtains estimates of the relative sizes of the different sources of variance. Second stage is a decision study (D-study). In Reliability 7 D-study the test developer administers the test under operational conditions, under the condition in which the test will be used to make the decisions for which it is designed and uses G-theory procedures to estimate the magnitude of the variance components. The application of G-theory enables test developer and test user to specify the different sources of variance that are concern for a given test use to estimate the relative importance of these different sources simultaneously. CTS model is a special case of G-theory in which there are only two sources of variance: a single ability and a single source of error. c) Item Response Theory Bachman (1990, P.202) another framework for estimating reliability is Item response theory. Factors that affect reliability estimates Bachman (1990, P.220) There are general characterictics of tests and test scores that influence the estimate of reliability. Understanding these factors help us to better determine which reliability estimates are appropriate to given set of test scores and interpret their meaning. These factors include: a) length of test, b) difficulty of test and score variance, c) cut-off score, d) systematic measurement error, e) the effect of systematic measurement error, f) the effect of test method. Conclusion Reliability is the consistency of measure, the score of the test is important. Reliability estimated in three frameworks such as 1. Classical true score measurement theory, 2. Generalizability theory, 3. Item Response theory. There are three approaches in CTS for estimating reliability, they included: a) internal consistency which involves approaches to Reliability examine internal consistency such as parallel test, split-half reliability estimates, spearman- Brown split- half estimate, The Gutten split-half estimate, Kuder- Richardson reliability coefficients, coefficient alpha, Rater consistency, Intra- rater reliability, Interrater reliability, b) Stability estimate, c) equivalence estimate. Also, there two problems with CTS model in estimating reliability. There are some factor that effect reliability estimates understanding them help us determine which reliability estimates are appropriate to a given set of test score. 8 Reliability 9 REFRENCES Bachman, L.F. (1990) Fundamental Consideration in Language Testing. Oxford:Oxford University Press. Brown, H.D. (2004) Language Assessment Principles and Classroom Practices. Longman Gipps , V. C. (1994) Beyond Testing:Towards a Theory of Educational Assessment. The Falmer Press (A member of the Taylor & Francis Group) London • Washington, D.C. Hening Grant. (2001) Guide to Language testing: Development, Evaluation and Research. Foreign language Teaching and Research Press. Heaten, J.B. (1988) Writing English Language Tests. Longman and New York