Measurement Measurement, test, evaluation • Measurement: process of quantifying the characteristics of persons according to explicit procedures and rules • Quantification: assigning numbers, distinguishable from qualitative descriptions • Characteristics: mental attributes such as aptitude, intelligence, motivation, field dependence/independence, attitude, native language, fluency • Rules and procedure: observation must be replicable in other context and with other individuals. Test • Carroll: a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual • Elicitation: to obtain a specific sample of behavior • Interagency Language Roundtable (ILR):oral interview: a test of speaking consisting of (1) a set of elicitation procedures, including a sequence of activities and sets of question types and topics; (2) a measurement scale of language proficiency ranging from a low level of 0 to a high level of 5. Test? • Years’ informal contact with a child to rate the child’s oral proficiency: the rater did not follow the procedures • A rating based on a collection of personal letters to indicate an individual’s ability to write effective argumentative editorials for a news magazine. • A teacher’s rating based on informal interactive social language use to indicate the student’s ability to use language to perform various cognitive/academic language functions. Evaluation • Definition: systematic gathering of information for the purpose of making decisions. • Evaluation need not be exclusively quantitative: verbal descriptions, performance profiles, letters of reference, overall impressions Relation Between Evaluation, Test, and Measurement Relation Between Evaluation, Test, and Measurement • 1: qualitative descriptions of student performance for diagnosing learning problems • 2. teacher’s ranking for assigning grades • 3. achievement test to determine student progress • 4. proficiency test as a criterion in second language acquisition research • 5.assigning code numbers to subjects in second language research according to native language What Is It, Measurement, Test, Evaluation ? • • • • • placement test classroom quiz grading of composition rating of classroom fast reading exercise rating of dictation Measurement Qualities • A test must be reliable and valid. Reliability • Free from errors of measurement. • If a student does a test twice within a short time, and if the test is reliable, the results of the 2 tests should be the same. • If 2 raters rate the same writing sample, the ratings should be consistent if the ratings should be reliable. • The primary concerns in examining reliability is to identify the different sources or error, then to use the appropriate empirical procedures for estimating the effect of these sources of errors on test scores. Validity • Validity: the extent to which the inferences or decisions are meaningful, appropriate and useful. The test should measure the ability and very little else. • If a test is not reliable, it is not valid. • Validity is a quality of test interpretation and use. • The investigation of validity is both a matter of judgment and of empirical research. Reliability and Validity • Both are essential to the use of tests. • Neither is a quality of tests themselves: reliability is a quality of test scores, while validity is a quality of interpretations or uses that are made of test scores. • Neither is absolute: we can never attain perfectly error free measures and particular use of a test score depends upon many factors outside the test itself. Properties of Measurement Scales • 4 properties • distinctiveness: different numbers assigned to persons with different values • ordered in magnitude: the larger the number, the larger the amount of the attribute • equal interval: equal difference between ability levels • absolute zero point: the absence of the attribute Four Types of Scales • Nomical: naming classes or categories. • Ordinal: an order with respect to each other. • Interval: the distance between the levels are equal. • Ratio: includes the absolute zero point Nominal • Examples :License plate numbers; Social Security numbers; names of people, places, objects; numbers used to identify football players • Limitations: Cannot specify quantitative differences among categories Ordinal • Examples: Letter grades (ratings from excellent to failing), military ranks, order of finishing a test • Limitations: Restricted to specifying relative differences without regard to absolute amount of difference Interval • Examples: Temperature (Celsius and Fahrenheit), calendar dates • Limitations: Ratios are meaningless; the zero point is arbitrarily defined Ratio • Examples: Distance, weight, temperature in degrees Kelvin, time required to learn a skill or subject • Limitations: None except that few educational variables have ratio characteristics Nominal, Ordinal, Interval or Ratio? • • • • • 5 in IELTS 550 in TOEFL C in BEC 8 in CET-4 writing 58 in the final evaluation of a student Property and Type of Scale Type of Scale Property Nominal Ordinal Interval Ratio Distinctiveness + + + + Ordering - + + + Equal intervals - - + + Absolute zero point - - - + Limitations in Measurement • It is essential and important for us to understand the characteristics of measures of mental abilities and the limitations these characteristics place on our interpretation of test scores. • These limitations are of two kinds: limitations in specification and limitations in observation and quantification. Limitation in Specification • Two levels of the specification of language ability • Theoretical level • Task: we need to specify the ability in relation to, or in contrast to, other language abilities and other factors that may affect test performance. • Reality: large number of different individual characteristics—cognitive, affective, physical— that could potentially affect test performance make the task nearly impossible. Limitation in Specification • Operational level • Task: we need to specify the instances of language performance as indicators of the ability we wish to measure. • Reality: the complexity ad the interrelationships among the factors that affect performance on language tests force us to simplify assumptions in designing language tests and interpreting test scores. Conclusion • Our interpretations and uses of test scores will be of limited validity. • Any theory of language test performance we develop is likely to be underspecified and we have to rely on measurement theory to deal with the problem of underspecification. Limitations in Observation and Quantification • All measures of mental ability are indirect, incomplete, imprecise, subjective and relative. Indirectness • The relationship between test scores and the abilities we want to measure is indirect. Language tests are indirect indicators of the underlying traits in which we are interested. Because scores from language tests are indirect indicators of ability, the valid interpretation and use of such scores depends crucially on the adequacy of the way we have specified the relationship between the test score and the ability we believe it indicates. To the extent that this relationship is not adequately specified, the interpretations and uses made of the test score may be invalid. Incompleteness • The performance we observe and measure in a language test is a sample of an individual's total performance in that language. Incompleteness • Since we cannot observe an individual's total language use, one of our main concerns in language testing is assuring that the sample we do observe is representative of that total use - a potentially infinite set of utterances, whether written or spoken. Incompleteness • It is vitally important that we incorporate into our measurement design principles or criteria that will guide us in determining what kinds of performance will be most relevant to and representative of the abilities we want to measure, for example, real life language use. Imprecision • Because of the nature of language, it is virtually impossible (and probably not desirable) to write tests with 'pure' items that test a single construct or to be sure that all items are equally representative of a given ability. Likewise, it is extremely difficult to develop tests in which all the tasks or items are at the exact level of difficulty appropriate for the individuals being tested. Subjectivity • As Pilliner (1968) noted, language tests are subjective in nearly all aspects. • Test developers • Test writers • Test takers • Test scorers Relativeness • The presence or absence of language abilities is impossible to define in an absolute sense. • The concept of 'zero' language ability is a complex one • The individual with absolutely complete language ability does not exist. • All measures of language ability based on domain specifications of actual language performance must be interpreted as relative to some 'norm' of performance. Steps in Measurement • Three steps • 1. identify and define the construct theoretically • 2. define the construct operationally • 3. establish procedures for quantifying observations Defining Constructs Theoretically • Historically, there were two distinct approaches to defining language proficiency. • Real-life approach: language proficiency itself is not define, but a domain of actual language us is identified. • The approach assumes that if we measure features present in language use, we measure the language proficiency. Real Life Approach: Example • American Council on the Teaching of Foreign Languages (ACTFL): definition of advanced level • Able to satisfy the requirements of everyday situations and routine school and work requirements. Can handle with confidence but not with facility complicated tasks and social situations, such as elaborating, complaining, and apologizing. Can narrate the describe with some details, liking sentences together smoothly. Can communicate facts and talk casually about topics of current public and personal interest, using general vocabulary. Interactional/ability Approach • Language proficiency is defined in terms of its component abilities. These components can be reading, writing, listening, speaking, (Lado), functional framework (Halliday), communicative frameworks (Munby) Example of Pragmatic Competence • The knowledge necessary, in addition to organizational competence, for appropriately producing or comprehending discourse,. Specifically, it includes illocutionary competence, or the knowledge of how to perform speech acts, and sociolinguistic competence, or the knowledge of the sociolinguistic conventions which govern language use. Defining Constructs Operationally • This step involves determining how to isolate the construct and make it observable. • We must decide what specific procedures we will follow to elicit the kind of performance that will indicate the degree to which the given construct is present in the individual. • The context in which the language testing takes place influences the operations we would follow. • The test must elicit language performance in a standard way, under uniform conditions. Quantifying Observations • The units of measurement of language tests are typically defined in two ways. • 1. points or levels of language performance. • From zero to five in oral interview • Different levels in mechanics, grammar, organization, content in writing • Mostly an ordinal scale, therefore needing appropriate statistics for ordinal scales. • 2. the number of tasks successfully completed Quantifying Observations • 2. the number of tasks successfully completed • We generally treat such a score as one with an interval scale. • Conditions for an interval scale Quantifying Observations • the performance must be defined and selected in a way that enables us to determine the relative difficulty and the extent to which they represent the construct being tested. • the relative difficulty: determined from the statistical analysis of responses to individual test items. • How much they represent the construct: depend on the adequacy of the theoretical definition of the construct. Score Sorting • Raw score • Score Class 1. Range 2. Number of groups: K=1.87(N-1)2/5 3. Interval: I=R/K 4. Highest and Lowest of the group 5. Arrange the data into groups Central Tendency & Dispersion • Mean: x-=∑x / N • Median: middle of the range • Mode: the score around which the bulk of the data congregate • Variance: V=∑(x-x-)2 / (n-1) • Standard deviation: S=√(∑(x-x-)2 / (n-1))