作者简介: Lyle F. Bachman 教授是国际语言测试学会前主席、语言测试研究终身成就奖获得者、世界知名 语言测试专家,美国加州大学洛杉矶分校(UCLA)教授。 Bachman 教授 2010 年又出版了一本新 书 Language Assessment in Practice. 2010 年 10 月 15 日 Bachman 教授来到浙江大学基于这本新书作 了主题为 “Justifying the Use of Language Assessment 的专题讲座,吸引了大批教师和研究生,讲座 现场座无虚席 Short Biographical Sketch Lyle F. Bachman is Professor, Department of Applied Linguistics and TESL, University of California, Los Angeles. He is a Past President of the American Association for Applied Linguistics and of the International Language Testing Association, and is currently co-editor, with Charles Alderson, of the Cambridge Language Assessment Series. He was the first winner of the TESOL/Newbury House Award for Outstanding Research, has won the Modern Language Association of America’s Kenneth Mildenberger Award for outstanding research publication twice, in 1999 was selected as one of 30 American "ESL Pioneers" by ESL Magazine, and in 2004 was given a Lifetime Achievement Award by the International Language Testing Association. He currently is a member of the Board on Testing and Assessment, a standing committee of the National Academies of Science. Prof. Bachman has published numerous articles and books in the areas of language testing, program evaluation and second language acquisition. He is regularly engaged in research projects in language testing and in program design and evaluation, as well as practitioner training workshops in language assessment, both at American institutions and at institutions abroad. His current research interests include validation theory, linking current validation models and procedures to test use, issues in assessing the academic achievement and academic English of English language learners in schools, the interface between language testing research and second language acquisition research, and the dialectic of abilities and contexts in language testing and educational performance assessment. Publications include the following books: Fundamental Considerations in Language Testing. Oxford University Press, 1990. An Investigation into the Comparability of Two Tests of English as a Foreign Language: The Cambridge-TOEFL Comparability Study (with Fred Davidson, Katherine Ryan and Inn-Chull Choi). University of Cambridge Local Examinations Syndicate and Cambridge University Press, 1994. Language Testing in Practice (with Adrian S. Palmer). Oxford University Press, 1996. Interfaces between Second Language Acquisition and Language Testing Research (co-edited with Andrew D. Cohen). Cambridge University Press. 1998. Keeping Score for All: the effects of inclusion and accommodation policies on large-scale educational assessments (co-authored with Judith Koenig). National Academies Press. 2004. Statistical Analyses for Language Assessment. Cambridge University Press, 2004. Workbook and CD for Statistical Analysis for Language Assessment. (with Antony J. Kunnen. Cambridge University Press, 2005. Language Assessment in Practice: Developing and Using Language Assessments in the Real World. (with Adrian S. Palmer). Oxford University Press, forthcoming. 1 On Fundamental Considerations in Language Testing — by Lyle F. Bachman This is an academic monograph dealing with problems about language testing by Lyle F. Bachman, a well-known professor of applied linguistics in the University of California and of English language teaching in Hongkong Chinese University. The book is not a ‘nuts and bolts’ text on how to write language tests. Rather, it is a discussion of fundamental issues that must be addressed at the start of any language testing effort, whether this involves the development of new tests or the selection of existing tests. One objective of this book is to provide a conceptual foundation for answering practical questions regarding the development and use of language tests. This foundation includes three broad areas: (1) the context that determines the uses of language tests; (2) the nature of the language abilities we want to measure, and (3) the nature of measurement. A second objective of this book is to explore some of the problems raised by what is perhaps a unique characteristic of language tests and a dilemma for language testers – that language tests instrument and the object of measurement – and to begin to develop a conceptual framework that eventually lead, if not to their solution, at least to a better understanding of the factors that affect performance on language tests. The book consists of eight chapters, each of which presents a set of related issues. The issues discussed in this book are relevant to two aspects of language testing: (1) the development and use of language tests; and (2) language testing research. Following the discussion of these issues is a summary, notes, suggestions for further reading, and discussion question. Chapter1 provides a general context for the discussion of language testing. In Chapter 2 the terms ‘measurement’, ‘test’, and ‘evaluation’ are defined and the relationships among them are discussed. Chapter 3 deals with the various uses of language tests in educational programs, along with examples of different types of programs to illustrate these different language tests. In Chapter 4 and 5, the author presents a theoretical framework for describing performance on language tests. Chapter 6 and 7 provide extensive discussions of the issues and problems related to demonstrating the reliability of test scores and the validity of test use. In the final chapter, the author shed the mantle of objective discussant and take more of a proactive advocate’s roles, dealing with some persistent issues ( and controversies) in language testing, and proposing an agenda for future research and development. Measurement Measurement in the social sciences is the process of quantifying the characteristics of persons according to explicit procedures and rules. This definition includes three distinguishing features: quantification characteristics, and explicit rules and procedures. Test is a measurement instrument designed to elicit a specific sample of an individual’s behavior. What distinguishes a test from other types of measurement is that it is designed to obtain a specific sample of behavior. This distinction is an important one, since it reflects the primary justification for the use of language tests and has implications for how we design, develop, and use them. Tests in and of themselves are not evaluative. They are often used for pedagogical purposes, either as a means of motivating students 2 to study, or a means of reviewing material taught, in which case no evaluative decision is made on the basis of the test results. They may also be used for purely descriptive purposes. The majority of tests are used for the purpose of making decisions about individuals. Evaluation can be defined as the systematic gathering of information for the purpose of making decisions. It is the collection of reliable and relevant information. Therefore, it does not necessarily entail testing. It is only when the results of tests are used as a basis for making a decision that evaluation is involved. So it is important to distinguish the information-providing function of measurement from the decision-making function of evaluation. Essential measurement qualities. Reliability is a quality of test scores, and a perfectly reliable score, or measure, would be one which is free from errors of measurement. And the most important quality of test interpretation or use is validity, or the extent to which the inferences or decisions we make on the basis of test scores are meaningful, appropriate, and useful. While reliability is a quality of test scores themselves, validity is a quality of test interpretation and use. They are both essential to the use of tests. In summary, ‘Measurement’ and ‘test’ involve the quantification of observations, and are thus distinct from qualitative descriptions. Tests are a type of measurement designed to elicit a specific sample of behavior. ‘Evaluation’ involves decision making, and is thus distinct from measurement, which essentially provides information. Thus, neither measures nor tests are in and of themselves evaluative, and evaluation need not involve measurement or testing. Uses of language tests The two major uses of language tests are: (1) as sources of information for making decisions within the context of educational programs; and (2) as indicators of abilities or attributes that are of interest in research on language, language acquisition, and language teaching. The fundamental use of testing in an educational program is to provide information for making decisions, that is, for evaluation. The use of tests as a source of evaluation information requires three assumptions. First, we must assume that information regarding educational outcomes is essential to effective formal education. Second, it is possible to improve learning and teaching through appropriate changes in the program, based on feedback. Third, we must assume that the educational outcomes of the given program are measurable. In addition to these assumptions, we must also consider how much and what kind of testing is needed, as well as the quality of information provided by our tests. In a word, the main point of this chapter is that the most important consideration in the development and use of language tests is the purpose or purposes for which the particular test is intended. By far the most prevalent use of language tests is for purposes of evaluation in educational programs. Communicative language ability According to Bachman, communicative language ability (CLA) can be described as consisting of both knowledge, or competence, and the capacity for implementing, or executing that competence in appropriate, contextualized communicative language use. The framework of CLA includes three components: Language competence. 3 Knowledge Structures Language Competence Knowledge of the world Knowledge of language Strategic Competence Psychophysiological Mechanisms Context of Situation ( Components of communicative language ability in communicative language use) Language Competence Organizational Competence Grammatical Competence Pragmatic Competence Textual Competence Illocutionary Competence Sociolinguistic Competence Voc. Morph. Synt. Phon. Cohes. Rhet.Org. Ideat. Manip. Heur. Imag. S to D. S to R. S to N. Cultural ( Components of language competence ) Language competence includes organizational competence, which consists of grammatical and textual competence, and pragmatic, which consists of illocutionary and sociolinguistic competence. Furthermore, Grammatical competence includes those competencies involved in language usage, consisting of a number of relatively independent competencies such as the knowledge of vocabulary, morphology, syntax, and phonology/ graphology. Textual competence consists of cohesion and rhetorical organization. Illocutionary competence is related to four macro-functions: ideational, manipulative, heuristic, and imaginative. Abilities under sociolinguistic competence are sensitivity to differences in dialect or variety, to 4 differences in register and to naturalness, and the ability to interpret cultural references and figures of speech. Strategic competence. Three components are included in strategic competence: assessment, planning, and execution. Psychophysio-logical mechanisms. These are essentially the neurological and physiological processes and characterize the channel (auditory, visual) and mode (receptive, productive) in which competence is implemented. Test methods The characteristics of test methods can be seen as restricted or controlled versions of these contextual features that determine the nature of the language performance that is expected for a given test or test task. Performance on language tests varies as function both of an individual’s language ability and of the characteristics of the test method. It is also affected by individual attributes that are not part of test takers’ language ability. The five major categories of test method facet are: (1) the testing environment which includes the facets: familiarity of the place and equipment used in administering the test; the personnel involved in the test; the time of testing, and physical conditions; (2) the test rubric which consists of the facets that specify how test takers are expected to proceed in taking the test. These include the test organization, time allocation, and instructions; (3) the nature of the input the test taker receives; (4) the nature of the expected response to that input, and (5) the relationship between input and response. The frameworks described here have been presented as a means for describing performance on language tests, and they are intended as a guide for both the development and use of language tests and for research in language testing. These frameworks provide the applied linguistic foundation that informs the discussions in the remainder of the book. Reliability A high score on a language test is determined or caused by high communicative language ability, and a theoretical framework defining this ability is thus necessary if we want to make inferences about ability from test scores. Performance on language tests is also affected by factors other than communicative language ability. These can be grouped into the following broad categories: (1) test method facets, as discussed in Chapter 5; (2) attributes of the test taker that are not considered part of the language abilities we want to measure, and (3) random factors that are largely unpredictable and temporary. Communicative language ability TEST SCORE Test method facets Personal attributes 5 Random factors Fundamental to the development and use of language tests is being able to identify and estimate the effect of various factors on language test scores. Test scores are influenced as much as possible by a given language ability and any factors other than the ability being tested that affect test scores are potential sources of error that decrease both the reliability of scores and the validity of their interpretations. Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of their effect on test scores. Measurement theory provides several models that specify the relationships between measures, or observed scores, and factors that affect these scores. Generalizability theory is an extension of the classical model that overcomes many of these limitations, in that in enables test developers to examine several sources of variance simultaneously, and to distinguish systematic from random error. Estimates of reliability based on classical measurement theory are inappropriate for use with criterion-referenced tests because of differences in the types of comparisons and decisions made. Systematic error, such as that associated with test method, is different from random error. Validation The primary concern in test development and use is demonstrating not only that test scores are reliable, but that interpretations and uses we make of test scores are valid. It has been traditional to classify validity into different types, such as content, criterion, and construct validity. Validity is a unitary concept and it always refers to the degree to which that evidence supports the inferences that are made from the scores. In addition to the test’s content and method, validation must consider how test takers perform. The examination of content relevance and content coverage is a necessary part of the validation process. Information about criterion relatedness – concurrent or predictive – is by itself insufficient evidence for validation. The process of construct validation, of providing evidence for ‘the adequacy of a test as a measure of the characteristic it is interpreted to assess’, is a complex and continuous undertaking, involving both (1) theoretical, logical analysis leading to empirically testable hypotheses, and (2) a variety of appropriate approaches to empirical observation and analysis. Reliability is a requirement for validity, and the investigation of reliability and validity can be viewed as complementary aspects of identifying, estimating, and interpreting different sources of variance in test scores. Validity is concerned with identifying the factors that produce the reliable variance in test scores. Reliability is concerned with determining how much of the variance in test scores is reliable variance, while validity is concerned with determining what abilities contribute to this reliable variance. Another way to distinguish reliability from validity is to consider the theoretical frameworks upon which they depend. The most important quality to consider in the development interpretation, and use of language tests is validity, which has been described as a unitary concept related to the adequacy and appropriateness of the way we interpret and use test scores, whereas reliability is a necessary condition for validity, in the sense that test scores that are not reliable cannot provide a basis for valid interpretation and use. In order to examine validity, we need a theory that specifies the language abilities that we hypothesize will affect test performances. Distinguishing between reliability and validity, then, involves differentiating sources of 6 measurement error form other factors that affect test scores. In order to maximize the reliability of test scores and the validity of test use, we should follow three fundamental steps in the development of tests: (1) provide clear and unambiguous theoretical definitions of the abilities we want to measure; (2) specify precisely the conditions, or operations that we follow in eliciting and observing performance, and (3) quantify our observations so as to assure that our measurement scales have the properties we require. Some persistent problems and future directions The challenge facing us is to utilize insights from linguistics, language learning, and language teaching to develop tests as instruments of research that can lead to a better understanding of the factors that affect performance on language tests. As developers and users of language tests, our task is to incorporate this increased understanding into practical test design, construction, and use. Another major challenge will be either to adapt current measurement models to the analysis of language test scores or to develop new models that are appropriate for such data. Meeting these challenges will require innovation, and the re-examination of existing assumptions, procedures, and technology. The most complex and persistent problems in language testing are those presented by the consideration of the relationship between the language use required by tasks on language tests and that which is part of our everyday communicative use of language. One of the two distinct approaches for attempting to describe this vital relationship, or test ‘authenticity’ is to identify the ‘real-life’ language use that we expect will be required of test takers, and with this as a criterion, attempt to design test tasks that mirror this, and the other is to examine actual non-test communicative language use, in an attempt to identify the critical, or essential features of such language use. The author is sure there are pressing needs for language tests suitable for uses in making minimum competency decisions about foreign language learners and language teachers, and in the evaluation of foreign language teaching methods. First, our highest priority must be given to the continued development and validation of authentic tests of communicative language ability. Second is the development of criterion-referenced measures of communicative language ability. A third area of need is in second language acquisition research, where criterion measures of language abilities that can be used to assess learners’ progression through developmental sequences are still largely absent. This book is an authoritative and inspiring monograph, especially suitable for doctors and post-graduates majoring in applied linguistics and foreign language teaching theory as well as those who specialize in the development and use of language tests. 7