On Fundamental Considerations in Language Testing

advertisement
作者简介:
Lyle F. Bachman 教授是国际语言测试学会前主席、语言测试研究终身成就奖获得者、世界知名
语言测试专家,美国加州大学洛杉矶分校(UCLA)教授。 Bachman 教授 2010 年又出版了一本新
书 Language Assessment in Practice. 2010 年 10 月 15 日 Bachman 教授来到浙江大学基于这本新书作
了主题为 “Justifying the Use of Language Assessment 的专题讲座,吸引了大批教师和研究生,讲座
现场座无虚席
Short Biographical Sketch
Lyle F. Bachman is Professor, Department of Applied Linguistics and
TESL, University of California, Los Angeles. He is a Past President of the American Association for
Applied Linguistics and of the International Language Testing Association, and is currently co-editor, with
Charles Alderson, of the Cambridge Language Assessment Series. He was the first winner of the
TESOL/Newbury House Award for Outstanding Research, has won the Modern Language Association of
America’s Kenneth Mildenberger Award for outstanding research publication twice, in 1999 was selected
as one of 30 American "ESL Pioneers" by ESL Magazine, and in 2004 was given a Lifetime Achievement
Award by the International Language Testing Association. He currently is a member of the Board on
Testing and Assessment, a standing committee of the National Academies of Science.
Prof. Bachman has published numerous articles and books in the areas of language testing, program
evaluation and second language acquisition. He is regularly engaged in research projects in language
testing and in program design and evaluation, as well as practitioner training workshops in language
assessment, both at American institutions and at institutions abroad. His current research interests include
validation theory, linking current validation models and procedures to test use, issues in assessing the
academic achievement and academic English of English language learners in schools, the interface
between language testing research and second language acquisition research, and the dialectic of abilities
and contexts in language testing and educational performance assessment.
Publications include the following books:
Fundamental Considerations in Language Testing. Oxford University Press, 1990.
An Investigation into the Comparability of Two Tests of English as a Foreign Language:
The Cambridge-TOEFL Comparability Study (with Fred Davidson, Katherine Ryan and Inn-Chull
Choi). University of Cambridge Local Examinations Syndicate and Cambridge University Press, 1994.
Language Testing in Practice (with Adrian S. Palmer). Oxford University Press, 1996.
Interfaces between Second Language Acquisition and Language Testing Research (co-edited with
Andrew D. Cohen). Cambridge University Press. 1998.
Keeping Score for All: the effects of inclusion and accommodation policies on large-scale educational
assessments (co-authored with Judith Koenig). National Academies Press. 2004.
Statistical Analyses for Language Assessment. Cambridge University Press, 2004.
Workbook and CD for Statistical Analysis for Language Assessment. (with Antony J.
Kunnen. Cambridge University Press, 2005.
Language Assessment in Practice: Developing and Using Language Assessments in the Real World.
(with Adrian S. Palmer). Oxford University Press, forthcoming.
1
On Fundamental Considerations in Language Testing
— by Lyle F. Bachman
This is an academic monograph dealing with problems about language testing by Lyle F.
Bachman, a well-known professor of applied linguistics in the University of California and of English
language teaching in Hongkong Chinese University. The book is not a ‘nuts and bolts’ text on how to write
language tests. Rather, it is a discussion of fundamental issues that must be addressed at the start of any
language testing effort, whether this involves the development of new tests or the selection of existing tests.
One objective of this book is to provide a conceptual foundation for answering practical questions
regarding the development and use of language tests. This foundation includes three broad areas: (1) the
context that determines the uses of language tests; (2) the nature of the language abilities we want to
measure, and (3) the nature of measurement.
A second objective of this book is to explore some of the problems raised by what is perhaps a unique
characteristic of language tests and a dilemma for language testers – that language tests instrument and the
object of measurement – and to begin to develop a conceptual framework that eventually lead, if not to
their solution, at least to a better understanding of the factors that affect performance on language tests.
The book consists of eight chapters, each of which presents a set of related issues. The issues discussed
in this book are relevant to two aspects of language testing: (1) the development and use of language tests;
and (2) language testing research. Following the discussion of these issues is a summary, notes, suggestions
for further reading, and discussion question. Chapter1 provides a general context for the discussion of
language testing. In Chapter 2 the terms ‘measurement’, ‘test’, and ‘evaluation’ are defined and the
relationships among them are discussed. Chapter 3 deals with the various uses of language tests in
educational programs, along with examples of different types of programs to illustrate these different
language tests. In Chapter 4 and 5, the author presents a theoretical framework for describing performance
on language tests. Chapter 6 and 7 provide extensive discussions of the issues and problems related to
demonstrating the reliability of test scores and the validity of test use. In the final chapter, the author shed
the mantle of objective discussant and take more of a proactive advocate’s roles, dealing with some
persistent issues ( and controversies) in language testing, and proposing an agenda for future research and
development.
Measurement
Measurement in the social sciences is the process of quantifying the characteristics of persons
according to explicit procedures and rules. This definition includes three distinguishing features:
quantification characteristics, and explicit rules and procedures.
Test is a measurement instrument designed to elicit a specific sample of an individual’s behavior. What
distinguishes a test from other types of measurement is that it is designed to obtain a specific sample of
behavior. This distinction is an important one, since it reflects the primary justification for the use of
language tests and has implications for how we design, develop, and use them. Tests in and of themselves
are not evaluative. They are often used for pedagogical purposes, either as a means of motivating students
2
to study, or a means of reviewing material taught, in which case no evaluative decision is made on the basis
of the test results. They may also be used for purely descriptive purposes. The majority of tests are used for
the purpose of making decisions about individuals.
Evaluation can be defined as the systematic gathering of information for the purpose of making
decisions. It is the collection of reliable and relevant information. Therefore, it does not necessarily entail
testing. It is only when the results of tests are used as a basis for making a decision that evaluation is
involved. So it is important to distinguish the information-providing function of measurement from the
decision-making function of evaluation.
Essential measurement qualities. Reliability is a quality of test scores, and a perfectly reliable score,
or measure, would be one which is free from errors of measurement. And the most important quality of test
interpretation or use is validity, or the extent to which the inferences or decisions we make on the basis of
test scores are meaningful, appropriate, and useful. While reliability is a quality of test scores themselves,
validity is a quality of test interpretation and use. They are both essential to the use of tests.
In summary, ‘Measurement’ and ‘test’ involve the quantification of observations, and are thus distinct
from qualitative descriptions. Tests are a type of measurement designed to elicit a specific sample of
behavior. ‘Evaluation’ involves decision making, and is thus distinct from measurement, which essentially
provides information. Thus, neither measures nor tests are in and of themselves evaluative, and evaluation
need not involve measurement or testing.
Uses of language tests
The two major uses of language tests are: (1) as sources of information for making decisions within the
context of educational programs; and (2) as indicators of abilities or attributes that are of interest in
research on language, language acquisition, and language teaching.
The fundamental use of testing in an educational program is to provide information for making
decisions, that is, for evaluation. The use of tests as a source of evaluation information requires three
assumptions. First, we must assume that information regarding educational outcomes is essential to
effective formal education. Second, it is possible to improve learning and teaching through appropriate
changes in the program, based on feedback. Third, we must assume that the educational outcomes of the
given program are measurable. In addition to these assumptions, we must also consider how much and
what kind of testing is needed, as well as the quality of information provided by our tests.
In a word, the main point of this chapter is that the most important consideration in the development
and use of language tests is the purpose or purposes for which the particular test is intended. By far the
most prevalent use of language tests is for purposes of evaluation in educational programs.
Communicative language ability
According to Bachman, communicative language ability (CLA) can be described as consisting of both
knowledge, or competence, and the capacity for implementing, or executing that competence in appropriate,
contextualized communicative language use.
The framework of CLA includes three components:
Language competence.
3
Knowledge Structures
Language Competence
Knowledge of the world
Knowledge of language
Strategic Competence
Psychophysiological Mechanisms
Context of Situation
( Components of communicative language ability in communicative language use)
Language Competence
Organizational Competence
Grammatical Competence
Pragmatic Competence
Textual Competence
Illocutionary Competence
Sociolinguistic Competence
Voc. Morph. Synt. Phon.
Cohes.
Rhet.Org.
Ideat. Manip. Heur. Imag.
S to D. S to R. S
to N. Cultural
( Components of language competence )
Language competence includes organizational competence, which consists of grammatical and textual
competence, and pragmatic, which consists of illocutionary and sociolinguistic competence. Furthermore,
Grammatical competence includes those competencies involved in language usage, consisting of a
number of relatively independent competencies such as the knowledge of vocabulary, morphology, syntax,
and phonology/ graphology. Textual competence consists of cohesion and rhetorical organization.
Illocutionary competence is related to four macro-functions: ideational, manipulative, heuristic, and
imaginative. Abilities under sociolinguistic competence are sensitivity to differences in dialect or variety, to
4
differences in register and to naturalness, and the ability to interpret cultural references and figures of
speech.
Strategic competence. Three components are included in strategic competence: assessment, planning,
and execution.
Psychophysio-logical mechanisms. These are essentially the neurological and physiological processes
and characterize the channel (auditory, visual) and mode (receptive, productive) in which competence is
implemented.
Test methods
The characteristics of test methods can be seen as restricted or controlled versions of these contextual
features that determine the nature of the language performance that is expected for a given test or test task.
Performance on language tests varies as function both of an individual’s language ability and of the
characteristics of the test method. It is also affected by individual attributes that are not part of test takers’
language ability.
The five major categories of test method facet are: (1) the testing environment which includes the
facets: familiarity of the place and equipment used in administering the test; the personnel involved in the
test; the time of testing, and physical conditions; (2) the test rubric which consists of the facets that specify
how test takers are expected to proceed in taking the test. These include the test organization, time
allocation, and instructions; (3) the nature of the input the test taker receives; (4) the nature of the expected
response to that input, and (5) the relationship between input and response. The frameworks described here
have been presented as a means for describing performance on language tests, and they are intended as a
guide for both the development and use of language tests and for research in language testing. These
frameworks provide the applied linguistic foundation that informs the discussions in the remainder of the
book.
Reliability
A high score on a language test is determined or caused by high communicative language ability, and a
theoretical framework defining this ability is thus necessary if we want to make inferences about ability
from test scores. Performance on language tests is also affected by factors other than communicative
language ability. These can be grouped into the following broad categories: (1) test method facets, as
discussed in Chapter 5; (2) attributes of the test taker that are not considered part of the language abilities
we want to measure, and (3) random factors that are largely unpredictable and temporary.
Communicative language ability
TEST SCORE
Test method facets
Personal attributes
5
Random factors
Fundamental to the development and use of language tests is being able to identify and estimate the
effect of various factors on language test scores. Test scores are influenced as much as possible by a given
language ability and any factors other than the ability being tested that affect test scores are potential
sources of error that decrease both the reliability of scores and the validity of their interpretations.
Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of
their effect on test scores.
Measurement theory provides several models that specify the relationships between measures, or
observed scores, and factors that affect these scores. Generalizability theory is an extension of the classical
model that overcomes many of these limitations, in that in enables test developers to examine several
sources of variance simultaneously, and to distinguish systematic from random error. Estimates of
reliability based on classical measurement theory are inappropriate for use with criterion-referenced tests
because of differences in the types of comparisons and decisions made. Systematic error, such as that
associated with test method, is different from random error.
Validation
The primary concern in test development and use is demonstrating not only that test scores are reliable,
but that interpretations and uses we make of test scores are valid. It has been traditional to classify validity
into different types, such as content, criterion, and construct validity. Validity is a unitary concept and it
always refers to the degree to which that evidence supports the inferences that are made from the scores. In
addition to the test’s content and method, validation must consider how test takers perform. The
examination of content relevance and content coverage is a necessary part of the validation process.
Information about criterion relatedness – concurrent or predictive – is by itself insufficient evidence for
validation. The process of construct validation, of providing evidence for ‘the adequacy of a test as a
measure of the characteristic it is interpreted to assess’, is a complex and continuous undertaking, involving
both (1) theoretical, logical analysis leading to empirically testable hypotheses, and (2) a variety of
appropriate approaches to empirical observation and analysis.
Reliability is a requirement for validity, and the investigation of reliability and validity can be viewed
as complementary aspects of identifying, estimating, and interpreting different sources of variance in test
scores. Validity is concerned with identifying the factors that produce the reliable variance in test scores.
Reliability is concerned with determining how much of the variance in test scores is reliable variance,
while validity is concerned with determining what abilities contribute to this reliable variance. Another way
to distinguish reliability from validity is to consider the theoretical frameworks upon which they depend.
The most important quality to consider in the development interpretation, and use of language tests is
validity, which has been described as a unitary concept related to the adequacy and appropriateness of the
way we interpret and use test scores, whereas reliability is a necessary condition for validity, in the sense
that test scores that are not reliable cannot provide a basis for valid interpretation and use. In order to
examine validity, we need a theory that specifies the language abilities that we hypothesize will affect test
performances. Distinguishing between reliability and validity, then, involves differentiating sources of
6
measurement error form other factors that affect test scores.
In order to maximize the reliability of test scores and the validity of test use, we should follow three
fundamental steps in the development of tests: (1) provide clear and unambiguous theoretical definitions of
the abilities we want to measure; (2) specify precisely the conditions, or operations that we follow in
eliciting and observing performance, and (3) quantify our observations so as to assure that our
measurement scales have the properties we require.
Some persistent problems and future directions
The challenge facing us is to utilize insights from linguistics, language learning, and language teaching
to develop tests as instruments of research that can lead to a better understanding of the factors that affect
performance on language tests. As developers and users of language tests, our task is to incorporate this
increased understanding into practical test design, construction, and use. Another major challenge will be
either to adapt current measurement models to the analysis of language test scores or to develop new
models that are appropriate for such data. Meeting these challenges will require innovation, and the
re-examination of existing assumptions, procedures, and technology.
The most complex and persistent problems in language testing are those presented by the consideration
of the relationship between the language use required by tasks on language tests and that which is part of
our everyday communicative use of language. One of the two distinct approaches for attempting to
describe this vital relationship, or test ‘authenticity’ is to identify the ‘real-life’ language use that we expect
will be required of test takers, and with this as a criterion, attempt to design test tasks that mirror this, and
the other is to examine actual non-test communicative language use, in an attempt to identify the critical, or
essential features of such language use.
The author is sure there are pressing needs for language tests suitable for uses in making minimum
competency decisions about foreign language learners and language teachers, and in the evaluation of
foreign language teaching methods. First, our highest priority must be given to the continued development
and validation of authentic tests of communicative language ability. Second is the development of
criterion-referenced measures of communicative language ability. A third area of need is in second
language acquisition research, where criterion measures of language abilities that can be used to assess
learners’ progression through developmental sequences are still largely absent.
This book is an authoritative and inspiring monograph, especially suitable for doctors and
post-graduates majoring in applied linguistics and foreign language teaching theory as well as those who
specialize in the development and use of language tests.
7
Download