Language Testing Liu Jianda Syllabus It is expected that, by the end of this module, participants should be able to do the following : Understand the general considerations that must be addressed in the development of new tests or the selection of existing language tests; Make their own judgements and decisions about either selecting an existing language test or developing a new language test; Familiarise themselves with the fundamental issues, approaches, and methods used in measurement and evaluation; Design, develop, evaluate and use language tests in ways that are appropriate for a given purpose, context, and group of test takers; Understand the future development of language testing and the application of IT to computerized language testing. Syllabus In order to achieve these objectives, the module gives participants the opportunity to develop the following skills: writing test items collecting test data and conducting item analysis evaluating language tests with regard to validity and reliability This is done by considering a wide range of issues and topics related to language testing. These include the following : General concepts in language testing and evaluation Evaluation of a language test: reliability and validity Communicative approach to language testing Design of a language test Item writing and item analysis Interpreting test results Item response theory and its applications Computerized language testing and its future development Class Schedule 1 2 3 4 5 6 7 8 9 10 11 12 Basic concepts in language testing Test validation: reliability and validity (1) Test validation: reliability and validity (2) Test construction (1) Test construction (2) Test construction (3) Test Construction (4) Test Construction (5) Test Construction (6) Rasch analysis (1) Rasch analysis (2) Language testing and modern technology Assessment One 5000 – 6000 word paper on language testing Collaborative work: You’ll be divided into group of four to complete the development of a test paper. Each of you will be responsible for one part of the test paper. But each part should contribute equally to the whole test paper. Therefore, besides developing your part, you need to come together to discuss the whole test paper in terms of reliability and validity. Course books Bachman, L. F. & Palmer, A. (1996). Language Testing in Practice. Oxford: Oxford University Press. Brown, J. D. (1996). Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall Regents. Li, X. (1997). The Science and Art of Language Testing. Changsha: Hunan Educational Press. McNamara, T. (1996). Measuring second language performance. London ; New York: Longman. Website: http://www.clal.org.cn/personal/testing/Leeds Session 1 Basic concepts in language testing . A short history of language testing Spolsky (1978) classified the development of language testing into three periods, or trends: the prescientific period the psychometric/structuralist period the integrative/sociolinguistic period. The prescientific period grammar-translation approaches to language teaching translation and free composition tests difficult to score objectively no statistical techniques applied to validate the tests simple, but unfair to students The psychometric-structuralist period audio-lingual and related teaching methods objectivity, reliability, and validity of tests considered measure discrete structure points multiple-choice format (standardized tests) follow scientific principles, have trained linguists and language testers The integrative-sociolinguistic period communicative competence Chomsky’s (1965) distinction of competence and performance Competence: an ideal speaker-listener’s knowledge of the rules of the language; performance: the actual use of language in concrete situations Hymes’s (1972) proposal of communicative competence the ability of native speakers to use their language in ways that are not only linguistically accurate but also socially appropriate. Canale & Swain’s (1980) framework of communicative competence: Grammatical competence, mastery of the language code such as morphology, lexis, syntax, semantics, phonology; Sociolinguistic competence, mastery of appropriate language use in different sociolinguistic contexts; Discourse competence, mastery of how to achieve coherence and cohesion in spoken and written communication Strategic competence, mastery of communication strategies used to compensate for breakdowns in communication and to enhance the effectiveness of communication. The integrative-sociolinguistic period Bachman’s (1990)’s framework of communicative language ability: Language competence: grammatical, sociolinguistic, and discourse competence (Canale & Swain): organizational competence grammatical competence textual competence pragmatic competence illocutionary competence sociolinguistic competence Strategic competence: performs assessment, planning, and execution functions in determining the most effective means of achieving a communicative goal Psychophysiological mechanisms: characterize the channel (auditory, visual) and mode (receptive, productive) The integrative-sociolinguistic period Oller’s (1979) pragmatic proficiency test: Temporally and sequentially consistent with the real world occurrences of language forms Linking to a meaningful extralinguistic context familiar to the testees Clark’s (1978) direct assessment: approximating to the greatest extent the testing context to the real world Cloze test and dictation (Yang, 2002b) Communicative testing or to test communicatively The integrative-sociolinguistic period Performance tests (Brown, Hudson, Norris, & Bonk, 2002; Norris, 1998) Not discrete-point in nature Integrating two or more of the language skills of listening, speaking, reading, writing, and other aspects like cohesion and coherence, suprasegmentals, paralinguistics, kinesics, pragmatics, and culture Task-based: essays, interviews, extensive reading tasks Performance Tests Three characteristics: The task should: be based on needs analysis (What criteria should be used? What content and context? How should experts be used?) be as authentic as possible with the goal of measuring real-world activities sometimes have collaborative elements that stimulate communicative interactions be contextualized and complex integrate skills with content be appropriate in terms of number, timing, and frequency of assessment be generally non-intrusive, that is, be aligned with the daily actions in the language classroom Performance Tests Raters should be appropriate in terms of: number of raters overall expertise familiarity and training in use of the scale The rating scale should be based on appropriate: categories of language learning and development appropriate breadth of information regarding learner performance abilities standards that are both authentic and clear to students To enhance the reliability and validity of decisions as well as accountability, performance assessments should be combined with other methods for gathering information (e.g. self-assessments, portfolios, conferences, classroom behaviors, and so forth) Development graph (Li, 1997: 5) 2. Theoretical issues Language testing is concerned with both content and methodology. Development since 1990 Communicative language testing (Weir, 1990) Reliability and validity Social functions of language testing Ethical language testing Washback (impact) (Qi, 2002; Wall, 1997) impact: effects of tests on individuals, policies or practices within the classroom, the school, the educational system or society as a whole washback: effects of tests on language teaching and learning Ways of investigating washback: analyses of test results teachers’ and students’ accounts of what takes place in the classroom (questionnaires and interviews) classroom observation Ethics of test use use with care (Spolsky, 1981: 20) codes of practice Professionalization of the field training of professionals development of standards of practice and mechanism for their implementation and enforcement Critical language testing put language testing in the society Factors affecting performance of examinees Communicative language ability TEST SCORE Random factors Test method facets Personal attributes Development since 1990 Testing interlanguage pragmatic knowledge currently on research level focus on method validation web-based test by Roever Computerized language testing Item banking Computer-assisted language testing Computerized adaptive language testing Test items adapted for individuals Test ends when examinee’s ability is determined Test time very shorter Web-based testing Phonepass testing Development since 1990 Language testing and second language acquisition (Bachman & Cohen, 1998) Help to define construct of language ability Use findings of language testing to prove hypotheses in SLA Provide SLA researchers with testing and standards of testing Development of research methodology Factor analysis The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method Generalizability theory (Bachman, 1997; Bachman, Lynch, & Mason, 1995) Estimating the relative effects of different factors on test scores (facets) The most generalizable indicator of an individual’s language ability is the universe score, however, in real world, we can only obtain scores from a limited sample of measures, so we need to estimate the dependability of a given observed score as an estimate of the universe score. Two stages are involved in applying G-theory to test development G-study The purpose is to estimate the effects of the various facets in the measurement procedure (usually conducted in pretesting). e.g. persons (differences in individuals’ speaking ability), raters (differences in severity among raters), tasks (differences in difficulty of tasks); two-way interactions: task x rater different raters are rating the different tasks differently person x task some tasks are differentially diffucult for different groups of test takers (source of bias) person x rater some raters score the performance of different groups of test takers differently (indication of rater bias) Two stages are involved in applying Gtheory to test development D-study The purpose is to design an optimal measure for the interpretations or decisions that are to be made on the basis of the test scores (estimation of dependability). Generalizability coefficient (G coefficient) provides an estimate of the proportion of an individual’s observed score that can be attributed to his or her universe score, taking into consideration the effects of the different conditions of measurement specified in the universe of generalization. But it is appropriate for norm-referenced tests. For criterion-referenced tests, use phi coefficient. (GENOVA) Item response theory (Rasch model) It enables us to estimate the statistical properties of items and the abilities of test takers so that these are not dependent upon a particular group of test takers or a particular form of a test. It is widely used in large-scale standardized test. Structural equation model (Antony John Kunnan, 1998) A combination of multiple regression, path analysis and factor analysis Attempts to explain a correlation or a covariance data matrix derived from a set of observed variables; latent variables are responsible for the covariance among the measured variables. Basic procedures in SEM (Example from Purpura, 1998) Examine the relationships between strategy use and second language test performance. Design two questionnaires for cognitive strategies and metacognitive strategies (40 items) Ask respondents to answer the questionnaires Respondents take a foreign language test Cluster the 40 items to measure several variables Compute the reliability of the variables Conduct factor analysis to identify factors Conduct SEM analysis (AMOS, EQS, LISREL) Qualitative method Verbal report (think-aloud, introspective) Observation Questionnaires and interviews Discourse analysis 3. Classification of language tests According to families Norm-referenced tests Criterion-referenced tests Norm-referenced tests Measure global language abilities (e.g. listening, reading speaking, writing) Score on a test is interpreted relative to the scores of all other students who took the test Normal distribution Normal Distribution http://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm Norm-referenced tests Students know the format of the test but do not know what specific content or skill will be tested A few relatively long subtests with a variety of question contents Criterion-referenced tests Measure well-defined and fairly specific objectives Interpretation of scores is considered absolute without referring to other students’ scores Distribution of scores need not to be normal Students know in advance what types of questions, tasks, and content to expect for the test A series of short, well-defined subtests with similar question contents According to decision purposes Proficiency tests Placement tests Achievement tests Diagnostic tests Proficiency tests Test students’ general levels of language proficiency The test must provide scores that form a wide distribution so that interpretations of the differences among students will be as fair as possible Can dramatically affect students’ lives, so slipshod decision making in this area would be particularly unprofessional Placement tests Group students of similar ability levels (homogeneous ability levels) Help decide what each student’s appropriate level will be within a specific program Right tests for right purposes Achievement tests About the amount of learning that students have done The decision may involve who will a advanced to the next level of study or which students should graduate Must be designed with a specific reference to a particular course Criterion-referenced, conducted at the end of the program Used to make decisions about student’s levels of learning, meanwhile can be used to affect curriculum changes and to test those changes continually against the program realities Diagnostic tests Aimed at fostering achievement by promoting strengths and eliminating the weaknesses of individual students Require more detailed information about the very specific areas in which students have strengths and weaknesses Criterion-referenced, conducted at the beginning or in the middle of a language course Can be diagnostic at the beginning or in the middle but achievement test at the end Perhaps the most effective use of a diagnostic test is to report the performance level on each objective (in a percentage) to each student so that he or she can decide how and where to invest time and energy most profitably Formative assessment vs. summative assessment Formative: a judgment of an ongoing program used to provide information for program review, identification of the effectiveness of the instructional process, and the assessment of the teaching process Summative: a terminal evaluation employed in the general assessment of the degree to which the larger outcomes have been obtained over a substantial part of or all of a course. It is used in determining whether or not the learner has achieved the ultimate objectives for instruction which were set up in advance of the instruction. Public examinations vs. classroom tests Purpose: proficiency vs. achievement (placement, diagnostic) Format: standardized vs. open (objective vs. subjective) Scale: large-scale vs. small-scale (selfassessment) Scores: normality, backwash