LANGUAGE PROFICIENCY DESCRIPTORS Brian North, Eurocentres Foundation, Zürich This paper is an abbreviated version of a presentation given at the Language Testing Research Colloquium in Tampere, Finland in 1996 (North 1997a). It reports results from a Swiss National Science Research Council project (Schneider and North forthcoming) which developed a scale of language proficiency in the form of a "descriptor bank". The project took place in two rounds: the first for English (1994), the second for French, German and English (1995). Up until now, most scales of language proficiency have been produced by appeal to intuition and to those scales which already exist rather than to theories of linguistic description or of measurement. The main aim of the project reported was to use an itembanking scaling methodology to develop a bank of transparent descriptors of communicative language proficiency which have known difficulty values. The descriptor bank produced was then exploited to produce the first edition of the "Common Reference Levels" in the Council of Europe "Common European Framework" (Council of Europe 1996) and in the self assessment instruments in a prototype "Language Passport" or "Language Portfolio" recording achievement in relation to that Framework (Council of Europe 1997). The pilot project for English conducted in 1994 (Year 1) was the subject of a PhD thesis (North 1996). In each of the two years, pools of descriptors were produced by analysing available proficiency scales. Through workshops with representative teachers, the descriptors were then refined into stand-alone criterion statements considered to be clear, useful and relevant to the sectors concerned. Selected descriptors presented on questionnaires were then used by participating teachers to assess the proficiency of learners in their classes. This data was used to scale the descriptors using the Rasch rating scale model. The difficulty estimates for the descriptors produced in relation to English in 1994 proved remarkably stable in relation to French, German and English in 1995. Introduction During the past decade, two influences have lead to the increasing use of scales of language proficiency. The first influence has been a general movement towards more transparency in educational systems. The second has been moves towards greater international integration, particularly in Europe, which places a higher value on being able to state what the attainment of a given language objective means in practice. The result is that whereas 10 or 15 years ago scales which were not directly or indirectly related back to the 1950s US Foreign Service Institute (FSI) scale (Wilds 1975) were quite rare, the last few years have seen quite a proliferation of European scales which do not take American scales as their starting point. Some examples are: the British National Language Standards (Languages Lead Body 1992); the Eurocentres Scale of Language Proficiency (Eurocentres 1983-92); the Finnish Scale of Language Proficiency (Luoma 1993) and the ALTE Framework (Association of Language Testers in Europe 1994). In Section I this paper considers scales of language proficiency: the functions they can fulfil and common criticisms of them. Section II then outlines the study which is the subject of the paper; Section III briefly presents the product: a bank of classified, calibrated descriptors. Finally Section IV briefly presents the formats in which the descriptors in the bank can be exploited. I Scales of Language Proficiency I Functions Many scales of proficiency represent what Bachman (1990: 325-330) has described as the "real-life" or behavioural approach to assessment. This is because they try to give a picture of what a learner at a particular level can do in the real world. Other scales take what Bachman describes as an "interactive-ability" approach attempting to describe the aspects of the learnerÕs language ability being sampled. Alderson (1991: 71-76) uses a three way classification focusing upon the purposes for which scales are written and used. His expression for BachmanÕs "interactive-ability" approach is "assessor-oriented" since such scales are intended to bring consistency to the rating process. He identifies in addition a "user-oriented" function to give meaning to scores in reporting results (usually in "real life" terms) as well as a "constructor-oriented" function to provide guidance in the construction of tests or syllabuses, again usually defined in terms of "real life" tasks. Matthews (1990) and Pollitt and Murray (1993/1996) point out that complex analytic grids of "interactive-ability" descriptors intended for profiling can confuse rather than aid the actual assessment process. Pollitt and Murray (ibid) go on to suggest that such profiling grids are, rather, "diagnosis-oriented". As Alderson points out, problems arrive when a scale developed for one function is used for another. In an educational framework, there will be circumstances in which descriptors relating to "real life" tasks are appropriate; there will be circumstances in which descriptors relating to qualitative aspects of a personÕs proficiency ("interactive-ability") will be appropriate. As Pollitt and Murray point out, rich detail may be appropriate for some functions, but not for others. Scales offering definitions of learner proficiency at successive bands of ability are becoming more popular because they can be used: 1. To provide "stereotypes" against which learners can compare their self image and roughly evaluate their position (Trim 1978; Oscarson 1978, 1984). 2. To increase the reliability of subjectively judged ratings, providing a common standard and meaning for such judgements (Alderson 1991). 3. To provide guidelines for test construction (Dandonoli and Henning 1990; Alderson 1991). 4. To report results from teacher assessments, scored tests, rated tests and self assessment all in terms of the same instrument - whilst avoiding the spurious suggestion of precision given by a scored scale (e.g. 1-1,000) (Alderson 1991; Griffin 1989). 5. To provide coherent internal links within an institution between pre-course testing, syllabus planning, materials organisation, progress assessment and certification (North 1993a). 6. To establish a framework of reference which can describe achievement in a complex educational system in terms meaningful to all the different partners involved (Trim 1978; Brindley 1986, 1991; Richterich and Schneider 1992, Council of Europe 1996). 7. To enable comparison between systems or populations using a common metric or yardstick (Lowe 1983, Liskin-Gasparro 1984; Bachman and Savignon 1986; Carroll B.J. and West 1989). II Criticisms A definition by John Clark (1985: 348) catches the main weakness of the genre: "descriptions of expected outcomes, or impressionistic etchings of what proficiency might look like as one moves through hypothetical points or levels on a developmental continuum". Put another way, there is no guarantee that the description of proficiency offered in a scale is accurate, valid or balanced. Learners may actually be able to interpret a scale remarkably successfully for self assessment; correlations of 0.74 - 0.77 to test/interview results are usual in relation to the Eurocentres scale. Raters may actually be trained to think the same; inter-rater reliability correlations of over 0.8 are common in the literature and correlations over 0.9 are reported in "studio conditions". But the fact that people may be able to use such instruments with surprising effectiveness doesn't necessarily mean that what the scales say is valid. Furthermore, with the vast majority of scales of language proficiency, it is far from clear on what basis it was decided to put certain statements at Level 3 and others at Level 4 anyway. Another line of criticism has been that many scales of proficiency cannot be regarded as offering criterion-referenced assessment although they generally claim to do so. Firstly, the meaning of the descriptors at one level is often dependant on a reading of the descriptors at other levels. Secondly, the formulation of the descriptors is itself sometimes overtly relativistic (e.g. "better than Level 2) or norm-referenced (e.g. using expressions like "poor", "weak", "moderate"). Since the descriptors on most scales are not developed independently to check that they are actually saying something, it is not surprising that many scale descriptors fail to present stand-alone criteria capable of generating a Yes / No response (Skehan 1984: 217). Most scales of language proficiency appear in fact to have been produced pragmatically by appeal to intuition, the local pedagogic culture and those scales to which the author had access. In the process of development it is rare that much consideration is given to the following points: 1. using a model of communicative competence and/or language use; 2. checking that the categories and the descriptor-formulations are relevant and make sense to users, as is standard practice in behavioural scaling in other fields (Smith and Kendall 1963; See North 1993b for a review). 3. using a model of measurement; 4. avoiding the dangers of lifting rank ordered scale content from one context and then using it inappropriately in another (see Spolsky 1986). Whilst an intuitive approach may be appropriate in the development of scales for use in a low stakes context in which a known group of assessors rate a familiar population of learners, it has been criticised in relation to the development of national framework scales (e.g. Skehan 1984, Fulcher 1987, 1993 in relation to the British ELTS; Brindley 1986, 1991, Pienemann and Johnston 1987 in relation to the Australian ASLPR; Bachman and Savignon 1986, Lantolf and Frawley 1985, 1988, Spolsky 1986, 1993 in relation to the American ACTFL). As De Jong (1990: 72) has put it: "the acceptability of these levels, grades and frameworks seems to rely primarily on the authority of the scholars involved in their definition, or on the political status of the bodies that control and promote them." On what basis have the authors of the scales put particular content at one level rather than another? Thurstone highlighted this problem as long ago as 1928: "the scale values of the statements should not be affected by the opinions of the people who helped to construct it (the scale). This may turn out to be a severe test in practice, but the scaling method must stand such a test before it can be accepted as being more than a description of the people who construct the scale" (Thurstone 1928: 547-8, cited in Wright and Masters 1982: 15). In the project reported a rigorous approach to both qualitative and quantitative scaling was taken. The methodology adopted: thoroughly documented the field of language proficiency scales; developed stand alone criterion statements describing concrete aspects of language proficiency related to the model in the emerging Council of Europe "Common European Framework" (Council of Europe 1996); validated these descriptors in an extensive series of workshops with teachers; calibrated each of the statements to a mathematical scale with a measurement model on the basic of data from teacher assessments of their students at the end of the school year (Wright and Masters 1982; Linacre 1989). II. The Study The focus in the pilot for English was on spoken interaction, including comprehension in interaction, and on spoken production (extended monologue). Some descriptors were also included for written interaction (letters, questionnaire and form-filling) and for written production (report writing, essays etc.). In 1995 (Year 2) the survey was extended to French and German as well as English and descriptors were added for reading and for non-interactive listening. The project took place in three steps in each of the two years: I Comprehensive documentation: Creation of a descriptor pool A survey of existing scales of language proficiency (North 1994) provided a starting point. Forty-one proficiency scales were pulled apart with the definition for each level from each scale assigned to a provisional level. Each descriptor was then split up into sentences which were then each allocated to a provisional category. When adjacent sentences were part of the same point, they were edited into a compound sentence. In Year 1 the creation of the descriptor pool for the project coincided with the period in which the Council of Europe Framework authoring group were developing the descriptive scheme. The descriptive scheme draws on the area of consensus between existing models of communicative competence and language use (e.g. Canale and Swain 1980, 1981; Van Ek 1986; Bachman 1990; Skehan 1995 McNamara 1995). In addition, an organisation of language activities under the headings Reception, Interaction and Production, developing an idea from Brumfit (1987), was adopted. Space does not permit detailed consideration of the scheme, readers are referred to North (1997b) for a shortened version of a study produced at the time and to the actual document (Council of Europe 1996). The elimination of repetition, negative formulation and norm-referenced statements now meaningless away from their co-text produced a pool of approximately 1,000 stand-alone, positively worded criterion statements in each of the two years. II Qualitative Validation: Consultation with teachers through workshops Qualitative validation of the descriptor pool was undertaken through wide consultation with foreign language teachers representative of the different sectors in the Swiss educational system. Two techniques were used in each of 32 workshops each attended by between 4 and 25 teachers. The first technique was adapted from that reported by Pollitt and Murray (1993/6). Teachers were asked to discuss which of a pair of learners talking to each other on a video was better and justify their choice. The aim was to elicit the metalanguage teachers used to talk about qualitative aspects of proficiency and check that these were included in the categories in the descriptor pool. These discussions were recorded, transcribed in note form, analysed and, if something new was being said, formulated into descriptors. The second technique was based on that used by Smith and Kendall (1963). Pairs of teachers were given a pile of 60-90 descriptors cut up into confetti-like strips of paper and asked to sort them into 3-4 labelled piles which represented related potential categories of description. At least two, generally four and up to ten pairs of teachers sorted each set of descriptors. A discard pile was provided for descriptors for which the teachers couldn't decide the category, or found unclear or unhelpful. In addition teachers were asked to indicate which descriptors they found particularly clear and useful and which were relevant to their particular sector. This data was coded in descriptor item histories. III Quantitative Validation: Main data collection & Rasch scaling Data Collection Instruments: A selection of the best descriptors was scaled in a questionnaire survey in which class teachers assessed learners representative of the spread of ability in their classes. Assessment was of two kinds: 1. Teachers' assessment of the proficiency of 10 learners in their classes using 50 item questionnaires; 2. Teachers' assessment of video performances of selected learners in the survey using "mini questionnaires" of appropriate descriptors selected from the main questionnaires. Subjects: Exactly 100 teachers took part in the English pilot in 1994, most rating 5 learners from two different classes (total 945 learners). In the second year 192 teachers (81 French teachers, 65 German teachers, 46 English teachers) each rated 10 learners, most rating 10 learners from the same class. In each year about a quarter of the teachers were teaching their mother tongue, and the main educational sectors were represented as follows: Year 1: Lower Sec: 35%; Upper Sec: 19%; Vocational: 15%; Adult: 31% Year 2: Lower Sec: 24%; Upper Sec: 31%; Vocational: 17%; Adult: 28 % Analysis Methodology: The analysis method was an adaptation of classic Rasch item banking in which a series of tests (here questionnaires) are linked by common items called "anchor items" in order to create a common item scale (Wright and Stone 1979). Once the descriptors had been calibrated in rank order onto an arithmetical scale in this way, the next task was to establish "cut-off points" between bands or levels on that scale. As Pollitt (1991: 90) shows there is a relationship between the reliability of a set of data and the number of levels it will bear. In this case the scale reliability of 0.97 justified 10 levels. The first step taken therefore was to set provisional cut-offs at approximately equal intervals to create a 10 band scale. The second step was to fine tune these cut-offs in relation to descriptor wording in case there were threshold effects between levels. Finally the coherence in the scaling of the elements contained in the descriptors was confirmed. In Year 2, one third of the descriptors used had already been calibrated in Year 1. The main aim of Year 2 was to see if the difficulty values obtained for descriptors in relation to English in Year 1 would be replicated in relation to French, German and English in Year 2. Parallel analyses were run, one anchoring the items from Year 1 back to their 1994 values in order to link the two analyses onto the same scale, and the other allowing the 1994 items to "float" and establish new values. Large numbers of sub-analyses were also run to see if the different content strands would be better analysed separately and to investigate the way in which descriptors were interpreted in different educational sectors, for different target languages and in different language regions. In the event it was discovered that: 1. Reading did not appear to "fit" a construct dominated by the overlapping concepts of speaking and interaction and needed to be analysed separately, with the resultant Reading Scale being equated subsequently to the main scale. 2. Social-cultural competence could not be scaled in this way, or at least not in the same data set as descriptors for language proficiency, or at least not with descriptors of the quality available. 3. Teachers (even from Berufsschule) were unable to use consistently descriptors for work-related aspects of proficiency which described activity beyond their direct classroom experience, e.g. Telephoning; Attending Formal Meetings; Giving Formal Presentations; Writing Reports & Essays; Formal Correspondence. 4. Descriptors formulated negatively tended to be used inconsistently. Pronunciation, which is often conceived in negative terms - the strength of accent, the amount of foreignness causing comprehension difficulties - was therefore problematic. Descriptors for Pronunciation were also used inconsistently when applied to several languages. 5. While there was a degree of variation in the difficulty values obtained for certain descriptors in different sectors, the statistical significance of such variation in relation to individual descriptors had to be treated with caution. Overall, such differences cancelled each other out, and the scale of levels was equally valid for all languages and sectors concerned. 6. The difficulty values from Year 1 (English) proved to be very stable. Only eight of the 61 1994 descriptors reused in 1995 were interpreted in a significantly different way. After the removal of those eight descriptors, the values of the 103 Listening & Speaking items used in 1995 correlated 0.99 (Pearson) when analysed (a) entirely separately from 1994 and (b) with the 1994 items anchored to their 1994 values. This is very satisfactory when one considers that : The 1994 difficulty values were based on judgements by 100 English teachers, whilst the ratings dominating the 1995 construct were those of the French and German teachers; The questionnaire forms used for data collection in 1994 and 1995 were completely different; The majority of teachers in 1995 were using the descriptors in French or German, not English. III Product: A bank of classified, calibrated descriptors The categories for which descriptors were successfully scaled are as follows: Communicative Activities Listening: Overall Listening Comprehension Receptive: Listening to Announcements & Instructions Listening as a Member of an Audience Listening to Radio & Audio Recordings Watching TV & Film Interactive: Comprehension in Spoken Interaction Reading: Overall Reading Comprehension Reading Instructions Reading for Information Reading for Orientation (scanning) Interaction: Transactional: Service Encounters & Negotiations Information Exchange Interviewing & Being Interviewed Notes, Messages & Forms Interaction: Interpersonal: Conversation Discussion Personal Correspondence Production (Spoken): Describing Experience (Sustained Monologue) Putting a Case Processing and Summarising Strategies Receptive Strategies: Deducing Meaning from Context (only 2 descriptors) Interaction Strategies: Taking the Turn Cooperating Asking for Clarification Production Strategies: Planning Compensating Repairing & Monitoring Qualitative Aspects of Language Proficiency Pragmatic: Fluency (Language Use) Flexibility Coherence Thematic Development Precision Linguistic: Range: General Range (Language (Knowledge): Vocabulary Range Resources) Accuracy : Grammatical Accuracy (Control) Vocabulary Control When one looks at the vertical scale of calibrated items it is striking the extent to which descriptors on similar issues land adjacent to each other although they were used on different questionnaires. Indeed, the levels produced by the cut-off points show a remarkable consistency of key characteristics. Space does not permit a detailed discussion of the whole scale, but taking two levels as an example: Threshold is intended to represent the Council of Europe specification for a visitor to a foreign country and is perhaps most categorised by two features: Firstly, the ability to maintain interaction and get across what you want to in a range of contexts: generally follow the main points of extended discussion around him/her, provided speech is clearly articulated in standard dialect; give or seek personal views and opinions in an informal discussion with friends; express the main point he/she wants to make comprehensibly; exploit a wide range of simple language flexibly to express much of what he or she wants to; maintain a conversation or discussion but may sometimes be difficult to follow when trying to say exactly what he/she would like to; keep going comprehensibly, even though pausing for grammatical and lexical planning and repair is very evident, especially in longer stretches of free production. Secondly the ability to cope flexibly with less straightforward situations in everyday life: cope with less routine situations on public transport; deal with most situations likely to arise when making travel arrangements through an agent or when actually travelling; make a complaint; enter unprepared into conversations on familiar topics; ask someone to clarify or elaborate what they have just said. The next main level appears to represent a significant shift, offering some justification for the new name Vantage. According to Trim (personal communication) the intention is, as with Threshold and Waystage, to find a name which hasn't been used before and which symbolises something central to the level concerned. In this case, the metaphor is that having been progressing slowly but steadily across the intermediate plateau, the learner finds he has arrived somewhere. He/she acquires a new perspective and can look around him/her in a new way. This concept does seem to be borne out to a considerable extent by the descriptors calibrated here, which represent quite a break with the content scaled so far. At the lower end of the band there is a focus on effective argument: account for and sustain his opinions in discussion by providing relevant explanations, arguments and comments; explain a viewpoint on a topical issue giving the advantages and disadvantages of various options; construct a chain of reasoned argument; develop an argument giving reasons in support of or against a particular point of view; explain a problem and make it clear that his counterpart in a negotiation must make a concession; speculate about causes, consequences, hypothetical situations; take an active part in informal discussion in familiar contexts, commenting, putting point of view clearly, evaluating alternative proposals and making and responding to hypotheses. Running right through the band are two new focuses: Firstly, being able to more than hold your own in social discourse: e.g. understand in detail what is said to him/her in the standard spoken language even in a noisy environment; initiate discourse, take his/her turn when appropriate and end conversation when he/she needs to, though he/she may not always do this elegantly; use stock phrases (e.g. "That's a difficult question to answer") to gain time and keep the turn whilst formulating what to say; interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without imposing strain on either party; adjust to the changes of direction, style and emphasis normally found in conversation; sustain relationships with native speakers without unintentionally amusing or irritating them or requiring them to behave other than they would with a native speaker. Secondly, there is a new degree of language awareness, especially self monitoring: correct mistakes if they have led to misunderstandings; make a note of "favourite mistakes" and consciously monitor speech for them; generally correct slips and errors if he becomes conscious of them; IV Exploitation Formats There would appear to be three principal ways of physically organising descriptors on paper though each has endless variations: (1) a holistic scale: bands on top of another; (2) a profiling grid: categories defined at a series of bands; (3) a checklist: individual descriptors each presented as a separate criterion statement. These three formats exploiting descriptors calibrated in the project are all used in the Language Portfolio. They are illustrated in the appendix as follows: Scale: 1. A global scale - all skills, 6 Common Reference Levels adopted for Council of Europe Framework; also used in the Language Portfolio as a yardstick for situating qualifications. 2. A holistic scale for spoken interaction, showing the full 10 level empirical scale developed in the research project. The bottom level "Tourist" is an ability to performance specific isolated tasks, and is not presented as a level in the Council of Europe Framework; the "Plus" Levels" are referred to in the Framework as an option for particular contexts, but the political consensus is to adopt the 6 Common Reference Levels. Grid: 1. A grid profiling proficiency in communicative activities, centred on Threshold Level. Shows only a limited range of level, defines "Plus Levels". 2. A grid profiling qualitative aspects of proficiency used to rate video performances at the final conference of the research project in September 1996. Shows the full range of levels, but doesn't define "Plus Levels" due partly to fears of causing cognitive overload in what was an initiation session. Checklist: 1. A self assessment checklist taken from the draft of the Portfolio, see below. Contains only items calibrated at this level, reformulated (if necessary) for self assessment. References: Alderson, J.C. 1991: Bands and scores. In Alderson and North: 71-86. Alderson, J.C. and North, B. 1991: (eds.): Language testing in the 1990s: Modern English Publications/British Council, London, Macmillan. Association of Language Testers in Europe (ALTE) 1994: A description of the framework of the Association of Language Testers in Europe. Cambridge, ALTE Document 4. Bachman, L.F. 1990: Fundamental considerations in language testing, Oxford, OUP. Bachman L. & Palmer A. 1982: The construct validation of some components of communicative proficiency TESOL Quarterly 16/4: 449-464. Bachman, L.F. and Savignon S.J. 1986: The evaluation of communicative language proficiency: a critique of the ACTFL oral interview. Modern Language Journal, 70/4, 380-90. Brindley, G. 1986: The assessment of second language proficiency: issues and approaches, Adelaide. National Curriculum Resource Centre. Brindley, G. 1991: Defining language ability: the criteria for criteria. In Anivan, S. (ed.) Current developments in language testing, Singapore, Regional Language Centre. Brumfit, C.J. 1987: Concepts and categories in language teaching methodology. AILA Review, 4: 25-31. Canale, M. and Swain, M. 1980: Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1/1, 1-47. Carroll B.J. and West, R. 1989. ESU (English-speaking union) framework. Performance scales for English language examinations. London: Longman. Clark, J.L. 1985: Curriculum renewal in second language learning: an overview. Canadian Modern Language Review, 42/2, 342-360. Council of Europe 1992: Transparency and coherence in language learning in Europe: objectives, assessment and certification. Strasbourg, Council of Europe; the proceedings of the intergovernmental Symposium held at Rüschlikon November 1991 (ed. North, B.). Council of Europe 1996: Modern languages: learning, teaching, assessment. A common European framework of reference. Draft 2 of a framework proposal. CC-LANG (95) 5 rev IV, Strasbourg, Council of Europe. Council of Europe 1997: European language portfolio. Proposals for development. CCLANG (97)1, Strasbourg, Council of Europe. Dandonoli, P. and Henning, G. 1990: An investigation of the construct validity of the ACTFL proficiency guidelines and oral interview procedure. Foreign Language Annals, 23/1, 11-22. De Jong, H.A.L. 1990: Response to Masters: Linguistic theory and psychometric models, in De Jong, H.A.L. and Stevenson D.K. Individualising the assessment of language abilities, Cleveland, Multilingual Matters, p.71-82. Fulcher, G. 1987: Tests of oral performance: the need for data-based criteria. ELT Journal, 41/4, 287-291. Fulcher, G. 1993: The construction and validation of rating scales for oral tests in English as a foreign language, PhD thesis, University of Lancaster. Griffin, P.E. 1989: Monitoring proficiency development in language. Paper presented at the Annual Congress of the Modern Language Teachers Association of Victoria, Monash University, July 10-11 1989. Languages Lead Body 1992: National standards for languages: units of competence and assessment guidance. UK Languages Lead Body, July 1992. Lantolf, J. and Frawley, W. 1985: Oral proficiency testing: a critical analysis. Modern Language Journal, 69/4, 337-345. Lantolf, J. and Frawley, W. 1988: Proficiency, understanding the construct. Studies in Second Language Acquisition, 10/2, 181-196. Linacre, J.M. 1989: Multi-faceted measurement. Chicago, MESA Press. Liskin-Gasparro, J.E. 1984: The ACTFL proficiency guidelines: a historical perspective. In Higgs, T.C. (ed.) Teaching for proficiency, the organising principle. Lincolnwood (Ill.): National Textbook Company: 11-42. Lowe, P. 1983: The IRL oral interview: origins, applications, pitfalls and implications. Unterrichtspraxis, 16/2, 230-244. Luoma, S. 1993: Validating the (Finnish) certificates of foreign language proficiency. Paper presented at the 15th Language Testing Research Colloquium, Cambridge, Arnhem, 2-4 August 1993. Matthews, M. 1990: The measurement of productive skills. Doubts concerning the assessment criteria of certain public examinations. ELT Journal 44/2: 117-120. McNamara, T. 1995: Modelling performance: opening PandoraÕs box. Applied Linguists, 16, 2, 159-179. North, B. 1993a: Transparency, coherence and washback in language assessment. In Sajavaara, K., Takala, S., Lambert, D. and Morfit, C. (eds.) 1994: National Foreign Language policies: practices and prospects. Institute for Education Research, University of Jyvskyla: 157-193. North, B. 1993b: The Development of descriptors on scales of proficiency: perspectives, problems, and a possible methodology. NFLC Occasional Paper, National Foreign Language Center, Washington D.C., April 1993. North, B. 1994: Scales of language proficiency: a survey of some existing systems, Strasbourg, Council of Europe. North, B. 1996: The development of a common framework scale of descriptors of language proficiency based on a theory of measurement, Unpublished PhD thesis, Thames Valley University. North, B. 1997a: The development of a common framework scale of descriptors of language proficiency based on a theory of measurement. Paper given at the LTRC 1996, Tampere, Finland. In Huhta, A., Kohonen, V., Kurki-Suonio, L. and Luoma, S. Current Developments and Alternatives in Language Assessment. Jyvskyl, University of Jyvskyl: 423-449. North, B. 1997b: Perspectives on language proficiency and aspects of competence. Language Teaching, 30/2. Oscarson, M. 1978/9: Approaches to self-assessment in foreign language learning. Strasbourg, Council of Europe 1978; Oxford, Pergamon 1979. Oscarson, M. 1984: Self-assessment of foreign language skills: a survey of research and development work. Strasbourg, Council of Europe. Pienemann, M. and Johnston, M. 1987: Factors influencing the development of language proficiency. (The Multi-dimensional model - summary). In Nunan, D. (ed.) Applying second language acquisition research. Adelaide, National Curriculum Resource Centre: 89-94. Pollitt, A. 1991: Response to Alderson: Bands and scores. In Alderson and North: 87-94. Pollitt, A. and Murray, N.L. 1993/1996: What raters really pay attention to. Paper presented at the 15th Language Testing Research Colloquium, Cambridge and Arnhem, 2-4 August 1993. In Milanovic, M. and Saville, N. (eds.) 1996: Performance testing, cognition and assessment. Cambridge: University of Cambridge Local Examinations Syndicate: 74-91. Richterich, R. and Schneider, G. 1992: Transparency and coherence: why and for whom? In Council of Europe: 43-50. Schneider, G. and North, B. forthcoming: Assessment and self-assessment of foreign language proficiency at cross-over-points in the Swiss educational system: transparent and coherent description of foreign language competence as assessment, reporting and planning instruments. Bern, National Science Research Council. Skehan, P. 1984: Issues in the testing of English for specific purposes. Language Testing, 1(2), 202-220. Skehan, P. 1995: Analysability, accessibility and ability for use. In Cook, G. and Seidlehofer, S. (eds.), Principle and practice in applied linguistics. Oxford: Oxford University Press. Smith, P.C. and Kendall, J.M. 1963: Retranslation of expectations: an approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47/2: 149-154. Spolsky, B. 1986: A multiple choice for language testers. Language Testing, 3/2, 147-158. Spolsky, B. 1993: Testing and examinations in a national foreign language policy. In Sajavaara, K., Takala, S., Lambert, D. and Morfit, C. (eds.) 1994: National foreign language policies: practices and prospects. Institute for Education Research, University of Jyvskyla: 194-214. Thurstone, L.L. 1928: Attitudes can be measured. American Journal of Sociology, 33 529554; cited in Wright, B.D. and Masters, G. 1982: 10-15. Trim, J.L.M. 1978: Some possible lines of development of an overall structure for a European unit/credit scheme for foreign language learning by adults. Strasbourg, Council of Europe. Van Ek, J.A. 1986: Objectives for foreign language teaching, volume I: scope. Strasbourg, Council of Europe. Wilds, C.P. 1975: The oral interview test. In Spolsky, B. and Jones, R.: Testing language proficiency. Washington D.C., Center for Applied Linguistics: 29-44. Wright, B,D. and Masters, G. 1982: Rating scale analysis. Rasch Measurement Chicago, Mesa Press. Wright, B.D. and Stone, M.H. 1979: Best test design. Chicago, Mesa Press.