LANGUAGE PROFICIENCY DESCRIPTORS

advertisement
LANGUAGE PROFICIENCY DESCRIPTORS
Brian North, Eurocentres Foundation, Zürich
This paper is an abbreviated version of a presentation given at the Language Testing Research
Colloquium in Tampere, Finland in 1996 (North 1997a). It reports results from a Swiss
National Science Research Council project (Schneider and North forthcoming) which
developed a scale of language proficiency in the form of a "descriptor bank". The project took
place in two rounds: the first for English (1994), the second for French, German and English
(1995).
Up until now, most scales of language proficiency have been produced by appeal to intuition
and to those scales which already exist rather than to theories of linguistic description or of
measurement. The main aim of the project reported was to use an itembanking scaling
methodology to develop a bank of transparent descriptors of communicative language
proficiency which have known difficulty values. The descriptor bank produced was then
exploited to produce the first edition of the "Common Reference Levels" in the Council of
Europe "Common European Framework" (Council of Europe 1996) and in the self assessment
instruments in a prototype "Language Passport" or "Language Portfolio" recording
achievement in relation to that Framework (Council of Europe 1997). The pilot project for
English conducted in 1994 (Year 1) was the subject of a PhD thesis (North 1996).
In each of the two years, pools of descriptors were produced by analysing available proficiency
scales. Through workshops with representative teachers, the descriptors were then refined into
stand-alone criterion statements considered to be clear, useful and relevant to the sectors
concerned. Selected descriptors presented on questionnaires were then used by participating
teachers to assess the proficiency of learners in their classes. This data was used to scale the
descriptors using the Rasch rating scale model. The difficulty estimates for the descriptors
produced in relation to English in 1994 proved remarkably stable in relation to French, German
and English in 1995.
Introduction
During the past decade, two influences have lead to the increasing use of scales of language
proficiency. The first influence has been a general movement towards more transparency in
educational systems. The second has been moves towards greater international integration,
particularly in Europe, which places a higher value on being able to state what the attainment
of a given language objective means in practice. The result is that whereas 10 or 15 years ago
scales which were not directly or indirectly related back to the 1950s US Foreign Service
Institute (FSI) scale (Wilds 1975) were quite rare, the last few years have seen quite a
proliferation of European scales which do not take American scales as their starting point.
Some examples are: the British National Language Standards (Languages Lead Body 1992);
the Eurocentres Scale of Language Proficiency (Eurocentres 1983-92); the Finnish Scale of
Language Proficiency (Luoma 1993) and the ALTE Framework (Association of Language
Testers in Europe 1994).
In Section I this paper considers scales of language proficiency: the functions they can fulfil
and common criticisms of them. Section II then outlines the study which is the subject of the
paper; Section III briefly presents the product: a bank of classified, calibrated descriptors.
Finally Section IV briefly presents the formats in which the descriptors in the bank can be
exploited.
I Scales of Language Proficiency
I Functions
Many scales of proficiency represent what Bachman (1990: 325-330) has described as the
"real-life" or behavioural approach to assessment. This is because they try to give a picture of
what a learner at a particular level can do in the real world. Other scales take what Bachman
describes as an "interactive-ability" approach attempting to describe the aspects of the
learnerÕs language ability being sampled. Alderson (1991: 71-76) uses a three way
classification focusing upon the purposes for which scales are written and used. His expression
for BachmanÕs "interactive-ability" approach is "assessor-oriented" since such scales are
intended to bring consistency to the rating process. He identifies in addition a "user-oriented"
function to give meaning to scores in reporting results (usually in "real life" terms) as well as a
"constructor-oriented" function to provide guidance in the construction of tests or syllabuses,
again usually defined in terms of "real life" tasks. Matthews (1990) and Pollitt and Murray
(1993/1996) point out that complex analytic grids of "interactive-ability" descriptors intended
for profiling can confuse rather than aid the actual assessment process. Pollitt and Murray
(ibid) go on to suggest that such profiling grids are, rather, "diagnosis-oriented". As Alderson
points out, problems arrive when a scale developed for one function is used for another. In an
educational framework, there will be circumstances in which descriptors relating to "real life"
tasks are appropriate; there will be circumstances in which descriptors relating to qualitative
aspects of a personÕs proficiency ("interactive-ability") will be appropriate. As Pollitt and
Murray point out, rich detail may be appropriate for some functions, but not for others.
Scales offering definitions of learner proficiency at successive bands of ability are becoming
more popular because they can be used:
1. To provide "stereotypes" against which learners can compare their self image and
roughly evaluate their position (Trim 1978; Oscarson 1978, 1984).
2. To increase the reliability of subjectively judged ratings, providing a common standard
and meaning for such judgements (Alderson 1991).
3. To provide guidelines for test construction (Dandonoli and Henning 1990; Alderson
1991).
4. To report results from teacher assessments, scored tests, rated tests and self assessment
all in terms of the same instrument - whilst avoiding the spurious suggestion of
precision given by a scored scale (e.g. 1-1,000) (Alderson 1991; Griffin 1989).
5. To provide coherent internal links within an institution between pre-course testing,
syllabus planning, materials organisation, progress assessment and certification (North
1993a).
6. To establish a framework of reference which can describe achievement in a complex
educational system in terms meaningful to all the different partners involved (Trim
1978; Brindley 1986, 1991; Richterich and Schneider 1992, Council of Europe 1996).
7. To enable comparison between systems or populations using a common metric or
yardstick (Lowe 1983, Liskin-Gasparro 1984; Bachman and Savignon 1986; Carroll
B.J. and West 1989).
II Criticisms
A definition by John Clark (1985: 348) catches the main weakness of the genre:
"descriptions of expected outcomes, or impressionistic etchings of what proficiency
might look like as one moves through hypothetical points or levels on a developmental
continuum".
Put another way, there is no guarantee that the description of proficiency offered in a scale is
accurate, valid or balanced. Learners may actually be able to interpret a scale remarkably
successfully for self assessment; correlations of 0.74 - 0.77 to test/interview results are usual in
relation to the Eurocentres scale. Raters may actually be trained to think the same; inter-rater
reliability correlations of over 0.8 are common in the literature and correlations over 0.9 are
reported in "studio conditions". But the fact that people may be able to use such instruments
with surprising effectiveness doesn't necessarily mean that what the scales say is valid.
Furthermore, with the vast majority of scales of language proficiency, it is far from clear on
what basis it was decided to put certain statements at Level 3 and others at Level 4 anyway.
Another line of criticism has been that many scales of proficiency cannot be regarded as
offering criterion-referenced assessment although they generally claim to do so. Firstly, the
meaning of the descriptors at one level is often dependant on a reading of the descriptors at
other levels. Secondly, the formulation of the descriptors is itself sometimes overtly relativistic
(e.g. "better than Level 2) or norm-referenced (e.g. using expressions like "poor", "weak",
"moderate"). Since the descriptors on most scales are not developed independently to check
that they are actually saying something, it is not surprising that many scale descriptors fail to
present stand-alone criteria capable of generating a Yes / No response (Skehan 1984: 217).
Most scales of language proficiency appear in fact to have been produced pragmatically by
appeal to intuition, the local pedagogic culture and those scales to which the author had access.
In the process of development it is rare that much consideration is given to the following
points:
1. using a model of communicative competence and/or language use;
2. checking that the categories and the descriptor-formulations are relevant and make
sense to users, as is standard practice in behavioural scaling in other fields (Smith and
Kendall 1963; See North 1993b for a review).
3. using a model of measurement;
4. avoiding the dangers of lifting rank ordered scale content from one context and then
using it inappropriately in another (see Spolsky 1986).
Whilst an intuitive approach may be appropriate in the development of scales for use in a low
stakes context in which a known group of assessors rate a familiar population of learners, it has
been criticised in relation to the development of national framework scales (e.g. Skehan 1984,
Fulcher 1987, 1993 in relation to the British ELTS; Brindley 1986, 1991, Pienemann and
Johnston 1987 in relation to the Australian ASLPR; Bachman and Savignon 1986, Lantolf and
Frawley 1985, 1988, Spolsky 1986, 1993 in relation to the American ACTFL). As De Jong
(1990: 72) has put it: "the acceptability of these levels, grades and frameworks seems to rely
primarily on the authority of the scholars involved in their definition, or on the political status
of the bodies that control and promote them." On what basis have the authors of the scales put
particular content at one level rather than another? Thurstone highlighted this problem as long
ago as 1928:
"the scale values of the statements should not be affected by the
opinions of the people who helped to construct it (the scale).
This may turn out to be a severe test in practice, but the scaling
method must stand such a test before it can be accepted as being
more than a description of the people who construct the scale"
(Thurstone 1928: 547-8, cited in Wright and Masters 1982: 15).
In the project reported a rigorous approach to both qualitative and quantitative scaling was
taken. The methodology adopted:




thoroughly documented the field of language proficiency scales;
developed stand alone criterion statements describing concrete aspects of language
proficiency related to the model in the emerging Council of Europe "Common
European Framework" (Council of Europe 1996);
validated these descriptors in an extensive series of workshops with teachers;
calibrated each of the statements to a mathematical scale with a measurement model on
the basic of data from teacher assessments of their students at the end of the school year
(Wright and Masters 1982; Linacre 1989).
II. The Study
The focus in the pilot for English was on spoken interaction, including comprehension in
interaction, and on spoken production (extended monologue). Some descriptors were also
included for written interaction (letters, questionnaire and form-filling) and for written
production (report writing, essays etc.). In 1995 (Year 2) the survey was extended to French
and German as well as English and descriptors were added for reading and for non-interactive
listening. The project took place in three steps in each of the two years:
I Comprehensive documentation: Creation of a descriptor pool
A survey of existing scales of language proficiency (North 1994) provided a starting point.
Forty-one proficiency scales were pulled apart with the definition for each level from each
scale assigned to a provisional level. Each descriptor was then split up into sentences which
were then each allocated to a provisional category. When adjacent sentences were part of the
same point, they were edited into a compound sentence. In Year 1 the creation of the descriptor
pool for the project coincided with the period in which the Council of Europe Framework
authoring group were developing the descriptive scheme. The descriptive scheme draws on the
area of consensus between existing models of communicative competence and language use
(e.g. Canale and Swain 1980, 1981; Van Ek 1986; Bachman 1990; Skehan 1995 McNamara
1995). In addition, an organisation of language activities under the headings Reception,
Interaction and Production, developing an idea from Brumfit (1987), was adopted. Space does
not permit detailed consideration of the scheme, readers are referred to North (1997b) for a
shortened version of a study produced at the time and to the actual document (Council of
Europe 1996). The elimination of repetition, negative formulation and norm-referenced
statements now meaningless away from their co-text produced a pool of approximately 1,000
stand-alone, positively worded criterion statements in each of the two years.
II Qualitative Validation: Consultation with teachers through workshops
Qualitative validation of the descriptor pool was undertaken through wide consultation with
foreign language teachers representative of the different sectors in the Swiss educational
system. Two techniques were used in each of 32 workshops each attended by between 4 and 25
teachers.
The first technique was adapted from that reported by Pollitt and Murray (1993/6). Teachers
were asked to discuss which of a pair of learners talking to each other on a video was better and justify their choice. The aim was to elicit the metalanguage teachers used to talk about
qualitative aspects of proficiency and check that these were included in the categories in the
descriptor pool. These discussions were recorded, transcribed in note form, analysed and, if
something new was being said, formulated into descriptors.
The second technique was based on that used by Smith and Kendall (1963). Pairs of teachers
were given a pile of 60-90 descriptors cut up into confetti-like strips of paper and asked to sort
them into 3-4 labelled piles which represented related potential categories of description. At
least two, generally four and up to ten pairs of teachers sorted each set of descriptors. A discard
pile was provided for descriptors for which the teachers couldn't decide the category, or found
unclear or unhelpful. In addition teachers were asked to indicate which descriptors they found
particularly clear and useful and which were relevant to their particular sector. This data was
coded in descriptor item histories.
III Quantitative Validation: Main data collection & Rasch scaling
Data Collection Instruments: A selection of the best descriptors was scaled in a questionnaire
survey in which class teachers assessed learners representative of the spread of ability in their
classes. Assessment was of two kinds:
1. Teachers' assessment of the proficiency of 10 learners in their classes using 50 item
questionnaires;
2. Teachers' assessment of video performances of selected learners in the survey using
"mini questionnaires" of appropriate descriptors selected from the main questionnaires.
Subjects: Exactly 100 teachers took part in the English pilot in 1994, most rating 5 learners
from two different classes (total 945 learners). In the second year 192 teachers (81 French
teachers, 65 German teachers, 46 English teachers) each rated 10 learners, most rating 10
learners from the same class. In each year about a quarter of the teachers were teaching their
mother tongue, and the main educational sectors were represented as follows:
Year 1: Lower Sec: 35%; Upper Sec: 19%; Vocational: 15%; Adult: 31%
Year 2: Lower Sec: 24%; Upper Sec: 31%; Vocational: 17%; Adult: 28 %
Analysis Methodology: The analysis method was an adaptation of classic Rasch item banking
in which a series of tests (here questionnaires) are linked by common items called "anchor
items" in order to create a common item scale (Wright and Stone 1979). Once the descriptors
had been calibrated in rank order onto an arithmetical scale in this way, the next task was to
establish "cut-off points" between bands or levels on that scale. As Pollitt (1991: 90) shows
there is a relationship between the reliability of a set of data and the number of levels it will
bear. In this case the scale reliability of 0.97 justified 10 levels. The first step taken therefore
was to set provisional cut-offs at approximately equal intervals to create a 10 band scale. The
second step was to fine tune these cut-offs in relation to descriptor wording in case there were
threshold effects between levels. Finally the coherence in the scaling of the elements contained
in the descriptors was confirmed.
In Year 2, one third of the descriptors used had already been calibrated in Year 1. The main
aim of Year 2 was to see if the difficulty values obtained for descriptors in relation to English
in Year 1 would be replicated in relation to French, German and English in Year 2. Parallel
analyses were run, one anchoring the items from Year 1 back to their 1994 values in order to
link the two analyses onto the same scale, and the other allowing the 1994 items to "float" and
establish new values. Large numbers of sub-analyses were also run to see if the different
content strands would be better analysed separately and to investigate the way in which
descriptors were interpreted in different educational sectors, for different target languages and
in different language regions.
In the event it was discovered that:
1. Reading did not appear to "fit" a construct dominated by the overlapping concepts of
speaking and interaction and needed to be analysed separately, with the resultant
Reading Scale being equated subsequently to the main scale.
2. Social-cultural competence could not be scaled in this way, or at least not in the same
data set as descriptors for language proficiency, or at least not with descriptors of the
quality available.
3. Teachers (even from Berufsschule) were unable to use consistently descriptors for
work-related aspects of proficiency which described activity beyond their direct
classroom experience, e.g. Telephoning; Attending Formal Meetings; Giving Formal
Presentations; Writing Reports & Essays; Formal Correspondence.
4. Descriptors formulated negatively tended to be used inconsistently. Pronunciation,
which is often conceived in negative terms - the strength of accent, the amount of
foreignness causing comprehension difficulties - was therefore problematic. Descriptors
for Pronunciation were also used inconsistently when applied to several languages.
5. While there was a degree of variation in the difficulty values obtained for certain
descriptors in different sectors, the statistical significance of such variation in relation
to individual descriptors had to be treated with caution. Overall, such differences
cancelled each other out, and the scale of levels was equally valid for all languages and
sectors concerned.
6. The difficulty values from Year 1 (English) proved to be very stable. Only eight of the
61 1994 descriptors reused in 1995 were interpreted in a significantly different way.
After the removal of those eight descriptors, the values of the 103 Listening &
Speaking items used in 1995 correlated 0.99 (Pearson) when analysed (a) entirely
separately from 1994 and (b) with the 1994 items anchored to their 1994 values. This is
very satisfactory when one considers that :



The 1994 difficulty values were based on judgements by 100 English teachers, whilst
the ratings dominating the 1995 construct were those of the French and German
teachers;
The questionnaire forms used for data collection in 1994 and 1995 were completely
different;
The majority of teachers in 1995 were using the descriptors in French or German, not
English.
III Product: A bank of classified, calibrated descriptors
The categories for which descriptors were successfully scaled are as follows:
Communicative Activities
Listening: Overall Listening Comprehension
Receptive: Listening to Announcements & Instructions
Listening as a Member of an Audience
Listening to Radio & Audio Recordings
Watching TV & Film
Interactive: Comprehension in Spoken Interaction
Reading: Overall Reading Comprehension
Reading Instructions
Reading for Information
Reading for Orientation (scanning)
Interaction: Transactional: Service Encounters & Negotiations
Information Exchange
Interviewing & Being Interviewed
Notes, Messages & Forms
Interaction: Interpersonal: Conversation
Discussion
Personal Correspondence
Production (Spoken): Describing Experience
(Sustained Monologue) Putting a Case
Processing and Summarising
Strategies
Receptive Strategies: Deducing Meaning from Context (only 2 descriptors)
Interaction Strategies: Taking the Turn
Cooperating
Asking for Clarification
Production Strategies: Planning
Compensating
Repairing & Monitoring
Qualitative Aspects of Language Proficiency
Pragmatic: Fluency
(Language Use) Flexibility
Coherence
Thematic Development
Precision
Linguistic: Range: General Range
(Language (Knowledge): Vocabulary Range
Resources) Accuracy : Grammatical Accuracy
(Control) Vocabulary Control
When one looks at the vertical scale of calibrated items it is striking the extent to which
descriptors on similar issues land adjacent to each other although they were used on different
questionnaires. Indeed, the levels produced by the cut-off points show a remarkable
consistency of key characteristics. Space does not permit a detailed discussion of the whole
scale, but taking two levels as an example:
Threshold is intended to represent the Council of Europe specification for a visitor to a foreign
country and is perhaps most categorised by two features:
Firstly, the ability to maintain interaction and get across what you want to in a range of
contexts:






generally follow the main points of extended discussion around him/her, provided
speech is clearly articulated in standard dialect;
give or seek personal views and opinions in an informal discussion with friends;
express the main point he/she wants to make comprehensibly;
exploit a wide range of simple language flexibly to express much of what he or she
wants to;
maintain a conversation or discussion but may sometimes be difficult to follow when
trying to say exactly what he/she would like to;
keep going comprehensibly, even though pausing for grammatical and lexical planning
and repair is very evident, especially in longer stretches of free production.
Secondly the ability to cope flexibly with less straightforward situations in everyday life:





cope with less routine situations on public transport;
deal with most situations likely to arise when making travel arrangements through an
agent or when actually travelling;
make a complaint;
enter unprepared into conversations on familiar topics;
ask someone to clarify or elaborate what they have just said.
The next main level appears to represent a significant shift, offering some justification for the
new name Vantage. According to Trim (personal communication) the intention is, as with
Threshold and Waystage, to find a name which hasn't been used before and which symbolises
something central to the level concerned. In this case, the metaphor is that having been
progressing slowly but steadily across the intermediate plateau, the learner finds he has
arrived somewhere. He/she acquires a new perspective and can look around him/her in a new
way. This concept does seem to be borne out to a considerable extent by the descriptors
calibrated here, which represent quite a break with the content scaled so far.
At the lower end of the band there is a focus on effective argument:







account for and sustain his opinions in discussion by providing relevant explanations,
arguments and comments;
explain a viewpoint on a topical issue giving the advantages and disadvantages of
various options;
construct a chain of reasoned argument;
develop an argument giving reasons in support of or against a particular point of view;
explain a problem and make it clear that his counterpart in a negotiation must make a
concession;
speculate about causes, consequences, hypothetical situations;
take an active part in informal discussion in familiar contexts, commenting, putting
point of view clearly, evaluating alternative proposals and making and responding to
hypotheses.
Running right through the band are two new focuses:
Firstly, being able to more than hold your own in social discourse: e.g.






understand in detail what is said to him/her in the standard spoken language even in a
noisy environment;
initiate discourse, take his/her turn when appropriate and end conversation when
he/she needs to, though he/she may not always do this elegantly;
use stock phrases (e.g. "That's a difficult question to answer") to gain time and keep the
turn whilst formulating what to say;
interact with a degree of fluency and spontaneity that makes regular interaction with
native speakers quite possible without imposing strain on either party;
adjust to the changes of direction, style and emphasis normally found in conversation;
sustain relationships with native speakers without unintentionally amusing or irritating
them or requiring them to behave other than they would with a native speaker.
Secondly, there is a new degree of language awareness, especially self monitoring:



correct mistakes if they have led to misunderstandings;
make a note of "favourite mistakes" and consciously monitor speech for them;
generally correct slips and errors if he becomes conscious of them;
IV Exploitation Formats
There would appear to be three principal ways of physically organising descriptors on paper though each has endless variations: (1) a holistic scale: bands on top of another; (2) a profiling
grid: categories defined at a series of bands; (3) a checklist: individual descriptors each
presented as a separate criterion statement. These three formats exploiting descriptors
calibrated in the project are all used in the Language Portfolio. They are illustrated in the
appendix as follows:
Scale:
1. A global scale - all skills, 6 Common Reference Levels adopted for Council of Europe
Framework; also used in the Language Portfolio as a yardstick for situating
qualifications.
2. A holistic scale for spoken interaction, showing the full 10 level empirical scale
developed in the research project. The bottom level "Tourist" is an ability to
performance specific isolated tasks, and is not presented as a level in the Council of
Europe Framework; the "Plus" Levels" are referred to in the Framework as an option
for particular contexts, but the political consensus is to adopt the 6 Common Reference
Levels.
Grid:
1. A grid profiling proficiency in communicative activities, centred on Threshold Level.
Shows only a limited range of level, defines "Plus Levels".
2. A grid profiling qualitative aspects of proficiency used to rate video performances at
the final conference of the research project in September 1996. Shows the full range of
levels, but doesn't define "Plus Levels" due partly to fears of causing cognitive overload
in what was an initiation session.
Checklist:
1. A self assessment checklist taken from the draft of the Portfolio, see below. Contains
only items calibrated at this level, reformulated (if necessary) for self assessment.
References:
Alderson, J.C. 1991: Bands and scores. In Alderson and North: 71-86.
Alderson, J.C. and North, B. 1991: (eds.): Language testing in the 1990s: Modern English
Publications/British Council, London, Macmillan.
Association of Language Testers in Europe (ALTE) 1994: A description of the framework
of the Association of Language Testers in Europe. Cambridge, ALTE Document 4.
Bachman, L.F. 1990: Fundamental considerations in language testing, Oxford, OUP.
Bachman L. & Palmer A. 1982: The construct validation of some components of
communicative proficiency TESOL Quarterly 16/4: 449-464.
Bachman, L.F. and Savignon S.J. 1986: The evaluation of communicative language
proficiency: a critique of the ACTFL oral interview. Modern Language Journal, 70/4, 380-90.
Brindley, G. 1986: The assessment of second language proficiency: issues and approaches,
Adelaide. National Curriculum Resource Centre.
Brindley, G. 1991: Defining language ability: the criteria for criteria. In Anivan, S. (ed.)
Current developments in language testing, Singapore, Regional Language Centre.
Brumfit, C.J. 1987: Concepts and categories in language teaching methodology. AILA Review,
4: 25-31.
Canale, M. and Swain, M. 1980: Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics, 1/1, 1-47.
Carroll B.J. and West, R. 1989. ESU (English-speaking union) framework. Performance
scales for English language examinations. London: Longman.
Clark, J.L. 1985: Curriculum renewal in second language learning: an overview. Canadian
Modern Language Review, 42/2, 342-360.
Council of Europe 1992: Transparency and coherence in language learning in Europe:
objectives, assessment and certification. Strasbourg, Council of Europe; the proceedings of the
intergovernmental Symposium held at Rüschlikon November 1991 (ed. North, B.).
Council of Europe 1996: Modern languages: learning, teaching, assessment. A common
European framework of reference. Draft 2 of a framework proposal. CC-LANG (95) 5 rev IV,
Strasbourg, Council of Europe.
Council of Europe 1997: European language portfolio. Proposals for development. CCLANG (97)1, Strasbourg, Council of Europe.
Dandonoli, P. and Henning, G. 1990: An investigation of the construct validity of the ACTFL
proficiency guidelines and oral interview procedure. Foreign Language Annals, 23/1, 11-22.
De Jong, H.A.L. 1990: Response to Masters: Linguistic theory and psychometric models, in
De Jong, H.A.L. and Stevenson D.K. Individualising the assessment of language abilities,
Cleveland, Multilingual Matters, p.71-82.
Fulcher, G. 1987: Tests of oral performance: the need for data-based criteria. ELT Journal,
41/4, 287-291.
Fulcher, G. 1993: The construction and validation of rating scales for oral tests in English as a
foreign language, PhD thesis, University of Lancaster.
Griffin, P.E. 1989: Monitoring proficiency development in language. Paper presented at the
Annual Congress of the Modern Language Teachers Association of Victoria, Monash
University, July 10-11 1989.
Languages Lead Body 1992: National standards for languages: units of competence and
assessment guidance. UK Languages Lead Body, July 1992.
Lantolf, J. and Frawley, W. 1985: Oral proficiency testing: a critical analysis. Modern
Language Journal, 69/4, 337-345.
Lantolf, J. and Frawley, W. 1988: Proficiency, understanding the construct. Studies in Second
Language Acquisition, 10/2, 181-196.
Linacre, J.M. 1989: Multi-faceted measurement. Chicago, MESA Press.
Liskin-Gasparro, J.E. 1984: The ACTFL proficiency guidelines: a historical perspective. In
Higgs, T.C. (ed.) Teaching for proficiency, the organising principle. Lincolnwood (Ill.):
National Textbook Company: 11-42.
Lowe, P. 1983: The IRL oral interview: origins, applications, pitfalls and implications.
Unterrichtspraxis, 16/2, 230-244.
Luoma, S. 1993: Validating the (Finnish) certificates of foreign language proficiency. Paper
presented at the 15th Language Testing Research Colloquium, Cambridge, Arnhem, 2-4
August 1993.
Matthews, M. 1990: The measurement of productive skills. Doubts concerning the assessment
criteria of certain public examinations. ELT Journal 44/2: 117-120.
McNamara, T. 1995: Modelling performance: opening PandoraÕs box. Applied Linguists, 16,
2, 159-179.
North, B. 1993a: Transparency, coherence and washback in language assessment. In
Sajavaara, K., Takala, S., Lambert, D. and Morfit, C. (eds.) 1994: National Foreign
Language policies: practices and prospects. Institute for Education Research, University of
Jyvskyla: 157-193.
North, B. 1993b: The Development of descriptors on scales of proficiency: perspectives,
problems, and a possible methodology. NFLC Occasional Paper, National Foreign Language
Center, Washington D.C., April 1993.
North, B. 1994: Scales of language proficiency: a survey of some existing systems, Strasbourg,
Council of Europe.
North, B. 1996: The development of a common framework scale of descriptors of language
proficiency based on a theory of measurement, Unpublished PhD thesis, Thames Valley
University.
North, B. 1997a: The development of a common framework scale of descriptors of language
proficiency based on a theory of measurement. Paper given at the LTRC 1996, Tampere,
Finland. In Huhta, A., Kohonen, V., Kurki-Suonio, L. and Luoma, S. Current Developments
and Alternatives in Language Assessment. Jyvskyl, University of Jyvskyl: 423-449.
North, B. 1997b: Perspectives on language proficiency and aspects of competence. Language
Teaching, 30/2.
Oscarson, M. 1978/9: Approaches to self-assessment in foreign language learning.
Strasbourg, Council of Europe 1978; Oxford, Pergamon 1979.
Oscarson, M. 1984: Self-assessment of foreign language skills: a survey of research and
development work. Strasbourg, Council of Europe.
Pienemann, M. and Johnston, M. 1987: Factors influencing the development of language
proficiency. (The Multi-dimensional model - summary). In Nunan, D. (ed.) Applying second
language acquisition research. Adelaide, National Curriculum Resource Centre: 89-94.
Pollitt, A. 1991: Response to Alderson: Bands and scores. In Alderson and North: 87-94.
Pollitt, A. and Murray, N.L. 1993/1996: What raters really pay attention to. Paper presented
at the 15th Language Testing Research Colloquium, Cambridge and Arnhem, 2-4 August 1993.
In Milanovic, M. and Saville, N. (eds.) 1996: Performance testing, cognition and assessment.
Cambridge: University of Cambridge Local Examinations Syndicate: 74-91.
Richterich, R. and Schneider, G. 1992: Transparency and coherence: why and for whom? In
Council of Europe: 43-50.
Schneider, G. and North, B. forthcoming: Assessment and self-assessment of foreign
language proficiency at cross-over-points in the Swiss educational system: transparent and
coherent description of foreign language competence as assessment, reporting and planning
instruments. Bern, National Science Research Council.
Skehan, P. 1984: Issues in the testing of English for specific purposes. Language Testing, 1(2),
202-220.
Skehan, P. 1995: Analysability, accessibility and ability for use. In Cook, G. and Seidlehofer,
S. (eds.), Principle and practice in applied linguistics. Oxford: Oxford University Press.
Smith, P.C. and Kendall, J.M. 1963: Retranslation of expectations: an approach to the
construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47/2:
149-154.
Spolsky, B. 1986: A multiple choice for language testers. Language Testing, 3/2, 147-158.
Spolsky, B. 1993: Testing and examinations in a national foreign language policy. In
Sajavaara, K., Takala, S., Lambert, D. and Morfit, C. (eds.) 1994: National foreign
language policies: practices and prospects. Institute for Education Research, University of
Jyvskyla: 194-214.
Thurstone, L.L. 1928: Attitudes can be measured. American Journal of Sociology, 33 529554; cited in Wright, B.D. and Masters, G. 1982: 10-15.
Trim, J.L.M. 1978: Some possible lines of development of an overall structure for a European
unit/credit scheme for foreign language learning by adults. Strasbourg, Council of Europe.
Van Ek, J.A. 1986: Objectives for foreign language teaching, volume I: scope. Strasbourg,
Council of Europe.
Wilds, C.P. 1975: The oral interview test. In Spolsky, B. and Jones, R.: Testing language
proficiency. Washington D.C., Center for Applied Linguistics: 29-44.
Wright, B,D. and Masters, G. 1982: Rating scale analysis. Rasch Measurement Chicago,
Mesa Press.
Wright, B.D. and Stone, M.H. 1979: Best test design. Chicago, Mesa Press.
Download