Assessment Methods: 1
Assessing assessment methods
– on the reliability of pronunciation tests in EFL
Jolanta Szpyra-Kozłowska, Justyna Frankiewicz,
Marta Nowacka, Lidia Stadnicka
Maria Curie-Skłodowska University, Lublin, Poland
1. Introductory remarks
Teaching another language is inevitably tied with testing. Teachers have to assess
the learners’ linguistic ability, their progress and achievements. In this respect
pronunciation is no different from other language skills; if we regard it as an important
element of communicative competence which deserves a place in language
instruction, we should also be able to evaluate the process of teaching/learning it as
well as its outcome. Yet, as pointed out by Celce-Murcia et al. (1996: 341), ‘in the
existing literature on teaching pronunciation, little attention is paid to issues of testing
and evaluation.’ The major reason for this negligence is the fact that, as argued by
Heaton (1988: 88), speaking, which obviously comprises pronunciation, is a very
complex skill ‘to permit any reliable analysis to be made for the purpose of objective
The present paper addresses the issue of the reliability of the most frequently
employed assessment methods of EFL learners’ pronunciation. First we examine
impression-based pronunciation testing in the internationally recognized Cambridge
English Examinations and point to its various shortcomings. Next we present a report
on an experiment which compares two approaches to pronunciation testing: holistic
(global, impressionistic) and atomistic (analytic) We point to their strengths and
weaknesses, and show that they are not equivalent and lead to different results.
2. Pronunciation assessment in Cambridge English Examinations
In evaluating different methods of pronunciation testing, it seems useful to start with
analyzing the way in which is it done in international English language examinations.
Pronunciation does not play any important role in the majority of them (for a detailed
analysis see Szpyra-Kozłowska 2003). Cambridge examinations are no exception to
this rule; candidates get only 5%-6% of the total score for this skill. The assessment
is impressionistic in nature Thus, the following criteria have been adopted for the 5
basic examinations:
• KET (Key English Test)– pronunciation is heavily influenced by L1 features
and may at times be difficult to understand;
• PET (Preliminary English Test) – pronunciation is generally intelligible, but L1
features may put a strain on the listener;
• FCE (First Certificate in English) – although pronunciation is easily
understood, L1 features may be intrusive;
• CAE (Certificate in Advanced English) – L1 accent may be evident but does
not affect the clarity of the message;
• CPE (Certificate of Proficiency in English) – pronunciation is easily
understood and prosodic features are used effectively; many features,
including pausing and hesitation, are ‘native-like.
It is obvious that these requirements are very general and impression-based. Also
comments addressed to examiners make constant reference to the vague notions of
intelligibility and the amount of strain a candidate’s pronunciation puts on the listener.
In the manual, evaluators, who are usually experienced nonnative teachers of
English, are instructed as follows, ‘when assessing pronunciation, examiners should
try to put themselves in the position of a non-EFL specialist, native speaker of
English and assess the amount of strain on the listener and the degree of patience
and effort required to understand the candidate.’ This procedure raises the following
1. A professional teacher of English cannot be required to pretend to be a nonEFL specialist who, in addition, is a native speaker of English; not everyone
has a talent of pretending to be a completely different person (what if he
2. It is not clear what kind of native speaker the examiner is supposed to
impersonate – a well-travelled university professor, familiar with many
nonnative varieties of English or a small-town housewife who has never left
her birthplace?
3. A nonnative teacher in most cases can understand even very bad English of
his fellow-countrymen because of his/her frequent exposure to it. He is,
therefore, in no position to judge its intelligibility to users of English of different
nationalities than his own.
4. Having no precise criteria of pronunciation assessment, the examiner is likely
to adopt his own subjective principles of evaluation (see section 3). This often
happens in spite of standardization procedures and examiners’ training.
We can conclude that the examinations under analysis do not provide clear-cut
criteria of assessing the examinees’ pronunciation by relying too heavily on very
imprecise impressionistic judgements and by making unreasonable demands on
nonnative examiners. This, in turn, seriously undermines their inter-rater reliability.
3. Holistic versus atomistic pronunciation testing
As shown in the preceding section, Cambridge English Examinations, similarly to
many other language tests, employ rather objectionable impressionistic evaluation. It
is therefore crucial to examine its logical alternative, i.e. analytic testing. In this
section these two approaches to pronunciation assessment are compared and
In the holistic approach to language testing (Alderson et al. 1996:289), ‘examiners
are asked not to pay too much attention to any one aspect of a candidate’s
performance, but rather to judge its overall effectiveness.’ The greatest advantage of
this procedure is that it can be administered to large groups of learners within a short
period of time. Moreover, according to Underhill (1987:101), ‘impression marking is
used for the kind of categories that are very hard to define but everybody agrees are
important: fluency, ability to communicate, style, naturalness of speech, and so on.’
For these reasons it is advocated by many researchers (e.g. Celce-Murcia et.
al.1996, Hughes 1991, Koren 1995).
Nevertheless, global pronunciation testing has many drawbacks. It is often too
general and imprecise since the assessment criteria in the rating scales, as has been
shown in section 2, tend to be vague. This means, in consequence, that different
raters might adopt their own criteria of evaluation. Finally, as pointed out by Underhill
(1987: 101), “making accurate impression-based assessments requires a lot of
experience. (…) Even experienced assessors find it difficult to make consistent
impression-based judgements.” In other words, this procedure raises problems both
of intra-rater and inter-rater reliability.
Analytic evaluation consists in establishing a detailed marking scheme in which
specific aspects of the learner’s performance are evaluated separately. Subsequently
these different ratings are combined to provide an overall mark. An atomistic
approach to pronunciation testing thus involves judgements on the correctness of the
learner’s production of particular vowels, consonants, stress, rhythm, intonation, etc.
This method of pronunciation testing is claimed to be more objective than the holistic
approach as it provides a more detailed diagnosis of the learner’s problems and
achievements. It is generally preferred by pronunciation specialists and phoneticians
(e.g. Vaughan-Rees 1989).
On the other hand, atomistic procedure is not without its problems. It is extremely
time-consuming and requires recording the learners’ speech samples and
subsequent listening to them several times by the raters. For these reasons this
approach seems unsuitable for large classes and examinations with many
According to Hughes (1991), the choice between holistic and analytic scoring
depends to some extent on the purpose of testing; atomistic tests are more reliable
for diagnostic purposes in the language classroom and in the situations in which
scoring is carried out in many places by different judges, while holistic evaluation,
which is faster, is more appropriate for experienced scorers who are well familiar with
the grading system.
In order to compare both approaches, we have carried out an experiment whose
primary goal was to examine whether the holistic and atomistic procedures of
pronunciation testing are equivalent and bring about the same results.
In the experiment reported here 10 judges, all teachers of English, evaluated the
pronunciation of 10 randomly selected intermediate Polish learners, secondary
school pupils, who were asked to read aloud a short passage, which was
subsequently recorded. The raters were first asked to evaluate holistically pupils’
pronunciation recorded on the tape using an ordinary scale of Polish school marks of
1, 2, 2,5, 3, 3,5, 4, 4,5, 5 and 6, where 1 = failure and 6 = excellent. After a break of
two weeks the same group of raters assessed the recordings once again. On this
occasion they were given the following 6 criteria to be employed in the evaluation:
pronunciation of individual words, vowel quality (the /i/ - /i:/ distinction in particular),
the interdental fricatives, the -ing suffix, word stress and other phonetic features.
Each of these aspects were rated individually using the same scoring scale as
before. Subsequently, the means were calculated. Finally, the assessors were asked
to comment on the strengths and weaknesses of both approaches.
The questionnaires have revealed that in making holistic evaluation the raters
adopted, in fact, various analytic criteria (such as the pronunciation of ‘silent’ letters,
intonation, pauses, devoicing of final obstruents, etc.), which differed from person to
person. Moreover, 90% of assessors regarded atomistic testing as more reliable and
The table below contains the results of the experiment. We provide averaged
atomistic and holistic marks given by the raters.
Holistic Atomistic
Table 1. Results of holistic and atomistic assessment
As can clearly be observed, in 8 cases out of 10 the mean atomistic marks are lower
that the holistic marks. In one case the results are reversed and in one are the same.
The obtained means are 3,53 in the holistic evaluation and 3,17 in the analytic
To verify the obtained results, another experiment, a replica of the previous one, has
been conducted with a different group of 5 raters and 5 other learners. This time the
mean scores have been 3.56 in the holistic and 3.04 in the analytic assessment.
Thus, a conclusion can be drawn that the holistic and atomistic approaches to
pronunciation testing are not equivalent; the former usually results in higher scores
than analytic assessment. This means that raters generally tend to be more lenient in
their overall impressions than in judgements made on the basis of more specific
criteria. An explanation of this phenomenon can be sought in the likely assumption
that in atomistic testing the focus seems to be on error finding more than in the
holistic procedure, where the criterion of intelligibility is employed, which allows for a
more tolerant approach to phonetic inaccuracies.
4. Final remarks
Pronunciation is extremely difficult to test in an objective and reliable fashion. We
have demonstrated that Cambridge English Examinations, just like other similar
tests, are based entirely on impressionistic evaluation and raise many objections with
regard to their reliability. We have considered an alternative procedure of analytic
evaluation and demonstrated that the two methods are not exactly equivalent, the
former being more lenient and permissive than the latter. The atomistic approach can
be regarded as more objective and reliable, and is particularly well-suited for
diagnostic purposes as it allows the teacher to identify specific pronunciation
problems of the learners to be dealt with in the course of subsequent instruction. It is,
however, time-consuming and not easy to execute with large groups of learners or
examinees. Holistic testing, on the other hand, is technically simpler to carry out. It is
invaluable in assessing the overall impression, the intelligibility of the learner’s
speech and other aspects of his pronunciation which cannot be easily expressed by
means of definite, clear-cut criteria. Its reliability, however, is questionable.
Apparently, none of these two methods can be viewed as fulfilling all the necessary
requirements of objectivity, reliability and practicality.
Alderson, C. J., Wall, D. & C. Claphaim. (1996). Language Test Construction and
Evaluation. Cambridge: Cambridge University Press.
Celce-Murcia, M., Brinton, D. & J. Goodwin. 1996. Teaching Pronunciation: a
Reference for Teachers of English to Speakers of Other Languages. Cambridge:
Cambridge University Press.
Heaton, J. B. 1988. Writing English Language Tests. London: Longman.
Hughes, A. (1991). Testing for Language Teachers. Cambridge: Cambridge
University Press.
Koren, S. (1995). “Foreign language pronunciation testing: a new approach.” System
23 (3). 387-400.
Szpyra-Kozłowska, J. (2003). ”Miejsce i rola fonetyki w międzynarodowych
egzaminach Cambridge, TOEFL i TSE.” Zeszyty Naukowe PWSZ w Płocku.
Neofilologia. Tom V. 181-191.
Underhill, N. (1987). Testing Spoken Language. A handbook of oral testing
techniques. Cambridge: Cambridge University Press.
Vaughan-Rees, M. (1989). “The testing of pronunciation – receptive skills.” Speak
Out! 4. p. 8.