Centre for Research in English Language Learning and Assessment Three lessons from the historiography of language testing Cyril J. Weir CRELLA Why is a knowledge of the past important for language testing? “We consider past developments in the field in order to have a richer, more accurate, and complete understanding of our field. It enables us to understand: • the state of our knowledge at any given point in time • the causes that have prompted change • the time it has taken for ideas to change • the links to developments in other disciplines.” Micheline Chalhoub-Deville “Familiarity with how a construct was measured in the past speaks to the importance of humility and maturity as we realize a path has been trodden before us.” Lynda Taylor “If I have seen further, it is by standing on ye sholders of giants” Isaac Newton in a Letter to Robert Hooke (15 February 1676) “Bernard of Chartres used to compare us to [puny] dwarfs perched on the shoulders of giants. He pointed out that we see more and farther than our predecessors, not because we have keener vision or greater height, but because we are lifted up and borne aloft on their gigantic stature.” John of Salisbury (1159) Metalogicon (quoted in Wikipedia) Temporal Awareness Bernard Spolsky, reminds us of the need to ensure temporal accuracy: …pride of place for a direct measure of oral language proficiency is usually granted to the oral interview created by the Foreign Service Institute (FSI) of the US State Department developed originally between 1952-56… Spolsky, B. (1990: 158). Oral examination: an historical note. Language Testing, 7 (2) Jack Roach Spolsky continues “It turns out to be the case, however, that many of the important issues the FSI linguists had to struggle with, especially those concerning reliability, had been anticipated and intelligently ventilated in a paper written some years before the FSI activity started, printed and circulated internally among examiners of the University of Cambridge Local Examinations Syndicate (UCLES)”. Some problems of oral examinations in modern languages: an experimental approach based on the Cambridge Examinations in English for foreign students (J O Roach 1945) Lynda Taylor suggested Roach's work in the 1930s and 1940s on issues in speaking assessment has much to teach the LT world. Cambridge had been conducting oral test for 40 years already by the time of the FSI development (an oral component including conversation was part of the original CPE in 1913) “Thanks in large part to the influence of Roach (Assistant Secretary to the Syndicate from 1925 to 1945 ), Cambridge was already well-sighted on many of the key issues, e.g. face-to-face format to allow for reciprocal interaction, multiple task design, scales with some sort of descriptors attached, rater training and standardization.” However… 1830’s Glenn Fulcher shared this earlier example with me: "The earliest record of an attempt to assess second language speaking dates to the first few years after Rev. George Fisher became Headmaster of the Greenwich Royal Hospital School in 1834. In order to improve and record academic achievement, he instituted a “Scale Book”, which recorded performance on a scale of 1 to 5 with quarter intervals. A scale was created for French as a second language, with typical speaking prompts to which boys would be expected to respond at each level...” Chadwick, E. (1864). Statistics of educational results. Museum: A Quarterly Magazine of Education, Literature and Science, 3, 479-484. Cadenhead, K. and Robinson, R. (1987). Fisher’s “Scale Book”: An Early Attempt at Educational Measurement. Educational Measurement: Issues and Practice 6(4), 15 – 18. Edward L. Thorndike’s standardised scales Barry O’Sullivan drew my attention to Thorndike’s work in the early 20th on the creation of standardized scales. Instead of estimating a scale based simply on connoisseurship as was often the case in the United Kingdom. Thorndike took a large sample of handwritten scripts and used a large number of teachers to rank these scripts in order. From the data he created a scale upon which he placed each script. He then provided a set of exemplar scripts at various levels to operationalise the scale from an absolute zero base, with scale points defined and their distances defined. Teachers were asked to compare their student’s scripts with those samples on the scale and identify the closest match to give the level. F.Y. Edgeworth 1888 • Weir (1983) noted that in the C19th Edgeworth (1888) had observed that one-third of scripts marked by different examiners in the British civil service examinations received a different mark and, further, that in a re-examination of scripts by the same examiner one seventh received a different mark. • Edgeworth offered two solutions to these problems in scoring validity: increasing the number of components in an exam and multiple marking. He argued the more components that were aggregated, the more likely that individual marker errors would be eliminated. He also stressed that the more markers that were involved in examining a script, the more likely it was that a ‘true value’ would emerge. Edgeworth, F.Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, LI, 599-635. Edgeworth, F. Y. (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society, 53, 460-75 and 644-63. First Lesson There is nothing new under the sun, but there are lots of old things we don't know. Ambrose Bierce, The Devil's Dictionary [which we should know, or face the ignominy of our work being seen as temporally and/or geographically challenged…] Part 2 Blind monks examining an elephant Hanabusa Itcho 1652-1724 A while back Fred Davidson brought the following apposite quotation to the attention of L-Test list serve: "Despite some exceptional instances, the first logical step in the development of psychometrics seems to be to devise a series of instruments each of which measures something accurately, regardless of what that something may be; and the second, and following step, to discover what that something is." O'Connor, J. 1934. Psychometrics: A Study of Psychological Measurements. Cambridge, MA: Harvard University Press, p. xvi. Construct: the sine qua non of language testing The more fully we are able to describe the construct we are attempting to measure at the a priori stage the more meaningful might be the statistical procedures contributing to construct validation that can subsequently be applied to the results on the test. Statistical data do not in themselves generate conceptual labels. We can never escape from the need to define what is being measured, just as we are obliged to investigate how adequate a test is in operation. Measured Constructs 1913-2012 • Attempts at explicit construct definition are a relatively recent phenomenon. • In first part of C20th seemingly little overt attention was paid to the underlying construct(s) in language tests • Only really in the 1960’s that it becomes an explicit concern in the work of language testers such as J.B. Carroll, Bernard Spolsky and Alan Davies The influence of ideas from language teaching on testing 1913-2012 Changing priorities in approaches to language learning/teaching obtaining at various stages in the C20th had an influence on language testing in the UK. These included: the Grammar Translation or Traditional Method, based upon the method used for the teaching of classical languages the direct method with its focus on spoken language promoted in continental Europe for the formal education system the structuralist approach the communicative approach with its focus on the needs of learners to use language for real life communication Cambridge Certificate of Proficiency in English 1913 CPE in 1913 can be seen as a hybrid creation which drew on a number of legacies (academic and social)from the past concerning what was to be taught and how: • (i) the Grammar Translation Approach reflected in the inclusion of translation tasks and questions on English grammar. “The prime object of scholastic education is the training of the mental faculties” (R.W. Hiley 1887 Journal of Education Vol IX: 308) ; • (ii) the Reform Movement (Viëtor 1882 Passy 1899, Jespersen 1904) reflected in the inclusion of a phonetics paper, an Oral paper.The assistance of modern ideas from phonetics, allowed for a new pedagogical approach rooted in the spoken language. Henry Sweet (1845-1912): a champion of the oral approach Sweet’s (1899) The Practical Study of Languages. A Guide for Teachers and Learners regarded by Howatt (1984:202) as one of the best Language Teaching methodology books ever written: “… unsurpassed in the history of linguistic pedagogy”.http://www.henrysweet.org/ The papers in CPE 1913 correspond closely to the chapters in his book Lesson 2 Alan Davies wrote in the first issue of the journal Language Testing: “…in the end no empirical study can improve a test’s validity... What is most important is the preliminary thinking and the preliminary analysis as to the nature of the language learning we aim to capture.” Davies, A. (1984). Validating three tests of English language proficiency. Language Testing, 1 (1), 50-69. 19 Part 3 The wider picture "People make their own history, but they do not make it as they please; they do not make it under self-selected circumstances, but under circumstances existing already, given and transmitted from the past.” Karl Marx, The Eighteenth Brumaire of Louis Bonaparte, Part 1 Different gods, different mountain tops Substantive differences grew between the UK and the USA in their approaches to testing from 19131970. In the US the predominant focus was on scoring validity, in particular the psychometric qualities of a test with a predilection for MCQ In the UK, for example in Cambridge English language examinations, there was a far greater concern with content validity: a concern with the how in the US as against the what in the UK. Socio-economic An important reason for the Atlantic rift can be found in the differing socio economic contexts prevailing in Britain and the USA in the early C20th. The compelling need to produce tests on an industrial scale in the US strongly influenced testing organizations in the direction of objective multiple choice methods at a very early stage. Population explosion Resnick (1982: 177,187) describes how in US schools “the need to identify those who had the least probability of being able to carry on normal work for their age, was stimulated by the demographic explosion… In 1870 there were about 80,000 students … by 1910 there were 900,000.” Allocating wartime jobs in the military Glenn Fulcher (1999: 390) describes how a serious logistical challenge faced the army in WW1. This was to result in the increased use of objective test formats in intelligence tests. Resnick (1982:182) records the successful placement in appropriate jobs of 1.7 million army recruits mobilised in 191718 through the administration of the US Army's Alpha and Beta tests, following Robert Yerkes’ successful advocacy of these. Harold Ormsby recently described this on L-test L as “… a significant moment in the history of mass testing.” Fred Davidson saw it as: “‘proof” that large-scale normative psychometric testing could work”. Numbers In short the pressure of numbers was one of the main factors which helped drive US testing in schools and in the military in the direction of psychometrically driven tests, especially MCQ. In the UK in 1913 there were only 3 candidates for the Cambridge Certificate of Proficiency in English (CPE), 15 in 1931 and 750 by 1939. A cottage industry as against an industrial behemoth. MCQ Samelson concludes: “The multiple choice test – efficient quantitative, objective, capable of sampling wide areas of subject matter and easily generating data for complicated statistical analysis – had become the symbol or synonym of American Education.” Samelson F, Was Early Mental Testing (a) Racist Inspired, (b) objective science, (c) a technology for democracy, (d) the origin of multiple-choice exams, (e) none of the above? (Mark the RIGHT answer). In Michael M. Sokal, Psychological Testing and American Society, 1890-1930 (New Brunswick and London: Rutgers University Press, 1987). “Perfidious Albion”: English as an instrument of UK foreign policy Pennycook (1994:134) viewed the attempt to spread English around the globe in the C20th as part of a wider focus on cultural and linguistic expansion in preference to the earlier material exploitation by the western powers. A “search for new means of social and political control in the world” saw “the prodigious spread” of English Spreading English around the world The Cambridge examination [CPE]…. was seized on by Jack Roach when he joined the Syndicate after the First World War for both ideological and personal reasons. He thought an international test would realize his ‘modest ambition of making English the world language’ (Roach 1956: 2) and he saw a role for his own activities… Roach, in 1929, hoped for ‘the reaffirmation and spread of British influence’ Spolsky (2004: 305) Propagation by simplification Richard Smith (2004:229-31) identified a politically motivated focus on lexical content in British ELT from the 1930s until the end of WW2. He describes how: “Discussions in the emerging UK ‘centre’ from the mid-1930s until the end of World War II had focused quite explicitly and narrowly on needs to propagate English as a world language via simplification of the lexical contents of instruction” 29 LCE 1939 • 1939 saw the introduction of the Cambridge Lower Certificate in English (LCE later in 1975 FCE) • The UCLES Regulations for the 1939 LCE examination papers (December 1938) reveal a lot about LCE constructs; in particular the references to ‘simplified’ texts, ‘simple English’ and ‘relatively limited vocabulary’. • Developing a test at a lower level than CPE with a large potential candidature fitted well with the expansionist rationale for ELT that Smith described and Roach subscribed to. • 1941 saw the signing of an agreement between Cambridge and the British Council to spread the former’s English examinations round the world 30 Socio Economic Forces: the obligation to define multiple levels of proficiency Progress towards a European Economic Community from the 1970's onwards brought with it a felt need on the part of intergovernmental agencies in Europe to define language teaching and learning goals more precisely and to make a start on delineating the stages of progression across the language proficiency spectrum. The result was a more granular approach to construct definition at different proficiency levels in Europe, which does not appear to have been a major concern for testers in the United States. It resulted, for example, in additional Cambridge English language examinations across further proficiency levels from 1980 onwards (PET at the B1 CEFR level in 1980, CAE at the CEFR C1 level in 1991 and KET at the CEFR A2 level in 1994 With the need for granularity came the need to establish specific criterial contextual and cognitive parameters to differentiate the six proficiency levels as per the CEFR. 31 Third Lesson The last word goes to Bernard Spolsky (1990a:159) who impresses on us the need to take account of: “…external, non theoretical, institutional social forces, that on deeper analysis, often turn out to be a much more powerful explanation of actual practice… A clearer view of the history of the field will emerge once we are willing to look carefully at not just the ideas that underlie it, but also the institutional, social and economic situations in which they are realized.”