The Ninetieth Anniversary of the LSA: A Commemorative Symposium Morphology: the last 40 years Mark Aronoff January 3, 2014 Preface: Technology and theory • The relation between technology and theory goes both ways • We like to believe that theory leads technology • At least as often it is the other way round • Many of the successes of early science were technology driven Galileo Galilei Sidereus Nuncius (1610) Antoni van Loewenhoek In the year of 1675 I discover’d living creatures in Rain water A Case Study Morphological Productivity • Morphological productivity was rarely investigated until the 1980’s • Newly available electronic tools made the quantitative study of morphological productivity possible • New tools have led to breakthroughs in our understanding of both synchronic and diachronic morphology • The tools lead us to question fundamental assumptions about the discreteness of language and the value of the competence/performance distinction Counting Words Data Resources and English Morphology • Fundamental discoveries in linguistic morphology over the last half-century have depended on improvements in our ability to count English words • As the resources for counting words have changed and improved, so have our ideas about morphology changed and (we hope) our understanding improved Laying the Foundations for Studying Morphological Productivity • Early linguistic word data resources were not designed for linguistics, though they were focused on language – Walker 1775 – Thorndike 1921, 1932, 1944 • Only in the 1960’s did the first truly linguistically driven electronic word data resources appear – Brown 1963 (word counts) – Kučera and Francis 1967 (frequency counts) John Walker 1732 – 1807 John Walker The Godfather of Modern Morphology • Walker’s Rhyming Dictionary. 1775 • Walker’s dictionary has gone through many editions and remains in print • The term rhyming dictionary was misleading, though it was a good selling point • Walker’s dictionary was meant for linguists as much as for poets, though few linguists used it Notable linguistic remarks from Walker’s original Introduction • As in other Dictionaries words follow each other in an alphabetical order according to the letters they begin with, in this they follow each other according to the letters they end with. • The English Language, it may be said, has hitherto been seen through but one end of the perspective; and though terminations form the distinguishing character and specific difference of every language in the world, we have never before had a prospect of our own, in this point of view. Edward Thorndike, 1874-1949 The Godfather, Part II The Father of Educational Psychology • Thorndike was one of the first American experimental psychologists • Thorndike’s work was a precursor to both behaviorism and modern cognitive psychology • Thorndike spent his entire career at Columbia University Teacher’s College • Thorndike is regarded as a founding figure in educational psychology Thorndike’s word books • Between 1921 and 1944, Thorndike published three frequency-based word books for teachers, to be used in curriculum design • The last edition (Thorndike and Lorge) contained 30,000 words • The books consisted almost entirely of frequency lists: 1/ 1,000,000; 1/4,000,000; 1000 most frequent • These were the first frequency lists published for any language A. F. Brown • A. F. Brown was one of the first computational linguists, working at Penn and then at LeHigh • In 1963, he published his Normal and Reverse English Word List, prepared under contract with the Air Force Office of Scientific Research • The list was collated from 18 dictionaries – Each list runs to 400 pages of computer printout, with 100 words per page = 400,000 entries Kučera and Francis Francis and Kučera • The Brown Corpus (1964) – 1,014,312 words of running text of edited English prose printed in the United States during the calendar year 1961 – 500 samples of 2000+ words each – Tagged in a variety of ways • Computational Analysis of Present-Day American English (1967) • Frequency Analysis of English Usage (1982) – Approximately 45,000 distinct lemmas listed with their frequencies – Lemmas with adjusted frequency >5/m in rank order The last 25 years Large-scale electronic resources • The availability in the last quarter century of large-scale electronic resources has made it possible to study English morphology in hitherto unimagined ways • These resources have changed our perspective on how morphology works • Two types of resources: – Electronic dictionaries – Large corpora The Oxford English Dictionary The Oxford English Dictionary • The largest, longest, and most expensive academic publishing project in history • 1857 Inaugurated • 1879 Work begins in earnest • 1933 First full edition • 1989 OED2 • 1992 CD-ROM of OED2 • 2000 – OED Online (by subscription) À quoi ça sert (l’amour)? • The OED, unlike Webster’s II and others, is a historical dictionary • Recent editions of the OED were designed from the bottom up as electronic resources • The combination allows us to ask questions that we could never before expect to find answers for • We can even ask questions that we might never before have imagined OED Tools • The OED prides itself on the accuracy of its first citations • The first citations provide the most accurate historical record available in any language of the first use of a word • The ability to use wild cards permits the simple construction of historical timelines for individual affixes • The timelines allow easy and accurate study for the first time of the growth and decline of patterns of affixation in English over the last millennium What the OED shows us • The system is self-organizing • We can track the emergence of “borrowed” affixes from the borrowing of large numbers of individual words to the productive use of an affix (e.g., -ment, -ation, -ity, -able) • Homonymous affixes compete • The competition between affixes is resolved through competition 12 51 13 -13 01 00 13 -13 51 50 14 -14 01 00 14 -14 51 50 15 -15 01 00 15 -15 51 50 16 -16 01 00 16 -16 51 50 17 -17 01 00 17 -17 51 50 18 -18 01 00 18 -18 51 50 19 -19 01 00 19 -19 51 50 -2 00 0 Adjusted number of words Sample affix histories from the OED (Anshen & Aronoff 1999) 160 140 120 100 80 60 40 20 0 Derived ity Derived ment Half centuries Sample affix histories from the OED (Marine Lasserre) Corpora • The Brown corpus, compiled 50 years ago, contained a total of 1 million words • The Google Books database currently contains over 30 million books and over 150 billion words • Other modern large corpora are comparably large and are tagged for part of speech • The COCA corpus contains over 450 million words • Corpora allow for the counting of individual words/lemmas and their frequencies in a corpus Harald and the Elusive Index Baayen’s Productivity Indices • In a series of publications from 1989 on, Harald Baayen developed a number of corpus-based indices intended to capture the intuitive notion of morphological productivity • Baayen’s indices are based on the idea that words that only occur once in a corpus, hapax legomena, are a window into morphological productivity • This idea makes no sense in the absence of a searchable corpus of reasonable size • The general method becomes less useful as the corpus grows in size P = n1 / N • The best known of Baayen’s indices is P, which measures the “growth rate” of the affix: the probability that an encounter with a word containing the affix is a new type. • In the equation, n1 is represents the total number of hapaxes containing the affix, and N represents the total number of tokens containing the affix. • P fits linguists’ intuitions about productivity reasonably well in corpora < 100M words, except when both n’s are small (for unproductive affixes) V and P* • V is the total number of lexeme types containing a given affix – Differences in V between affixes reflect the extent to which relevant base words have been used • Baayen plots P against V to obtain P*, the relative “global productivity” of affixes – This measure is problematic, as Baayen notes, because there is no principled way of scaling the axes Hapax vs. Hapax • Baayen’s final measure is P *, the hapaxconditioned degree of productivity • P * = n1 / h1, where h1 is the total number of hapaxes across all types in the corpus • Since h1 is the same for all affixes in a corpus, this measure simply counts the numbers of hapaxes for each affix identified in a corpus • The difference in P * yields intuitively satisfactory results for Baayen’s corpora • The greatest weakness of P * is that it cannot easily be compared across corpora Where hapaxes fail • Both P and P * measurements are dependent on the size (N) of the corpus • The number of hapaxes in a corpus is a decreasing function of N – The rate of increase in the number of hapaxes slows as the size of the corpus increases – Very large corpora show few if any hapaxes • There is no way to know what the “proper” size of a corpus is for hapax-based measures to be useful • It is not clear what the value of a measure of global productivity is So far, so good • We gain insights into morphological productivity if we use quantitative tools • We can not treat productivity as a discrete phenomenon if we want to learn about it • The methods and measures we use depend on the machinery that we have • The notion of an absolute measure of productivity that is valid across corpora is elusive and problematic Escape from Hapax • The number of hapaxes decreases as the size of the corpus increases • With very large corpora hapaxes are not helpful • We can learn a great deal from very large corpora if we confine ourselves to the direct comparison of pairs of competing affixes • This method is not based on hapaxes • This line of research does not address the question of global productivity at all • Google Fight! Using Google Search • We use Google Search Estimated Total Matches (ETM) as a measure of usage • PROBLEMS – Google is very noisy and must be used with great caution – ETM is not an actual count but an estimate based on a proprietary method • SOLUTIONS – Little weight is placed on raw numbers or on individual word pairs – Only large differences between affixes are taken into account A test case Comparing –ic and -ical • Sample ETM counts for high frequency doublets (Lindsay & Aronoff 2013) Comparing –ic and -ical • Sample ETM counts for high frequency singletons (Lindsay & Aronoff 2013) Usually –ic wins Sometimes -ical wins • -ical is productive in stems ending in -olog (from Lindsay and Aronoff 2013) Favoring -ic Favoring -ical Total Total Stems 10613 1353 11966 Ratio 7.84 1 olog Stems 74 401 475 Ratio 1 5.42 Usually –ic wins Sometimes -ical wins • -ical is productive in stems ending in -olog (from Lindsay and Aronoff 2013) Why –olog? • -olog defines the largest set by far of stems with neighborhood length 4 preceding either of the two suffixes (475 members) • The -olog set contains 2/3 of all stems in –g • The -olog set is thus a very large morphologically defined subsystem with very few neighbors • The -olog set is uniquely suited to sustain -ical as a productive suffix, in spite of the clear dominance of -ic overall Conclusion • The combination of rich computational resources and quantitative methods allows us to make progress in understanding questions that could not be profitably studied a quarter century ago • As the resources change, so do the questions, the methods, and the theories that they drive THANK YOU Special thanks to those who have joined in my personal struggle over the last 40 years to understand morphological productivity by counting Morris Halle Frank Anshen Mark Lindsay La lotta continua!