Morphology - Linguistic Society of America

advertisement
The Ninetieth Anniversary of the
LSA: A Commemorative Symposium
Morphology: the last 40 years
Mark Aronoff
January 3, 2014
Preface:
Technology and theory
• The relation between technology and theory
goes both ways
• We like to believe that theory leads
technology
• At least as often it is the other way round
• Many of the successes of early science were
technology driven
Galileo Galilei
Sidereus Nuncius (1610)
Antoni van Loewenhoek
In the year of 1675 I discover’d living
creatures in Rain water
A Case Study
Morphological Productivity
• Morphological productivity was rarely investigated
until the 1980’s
• Newly available electronic tools made the quantitative
study of morphological productivity possible
• New tools have led to breakthroughs in our
understanding of both synchronic and diachronic
morphology
• The tools lead us to question fundamental assumptions
about the discreteness of language and the value of
the competence/performance distinction
Counting Words
Data Resources and English
Morphology
• Fundamental discoveries in linguistic
morphology over the last half-century have
depended on improvements in our ability to
count English words
• As the resources for counting words have
changed and improved, so have our ideas
about morphology changed and (we hope)
our understanding improved
Laying the Foundations for Studying
Morphological Productivity
• Early linguistic word data resources were not
designed for linguistics, though they were
focused on language
– Walker 1775
– Thorndike 1921, 1932, 1944
• Only in the 1960’s did the first truly linguistically
driven electronic word data resources appear
– Brown 1963 (word counts)
– Kučera and Francis 1967 (frequency counts)
John Walker
1732 – 1807
John Walker
The Godfather of Modern Morphology
• Walker’s Rhyming Dictionary. 1775
• Walker’s dictionary has gone through many
editions and remains in print
• The term rhyming dictionary was misleading,
though it was a good selling point
• Walker’s dictionary was meant for linguists as
much as for poets, though few linguists used it
Notable linguistic remarks from
Walker’s original Introduction
• As in other Dictionaries words follow each other
in an alphabetical order according to the letters
they begin with, in this they follow each other
according to the letters they end with.
• The English Language, it may be said, has hitherto
been seen through but one end of the
perspective; and though terminations form the
distinguishing character and specific difference of
every language in the world, we have never
before had a prospect of our own, in this point of
view.
Edward Thorndike, 1874-1949
The Godfather, Part II
The Father of Educational Psychology
• Thorndike was one of the first American
experimental psychologists
• Thorndike’s work was a precursor to both
behaviorism and modern cognitive psychology
• Thorndike spent his entire career at Columbia
University Teacher’s College
• Thorndike is regarded as a founding figure in
educational psychology
Thorndike’s word books
• Between 1921 and 1944, Thorndike published
three frequency-based word books for teachers,
to be used in curriculum design
• The last edition (Thorndike and Lorge) contained
30,000 words
• The books consisted almost entirely of frequency
lists:
1/ 1,000,000; 1/4,000,000; 1000 most frequent
• These were the first frequency lists published for
any language
A. F. Brown
• A. F. Brown was one of the first computational
linguists, working at Penn and then at LeHigh
• In 1963, he published his Normal and Reverse
English Word List, prepared under contract
with the Air Force Office of Scientific Research
• The list was collated from 18 dictionaries
– Each list runs to 400 pages of computer printout,
with 100 words per page = 400,000 entries
Kučera and Francis
Francis and Kučera
• The Brown Corpus (1964)
– 1,014,312 words of running text of edited English prose
printed in the United States during the calendar year 1961
– 500 samples of 2000+ words each
– Tagged in a variety of ways
• Computational Analysis of Present-Day American
English (1967)
• Frequency Analysis of English Usage (1982)
– Approximately 45,000 distinct lemmas listed with their
frequencies
– Lemmas with adjusted frequency >5/m in rank order
The last 25 years
Large-scale electronic resources
• The availability in the last quarter century of
large-scale electronic resources has made it
possible to study English morphology in
hitherto unimagined ways
• These resources have changed our perspective
on how morphology works
• Two types of resources:
– Electronic dictionaries
– Large corpora
The Oxford English Dictionary
The Oxford English Dictionary
• The largest, longest, and most expensive
academic publishing project in history
• 1857
Inaugurated
• 1879
Work begins in earnest
• 1933
First full edition
• 1989
OED2
• 1992
CD-ROM of OED2
• 2000 – OED Online (by subscription)
À quoi ça sert (l’amour)?
• The OED, unlike Webster’s II and others, is a
historical dictionary
• Recent editions of the OED were designed
from the bottom up as electronic resources
• The combination allows us to ask questions
that we could never before expect to find
answers for
• We can even ask questions that we might
never before have imagined
OED Tools
• The OED prides itself on the accuracy of its first
citations
• The first citations provide the most accurate historical
record available in any language of the first use of a
word
• The ability to use wild cards permits the simple
construction of historical timelines for individual affixes
• The timelines allow easy and accurate study for the
first time of the growth and decline of patterns of
affixation in English over the last millennium
What the OED shows us
• The system is self-organizing
• We can track the emergence of “borrowed”
affixes from the borrowing of large numbers
of individual words to the productive use of an
affix (e.g., -ment, -ation, -ity, -able)
• Homonymous affixes compete
• The competition between affixes is resolved
through competition
12
51
13 -13
01 00
13 -13
51 50
14 -14
01 00
14 -14
51 50
15 -15
01 00
15 -15
51 50
16 -16
01 00
16 -16
51 50
17 -17
01 00
17 -17
51 50
18 -18
01 00
18 -18
51 50
19 -19
01 00
19 -19
51 50
-2
00
0
Adjusted number of words
Sample affix histories from the OED
(Anshen & Aronoff 1999)
160
140
120
100
80
60
40
20
0
Derived ity
Derived ment
Half centuries
Sample affix histories from the OED
(Marine Lasserre)
Corpora
• The Brown corpus, compiled 50 years ago,
contained a total of 1 million words
• The Google Books database currently contains
over 30 million books and over 150 billion words
• Other modern large corpora are comparably large
and are tagged for part of speech
• The COCA corpus contains over 450 million words
• Corpora allow for the counting of individual
words/lemmas and their frequencies in a corpus
Harald and the Elusive Index
Baayen’s Productivity Indices
• In a series of publications from 1989 on, Harald Baayen
developed a number of corpus-based indices intended
to capture the intuitive notion of morphological
productivity
• Baayen’s indices are based on the idea that words that
only occur once in a corpus, hapax legomena, are a
window into morphological productivity
• This idea makes no sense in the absence of a
searchable corpus of reasonable size
• The general method becomes less useful as the corpus
grows in size
P = n1 / N
• The best known of Baayen’s indices is P, which
measures the “growth rate” of the affix: the
probability that an encounter with a word
containing the affix is a new type.
• In the equation, n1 is represents the total number
of hapaxes containing the affix, and N represents
the total number of tokens containing the affix.
• P fits linguists’ intuitions about productivity
reasonably well in corpora < 100M words, except
when both n’s are small (for unproductive affixes)
V and P*
• V is the total number of lexeme types
containing a given affix
– Differences in V between affixes reflect the extent
to which relevant base words have been used
• Baayen plots P against V to obtain P*, the
relative “global productivity” of affixes
– This measure is problematic, as Baayen notes,
because there is no principled way of scaling the
axes
Hapax vs. Hapax
• Baayen’s final measure is P *, the hapaxconditioned degree of productivity
• P * = n1 / h1, where h1 is the total number of
hapaxes across all types in the corpus
• Since h1 is the same for all affixes in a corpus, this
measure simply counts the numbers of hapaxes
for each affix identified in a corpus
• The difference in P * yields intuitively satisfactory
results for Baayen’s corpora
• The greatest weakness of P * is that it cannot
easily be compared across corpora
Where hapaxes fail
• Both P and P * measurements are dependent on the
size (N) of the corpus
• The number of hapaxes in a corpus is a decreasing
function of N
– The rate of increase in the number of hapaxes slows as the
size of the corpus increases
– Very large corpora show few if any hapaxes
• There is no way to know what the “proper” size of a
corpus is for hapax-based measures to be useful
• It is not clear what the value of a measure of global
productivity is
So far, so good
• We gain insights into morphological
productivity if we use quantitative tools
• We can not treat productivity as a discrete
phenomenon if we want to learn about it
• The methods and measures we use depend on
the machinery that we have
• The notion of an absolute measure of
productivity that is valid across corpora is
elusive and problematic
Escape from Hapax
• The number of hapaxes decreases as the size of
the corpus increases
• With very large corpora hapaxes are not helpful
• We can learn a great deal from very large corpora
if we confine ourselves to the direct comparison
of pairs of competing affixes
• This method is not based on hapaxes
• This line of research does not address the
question of global productivity at all
• Google Fight!
Using Google Search
• We use Google Search Estimated Total Matches (ETM)
as a measure of usage
• PROBLEMS
– Google is very noisy and must be used with great caution
– ETM is not an actual count but an estimate based on a
proprietary method
• SOLUTIONS
– Little weight is placed on raw numbers or on individual
word pairs
– Only large differences between affixes are taken into
account
A test case
Comparing –ic and -ical
• Sample ETM counts for high frequency
doublets (Lindsay & Aronoff 2013)
Comparing –ic and -ical
• Sample ETM counts for high frequency
singletons (Lindsay & Aronoff 2013)
Usually –ic wins
Sometimes -ical wins
• -ical is productive in stems ending in -olog
(from Lindsay and Aronoff 2013)
Favoring -ic
Favoring -ical
Total
Total
Stems
10613
1353
11966
Ratio
7.84
1
olog
Stems
74
401
475
Ratio
1
5.42
Usually –ic wins
Sometimes -ical wins
• -ical is productive in stems ending in -olog
(from Lindsay and Aronoff 2013)
Why –olog?
• -olog defines the largest set by far of stems with
neighborhood length 4 preceding either of the
two suffixes (475 members)
• The -olog set contains 2/3 of all stems in –g
• The -olog set is thus a very large morphologically
defined subsystem with very few neighbors
• The -olog set is uniquely suited to sustain -ical as
a productive suffix, in spite of the clear
dominance of -ic overall
Conclusion
• The combination of rich computational
resources and quantitative methods allows us
to make progress in understanding questions
that could not be profitably studied a quarter
century ago
• As the resources change, so do the questions,
the methods, and the theories that they drive
THANK YOU
Special thanks to those who have joined in my
personal struggle over the last 40 years to
understand morphological productivity by counting
Morris Halle
Frank Anshen
Mark Lindsay
La lotta continua!
Download