A definition of productivity

advertisement
LIN 3098 Corpus Linguistics
Lecture 7
Albert Gatt
In this lecture
 We look at some ways in which
corpora can be useful in
morphological research.
 Main focus: morphological
productivity
Part 1
Morphology, corpora and productivity
Productivity in linguistics
 The term “productivity” is used in a
wide variety of contexts.
 Syntactic rules are “productive” in the
sense that they can be used to
generate new phrases.
 The same can be said of some
morphological rules.
A definition of productivity
 A linguistic process is productive if:
 It can be used to produce novel forms.
 If a rule is productive, then:
 Novel forms (previously unheard) can be
understood and produced;
 There is no need to store all forms in the
mental lexicon.
A couple of examples
 Imagine an English adjective garmy. How would you
derive a noun out of this adjective?
 Many speakers might say garminess
 This suggests that –ness suffixation is a productive
derivational process.
 E.g. Imagine a Maltese verb intoffa. How would you
produce a noun from it?
 Speakers might say intoffar or inttofament or
intoffazzjoni
 This suggests that –ar and –ment suffixation are
productive derivational processes in Maltese.
Productive vs non-productive
 Some morphological processes or
categories seem to have greater
potential to form new words than
others
 e.g. English -able, -ness
 compare to English –th: warmth,
strength… (much less productive)
Classical approaches to productivity
 Jackendoff (1975):
 morphological rules are called redundancy rules:
 They capture the relationship between related forms


E.g. Warm  warmth (ADJ  N via addition of –th)
E.g. Desire  desirable (N  ADJ via addition of –able)
 If a rule is productive, then it can be used to
create novel forms.
 e.g. adjectives with –able can be produced
“online”
Features of classical approaches
1. Relies on a binary distinction
(un/productive)
2. Productive rules are typically regular &
sub-regularities not considered much
(Dressler 2003)
3. Most of these approaches do not look at
corpus data
Productive vs regular
 Usually, productive morphological rules are regular.
Irregular forms are likely to be stored in the lexicon.
 However, we can sometimes detect “sub-regularities”:



sing-sang
ring-rang
bring-brang (?)
 Speakers can sometimes generalise these sub-regular
processes, perhaps by analogy.

What’s the past tense of tring or spling?
“Possible” vs “attested”
 Our tentative definition of productivity focuses on
production of novel forms.
 By definition, novel forms are:
 Possible words of the language;
 Previously unattested.
 This would suggest that we can’t use corpora to study
productivity.

Corpora only contain attested forms.
The problem of frequency
 Suppose we find that a corpus contains lots of words
ending in some suffix –X.
 This doesn’t necessarily imply that the -X suffix is
productive.
 It could have been productive in the past, but is not
anymore.
 Therefore, the likelihood of a new word ending in –X
is low, despite the high frequency.
Getting around the problem
 Frequency can’t give us all the answers. However, one
interesting solution is to look at hapax legomena.
 A corpus will usually contain lots of words occurring only
once.
 We can think of hapaxes as “one-offs”.
 It seems likely that some hapaxes will be “new
formations”

NB We can only make this assumption if the corpus is very large.
Corpus-based approaches
 View productivity as a gradable
phenomenon:
 some forms become ingrained through frequent
usage
 category can still be productive to some extent
 productivity estimated in terms of a category’s
potential to produce new forms
 can account for sub-regularities: productivity of
a category is due to a lot of factors, including
analogy to existing words
The continuum
ADJ+ness  Noun
ADJ+th  Noun
Productive
morphological
process
 Productive processes tend to:
 be compositional
 result in a lot of new words
lexicalised
word
Why is productivity interesting?
 No finite lexicon can contain all words of a
language at a certain time
 productive processes can be exploited to parse
new/unseen lexical items
 this is helped by the compositionality of
productive processes
 can also help to distinguish creative neologism
from systematic rule-application. compare:
 well-defined, well-intentioned, well-specified
 lots of adjectives with a well- prefix
 YouTube
 a one-off
Theoretical implications
 raises interesting questions about the
relationship between corpus-based
measures and psycholinguistic data
 likelihood of a morphological process
being applied depends on style, genre,
speech community…
 can give an indication of language
change over time (some processes are
fossilised, others become more
productive)
Statistical measures of productivity
(Baayen 2006)
What we need
 A measure of productivity of a
process/category C should reflect:
 our intuitions about how frequently we
encounter C
 how easily native speakers can form new
words using C
 Is it easier to produce a noun with –
th (like warmth) or one with –ness
(like goodness)?
An analogy
 We can compare morphological processes to companies.
 All try to dominate a market where the number of clients
(words) is limited.
 Productivity reflects the extent to which these
companies:
 have managed to dominate in the past (how many
words they’ve formed)
 are expanding into new areas of the market (how
many new words they’re forming)
 may expand in the future (how many as yet unseen
words they’re going to form)
Realised productivity (RP)
 Given a morphological category C, RP
gives a rough indication of the past
utility of C in forming new words.
 Measured as the number of distinct
types in C in a corpus of size N
 E.g. regular past tense –ed displays
many more types than sub-regular
forms such as keep-kept/sleep-slept
Realised productivity cont/d
 Why types, not tokens?
 Productive processes have lots of types which
are hapaxes, or are very infrequent (low token
frequency).
 Words formed from irregular processes tend to
be very frequent (have high token frequency).
 Some limitations:
 a high RP for a category does not imply that it
will keep forming lots of new words
 RP is heavily dependent on corpus size
Expanding productivity (P*)
 P* gives a rough indication of the rate of
expansion of C.
 Focuses on the number of hapaxes
produced using C in the corpus.
 aka hapax-conditioned productivity
P* 
No. of hapaxes formed using C
total no. of hapaxes in corpus
 NB: P* is still heavily dependent on
corpus size!
Potential productivity (P)
 Gives an indication of how likely a category
C is to form new words in future.
 I.e. the potential for C to be already saturated
 aka category-conditioned productivity
No. of hapaxes formed using C
P* 
total no. of tokens formed using C
Some more on P
 Unlike RP and P*, P is not very sensitive to
corpus size as such
 However, very sensitive to frequency of the
category.
 e.g. if C is realised only once in a corpus of size
N, then P = 1!
 Recent empirical work has shown that RP
and P* may correlate very strongly, but
both exhibit a weak correlation with P
(Vegnaduzzo 2009)
 pattern non-X has high RP and P*, but low P
 pattern X-ish has low RP and P*, but high P
P vs. RP and P*
 A category C can have low RP and P*,
but high P.
 In this case, C hasn’t been used much in the
past, but is being used quite productively at
the moment.
 Corresponds to the “ease” with which new
words can be formed using the category.
 If category has high RP, it may still be
saturated, so have low P.
The psycholinguistic connection
1. Rule vs. direct access:
 To produce a word (e.g. illegal), you can
either store it directly, or apply the rule
on the fly.
 Evidence suggests that frequency of
baseform vs. derivation is related to
which of the two alternatives apply.
The psycholinguistic connection
2. Complexity-based affix ordering:
 Corpus research: more productive
affixes follow less productive ones in
word formation
 It seems that more highly predictable
(low productivity) affixes are processed
first.
 High productivity may also imply less
likelihood of entering into further
derivational processes.
Works cited
 S. Vegnaduzzo (2009). Morphological
productivity rankings of complex adjectives.
Proc. NAACL-HLT Workshop on
Computational Approaches to Linguistic
Creativity.
 K. Molinen and S. Pulman (2008). The
good, the bad and the unknown:
Morphosyllabic sentiment tagging of unseen
words. Proc. ACL 2008
 Baayen 2006 linked from web page
Download