1 LING115 Lecture note Python #5: Exercise

advertisement
LING115 Lecture note
Python #5: Exercise - morphological productivity
1. Introduction
We've learned a few things about Python: data types (Boolean, numbers, strings, lists,
dictionaries, files), control flow (for-loop, if-statement), functions (including methods) and
modules. There's more. But we can already do cool things with what we've learned.
2. Problem: morphological productivity
Morphology is the study of how words are formed. There are many different ways a word can be
formed. Here we'll focus on a process called affixation: adding affixes (prefixes, suffixes, etc.) to
the base form of a word. For example, we often add the suffix -ness to an adjective X to form a
word that means 'the state or the quality of being X' as in foolishness. There is another suffix with
a similar function in English: -ity as in scarcity. We want to know which of the two patterns is
more productive: which suffix is more likely to be added by English speakers when they see an
adjective X and want to create a word meaning 'the state or the quality of being X'.
One way to find out is to present native speakers of English a made-up adjective and ask them
how they would make a noun out of it – by adding -ness or -ity. Another way is to follow Baayen
(1992) and use corpora to find out. Obviously, we will try the corpus-based approach here.
The basic idea of the corpus-based approach is this. We examine nouns ending in -ity or -ness in
a given corpus and see which of the two suffixes are more likely to create a "new" word. The
suffix that is more likely to create a new word is the more productive one. There are two issues
here:


How do we define a "new" word in the corpus?
How do we measure the likelihood of a process creating a new word?
Baayen (1992) argues that it is the hapax legomena – words that occur only once in a given
corpus – that should count as "new" words. To somewhat simplify, the idea is that such words
appear only once because they are words created on the fly, possibly due to lack of vocabulary.
Regarding the second issue, Baayen thought the likelihood of a process of creating a new word is
the proportion of tokens of new words created by the process among all tokens produced by the
process. That is, the productivity of a process (P) is estimated as follows:
=
ℎ
1
where nhl = the number of hapax legomena produced by the process in the corpus and N = the
number of all tokens produced by the process in the corpus. Basically, the productivity of a
morphological process is measured by relatively how many of the tokens created by the process
are "new" words (hapax legomena in the corpus).
For example, suppose we saw 30 nouns consisting of an adjective followed by -ity in our corpus.
Some nouns appeared more than once and when we tallied all instances of a noun as separate
tokens, the number of such noun tokens added up to 100. Let's say there were 20 nouns that
appeared only once. The quantified productivity of -ity would be 20/100 = 0.2.
Note that we're focusing on token frequency rather than type frequency. A token frequency of a
morphological process is the total number of specific examples of the process found in the
corpus. A type frequency of a process is the total number of different words that exemplify the
process found in the corpus. The crucial difference is in "specific examples" vs. "words". A word
may be found more than once in a given corpus. Each occurence of a word counts as a different
token. On the other hand, all occurrences of the same word counts as a single type. Say our
corpus consisted of the following three sentences:
I think awesomeness and freakishness are perfectly fine words.
John doesn't agree.
But I think awesomeness sounds better than awesomeity, for example.
In our mini corpus above, the token frequency of -ness suffixation is three, while its type
frequency is two. There are three tokens of -ness suffixation: two instances of awesomeness and
one instance of freakishness. There are two (word) types of instances of the same suffixation
process: awesomeness and freakishness.
Anyway, to see which of the two suffixes – ness or ity – is more productive, we calculate the
productivity of each suffix as defined above and see which is higher. In the sections below, I first
explain the corpus you'll use to do this (section 3) and how this can be implemented in Python
(sections 4~6).
3. Data
The Penn Treebank corpus is a collection of sentences from two sources: ATIS (Air Travel
Information System transcripts) and WSJ (Wall Street Journal articles). What's especially nice
about this corpus is that the sentences are tagged and parsed. The words are annotated with their
part-of-speech tags (e.g. NN for singular noun, JJ for adjective, etc.). The hierarchical phrase
structure of each sentence is specified. For example, the words in 'The dog chased the cat' would
be grouped as [[The dog] [chased [the cat]]], where [the dog] is a noun phrase and [chased the
cat] is a verb phrase containing another noun phrase [the cat]. The phrase structure of a sentence
is often represented by an upside-down tree. The corpus is a 'treebank' because you can think of
it as a collection of such trees.
Our copy of the corpus is under /data/TREEBANK/ in the gray machine
(gray.sjsu.edu). It consists of several directories including the following three:
2
(1)
(2)
(3)
./RAW/
./TAGGED/
./PARSED/
As the names imply, (1) contains raw text, (2) contains text annotated with part-of-speech tags,
and (3) contains parsed text. The text is the same in each directory. The difference is how the text
is annotated. Each directory has two sub-directories: ./ATIS/ and ./WSJ/. The directory
names specify where the text is from.
We want to focus on nouns in Wall Street Journal articles for this exercise. For that, we need to
know the part-of-speech (POS) of each word in WSJ articles. So we are going to use the files
under the 'tagged' directory (/data/TREEBANK/TAGGED/WSJ/).
Words are tagged in the following format in those files: <word>/<pos>. For example,
old/JJ means that 'old' is an adjective (JJ)1. A string enclosed in brackets as in [ the/DT
board/NN ] means the string is a noun phrase.
For this exercise, we focus on nouns derived from adjectives. So we examine words with either
of the following two tags:


NN - noun, singular or mass
JJ - adjective
4. Approach
We need to calculate the morphological productivity of each suffix (-ity or -ness) and see which
is higher. Here's one way to do it:
(1) Identify all tokens (instances) of words that satisfy the following conditions:
a. The word is a noun.
b. The word ends with the given suffix.
c. Its stem (the word minus the suffix) is an adjective.
(2) Count how many of the nouns identified in (1) appear only once in the given corpus.
(3) Count the total number of tokens in (1).
(4) Divide the count in (2) by the count in (3).
5. Implementation
Let's try to implement the approach in section 4 in Python. Note that the procedures apply for
both -ity and -ness. So instead of writing the same lines of code twice - one for each suffix - let's
define functions for a generic suffix.
1
See http://bulba.sdsu.edu/jeanette/thesis/PennTags.html for what the tags mean.
3
5.1. Listing nouns and adjectives
Although this step was not explicitly mentioned in section 4, implementing this step can make
things easier. Basically, we list nouns to narrow our search and we list adjectives to check if the
stem of a noun under consideration is an adjective.
Recall that we will use the tagged sentences from WSJ articles. So we'll use the files that match
/data/TREEBANK/TAGGED/WSJ/*/* . Take a look at any one of those files. You will
realize that it is actually a bunch of lines containing word/tag (e.g. old/JJ) combinations
separated by white space. So we first extract word/tag combinations, just like we extract words
from raw text. We then separate each word/tag combination into two parts - word and tag - to see
if the word is labeled with the part-of-speech tag we want. A word and its tag are conjoined by /
in this corpus, so we can split it by /. This should return a list. The first element will be the word,
while the second element is the tag. We check if the tag is what we want: NN if we're gathering
nouns and JJ if we're gathering adjectives.
Let's implement this by writing a function named get_tokens. It takes two arguments: a partof-speech tag and a list of lines of text. It returns a list of word tokens that matches the tag from
the list of lines.
def get_tokens(pos, lines):
out = []
for line in lines:
wt = line.strip().split()
for each in wt:
parts = each.split('/')
if len(parts)==2:
word = parts[0]
tag = parts[1]
if tag==pos:
out.append(word.lower())
return out
Notice two things about the code. First, notice the if-statement if len(parts)==2 . This is
to ensure that we only focus on word/tag pairs in a sentence. As you can see in the tag files, there
are occasional non-word/tag pairs scattered in the text. We don't want our program to be
confused by such garbage. Second, note that we converted each word to lower-case before we
keep it: out.append(word.lower()) . This is to ensure that both 'happiness' and
'Happiness' are tokens of the same noun.
Assuming that our lines of text is stored in a list named corpus, we can do
get_tokens(NN,corpus) and get_tokens(JJ,corpus) to list the noun and the
adjective tokens, respectively.
4
5.2. Zeroing in
Among the noun tokens we extracted, we just need those with the right suffix and an adjectival
stem. The suffix part is easy: we count how many letters a given suffix consists of, say n, and see
if the last n letters of a word matches the suffix. The adjectival stem part can be tricky. Ideally,
the stem of a word is the remaining substring of a word when the suffix is removed. But it's a
little more complicated because of spelling change as you can see in the following examples:
Word
Stem
happiness
scarcity
curiosity
possibility
happy
scarce
curious
possible
We have to tinker with the spelling to get the stem from a word. We will learn to do this when
we learn regular expressions. For now, I've written a function named possible_stems that
returns a list of potential stems given a word and a suffix. The function is defined in
/home/ling115/python_examples/morph.py . Try the following in the Python
interactive mode:
>>>
>>>
>>>
>>>
import sys
sys.path.append('/home/ling115/python_examples/')
import morph
morph.stem('happiness','ness')
['happy']
>>> morph.stem('scarcity','ity')
['scarce','scarc']
>>> morph.stem('curiosity','ity')
['curious']
Once we list potential stems, we check if any member of the list is one of the adjectives we
collected.
Let's implement all this by writing a function named zero_in. The function takes three
arguments: a list of noun tokens, a list of adjective tokens, and a suffix. It returns a list of noun
tokens each of which ends with the given suffix and has an adjectival stem. We're going to use
the stem function defined in /home/ling115/python_examples/morph.py . We want
5
to import that module before we define our zero_in function. We don't want Python to import
the same module every time we run our function or a part of our function.
import sys
sys.path.append('/home/ling115/python_examples/')
import morph
def zero_in(nouns, adjectives, suffix):
out = []
n = len(suffix)
for noun in nouns:
if noun[-n:]==suffix:
stem_list = morph.stem(noun,suffix)
for s in stem_list:
if s in adjectives:
out.append(noun)
break
return out
Notice the use of the break statement in the above function definition. The idea is as soon as
we find a potential stem that is an adjective, we're done and we don't have to worry about the
other potential stems.
5.3. Counting the number of hapax legomena
We now have ways to extract target noun tokens from our corpus. Next, we need a way to count
how many of them are nouns that appeared only once. To do this, we (1) identify unique nouns
in a given token list, (2) count how many tokens each noun has, and (3) count how many nouns
have only one token.
Let's implement this by defining a function named count_hapax. It takes one argument: a list of
noun tokens. It returns the number of hapax legomena in that list.
def count_hapax(noun_token_list):
n = 0
noun_set = set(noun_token_list)
for noun in noun_set:
if noun_token_list.count(noun)==1:
n = n+1
return n
6
Notice the use of a built-in function called set. The function takes a sequential data-type (e.g.
strings, lists) and makes a set out of it: a collection of unique items in the input. It's an efficient
way to get rid of duplicates. A set is a data-type of its own in Python. The function set you see
above is a function that converts the data-type, like the str or the int functions you saw before.
For more on sets, see section 5.7 of http://docs.python.org/library/stdtypes.html .
5.4. Counting the total number of tokens
We take the length of the token list from 5.2.
5.5. Measuring the productivity
We divide the output of the count_hapax function by the length of the token list.
6. Summary
Below is how the above implementations can be put together. You can also check out the code at
/home/ling115/python_examples/ity_vs_ness.py .
import sys
sys.path.append('/home/ling115/python_examples/')
import morph
def get_tokens(pos, lines):
out = []
for line in lines:
wt = line.strip().split()
for each in wt:
parts = each.split('/')
if len(parts)==2:
word = parts[0]
tag = parts[1]
if tag==pos:
out.append(word.lower())
return out
def zero_in(nouns, adjectives, suffix):
out = []
n = len(suffix)
for noun in nouns:
if noun[-n:]==suffix:
stem_list = morph.stem(noun,suffix)
7
for s in stem_list:
if s in adjectives:
out.append(noun)
break
return out
def count_hapax(noun_token_list):
n = 0
noun_set = set(noun_token_list)
for noun in noun_set:
if noun_token_list.count(noun)==1:
n = n+1
return n
# 1. Collect tokens.
corpus = sys.stdin.readlines() # Get lines from the standard input.
nouns = get_tokens('NN',corpus) # Get noun tokens.
adjectives = get_tokens('JJ',corpus) # Get adjective tokens.
ness = zero_in(nouns,adjectives,'ness') # Get -ness nouns.
ity = zero_in(nouns,adjectives,'ity') # Get -ity nouns.
# 2. Calculate productivity.
p_ness = count_hapax(ness) / float(len(ness)) # Productivity of -ness.
p_ity = count_hapax(ity) / float(len(ity)) # Productivity of -ity.
# 3. Print out results.
print 'Productivity of -ness suffixation = '+str(p_ness)
print 'Productivity of -ity suffixation = '+str(p_ity)
8
Download