LING115 Lecture note Python #5: Exercise - morphological productivity 1. Introduction We've learned a few things about Python: data types (Boolean, numbers, strings, lists, dictionaries, files), control flow (for-loop, if-statement), functions (including methods) and modules. There's more. But we can already do cool things with what we've learned. 2. Problem: morphological productivity Morphology is the study of how words are formed. There are many different ways a word can be formed. Here we'll focus on a process called affixation: adding affixes (prefixes, suffixes, etc.) to the base form of a word. For example, we often add the suffix -ness to an adjective X to form a word that means 'the state or the quality of being X' as in foolishness. There is another suffix with a similar function in English: -ity as in scarcity. We want to know which of the two patterns is more productive: which suffix is more likely to be added by English speakers when they see an adjective X and want to create a word meaning 'the state or the quality of being X'. One way to find out is to present native speakers of English a made-up adjective and ask them how they would make a noun out of it – by adding -ness or -ity. Another way is to follow Baayen (1992) and use corpora to find out. Obviously, we will try the corpus-based approach here. The basic idea of the corpus-based approach is this. We examine nouns ending in -ity or -ness in a given corpus and see which of the two suffixes are more likely to create a "new" word. The suffix that is more likely to create a new word is the more productive one. There are two issues here: How do we define a "new" word in the corpus? How do we measure the likelihood of a process creating a new word? Baayen (1992) argues that it is the hapax legomena – words that occur only once in a given corpus – that should count as "new" words. To somewhat simplify, the idea is that such words appear only once because they are words created on the fly, possibly due to lack of vocabulary. Regarding the second issue, Baayen thought the likelihood of a process of creating a new word is the proportion of tokens of new words created by the process among all tokens produced by the process. That is, the productivity of a process (P) is estimated as follows: = ℎ 1 where nhl = the number of hapax legomena produced by the process in the corpus and N = the number of all tokens produced by the process in the corpus. Basically, the productivity of a morphological process is measured by relatively how many of the tokens created by the process are "new" words (hapax legomena in the corpus). For example, suppose we saw 30 nouns consisting of an adjective followed by -ity in our corpus. Some nouns appeared more than once and when we tallied all instances of a noun as separate tokens, the number of such noun tokens added up to 100. Let's say there were 20 nouns that appeared only once. The quantified productivity of -ity would be 20/100 = 0.2. Note that we're focusing on token frequency rather than type frequency. A token frequency of a morphological process is the total number of specific examples of the process found in the corpus. A type frequency of a process is the total number of different words that exemplify the process found in the corpus. The crucial difference is in "specific examples" vs. "words". A word may be found more than once in a given corpus. Each occurence of a word counts as a different token. On the other hand, all occurrences of the same word counts as a single type. Say our corpus consisted of the following three sentences: I think awesomeness and freakishness are perfectly fine words. John doesn't agree. But I think awesomeness sounds better than awesomeity, for example. In our mini corpus above, the token frequency of -ness suffixation is three, while its type frequency is two. There are three tokens of -ness suffixation: two instances of awesomeness and one instance of freakishness. There are two (word) types of instances of the same suffixation process: awesomeness and freakishness. Anyway, to see which of the two suffixes – ness or ity – is more productive, we calculate the productivity of each suffix as defined above and see which is higher. In the sections below, I first explain the corpus you'll use to do this (section 3) and how this can be implemented in Python (sections 4~6). 3. Data The Penn Treebank corpus is a collection of sentences from two sources: ATIS (Air Travel Information System transcripts) and WSJ (Wall Street Journal articles). What's especially nice about this corpus is that the sentences are tagged and parsed. The words are annotated with their part-of-speech tags (e.g. NN for singular noun, JJ for adjective, etc.). The hierarchical phrase structure of each sentence is specified. For example, the words in 'The dog chased the cat' would be grouped as [[The dog] [chased [the cat]]], where [the dog] is a noun phrase and [chased the cat] is a verb phrase containing another noun phrase [the cat]. The phrase structure of a sentence is often represented by an upside-down tree. The corpus is a 'treebank' because you can think of it as a collection of such trees. Our copy of the corpus is under /data/TREEBANK/ in the gray machine (gray.sjsu.edu). It consists of several directories including the following three: 2 (1) (2) (3) ./RAW/ ./TAGGED/ ./PARSED/ As the names imply, (1) contains raw text, (2) contains text annotated with part-of-speech tags, and (3) contains parsed text. The text is the same in each directory. The difference is how the text is annotated. Each directory has two sub-directories: ./ATIS/ and ./WSJ/. The directory names specify where the text is from. We want to focus on nouns in Wall Street Journal articles for this exercise. For that, we need to know the part-of-speech (POS) of each word in WSJ articles. So we are going to use the files under the 'tagged' directory (/data/TREEBANK/TAGGED/WSJ/). Words are tagged in the following format in those files: <word>/<pos>. For example, old/JJ means that 'old' is an adjective (JJ)1. A string enclosed in brackets as in [ the/DT board/NN ] means the string is a noun phrase. For this exercise, we focus on nouns derived from adjectives. So we examine words with either of the following two tags: NN - noun, singular or mass JJ - adjective 4. Approach We need to calculate the morphological productivity of each suffix (-ity or -ness) and see which is higher. Here's one way to do it: (1) Identify all tokens (instances) of words that satisfy the following conditions: a. The word is a noun. b. The word ends with the given suffix. c. Its stem (the word minus the suffix) is an adjective. (2) Count how many of the nouns identified in (1) appear only once in the given corpus. (3) Count the total number of tokens in (1). (4) Divide the count in (2) by the count in (3). 5. Implementation Let's try to implement the approach in section 4 in Python. Note that the procedures apply for both -ity and -ness. So instead of writing the same lines of code twice - one for each suffix - let's define functions for a generic suffix. 1 See http://bulba.sdsu.edu/jeanette/thesis/PennTags.html for what the tags mean. 3 5.1. Listing nouns and adjectives Although this step was not explicitly mentioned in section 4, implementing this step can make things easier. Basically, we list nouns to narrow our search and we list adjectives to check if the stem of a noun under consideration is an adjective. Recall that we will use the tagged sentences from WSJ articles. So we'll use the files that match /data/TREEBANK/TAGGED/WSJ/*/* . Take a look at any one of those files. You will realize that it is actually a bunch of lines containing word/tag (e.g. old/JJ) combinations separated by white space. So we first extract word/tag combinations, just like we extract words from raw text. We then separate each word/tag combination into two parts - word and tag - to see if the word is labeled with the part-of-speech tag we want. A word and its tag are conjoined by / in this corpus, so we can split it by /. This should return a list. The first element will be the word, while the second element is the tag. We check if the tag is what we want: NN if we're gathering nouns and JJ if we're gathering adjectives. Let's implement this by writing a function named get_tokens. It takes two arguments: a partof-speech tag and a list of lines of text. It returns a list of word tokens that matches the tag from the list of lines. def get_tokens(pos, lines): out = [] for line in lines: wt = line.strip().split() for each in wt: parts = each.split('/') if len(parts)==2: word = parts[0] tag = parts[1] if tag==pos: out.append(word.lower()) return out Notice two things about the code. First, notice the if-statement if len(parts)==2 . This is to ensure that we only focus on word/tag pairs in a sentence. As you can see in the tag files, there are occasional non-word/tag pairs scattered in the text. We don't want our program to be confused by such garbage. Second, note that we converted each word to lower-case before we keep it: out.append(word.lower()) . This is to ensure that both 'happiness' and 'Happiness' are tokens of the same noun. Assuming that our lines of text is stored in a list named corpus, we can do get_tokens(NN,corpus) and get_tokens(JJ,corpus) to list the noun and the adjective tokens, respectively. 4 5.2. Zeroing in Among the noun tokens we extracted, we just need those with the right suffix and an adjectival stem. The suffix part is easy: we count how many letters a given suffix consists of, say n, and see if the last n letters of a word matches the suffix. The adjectival stem part can be tricky. Ideally, the stem of a word is the remaining substring of a word when the suffix is removed. But it's a little more complicated because of spelling change as you can see in the following examples: Word Stem happiness scarcity curiosity possibility happy scarce curious possible We have to tinker with the spelling to get the stem from a word. We will learn to do this when we learn regular expressions. For now, I've written a function named possible_stems that returns a list of potential stems given a word and a suffix. The function is defined in /home/ling115/python_examples/morph.py . Try the following in the Python interactive mode: >>> >>> >>> >>> import sys sys.path.append('/home/ling115/python_examples/') import morph morph.stem('happiness','ness') ['happy'] >>> morph.stem('scarcity','ity') ['scarce','scarc'] >>> morph.stem('curiosity','ity') ['curious'] Once we list potential stems, we check if any member of the list is one of the adjectives we collected. Let's implement all this by writing a function named zero_in. The function takes three arguments: a list of noun tokens, a list of adjective tokens, and a suffix. It returns a list of noun tokens each of which ends with the given suffix and has an adjectival stem. We're going to use the stem function defined in /home/ling115/python_examples/morph.py . We want 5 to import that module before we define our zero_in function. We don't want Python to import the same module every time we run our function or a part of our function. import sys sys.path.append('/home/ling115/python_examples/') import morph def zero_in(nouns, adjectives, suffix): out = [] n = len(suffix) for noun in nouns: if noun[-n:]==suffix: stem_list = morph.stem(noun,suffix) for s in stem_list: if s in adjectives: out.append(noun) break return out Notice the use of the break statement in the above function definition. The idea is as soon as we find a potential stem that is an adjective, we're done and we don't have to worry about the other potential stems. 5.3. Counting the number of hapax legomena We now have ways to extract target noun tokens from our corpus. Next, we need a way to count how many of them are nouns that appeared only once. To do this, we (1) identify unique nouns in a given token list, (2) count how many tokens each noun has, and (3) count how many nouns have only one token. Let's implement this by defining a function named count_hapax. It takes one argument: a list of noun tokens. It returns the number of hapax legomena in that list. def count_hapax(noun_token_list): n = 0 noun_set = set(noun_token_list) for noun in noun_set: if noun_token_list.count(noun)==1: n = n+1 return n 6 Notice the use of a built-in function called set. The function takes a sequential data-type (e.g. strings, lists) and makes a set out of it: a collection of unique items in the input. It's an efficient way to get rid of duplicates. A set is a data-type of its own in Python. The function set you see above is a function that converts the data-type, like the str or the int functions you saw before. For more on sets, see section 5.7 of http://docs.python.org/library/stdtypes.html . 5.4. Counting the total number of tokens We take the length of the token list from 5.2. 5.5. Measuring the productivity We divide the output of the count_hapax function by the length of the token list. 6. Summary Below is how the above implementations can be put together. You can also check out the code at /home/ling115/python_examples/ity_vs_ness.py . import sys sys.path.append('/home/ling115/python_examples/') import morph def get_tokens(pos, lines): out = [] for line in lines: wt = line.strip().split() for each in wt: parts = each.split('/') if len(parts)==2: word = parts[0] tag = parts[1] if tag==pos: out.append(word.lower()) return out def zero_in(nouns, adjectives, suffix): out = [] n = len(suffix) for noun in nouns: if noun[-n:]==suffix: stem_list = morph.stem(noun,suffix) 7 for s in stem_list: if s in adjectives: out.append(noun) break return out def count_hapax(noun_token_list): n = 0 noun_set = set(noun_token_list) for noun in noun_set: if noun_token_list.count(noun)==1: n = n+1 return n # 1. Collect tokens. corpus = sys.stdin.readlines() # Get lines from the standard input. nouns = get_tokens('NN',corpus) # Get noun tokens. adjectives = get_tokens('JJ',corpus) # Get adjective tokens. ness = zero_in(nouns,adjectives,'ness') # Get -ness nouns. ity = zero_in(nouns,adjectives,'ity') # Get -ity nouns. # 2. Calculate productivity. p_ness = count_hapax(ness) / float(len(ness)) # Productivity of -ness. p_ity = count_hapax(ity) / float(len(ity)) # Productivity of -ity. # 3. Print out results. print 'Productivity of -ness suffixation = '+str(p_ness) print 'Productivity of -ity suffixation = '+str(p_ity) 8