Dictionaries and Grammar Questions to Address • Do we include all forms of a particular word, or do we include only the base word and derive its forms? • How are the grammatical rules of a language represented? • How do we represent the parts of speech that go with particular grammatical rules? Morphology Definitions • Morphology – The study of the patterns used to form words – E.g. inflection, derivation, and compounds • Morpheme - Minimal meaning-bearing unit – Could be a stem or an affix • Stem {“unthinkable” “realization” “distrust”} – The part of a word that contains the root meaning (E.g. cat) • Affixes {-s, un-, de-, -en, -able, -ize, -hood} – a linguistic element added to a word modify the meaning – E.g.: prefix (unbuckle), suffix (buckled), infix (absobloodylutely), and circumfix (gesagt in German for said). – Affixes can attach to other affixes (boyishness) Knowing Words • When we know a word, we know its 1. 2. 3. 4. • Phonological sound sequences Semantic meanings Morphological relationships Syntactic categories and proper structure of a sentence Morphological relationships adjust word meanings – – – – – – – Person Number Case Tense Degree Gender Part of Speech Jill waits. Jill carried two buckets. The chair’s leg is broken. Jill is waiting there now. Jill ran faster than Jack. Jill is female Jill is a proper noun These are the kind of things we want our computers to figure out Units of Meaning • How many morphemes do each of the following sentences have? – – – – “I have two cats” “She wants to leave soon” “He walked across the room” “Her behavior was unbelievable” • Free Morphemes {eye, think, run, apple} • Bound Morphemes {-able, un-, -s, -tion, -ly} Affix Examples • Prefixes from Karuk, a Hokan language of California [pasip] [nipasip] [/upasip] “Shoot!” “I shoot” “She/he shoots” • Suffixes from Mende spoken in Liberia and Sierra Leone [pElE] [pElEi] [mEmE] [mEmEi] “house” “the house” “glass” “the glass” • Infixes from Bontoc spoken in the Phillipines [fikas] [fumikas] [fusul] [fumusal] “strong” “she is becoming strong” “enemy” “she is becoming an enemy” Turkish Morpology Uygarlastiramadiklarimizdanmissinizcasina Meaning: `behaving as if you are among those whom we could not civilize’ • • • • Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ How does the Mind Store Meanings? • Hypotheses – Full listing: We store all words individually – Minimum redundancy: We store morphemes and how they relate • Analysis – Determine if people understand new words based on root meanings – Observe whether children have difficulty learning exceptions – Regular form: government/govern, Irregular form: department/depart • Evidence suggests – The mind represents words and affix meanings separately – Linguists observe that affixes were originally separate words that speakers slur together over time General Observations about Lexicons • Meanings are continually changing • Roots and Morphemes do not have to occur in a fixed position in relation to other elements. • How many words do people know? – Shakespeare uses 15,000 words – A typical high school student knows 60,000 (learning 10 words a day from 12 months to 18 years) • How many English words are there? – Over 300,000 words without Morphemes in 1988 Computational Morphology Speech recognition requires a language dictionary How many words would it contain? • Consider all of the morphemes of the word ‘true’ – true, truer, truest, truly, untrue, truth, truthful, truthfully, untruthfully, untruthfulness – Untruthfulness = un- + true + -th + -ful + -ness • Productive morphemes – An affix that at a point in time spread rapidly through the language – Consider goose and geese versus cat and cats • The former was an older way to indicate plurals • The latter is a more recent way that spread throughout • If we store morpheme rules, not all words, we can – Reduce storage requirements and simplify creating entire dictionaries – More closely mimic how the mind does it – Be able to automatically understand newly encountered word forms Morphology Rules • There are rules used to form complex words from their roots – ‘re-’ only precedes verbs (rerun, release, return) – ‘-s’ indicates plurals – ‘-ed’ indicates past tense • Affix Rules – Regular: follow productive affix rules – Irregular: don’t follow productive affix rules • Nouns – Regular: (cat, thrush), (cats, thrushes), (cat’s thrushes’) – Irregular: (mouse, ox), (mice, oxen) Observation: More frequent words resist changes that result from productive affixes and take irregular forms (E.g. am, is, are). Exceptions: A singer sings, and a writer writes. Why doesn’t a whisker whisk, a spider spid, or a finger fing? Parsing Identify components and underlying structure • Morphological parsing – Identifies stem and affixes and how they relate – Example: • fish fish + Noun + Singular or goose + Verb • fish fish +Noun +Plural • fish fish +Verb +Singular – Bracketing: indecipherable [in [[de [cipher]] able]] • Why do we parse? – – – – – spell-checking: Is muncheble a real word? Identify a word’s part-of-speech (pos) Sentence parsing and machine translation Identify word stems for data mining search operations Speech recognition and text to speech Parsing Applications • Lexicon – Create a word list – Include both stems and affixes (with the part of speech) • Morphotactics – Models how morphemes can be affixed to a stem. – E.g., plural morpheme follows noun in English • Orthographic rules – Defines spelling modifications during affixation – E.g. true tru in context of true truthfully Grammatical Morphemes • New forms are rarely added to closed morpheme classes • Examples – prepositions – articles – conjunctions at, for, by a, the and, but, or Morphological Parsing (stemming) • Goal: Break the surface input into morphemes • foxes – Fox is a noun stem – It has -es as a plural suffix • rewrites – Write is the verb stem – It has re- as a prefix meaning to do again – It has a –s suffix indicating a continuing activity Inflectional Morphology Does not change the grammatical category • Nouns – plural marker: -s (dog + s = dogs) – possessive marker: -’s (dog + ’s = dog’s) • Verbs – – – – 3rd person present singular: -s (walk + s = walks) past tense: -ed (walk + ed = walked) progressive: -ing (walk + ing = walking) past participle: -en or -ed (eat + en = eaten) • Adjectives – comparative: -er (fast + er = faster) – superlative: -est (fast + est = fastest) • In English – Meaning transformations are predictable – All inflectional affixes are suffices – Inflectional affixes are attached after any derivational (next slide) affixes • E.g. modern + ize + s = modernizes; not modern + s + ize Concatenative and Non-concatenative • Concatenative morphology combines by concatentation – prefixes and suffixes • Non-concatentative morphology combines in complex ways – circumfixes and infixes – templatic morphology • words change by internal changes to the root • E.g. (Arabic, Hebrew) ktb (write), kuttib (will have been written) ktb C V C C V C kuttib ui Templative Example Verbal Inflective Morphology • Verbal inflection – Main verbs (sleep, like, fear) are relatively regular Standard morphemes: -s, ing, ed These morphemes are productive: Emails, Emailing, Emailed – Combination with nouns for syntactical agreement I am, we are, they were • There are exceptions – – – – Eat (will eat, eats, eating, ate) Catch (will catch, catches, catching, caught) Be (will be, is, being, was) Have (will have, has, having, had) • General Observations about English – There are approximately 250 Irregular verbs that occur – Other languages have more complex verbal inflection rules Nominal Inflective Morphology • Plural forms (s or es) • Possessives (cat’s or cats’) • Regular Nouns – Singular (cat, bush) – Plural (cats, bushes) – Possessive (cat’s bushes’) • Irregular Nouns – Singular (mouse, ox) – Plural (mice, oxen) Derivational Morphology • Word stem combines with grammatical morpheme – Usually produces word of different class – Complex rules that are less productive with many exceptions – Sometimes meanings of derived terms are hard to predict (E.g. hapless) • Examples: verbs to nouns – generalize, realize generalization, realization – Murder, spell murderer, speller • Examples: verbs and nouns to adjectives – embrace, pity embraceable, pitiable – care, wit careless, witless • Example: adjectives adverbs – happy happily • More complicated to model than inflection – Less productive: science-less, concern-less, go-able, sleep-able Derivational Morphology Examples Level 1 • Examples: ize, ization, ity, ic, al, ity, ion, y, ate, ous, ive, ation • Observations – Can attach to non-words (e.g. fratern-al, paternal) – Often changes stem’s stress and vowel quality Level 2 • Examples: hood, ness, ly, s, ing, ish, ful, ly, less, y (adj.) • Observations – Never precede Level 1 suffixes – Never change stress or vowel quality – Almost always attach to words that exists Level 1 + Level 1: histor-ic-al, illumina-at-tion, indetermin-at-y; Level 1 + Level 2: fratern-al-ly, transform-ate-ion-less; Level 2 + Level 2: weight-less-ness Big one: antidisestablishmenterrianism (if I spelled it right) Adjective Morphology • Standard Forms – – – – – – – Big, bigger, biggest Cool, cooler, coolest, cooly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily Real, unreal, really • Exceptions: unbig, redly, realest Identify and Classify Morphemes • In each group – Two words have a different morphological structure – One word has a different type of suffix – One word has no suffix at all • Perform the following tasks – 1.Isolate the suffix that two of the words share. – 2.Identify whether it is (i) free or bound; (ii) prefix, infix, suffix; (iii) inflectional or derivational. – 3.Give its function/meaning. – 4.Identify the word that has no suffix – 5.Identify the word that has a suffix which is different from the others in each group. a. rider colder silver actor b. tresses melodies Bess’s guess c. running foundling handling fling d. tables lens witches calculates Computational Techniques • Regular Grammars • Finite State Automata • Finite State Transducer • Parsing – Top down and bottom up Regular Grammars • Grammar: Rules that define legal characters strings • A regular grammar accepts regular expressions • A regular expression must satisfy the following: – – – – The grammar with no strings is regular The grammar that accepts the empty string is regular A single character is a regular grammar If r1 and r2 are regular grammars, then r1 union r2, and r1 concatenated with r2 are regular grammars – If r is a regular grammar, then r* ( where * means zero or more occurrences) is regular Notations to Express Regular Expressions • • • • • • Conjunction: abc Disjunction: [a-zA-Z], gupp(y|ies) Counters: a*, a+, ?, a{5}, a{5,8}, a{5,} Any character: a.b Not: [^0-9] Anchors: /^The dog\.$/ – Note: the backslash before the period is an escape character – Other escape characters include \*, \?, \n, \t, \\, \[, \], etc. • Operators – \d equivalent to [0-9], \D equivalent to [^0-9] – \w equivalent to [a-zA-z0-9 ], \W equivalent to [^\w] – \s equivalent to [ \r\t\n\f], \S equivalent to [^s] • Substitute one regular expression for another: s/regExp1/regExp2/ Examples of Regular Expressions • All strings ending with two zeroes • All strings containing three consecutive zeroes • All strings that every block of five consecutive symbols have at least two zeroes • All strings that the tenth symbol from the right is a one • The set of all modular five numbers Finite State Automata (FSA) FSA’s recognize grammars that are regular Definition: A FSA consists of 1. 2. 3. 4. 5. a set of states (Σ) a starting state (q0) a set of final or accepting states (F Q) a finite set of symbols (Q) a transition function ((q,i) ) that maps QxΣ to Q. It switches from a from-state to a to-state, based on one of the valid symbols Synonyms: Finite Automata, Finite State Machine Recognition Determine if the machine accepts a particular string i.e. Is a string in the language? • Traditionally, Turing used a tape reader to depict a FSA • Algorithm – – – – – – Begin in the start state Examine the current input character Consult the table Go to a new state and update the tape pointer. Until you run out of tape. The machine accepts the string processing stops in a final state Graphs and State Transition Tables • What can we can say about this machine? – – – – – It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions • Questions – Which strings does it accept? baaaa, aaabaaa, ba – Is this the only FSA that can accept this language? State Transition Table Annotated Directed Graph An FSA only can accept regular strings. Question: Can you think of a string that is not regular? Recognizer Implementation index = beginning of tape state = start state DO IF transition[index, tape[index]] is empty RETURN false state = transition[index, tape[index]] index = index + 1 UNTIL end of tap is reached IF state is a final state RETURN true ELSE RETURN false Key Points Regarding FSAs • This algorithm is a state-space search algorithm – Implementation uses simple table lookups – Success occurs when at the end of a string, we reach a final state • The results are always deterministic – There is one unique choice at each step – The algorithm recognizes all regular languages • Perl, Java, etc. use a regular expression algorithm – Create a state transition table from the expression – pass the table to the FSA interpreter • FSA algorithms – Recognizer: determines if a string is in the language – Generator: Generates all strings in the language Non-Deterministic FSA • Deterministic: Given a state and symbol, only one transition is possible • Nondeterministic: – Given a state and a symbol, multiple transitions are possible – Epsilon transitions: those which DO NOT examine or advance the tape • The Nondeterministic FSA recognizes a string if: – At least one transition sequence ends at a final state – Note: all sequences DO NOT have to end at a final state – Note: String rejection occurs only when NO sequence ends at a final state Examples ε Concatenation Closure Closure Union Using NFSAs Input State 0 1 2 3 4 b 1 0 0 0 0 a 0 2 2,3 0 0 ! 0 0 0 4 0 e 0 0 0 0 0 NFSA Recognition of “baaa!” Breadth-first Recognition of “baaa!” Nondeterministic FSA Example b a q0 q1 q2 a a ! q2 q3 \ q4 Other FSA Examples Dollars and Cents Exercise: Create a FSA for the following regular expressions (0|1)* [a-f1-9] abc{5} Non Deterministic FSA Recognizer Recognizer (index, state) LOOP IF end of tape THEN IF state is final RETURN true ELSE RETURN false IF no possible transitions RETURN false IF there is only one transition state = transition[index, tape[index]] IF not an epsilon transition THEN index++ ELSE FOR each possible transition not considered result = CALL recognizer(nextState,nextIndex) IF result = true RETURN true END LOOP RETURN false FSA’s and Morphology • Apply an FSA to each word in the dictionary to capture the morphological forms. • Groups of words with common morphology can share FSAs Building a Lexicon with a FSA Derivational Rules Simple Morphology Example unq0 q1 adj-root q2 e er ( un ) adj roots ( est ly Stop states: q2 and q3 -er –est -ly ) q3 From To Output 0 1 un 0 1 NULL 1 2 adj-root-list 2 3 er;est;ly An Extended Example unq0 e adj-root-1 -er –est -ly q1 q2 q5 adj-root-1 q3 q4 -er –est adj-root-2 Adj-root1: clear, happy, real Adj-root2: big, red er ( un )adj roots 1 ( est ly ) From To Output 0 1 un 0 3 NULL 1 2 adj-root-list-1 2 5 er;est;ly 3 2 adj-root-list-1 3 4 adj-root-list-2 4 5 er;est er adj roots 2 ( est ) Representing Derivational Rules nouns noun ity, ness ize ation er verbs verb ative ive able adjectives ful adj ly ly adverbs adverb Finite State Transducer (FST) • Definition: A FST is a 5-tuple consisting of – Q: set of states {q0,q1,q2,q3,q4} – : an alphabet of complex symbols • • • • Each complex symbol contains two simple symbols The first symbol is from an input alphabet i I The second symbol is from an output alphabet o O is in I x O, ε is the null character – q0: a start state – F: a set of final states in Q {q4} – (q,i:o): a transition function mapping Q x to Q • Concept: Translates and writes to a second tape a:o b:m a:o a:o !:? q0 q1 q2 q3 Example: baaaamoooo q4 Transition Example c:c a:a t:t +N:ε +PL:s • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s On-line demos • Finite state automata demos http://www.xrce.xerox.com/competencies/content -analysis/fsCompiler/fsinput.html • Finite state morphology http://www.xrce.xerox.com/competencies/content -analysis/demos/english • Some other downloadable FSA tools: http://www.research.att.com/sw/tools/fsm/ Lexicon for L0 Rule based languages Top Down Parsing Driven by the grammar, working down S NP VP NP Nom Pro Verb Det Noun Noun I prefer a morning flight S → NP VP, NP→Pro, Pro→I, VP→V NP, V→prefer, NP→Det Nom, Det→a, Nom→Noun Nom, Noun→morning, Noun→flight [S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [N flight]]]]] Bottom Up Parsing Driven by the words, working up The Bottom Up Parse The Grammar 0) S E $ 1)E E + T | E - T | T 2)T T * F | T / F | F 3) F num | id 1)id - num * id 2)F - num * id 3)T - num * id 4)E - num * id 5)E - F * id 6)E - T * id 7)E - T * F 8)E - T 9)E 10)S correct sentence Note: If there is no rule that applies, backtracking is necessary Top-Down and Bottom-Up • Top-down – Advantage: Searches only trees that are legal – Disadvantage: Tries trees that don’t match the words • Bottom-up – Advantage: Only forms trees matching the words – Disadvantage: Tries trees that make no sense globally • Efficient combined algorithms – Link top-down expectations with bottom-up data – Example: Top-down parsing with bottom-up filtering Stochastic Language Models A probabilistic view of language modeling • Problems – A Language model cannot cover all grammatical rules – Spoken language is often ungrammatical • Solution – Constrain search space emphasizing likely word sequences – Enhance the grammar to recognize intended sentences even when the sequence doesn't satisfy the rules Probabilistic Context-Free Grammars (PCFG) Goal: Assist in discriminating among competing choices • Definition: G = (VN, VT, S, P, p); VN = non-terminal set of symbols VT = terminal set of symbols S = start symbol p = set of rule probabilities R = set of rules P(S ->W |G): S is the start symbol, W = expression in grammar G • Training the Grammar: Count rule occurrences in a training corpus P(R | G) = Count(R) / ∑C(R) PFSA (Probabilistic Finite State Automata) • A PFSA is a type of Probabilistic Context Free Grammar – – – – The states are the non-terminals in a production rule The output symbols are the observed outputs The arcs represent a context-free rule The path through the automata represent a parse tree • A PCFG considers state transitions and the transition path S1 a a S1 S2 b S2 S3 b S3 ε Probabilistic Finite State Machines • Probabilistic models determine weights of the transitions • The sum of weights leaving a state total to unity • Operations – Consider the weights to compute the probability of a given string or most likely path. – The machine can ‘learn’ the weights over time .001 .01 Companion Canine .0035 Tooth Another Example Pronunciation decoding [n iy] Merging the machines together [n iy] Another Example