UNIVERSITY OF ABERDEEN - CS4517 SOLUTION SESSION 2003-2004 1. (a) How does inflectional morphology differ from derivational morphology? Explain with examples from the English language. (5) SOLUTION (from lecture) Inflectional morphology and derivational morphology refer to the twin processes used in English for forming words from morphemes. Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, usually filling some syntactic function like agreement. Examples: 1. ‘–s’ is an inflectional morpheme used to mark the plural on nouns (book – books, tree – trees). 2. ‘-ed’ is the past tense mark on verbs (delete – deleted, add-added). Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. Examples: 1. Computer can take ‘-ize’ to form computerize Computerize can take ‘-ation’ to produce computerization (b) You are given the DCG grammar and lexicon as shown in Appendix 1. Modify or extend the grammar and lexicon to accept the following sentences i. ii. iii. [peter, may, join, the, army] (i.e., Peter may join the army) [army, men, lead, risky, lives] (i.e., Army men lead risky lives) [should, nations, sacrifice, the, lives, of, their, citizens] (i.e., Should nations sacrifice the lives of their citizens) but to reject the following i. [peter, may, joins, the, army] (i.e., Peter may joins the army) ii. [army, men, leads, risky, lives] (i.e., Army men leads risky lives) iii. [should, nations, sacrifices, the, lives, of, their, citizens] (i.e., Should nations sacrifices the lives of their citizens) (9) SOLUTION (problem solving from the practical class) i) Names don’t have any feature. Add the word ‘may’ to the lexicon as an auxiliary verb with number and tense features. aux (-,present) ---> [may]. Add tense and number features to the verb, join in the lexicon. verb(plural,present) ---> [join]. Add number feature to the noun, army. noun(singular) ---> [army]. Define Number feature for the determiner, the det(-) ---> [the]. In a sentence, np and vp must agree in number. s ---> np(number), vp(number). Add a new rule for vp that allows auxiliary verb before verb vp number comes from the auxiliary verb number. vp(number) ---> aux(number,-), verb(plural,present), np(-). Names are always singular in number. np(singular) ---> name. ii) Add the word ‘army’ to the lexicon as an adjective CS4517 (Natural Language Processing) - 2 - SOLUTION adj ---> [army]. (in this case, army is an adjective) Add number feature to the noun, men. noun(plural) ---> [men]. Add tense and number features to the verb, lead in the lexicon. verb(plural,present) ---> [lead]. Add number features to the noun, lives in the lexicon noun(plural) ---> [lives]. (in this case, lives is a noun) Add a new rule for np that allows adjective before the noun np number is derived from the noun number np(number) ---> adj, noun(number). iii) Add the word ‘should’ to the lexicon as an auxiliary verb with number and tense features. aux (-,past) ---> [should]. Add tense and number features to the verb, sacrifice in the lexicon. verb(plural,present) ---> [sacrifice]. Add number features to the noun, nations. noun(plural) ---> [nations]. Add a new rule for np that allows a prepositional phrase (pp) after the noun np(number) ---> det(number), noun(number),pp. Add a new rule for pp pp ---> prep, np(-). Add a new rule for S that allows a question sentence s ---> aux(-,number),np(number), verb(plural,present), np(-). (c) What is robust parsing? How is it achieved? (5) SOLUTION (from lecture) In a text understanding system when we use parsers that follow the grammar strictly, then ungrammatical sentences that are input by users are often rejected with a message such as ‘syntax error’. Users will not accept this particularly when the mistake is trivial and humans would have understood something similar. Therefore parsers need to be designed to be more tolerant to input errors that do not distort the meaning. Robust parsing is one of the techniques used for building tolerant parsers. Some ideas for building a robust parser: i) Allow parser to relax some constraints such as number agreement between determiner and noun in a noun phrase ii) Allow parser to insert/delete function words such as ‘a’ and ‘the’. (d) Describe briefly a text-understanding application in which a regular grammar rather than a phrase structure grammar is used for modelling English syntax. (6) SOLUTION Information extraction systems such as FASTUS, which was discussed in the lecture use finite state model (regular grammar) of English syntax. Stages of processing in FASTUS are as follows 1. Tokens - Transfer an input stream of characters into a token sequence. 2. Complex Words – Recognize multi-word phrases, numbers, and proper names 3. Basic Phrases – Segment sentences into noun groups verb groups and particles. 4. Complex Phrases – Identify complex noun groups and complex verb groups CS4517 (Natural Language Processing) 5. 6. 2. - 3 - SOLUTION Semantic Patterns – Identify semantic entities and events and insert into templates Merging – Merge references to the same entity or event from the different parts of the text. (a) With the help of the frequencies in Appendix 2 order the following alternative phrases that the acoustic model in a speech recogniser generated using unigram model, bigram model without smoothing and bigram model with plus-one smoothing (clearly show your calculations), i. In muscle bee the good ii. If music be the food (8) SOLUTION (problem solving) Unigram likelihood In muscle bee the good 1968509 * 1778 * 511 * 5894379 * 83077 = 875,808,031,873,759,862,932,026 If music be the food 261249 * 14747 * 663946 * 5894379 * 18540 = 279,536,718,416,175,882,967,321,080 Ranking 1) If music be the food 2) In muscle bee the good Bigram likelihood without smoothing In muscle bee the good 30 * 0 * 1 * 3130 = 0 If music be the food 11 * 5 * 18516 * 2280 = 2,321,906,400 Ranking 1) If music be the food 2) In muscle bee the good Bigram likelihood with plus-one smoothing In muscle bee the good 31 * 1 * 2 * 3131 = 194,122 If music be the food 12 * 6 * 18517 * 2281 = 3,041,083,944 Ranking 1) If music be the food 2) In muscle bee the good (b) What are Markov Models? How are they used in a speech recognition system? (7) SOLUTION (From lectures) A Markov Model is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through. A simple Markov model has the input alphabet same as that of the underlying alphabet. A Hidden Markov Model has these two alphabets different. Markov Models are used to capture the probabilistic pronunciation models of morphemes or words in a speech processing system. In a speech recognition system these models are used for either identifying the phonemes from code sequences or words from phoneme sequences In a speech synthesis system they represent the pronunciation models for words. (c) Briefly describe the major stages in a speech synthesis system. (5) SOLUTION (from lecture) Speech synthesis or Text to Speech (TTS) systems perform two major stages of processing: 1. Text Structure Analysis: Here the grammatical structure of the input text is determined. This helps in achieving a number of subtasks such as disambiguating words (for example St. in St. Andrews vs Don St.), determining the prosodic structure. Using this structural analysis and a pronunciation dictionary, this stage produces a sequence of phones and the pitch specification. 2. Waveform Generation: This stage, actually produces the sound wave given the ‘target’ from the previous stage. This stage often uses triphones from prerecorded speech. CS4517 (Natural Language Processing) - 4 - SOLUTION (d) What is micro-planning and explain the key tasks of a micro-planner. (5) SOLUTION (from lecture) Micro-planning is one of the modules in a text generation system. It takes a document plan tree as an input from its previous stage and carries out lexicalisation, referring expression generation and aggregation before passing on sentence or phrase specifications to the realizer. Lexicalisation: Selecting words to communicate the information in messages. Decision trees are used to select words. Aggregation: combining individual phrases or sentences based on either information content or possible realisation forms. Referring Expression Generation: Identifying specific domain entities and objects. There are two issues here. The first is to introduce the object for the first time and the second is to make subsequent references. A comprehensive solution to this problem is yet unknown. First time references are made either by an indefinite noun phrase or using the proper name. For subsequent references we use a simple solution such as: Use a pronoun if the object is mentioned in the previous clause Else use a definite noun phrase or a short name if one exists. 3. (a) Summarise the merits and demerits of any three schemes for representing semantics in an NLP system. (6) SOLUTION: (from lecture) There are many ways of representing the meanings of utterances in an NLP system. Three methods have been compared below: S. no Executable Programs Logical formulae AI Knowledge representation 1. C (or SQL) code that carries out the Translates into first order ( or any Semantic nets, and frames used in task expressed in the utterance other) logic the AI systems 2 Suitable for representing procedural Suitable for representing declarative Mainly suitable in systems where knowledge. For example the knowledge. Helps in carrying out there is already an existing AI meaning of ‘Show the directory’ is inferences and generating knowledge base better represented by code that information that can be used either actually retries the directory contents for answering questions or for performing some decision making task 3 Usually need a different translator Might differ from the way people May run into problems about the for each application think choice of primitives etc. Consider the sample movie database in Appendix 3. Define its semantics to Microsoft English Query such that it understands the following queries: i) List action movies ii) Who is the oldest actor? iii) Who is the star in Star Wars? (9) SOLUTION (from the practical class, but this particular database was not discussed in the practical. Students worked on a different database.) i) movie entity - corresponds to movie table some_movies_are_action relationship - corresponds to genre attribute in movie table ii) Actor entity – corresponds to actor table Age entity - age attribute in customers table customer_has_age relationship - age field from customer, adjective old used for higher ages (say > 65) iii) movie_cast_actor relationship – corresponds to the join of movie and actor via casting cast entity – corresponds to casting.ordinal (b) (c) What are referring expressions? In the following passage mark all the referring expressions and explain how they may be processed in a text understanding system. State clearly any assumptions you have to make. “A policeman saw a youth stealing a camera. The policeman ran after the thief. He caught him.” (5) CS4517 (Natural Language Processing) - 5 - SOLUTION SOLUTION (from lecture) Referring expressions are linguistic expressions that are used to identify entities and objects in a text/discourse. Referring Expression from the example text: 1. A policeman – indefinite noun phrase 2. A youth – indefinite noun phrase 3. A camera – indefinite noun phrase 4. The policeman – definite noun phrase 5. The thief – definite noun phrase 6. He – pronoun 7. him – pronoun Processing indefinite noun phrases involves creating new instances into the discourse representation without specifying their exact identity. definite noun phrases involves finding the most recently mentioned entity that fits the definite NP (The policeman matches a policeman, but not a youth) pronouns involves finding the last object mentioned with the correct gender agreement. Here in the above example, both the policeman and the youth are candidates. As a further guide for resolving the referent we can use focus. Focus of the second sentence is clearly on the policeman and therefore he in the third sentence refers to the policeman but not the youth. Moreover, if we assume that the NLP system has access to a common sense based reasoning engine, then the system can reason that the policeman is expected to catch the thief, not the other way round. (d) What is a controlled language? Explain how its use facilitates machine translation. (5) SOLUTION (from lecture) A controlled language is a subset of full natural language, say English. Originally designed for writing technical documentation for non-native speakers. Controlled English allows only limited vocabulary, limited grammatical structures and limited sentence lengths. Also words are defined to mean only one thing and concepts are expressed always by a unique word. Therefore there is no scope for ambiguity which is a major problem with source text understanding in a machine translation systems. Commercial aerospace industry in Europe has AECMA simplified English for authoring their technical documentation. Because of its simplicity and un-ambiguity this simplified English allows authors to produce texts that are far more easier to machine translate. CS4517 (Natural Language Processing) - 6 - Appendix 1 Grammar and Lexicon distinguished(s). s ---> np, vp. np ---> name. np ---> det, noun. vp ---> verb, np. name ---> [Peter]. noun noun noun noun noun ---> ---> ---> ---> ---> [army]. [nations]. [citizens]. [men]. [lives]. verb verb verb verb verb verb ---> ---> ---> ---> ---> ---> [join]. [joins]. [lead]. [leads]. [sacrifice]. [sacrifices]. adj ---> [risky]. prep ---> [of]. det det det det [the]. [an]. [a]. [their]. ---> ---> ---> ---> Appendix 2 Frequency data from the British National Corpus (BNC) Word/phrase number of occurrences in the BNC if if music music music be be be the the the food food in in muscle muscle muscle bee bee bee the the good good 261249 11 14747 5 663946 18516 5894379 2280 18540 1968509 30 1778 0 511 1 3130 83077 SOLUTION CS4517 (Natural Language Processing) - 7 - SOLUTION Appendix 3 Database Movie Table Id (integer, key) Title (text) Year (Decimal(4)) Genre (text) Casting Table Actor Table Movieid (key) Actorid (key) Ordinal (integer) Id (integer, key) Name (text) Age (integer) Notes: Casting.Ordinal - The ordinal position of the actor in the cast list. The star of the movie will have ordinal value 1 the co-star will have value 2, and so on Example Tuples: Movie (1, Star Wars, 1971, science fiction) Casting (1, 1, 2) Actor (1, Harrison Ford, 52)