UNIVERSITY OF ABERDEEN - CS4517 SOLUTION 1. SESSION 2001-2002 (a) You are given the DCG grammar and lexicon as shown in Appendix 1. Modify or extend the grammar and lexicon to parse the following sentences i. John leaves in the early morning. ii. Mary can accept an evening flight. iii. John and Mary prefer British Airways. (9) SOLUTION (problem solving from the practical class) i) Add a new rule for np that allows adjective after the determiner np ---> det, adj, noun. Add a new rule for vp that allows a prepositional phrase (pp) after the verb vp ---> verb, pp. Add a new rule for pp pp ---> prep, np. In this case, the word ‘morning’ is a noun. ii) Add a new rule for vp that allows auxverb before verb vp ---> auxverb, verb, np. In this case, the word ‘evening’ is an adjective iii) Add a new rule for np that allows conjunction between two names or more generally two nps np ---> name, conj, name. Alternatively, a more general np can be defined as np ---> np, conj, np. (I will accept if students write any of the above two rules involving conjunction). (b) Add number and tense features to the grammar and lexicon to enforce their agreement in the above sentences. (8) SOLUTION (problem solving from the practical class) Tense and number features are to be added to all the verbs in the lexicon. For instance, Verb(singular,present) ---> [leaves]. Verb(plural,present) ---> [prefer]. Verb(infinitive) ---> [accept]. Number features are to be added to all the nouns in the lexicon. For instance, Noun(singular) ---> [morning]. Names don’t have any feature. Define Number feature for determiners in the lexicon Det(singular) ---> [his]. Det(-) ---> [the]. %(-) means either Det(singular) ---> [an]. Define Number feature for conjunction in the lexicon. conj(plural) ---> [and]. In a sentence, np and vp must agree in number. S ---> np(Number), vp(Number). vp number comes from the verb number. vp(Number) ---> verb(Number,-), np(-). vp(Number) ---> verb(Number,-), pp. vp(Number) ---> auxverb(Number,-), verb(infinitive), np. Names are always singular in number. np(singular) ---> name. Determiner and noun agree in number np(Number) ---> det(Number), noun(Number). CS4517 (Natural Language Processing) - 2 - SOLUTION np(Number) ---> det(Number), noun(Number), name. np(Number) ---> det(Number), adj, noun(Number). Modification of np rule involving conjunction is non-trivial linguistically. Since conjunctions are not discussed in the class, I accept a solution such as np(plural) ---> name, conj, name. However, the following solution will also be accepted. np(Number) ---> name, conj(Number), name. Yet another solution could be np(plural) ---> np(-),conj(plural),np(-). np(Number) ---> np(Number),conj(singular),np(Number). (c) What are inflectional and derivational morphology in English. Explain with the help of examples. (4) SOLUTION (from lecture) Inflectional morphology and derivational morphology refer to the twin processes used in English for forming words from morphemes. Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, usually filling some syntactic function like agreement. Examples: 1. ‘–s’ is an inflectional morpheme used to mark the plural on nouns (book – books, tree – trees). 2. ‘-ed’ is the past tense mark on verbs (delete – deleted, add-added). Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. Examples: 1. Computer can take ‘-ize’ to form computerize 2. Computerize can take ‘-ation’ to produce computerization (d) How can an NLP system handle spelling errors while processing text? (4) SOLUTION (based on lecture, but no spelling correction algorithms have been discussed in the class) Some of the points I'd like to see made are Categorization of errors can help in the detection and correction of spelling errors spelling errors can belong to the following categories 1. non-word errors 2. real word errors 80% of all spelling errors are due to 1. character insertion (the -> ther) 2. character deletion (the -> th) 3. character substitution (the -> thw) 4. character transposition (the -> hte) Spelling correction can be based on statistical models of error frequency (e.g. due to key proximity on the key board) Statistical model for word bigrams (e.g. out cat is more possible than our cask) Using AI world knowledge 2. (a) Consider the sample library database in Appendix 2. If you want to use Microsoft English Query as the front end to this database, what semantic model do you need to define to respond to the following queries? i) List the general books. ii) Who has David Copperfield? iii) When is Hamlet due? (9) SOLUTION (from the practical class, but this particular database was not discussed in the practical. Students worked on a different database.) i) books entity - corresponds to books table, name corresponds to title CS4517 (Natural Language Processing) - 3 - SOLUTION some_books_are_Technical relationship - corresponds to Category attribute in books table ii) members entity – corresponds to members table, name corresponds to firstname and lastname member_borrow_book relationship - corresponds to join of books and members via issues iii) member_return_book relationship – corresponds to the join of books and members via issue Relate this relationship to a time and associate the DueDate field of the Issues table to the time. Alternative answers will be acceptable. (b) What are Markov Models? How are they used in speech processing systems? (6) SOLUTION (From lectures) A Markov Model is a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through. A simple Markov model has the input alphabet same as that of the underlying alphabet. A Hidden Markov Model has these two alphabets different. Markov Models are used to capture the probabilistic pronunciation models of morphemes or words in a speech processing system. In a speech recognition system these models are used for either identifying the phonemes from code sequences or words from phoneme sequences In a speech synthesis system they represent the pronunciation models for words. (c) What are referring expressions? Write down all the referring expressions in the following text and explain how a text understanding system will resolve them. “A rich doctor met a tax lawyer for lunch. The doctor was unhappy, because he found the lawyer greedy.” (5) SOLUTION (from lecture) Referring Expression from the example text: 1. A rich doctor. – indefinite noun phrase 2. A rich lawyer – indefinite noun phrase 3. The doctor – definite noun phrase 4. He – pronoun 5. The lawyer – definite noun phrase Processing indefinite noun phrases involves creating new instances into the discourse representation without specifying their exact identity. definite noun phrases involves finding the most recently mentioned entity that fits the definite NP (The doctor matches a rich doctor, but not a tax lawyer) pronouns involves finding the last object mentioned with the correct gender agreement. Here in the above example, both the lawyer and doctor are candidates. As a further guide for resolving the referent we can use focus. Focus of the second sentence is clearly on the doctor and therefore he refers to doctor but not the lawyer. (d) Define the terms term frequency and inverse document frequency in the context of information retrieval (IR). Explain how they are used in IR systems. (5) SOLUTION (from lecture) Mainly the following points are expected Term frequency (TF)– a measure of the frequency of a term in a document or query Inverse document frequency (IDF)– a measure of the rarity of a term across the collection. In an IR system both the query and the documents are represented as vectors of terms (excluding stop words). TF and IDF are used to measure the distance between the query and the documents. Documents with the higher scores of the product of TF and IDF are finally retrieved for the user. CS4517 (Natural Language Processing) 3. - 4 - SOLUTION Explain the different stages (tasks) involved in Information Extraction (IE) clearly stating the techniques used. How do you measure the effectiveness of an IE system? Do you think IE at the state of the art offers a viable solution for creating databases from text bases? Explain your opinion. (25) SOLUTION This is a descriptive question. IE has been discussed in the lecture using FASTUS as the example system. Stages of processing in FASTUS are as follows 1. Tokens - Transfer an input stream of characters into a token sequence. 2. Complex Words – Recognize multi-word phrases, numbers, and proper names 3. Basic Phrases – Segment sentences into noun groups verb groups and particles. 4. Complex Phrases – Identify complex noun groups and complex verb groups 5. Semantic Patterns – Identify semantic entities and events and insert into templates 6. Merging – Merge references to the same entity or event from the different parts of the text. Effectiveness of the system is measured with metrics such as Recall (R)= number of correct answers given by system/total number of possible correct answers in the text. Precision (P) = number of correct answers given by system/number of answers given by system F-measure = ((b2+1)PR)/(b2P+R) --- when b is less than one recall is favoured and when b is greater than 1 precision is favoured. Does IE offer a viable solution for transforming text bases into databases: This is an open-ended question. IE at the state of the art assumes that the input text base has less information and more noise. This assumption allows the processing to ignore large portions of text and concentrate on limited amounts of text while handling user requirements. If the user requirements continue to fulfil this assumption, then one can argue that IE systems have succeeded in narrowing the gap between text bases and databases. However, most information is still hidden in texts and therefore it can be argued that, text bases and databases are still as far as they have always been. Moreover, most IE systems have been applied on specialised domains such as terrorist events and business mergers. The technology may not scale up to more desired applications such as automatically acquiring domain ontologies (or database schemas) from a set of domain specific documents. CS4517 (Natural Language Processing) - 5 - SOLUTION Appendix 1: Grammar and Lexicon distinguished(s). s ---> np, vp. np ---> name. np ---> det, noun. vp ---> verb, np. name name name noun ---> ---> ---> ---> verb ---> verb ---> verb ---> [John]. [Mary]. [British, Airways]. [morning]. [leaves]. [prefer]. [accept]. auxverb ---> [can]. adj ---> adj ---> [evening]. [early]. prep ---> [in]. conj ---> [and]. det ---> det ---> det ---> [his]. [the]. [an]. Appendix 2: Database Members table Issues Table Books Table MemberID (numeric, key) Firstname (text) Lastname (text) Address(text) Phone(numeric) MemberID (key) BookID (key) IssueDate(Date) DueDate(Date) BookID (numeric, key) Title (text) Category(General or Technical) Author(text) Publisher(text) Year(numeric) Example tuples Members (123456, John, Smith, 99 Skene Street,707070) Issues (123456, 642.21A, 21Jan2002, 21Jun2002) Books (642.21A, Natural Language Understanding, Technical, James Allen, Benjamin Cummins, 1994)