Solutions to CS4517

advertisement
UNIVERSITY OF ABERDEEN - CS4517 SOLUTION
SESSION 2003-2004
1.
(a) How does inflectional morphology differ from derivational morphology? Explain with examples from the English
language.
(5)
SOLUTION (from lecture)
Inflectional morphology and derivational morphology refer to the twin processes used in English for forming words from
morphemes.
Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the
original stem, usually filling some syntactic function like agreement.
Examples:
1. ‘–s’ is an inflectional morpheme used to mark the plural on nouns (book – books, tree – trees).
2. ‘-ed’ is the past tense mark on verbs (delete – deleted, add-added).
Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class,
often with a meaning hard to predict exactly.
Examples:
1. Computer can take ‘-ize’ to form computerize
Computerize can take ‘-ation’ to produce computerization
(b) You are given the DCG grammar and lexicon as shown in Appendix 1. Modify or extend the grammar and lexicon
to accept the following sentences
i.
ii.
iii.
[peter, may, join, the, army]
(i.e., Peter may join the army)
[army, men, lead, risky, lives]
(i.e., Army men lead risky lives)
[should, nations, sacrifice, the, lives, of, their, citizens]
(i.e., Should nations
sacrifice the lives of their citizens)
but to reject the following
i.
[peter, may, joins, the, army]
(i.e., Peter may joins the army)
ii. [army, men, leads, risky, lives]
(i.e., Army men leads risky lives)
iii. [should, nations, sacrifices, the, lives, of, their, citizens]
(i.e., Should nations sacrifices the lives of their
citizens)
(9)
SOLUTION (problem solving from the practical class)
i)
Names don’t have any feature.
Add the word ‘may’ to the lexicon as an auxiliary verb with number and tense features.
aux (-,present) ---> [may].
Add tense and number features to the verb, join in the lexicon.
verb(plural,present) --->
[join].
Add number feature to the noun, army.
noun(singular) --->
[army].
Define Number feature for the determiner, the
det(-) ---> [the].
In a sentence, np and vp must agree in number.
s ---> np(number), vp(number).
Add a new rule for vp that allows auxiliary verb before verb
vp number comes from the auxiliary verb number.
vp(number) --->
aux(number,-), verb(plural,present), np(-).
Names are always singular in number.
np(singular) ---> name.
ii)
Add the word ‘army’ to the lexicon as an adjective
CS4517 (Natural Language Processing)
- 2 -
SOLUTION
adj ---> [army].
(in this case, army is an adjective)
Add number feature to the noun, men.
noun(plural) ---> [men].
Add tense and number features to the verb, lead in the lexicon.
verb(plural,present) --->
[lead].
Add number features to the noun, lives in the lexicon
noun(plural) ---> [lives].
(in this case, lives is a noun)
Add a new rule for np that allows adjective before the noun
np number is derived from the noun number
np(number) ---> adj, noun(number).
iii)
Add the word ‘should’ to the lexicon as an auxiliary verb with number and tense features.
aux (-,past) ---> [should].
Add tense and number features to the verb, sacrifice in the lexicon.
verb(plural,present) --->
[sacrifice].
Add number features to the noun, nations.
noun(plural) ---> [nations].
Add a new rule for np that allows a prepositional phrase (pp) after the noun
np(number) --->
det(number), noun(number),pp.
Add a new rule for pp
pp ---> prep, np(-).
Add a new rule for S that allows a question sentence
s ---> aux(-,number),np(number), verb(plural,present), np(-).
(c) What is robust parsing? How is it achieved?
(5)
SOLUTION (from lecture)
In a text understanding system when we use parsers that follow the grammar strictly, then ungrammatical sentences that are
input by users are often rejected with a message such as ‘syntax error’. Users will not accept this particularly when the mistake
is trivial and humans would have understood something similar. Therefore parsers need to be designed to be more tolerant to
input errors that do not distort the meaning. Robust parsing is one of the techniques used for building tolerant parsers. Some
ideas for building a robust parser:
i)
Allow parser to relax some constraints such as number agreement between determiner and noun in a noun phrase
ii) Allow parser to insert/delete function words such as ‘a’ and ‘the’.
(d) Describe briefly a text-understanding application in which a regular grammar rather than a phrase structure grammar
is used for modelling English syntax.
(6)
SOLUTION
Information extraction systems such as FASTUS, which was discussed in the lecture use finite state model (regular grammar)
of English syntax. Stages of processing in FASTUS are as follows
1. Tokens - Transfer an input stream of characters into a token sequence.
2. Complex Words – Recognize multi-word phrases, numbers, and proper names
3. Basic Phrases – Segment sentences into noun groups verb groups and particles.
4. Complex Phrases – Identify complex noun groups and complex verb groups
CS4517 (Natural Language Processing)
5.
6.
2.
- 3 -
SOLUTION
Semantic Patterns – Identify semantic entities and events and insert into templates
Merging – Merge references to the same entity or event from the different parts of the text.
(a) With the help of the frequencies in Appendix 2 order the following alternative phrases that the acoustic model in a
speech recogniser generated using unigram model, bigram model without smoothing and bigram model with plus-one
smoothing (clearly show your calculations),
i. In muscle bee the good
ii. If music be the food
(8)
SOLUTION (problem solving)
Unigram likelihood
In muscle bee the good
1968509 * 1778 * 511 * 5894379 * 83077 = 875,808,031,873,759,862,932,026
If music be the food
261249
*
14747
*
663946
*
5894379
*
18540
=
279,536,718,416,175,882,967,321,080
Ranking
1) If music be the food
2) In muscle bee the good
Bigram likelihood without smoothing
In muscle bee the good
30 * 0 * 1 * 3130 = 0
If music be the food
11 * 5 * 18516 * 2280 = 2,321,906,400
Ranking
1) If music be the food
2) In muscle bee the good
Bigram likelihood with plus-one smoothing
In muscle bee the good
31 * 1 * 2 * 3131 = 194,122
If music be the food
12 * 6 * 18517 * 2281 = 3,041,083,944
Ranking
1) If music be the food
2) In muscle bee the good
(b) What are Markov Models? How are they used in a speech recognition system?
(7)
SOLUTION (From lectures)
A Markov Model is a special case of a weighted automaton in which the input sequence uniquely determines which states the
automaton will go through. A simple Markov model has the input alphabet same as that of the underlying alphabet. A Hidden
Markov Model has these two alphabets different.
Markov Models are used to capture the probabilistic pronunciation models of morphemes or words in a speech processing
system.
In a speech recognition system these models are used for either identifying the phonemes from code sequences or words from
phoneme sequences
In a speech synthesis system they represent the pronunciation models for words.
(c) Briefly describe the major stages in a speech synthesis system.
(5)
SOLUTION (from lecture)
Speech synthesis or Text to Speech (TTS) systems perform two major stages of processing:
1. Text Structure Analysis: Here the grammatical structure of the input text is determined. This helps in achieving a number
of subtasks such as disambiguating words (for example St. in St. Andrews vs Don St.), determining the prosodic structure.
Using this structural analysis and a pronunciation dictionary, this stage produces a sequence of phones and the pitch
specification.
2. Waveform Generation: This stage, actually produces the sound wave given the ‘target’ from the previous stage. This stage
often uses triphones from prerecorded speech.
CS4517 (Natural Language Processing)
- 4 -
SOLUTION
(d) What is micro-planning and explain the key tasks of a micro-planner.
(5)
SOLUTION (from lecture)
Micro-planning is one of the modules in a text generation system. It takes a document plan tree as an input from its
previous stage and carries out lexicalisation, referring expression generation and aggregation before passing on sentence
or phrase specifications to the realizer.
Lexicalisation: Selecting words to communicate the information in messages. Decision trees are used to select words.
Aggregation: combining individual phrases or sentences based on either information content or possible realisation forms.
Referring Expression Generation: Identifying specific domain entities and objects. There are two issues here. The first is
to introduce the object for the first time and the second is to make subsequent references. A comprehensive solution to
this problem is yet unknown. First time references are made either by an indefinite noun phrase or using the proper name.
For subsequent references we use a simple solution such as:
Use a pronoun if the object is mentioned in the previous clause
Else use a definite noun phrase or a short name if one exists.
3.
(a) Summarise the merits and demerits of any three schemes for representing semantics in an NLP system.
(6)
SOLUTION: (from lecture)
There are many ways of representing the meanings of utterances in an NLP system. Three methods have been compared below:
S. no
Executable Programs
Logical formulae
AI Knowledge representation
1.
C (or SQL) code that carries out the Translates into first order ( or any Semantic nets, and frames used in
task expressed in the utterance
other) logic
the AI systems
2
Suitable for representing procedural Suitable for representing declarative Mainly suitable in systems where
knowledge. For example the knowledge. Helps in carrying out there is already an existing AI
meaning of ‘Show the directory’ is inferences
and
generating knowledge base
better represented by code that information that can be used either
actually retries the directory contents for answering questions or for
performing some decision making
task
3
Usually need a different translator Might differ from the way people May run into problems about the
for each application
think
choice of primitives etc.
Consider the sample movie database in Appendix 3. Define its semantics to Microsoft English Query such that it
understands the following queries:
i) List action movies
ii) Who is the oldest actor?
iii) Who is the star in Star Wars?
(9)
SOLUTION (from the practical class, but this particular database was not discussed in the practical. Students worked on a
different database.)
i)
movie entity - corresponds to movie table
some_movies_are_action relationship - corresponds to genre attribute in movie table
ii)
Actor entity – corresponds to actor table
Age entity - age attribute in customers table
customer_has_age relationship - age field from customer, adjective old used for higher ages (say > 65)
iii)
movie_cast_actor relationship – corresponds to the join of movie and actor via casting
cast entity – corresponds to casting.ordinal
(b)
(c) What are referring expressions? In the following passage mark all the referring expressions and explain how they
may be processed in a text understanding system. State clearly any assumptions you have to make.
“A policeman saw a youth stealing a camera. The policeman ran after the thief. He caught him.”
(5)
CS4517 (Natural Language Processing)
- 5 -
SOLUTION
SOLUTION (from lecture)
Referring expressions are linguistic expressions that are used to identify entities and objects in a text/discourse.
Referring Expression from the example text:
1. A policeman – indefinite noun phrase
2. A youth – indefinite noun phrase
3. A camera – indefinite noun phrase
4. The policeman – definite noun phrase
5. The thief – definite noun phrase
6. He – pronoun
7. him – pronoun
Processing
indefinite noun phrases involves creating new instances into the discourse representation without specifying their
exact identity.
definite noun phrases involves finding the most recently mentioned entity that fits the definite NP (The policeman
matches a policeman, but not a youth)
pronouns involves finding the last object mentioned with the correct gender agreement. Here in the above example,
both the policeman and the youth are candidates. As a further guide for resolving the referent we can use focus. Focus of the
second sentence is clearly on the policeman and therefore he in the third sentence refers to the policeman but not the youth.
Moreover, if we assume that the NLP system has access to a common sense based reasoning engine, then the system can reason
that the policeman is expected to catch the thief, not the other way round.
(d) What is a controlled language? Explain how its use facilitates machine translation.
(5)
SOLUTION (from lecture)
A controlled language is a subset of full natural language, say English. Originally designed for writing technical documentation
for non-native speakers. Controlled English allows only limited vocabulary, limited grammatical structures and limited
sentence lengths. Also words are defined to mean only one thing and concepts are expressed always by a unique word.
Therefore there is no scope for ambiguity which is a major problem with source text understanding in a machine translation
systems. Commercial aerospace industry in Europe has AECMA simplified English for authoring their technical
documentation. Because of its simplicity and un-ambiguity this simplified English allows authors to produce texts that are far
more easier to machine translate.
CS4517 (Natural Language Processing)
- 6 -
Appendix 1 Grammar and Lexicon
distinguished(s).
s
---> np, vp.
np ---> name.
np ---> det, noun.
vp ---> verb, np.
name --->
[Peter].
noun
noun
noun
noun
noun
--->
--->
--->
--->
--->
[army].
[nations].
[citizens].
[men].
[lives].
verb
verb
verb
verb
verb
verb
--->
--->
--->
--->
--->
--->
[join].
[joins].
[lead].
[leads].
[sacrifice].
[sacrifices].
adj --->
[risky].
prep --->
[of].
det
det
det
det
[the].
[an].
[a].
[their].
--->
--->
--->
--->
Appendix 2 Frequency data from the British National Corpus (BNC)
Word/phrase
number of occurrences in the BNC
if
if music
music
music be
be
be the
the
the food
food
in
in muscle
muscle
muscle bee
bee
bee the
the good
good
261249
11
14747
5
663946
18516
5894379
2280
18540
1968509
30
1778
0
511
1
3130
83077
SOLUTION
CS4517 (Natural Language Processing)
- 7 -
SOLUTION
Appendix 3 Database
Movie Table
Id (integer, key)
Title (text)
Year (Decimal(4))
Genre (text)
Casting Table
Actor Table
Movieid (key)
Actorid (key)
Ordinal (integer)
Id (integer, key)
Name (text)
Age (integer)
Notes: Casting.Ordinal - The ordinal position of the actor in the cast list. The star of the movie will have ordinal value 1 the
co-star will have value 2, and so on
Example Tuples:
Movie (1, Star Wars, 1971, science fiction)
Casting (1, 1, 2)
Actor (1, Harrison Ford, 52)
Download