Information Systems Natural Language Processing Advanced Higher

advertisement
Information Systems
Natural Language Processing
Advanced Higher
9020
Summer 2001
HIGHER STILL
Information
Systems
Natural Language Processing
Advanced Higher
Support Materials
This publication may be reproduced in whole or in part for educational purposes provided that no profit
is derived from the reproduction and that, if reproduced in part, the source is acknowledged.
First published 2001
Learning + Teaching Scotland
Northern College
Gardyne Road
Broughty Ferry
Dundee
DD5 INY
Tel. 01382 443 600
CONTENTS
Section 1: Motivation and context for Natural Language Processing
Section 2: Analysing Natural Language Processing
Section 3: Parsing and General Techniques
Section 4: Further Techniques used in Natural Language Processing
Information Systems: Natural Language Processing
(AH)
Information Systems: Natural Language Processing
(AH)
1.
MOTIVATION AND CONTEXT FOR NATURAL LANGUAGE
PROCESSING
1.1 The Nature and Role of Natural Language Processing
Natural Language Processing (NLP) is a sub-field of Artificial Intelligence. It is
also known as Computational Linguistics.
NLP is concerned with the production and comprehension of natural languages
such as English or Russian. It deals largely with written language or text, but there is
some consideration of spoken language, including phonology, the study of the sounds
that make up a language.
You can find an interesting glossary of terms used in NLP on the World Wide Web at:
http://www.cs.bham.ac.uk/~pxc/nlpa/nlpgloss.html
1.1.1 Functions of language in human communication
Language is the principal method of communication between humans. We use it for a
number of different purposes:
•
Transferring information by making statements, e.g. ‘It’s raining outside.’ This
allows us to benefit from the experience of others and reduces the amount of
exploration or information seeking each individual has to carry out.
•
Querying others about some aspect of the world or their experience of it. This is
done by asking questions, e.g. ‘What’s the weather like in Spain?’
•
Answering questions, e.g. ‘The rain in Spain falls mainly on the plain.’ This can
also be regarded as a form of information transfer.
•
Making requests or giving commands, i.e. asking others to carry out actions:
‘Peel me a grape’. Direct commands are sometimes considered impolite, so
requests are often phrased indirectly, e.g. ‘Could you peel a grape for me?’ or ‘I
could use some help peeling these grapes.’
•
Promising to do something or offering deals, e.g. ‘I’ll buy the grapes if you’ll
peel them.’
•
Acknowledging offers or requests, e.g. ‘OK’, ‘Fine’ etc.
•
Sharing feelings or experiences, e.g. ‘I really enjoy listening to Mozart.’
Notice that some of these (informing, answering, acknowledging and sharing) are
intended to transfer information to the listener, while others (requesting, commanding,
querying) are intended to prompt the listener to take some action.
Information Systems: Natural Language Processing (AH)
1
Some communications such as greetings ‘Good morning. How are you today?’ ‘I’m
fine thanks. How are you?’ are intended only to build and reinforce social links and
convey little or no real information.
1.1.2 Language as a sign of intelligence
No one really knows if we use language because we’re intelligent or if we’re
intelligent because we use language. Jerison suggests that human language arises
from the need for better ‘cognitive maps’ of our territory. He points out that dogs and
other carnivores rely largely on scent marking and their sense of smell to tell them
where they are and what other animals have been there. The early primates (30
million years ago) lacked this well-developed sense of smell and substituted sounds
for scent marking. Language may simply be a means of compensating for our
inadequate noses!
1.1.3 Natural language and other forms of communication
Natural language is not the only form of communication which exists: we’ll look
briefly at four others: sign language, non-human communication, programming
languages and formal logic.
Sign Language
Sign languages, such as British Sign Language (BSL) and American Sign
Language (ASL) are true languages, with vocabularies of thousands of words, and
grammars as complex and sophisticated as those of any spoken or written language.
BSL is now the fourth most widely used language in the UK and a major campaign is
under way to have it officially recognised by the government, as has already happened
in most EU countries.
ASL is quite different from BSL - it is more closely related to French Sign Language,
due to the influence of Laurent Clerc, the first teacher of the deaf in the United States.
ASL has a Topic-Comment syntax, while English uses Subject-Object-Verb. In terms
of syntax, ASL shares more with spoken Japanese than it does with English.
Sign languages are not an invented system like Esperanto. They are linguistically
complete, natural languages and are the native languages of many deaf men and
women, as well as some hearing children born into deaf families.
Sign languages are sometimes described as gestural languages. This is not absolutely
correct because hand gestures are only one component. Facial features such as
eyebrow motion and lip-mouth movements are also significant and form a crucial part
of the grammatical system. Sign languages also make use of the space surrounding the
signer to describe places and persons that are not present.
Sign languages have a very complex grammar. Unlike spoken languages where there
is only a single stream of sounds, sign languages can have several things happening at
the same time. For instance, the concept of ‘very big’ is conveyed by the simultaneous
use of a hand gesture for ‘big’ and a mouth/cheek shape for ‘very’. Sign languages
have their own morphology (rules for the creation of words), phonetics (rules for hand
shapes), and grammar that are very unlike those found in spoken languages.
Information Systems: Natural Language Processing (AH)
2
Sign languages should not be confused with Signed English, which is a word-forword signed equivalent of English. Deaf people tend to find it tiring, because its
grammar, like that of spoken languages, is linear, while that of sign languages is
primarily spatial.
Non-Human Communication
It is often stated that one of the great differences between humans and animals is the
ability of humans to use language. This has frequently been challenged, particularly
in studies of the use of language by primates. These studies have generally followed
one of two paths: the use of language by primates in the wild and attempts to teach
some form of language (not necessarily spoken) to primates in captivity.
Wild primates use a variety of methods of communication. Many use scents to mark
their territory and they use touch to indicate relationships: mothers carry their young
and adults may sit and/or sleep together or groom each other. The higher primates
look at whatever they are paying attention to. Important visual cues include facial
expression, hair erection, general posture, and tail position.
Primates use vocal communication, from soft grunts to whoops, when they want to
attract the attention of others. Sounds may be used to signal danger of an attack or the
location of a food source. The meaning of primate communication depends on the
social and environmental context as well as the particular signals being used. Most
animals use a fixed set of signals to represent messages, which are important to their
survival (food here, predator nearby, approach, withdraw etc.).
Vervet monkeys have the most sophisticated animal communication that we know of.
The sounds they use are learned, rather than instinctive. They have a variety of calls
for different predators: a loud bark for leopards, a short cough for eagles and a chatter
for snakes. They also use one type of grunt to communicate with dominant members
of their own group, another to communicate with subordinate members and a third
type to communicate with members of other groups. They are even capable of lying!
A vervet that is losing a fight may make the leopard alarm, causing the whole group to
run for the trees and forget the fight.
There have been numerous attempts to teach some kind of language to primates.
Researchers argue that projects of this nature can provide valuable information, not
only about the nature of language and cognitive and intellectual capacities, but also
about such issues as the uniqueness of human language and thought. Such projects
also shed light on the early development of language in humans. Another reason for
teaching language to primates is the hope of discovering better methods for training
children with learning difficulties who fail to develop linguistic skills during their
early years.
Allen and Beatrice Gardner began teaching American Sign Language to an infant
chimpanzee named Washoe in 1966. They provided a friendly environment that they
believed would be conducive to learning. The people who looked after Washoe used
only sign language in her presence. She was able to transfer her signs spontaneously
to a new situation, e.g. she used the word ‘more’ in a variety of contexts, not just for
more tickling, which was the first context.
Information Systems: Natural Language Processing (AH)
3
The Gardners reported that Washoe began to use combinations of signs spontaneously
after learning only about eight or ten of them. At one stage Washoe ‘adopted’ an
infant chimpanzee named Loulis. For the next five years, no sign language was used
by humans in Loulis' presence; however, Loulis still managed to learn over 50 signs
from the other chimpanzees.
The year after Project Washoe began, David and Ann Premack started an experiment
with a different kind of language. They used plastic tokens, which represented words
and varied in shape, size, texture, and colour to train a chimpanzee named Sarah.
Sentences were formed by placing the tokens in a line. Sarah was taught nouns,
verbs, adjectives, pronouns, quantifiers, same-difference, negation, and compound
sentences. To show that she was not simply responding to cues from her trainers, she
was introduced to a new trainer who didn’t know her language. When this trainer
presented her with questions, she gave the correct answers less frequently than usual,
but still well above chance.
A chimpanzee named Lana learned to use another language system, a keyboard with
keys for various lexigrams, each representing one word. When Lana pressed a key
with a lexigram on it, the key would light up and the lexigram would appear on a
projector. If keys were pressed accidentally, Lana used the period key as an eraser so
that she could restart the sentence - she did this on her own before it occurred to the
researchers.
Lana started using ‘no’ as a protest (e.g. when someone else was drinking a Coke and
she did not have one) after having learned it as a negation. Lana acquired many skills
which showed her ability to abstract and generalise, e.g. she spontaneously used ‘this’
to refer to things for which she had no name, and she invented names for things by
combining lexigrams in novel ways.
However, many linguists, including the highly influential Noam Chomsky, argue that
language is a uniquely human gift. According to this school, chimpanzees and other
close relatives cannot use language because they lack the human brain structures that
make language work. Chomsky argues that trying to teach language to a chimpanzee
is a bit like teaching a human being to fly. An athlete may be able to jump 20 feet, but
it’s a crude imitation of flying.
Programming Languages
Programming languages have a number of features in common with natural
languages, but there are also significant differences. Programming languages have a
lexicon (or vocabulary) and rules governing how sentences in the languages are
constructed. Most languages allow two different kinds of words, usually referred to as
keywords and identifiers. There are a fixed number of keywords, e.g. begin, end, do,
while etc. and these have a fixed function. There are an infinite number of identifiers.
These are usually associated with a fixed function at the time of declaration, e.g.
procedure name, variable name etc. In general, computer programmers have far more
ability to generate new words than the speakers of a natural language, although their
new words are often influenced by natural languages, e.g. CustName, TotPrice etc.
Information Systems: Natural Language Processing (AH)
4
The syntax (grammatical rules) of modern programming languages can be rigorously
defined and can often be expressed in a formal notation such as BNF (Backus
Normal Form) or Syntax Diagrams. Unfortunately it’s not quite as easy to describe
the semantics (meaning) of a programming language in a formal manner and this is
normally still done by means of an English description. However, it’s still
considerably more rigorous than a natural language.
Formal Logic
One important area of NLP is the study of the semantics or meaning of natural
language statements. In many cases, the most important aspect of semantics is
determining whether a sentence is true or false. We can simplify this task by defining
a formal language with simple semantics and mapping natural language sentences on
to it.
This formal language should be unambiguous, have simple rules of interpretation and
inference and have a logical structure determined by the form of the sentence. Two
commonly used formal languages are propositional logic and predicate logic.
Formal Logic is covered in greater detail in section 4.1.4.
1.1.4 Natural Language Modalities (Speech and Text)
Natural language occurs in two distinct forms, text and speech. Although these can be
considered as two different ways of expressing the same information there are
important distinctions. Speech is usually less formal than text, but it can convey
important additional information by means of volume, tone of voice etc. that are
absent in text. It can also be more confusing as a result of accent, mispronunciation
etc.
NLP has traditionally focused on text with Speech Recognition and Speech
Generation being regarded as relatively disparate fields. However, in recent years
there has been a degree of convergence as researchers have realised that a knowledge
of language structure can assist in recognition or generation. We’ll look briefly at
these fields.
Speech Recognition
Speech recognition is the process by which a computer converts an acoustic speech
signal to text. It should be distinguished from speech understanding, the process by
which a computer converts an acoustic speech signal to some form of abstract
meaning.
Speech recognition systems can be speaker-dependent or speaker-independent. A
speaker-dependent system is designed to operate for a single speaker. These systems
are usually easier to develop, cheaper to buy and more accurate, but not as flexible as
speaker adaptive or speaker independent systems.
A speaker-independent system is designed to operate for any speaker of a particular
language. These systems are the most difficult to develop, most expensive and
accuracy is lower than speaker dependent systems. However, they are more flexible.
Information Systems: Natural Language Processing (AH)
5
A speaker-adaptive system is developed to adapt its operation to the characteristics
of new speakers. It's difficulty lies somewhere between speaker-independent and
speaker dependent systems.
The size of vocabulary of a speech recognition system affects its complexity,
processing requirements and accuracy. Some applications only require a few words
(e.g. numbers only), others require very large dictionaries (e.g. dictation machines).
An isolated-word system operates on single words at a time - requiring a pause
between each word. This is the simplest form of recognition to perform because the
end points are easier to find and the pronunciation of a word tends not affect others.
Thus, because the occurrences of words are more consistent they are easier to
recognise.
A continuous speech system operates on speech in which words are not separated by
pauses. Continuous speech is more difficult to handle for a variety of reasons. It is
difficult to find the start and end points of words. Another problem is coarticulation the production of each phoneme is affected by the production of surrounding
phonemes, and similarly the start and end of words are affected by the preceding and
following words. The recognition of continuous speech is also affected by the rate of
speech. Rapid speech tends to be harder.
Speech recognition starts with the digital sampling of speech, followed by acoustic
signal processing. The next stage is recognition of phonemes, groups of phonemes
and words. Most systems utilise some knowledge of the language to aid the
recognition process. Some systems try to ‘understand’ speech, i.e. they try to convert
the words into a representation of what the speaker intended to mean or achieve.
Speech Synthesis
Speech synthesis programs convert written input to spoken output by automatically
generating synthetic speech. Speech synthesis is often referred to as ‘Text-to-Speech’
conversion (TTS). There are several algorithms available. The easiest way is to just
record the voice of a person speaking the desired phrases. This is useful if only a
restricted volume of phrases and sentences is used, e.g. messages in a train station, or
schedule information via phone. The quality depends on the way recording is done.
More sophisticated, but poorer in quality, are algorithms that split the speech into
smaller pieces. The smaller those units are, the fewer are they in number, but the
quality also decreases. One frequently used unit is the phoneme, the smallest
linguistic element. Depending on the language used there are about 35-50 phonemes
in western European languages, i.e. there are 35-50 single recordings. The problem is
combining them, as fluent speech requires fluent transitions between the elements.
The intelligibility is therefore lower, but the memory required is small.
One solution to this dilemma is the use of diphones. Instead of splitting at the
transitions, the cut is done at the centre of the phonemes, leaving the transitions
themselves intact. This gives about 400 elements (20*20) and the quality increases.
The longer the units become, the more elements are there, but the quality increases
along with the memory required. Other units that are widely used are half-syllables,
syllables, words, or combinations of them, e.g. word stems and inflectional endings.
Information Systems: Natural Language Processing (AH)
6
1.1.5 Ambiguity
Natural languages, including English, display a remarkable amount of ambiguity. The
same word, phrase or sentence can have a whole variety of different meanings.
Consider a simple sentence like ‘He made her duck’. This has at least four different
meanings:
•
•
•
•
He cooked duck for her
He cooked a duck, which she had provided
He caused her to lower her head
He made a duck for her (presumably from wood, plaster or some other substance).
Ambiguity, and the process of resolving it (disambiguation) are considered in detail in
Section 2.
1.1.6 Complexity and Scale
One of the problems faced by NLP researchers is the sheer complexity and scale of
natural language, in terms of the amounts of knowledge of different kinds needed to
describe it. We may justifiable regard certain types of computing systems as large or
complex, for example, airline reservation systems or banking systems, but they are far
less complex than natural languages.
The complexity of natural languages arises largely from three areas: the range of
words available in the language (its lexicon), the grammar of the language (its syntax)
and the meaning of sentences within the language (its semantics).
The Oxford English Dictionary lists more than half a million words, but the English
language is generally reckoned to have a lexicon of around a million words, although
this is difficult to establish precisely because of the number of related words, e.g. are
‘climb’, ‘climbing’ and ‘climber’ three different words, or one word with several
suffixes?
This lexicon isn’t fixed. We keep adding to it by borrowing words from other
languages. You can find an interesting list of these at:
http://www.krysstal.com/borrow.html
New words (or neologisms) are also coined continuously, often in technical areas.
Recent additions to the English language include ‘cyberspace’ and ‘gameboy’.
A further level of complexity is added by the syntax of natural languages. The rules
by which words may be combined to produce phrases and sentences are highly
complex and allow for the generation of an infinite number of possible sentences.
Some words can belong to more than one grammatical category (e.g. ‘flying’ can be
an adjective or a verb), adding to the number of possible meanings of sentences.
Another level of complexity comes from the meaning, or semantics, of words and
phrases. Many English words have more than one meaning, and they keep gaining
new ones, e.g. the use of ‘house’ or ‘garage’ to describe types of music.
Information Systems: Natural Language Processing (AH)
7
1.2 Goals for the Development of NLP
NLP has two primary goals. The technological goal is to build intelligent computer
systems, such as natural language interfaces to databases, machine-translation
systems, text analysis systems, speech understanding systems, or computer-aided
instruction systems. This goal cannot be achieved without using sophisticated theories
of the type being developed by theoretical linguists.
The linguistic, or cognitive science, goal is to gain a better understanding of how
humans communicate by using natural language. This second motivation is not unique
to NLP, but is shared with theoretical linguistics and psycholinguistics. The current
state of knowledge about natural language processing is so preliminary that it is not
yet feasible to build a complete model of human communication – this would require
major advances in both NLP and the experimental techniques used by
psycholinguistics.
1.3
Potential Application Areas for NLP
1.3.1 Machine Translation
Machine Translation (MT) can be defined as the use of computer systems to
translate text from one natural language to another, with or without human assistance.
An old joke tells of the scientist who devised a machine to translate English into
Russian and vice versa. To test his machine he input the proverb ‘Out of sight, out of
mind’. The machine translated it into Russian and translated the Russian back into
English. The final output was ‘Blind lunatic’.
Although there is an element of truth here, the joke underestimates the importance
which machine translation has now acquired. There are many types of MT system,
ranging from MAHT (Machine Aided Human Translation) where the emphasis is on
the human element, through to FAHQMT (Fully Automated Human Quality Machine
Translation).
There are three broad categories of MT strategy:
•
Direct systems: these carry out a word-for-word translation between source and
target language with no intermediate representation. One major limitation of this
type of system is poor handling of long-distance dependencies.
•
Transfer systems: language-dependent intermediate representations are used
between each language pair to be translated between (e.g. English-to-French,
French-to-English etc.). Thus, information derived in the analysis stage may
provide input to the synthesis stage.
•
Interlinguas: the target text is generated from an intermediate language-neutral
representation, itself built up from the source text. The analysis and synthesis
stages are completely separate.
MT has been a major research area since the 1950s, however MT systems are still far
short of the overall quality achieved by human translators, but they can offer major
benefits in terms of cost and management.
Information Systems: Natural Language Processing (AH)
8
There are three standard ways of improving the quality of output from MT systems:
•
Text mark-up facilitates easier translation through the addition of helpful
information to the source text.
•
Controlled languages are designed to be easy to translate, general-purpose
languages, based on a simple grammar and vocabulary.
•
Sublanguages focus on a particular field, allowing more complex grammar and
vocabulary within a small range of documents.
The use of sublanguages allows restrictions to be placed on the range of text for
translation and improves the MT output quality without any significant increase in
processing demand. For many types of subject-specific knowledge there will be an
associated sublanguage, e.g. weather forecasting, software manuals.
The restrictions in syntax complexity and size of vocabulary offer many benefits
towards MT system design, including simpler analysis and synthesis modules, a
smaller lexicon, and the avoidance of difficult constructs such as idioms.
In general, system complexity is reduced. However, these benefits should be seen as a
trade-off with the ability of the system to act in a general-purpose way and handle
novel constructs. It may also be difficult to reuse components within other
sublanguage applications.
Sublanguages have provided some major successes. One example is the Météo
system, which is possibly the most successful MT system to date and has been
translating weather reports from English to French for the Canadian Office of
Meteorology for nearly two decades. Météo is based on the specialised sublanguage
of weather forecasts, uses a simple set of temporal dimensions and avoids idioms.
Researchers at the University of Montreal realised that Météo's success was due to the
restricted nature of the texts it worked on and looked at eliminating the input text
altogether in favour of data gathered directly from weather stations. This approach led
to the development of a system that produces parallel English and French weather
bulletins for the Canadian eastern seaboard. The planning of what will be said and in
what order is done once for both languages. It is only towards the end that the
processes diverge. The same approach is now being taken with reports on labour
statistics.
Over the last few years, several web-sites have started offering automatic translations
of short documents, such as web pages. Take a look at the following:
http://babelfish.altavista.com/translate.dyn
http://www.freetranslation.com/
If you want to find out more about machine translation there is an entertaining and
easy-to-read book about it available on the World Wide Web at:
http://clwww.essex.ac.uk/~doug/book/book.html
Information Systems: Natural Language Processing (AH)
9
1.3.2 Text Retrieval and Question Answering
Text retrieval is the process of matching a user query against free-text records, such
as bibliographic records, newspaper articles or sections of a manual. Queries can
range from multi-sentence descriptions of the information required to a few words.
Text retrieval systems currently in use range from simple Boolean systems through to
systems making extensive use of NLP.
The recent growth of the Internet, and especially the World Wide Web, has led to new
search requirements from users who want effective and user-friendly searching
systems. At the same time, computer hardware has become capable of running
complex searches against massive amounts of data with acceptable response times.
This combination of factors has produced a demand for more effective search
methodologies, making greater use of NLP techniques.
Reasons for using NLP in text retrieval are mostly intuitive: users normally decide on
the relevance of documents by reading and analysing them, so if we can automate
document analysis this should help in the process of deciding on document relevance.
However, the use of NLP techniques has not yet significantly improved performance
in text retrieval.
Most researchers believe that it is easier to improve the effectiveness of text retrieval
by means of statistical methods than by NLP-based approaches. Only a small
proportion of current research is based on NLP techniques, but NLP resources like
thesauri, lexicons, dictionaries and proper name databases, are used regularly. It
seems that NLP resources are having more of an impact than NLP techniques on text
retrieval at present. One reason for this is that NLP techniques are not generally
designed to handle large amounts of text from different subject areas.
There is, however, an inherent mismatch between the statistical techniques used in
text retrieval and the linguistic techniques used in NLP. The statistical techniques
attempt to match the rough statistical approximation of a record to a query. Further
refinement of this process using NLP techniques often adds only noise to the
matching process, or fails because of the inconsistencies of language use.
The proper integration of these two techniques is difficult. What we really need are
NLP techniques designed specifically for text retrieval along with text retrieval
techniques developed specifically for taking advantage of NLP techniques.
Information Systems: Natural Language Processing (AH)
10
1.3.3 Command and Control
Command and control systems involve interaction with devices, often via speech
recognition. They are widely used in defence-related areas but also provide useful
aids for the handicapped and speech driven applications such as word processing. A
number of significant products have appeared on the market in recent years. Some of
them are described briefly below:
Dragon NaturallySpeaking
Dragon NaturallySpeaking Standard lets you communicate with your PC by speaking.
You can write reports, letters, and e-mails in virtually any Windows application with
your voice. Specialist versions are available for the legal and medical professions.
Product features include:
•
revise, edit and format text: with select-and-say editing features.
•
manage e-mail by voice: create and send e-mail by voice. listen to your messages
read aloud.
•
customise vocabulary: with names and terms you use.
•
quick correct list: proofread and make changes to your work as you go.
•
intuitive commands: say commands that make sense to you in Word 97/2000.
•
switch between applications: launch programs by voice and switch between
applications just by saying so.
You can get more information about Dragon NaturallySpeaking at:
http://www.dragonsys.com/
and you can read a complete online book about Dragon NaturallySpeaking at:
http://www.sayican.com/lib/sayican/onlinebook.html
Information Systems: Natural Language Processing (AH)
11
IBM ViaVoice
IBM's ViaVoice for Windows allows you to dictate, format and edit text directly in
popular word processing applications like Microsoft Word 97 and 2000. Or dictate
into ViaVoice SpeakPad and transfer your words into other Windows applications such as e-mail - with a single voice command. Personalised Attention Words help
dictation and voice commands work more smoothly and the Text-To-Speech feature
reads back what you've already dictated for quick, easy editing, spelling correction
and grammar. Product features include the following:
•
Direct Dictation in Microsoft Word 97 and 2000: Now you can speak, format
and edit directly into this popular application.
•
Natural Commands in Microsoft Word 97 and 2000: Can make working with
Word easier as you use everyday language rather than 'computer speak' to
command functions within this popular word processor.
•
SpeakPad: Simplest way to dictate, format and edit text: Directly into the
ViaVoice speech-enabled word processor.
•
Improved accuracy and correction: Greater accuracy and quicker correction
means your letters, e-mail, homework can be finished faster and more easily and
with greater product satisfaction.
•
Text-To-Speech: Editing made easy. Listen as ViaVoice reads back the text you
have dictated and formatted so you know just what it is you’ve dictated.
•
Attention Words: Choose whatever words you like to ensure ViaVoice knows
when you are issuing a command rather than dictating text.
•
60,000-word vocabulary: The greater the vocabulary, the greater the accuracy as
ViaVoice recognises the words you use.
•
Teach ViaVoice new words: Greater accuracy when you use special words,
terms, nicknames, acronyms, pet phrases, addresses and idioms: ViaVoice can
learn to recognize them, and add them to the vocabulary.
•
Command the Web: Your voice is the easy and natural way to quickly navigate
basic Web commands in selected browsers.
You can get further information about ViaVoice from IBM's ViaVoice web site at:
http://www-4.ibm.com/software/speech/desktop/w8-win.html
Information Systems: Natural Language Processing (AH)
12
Game Commander
This program is a bit different, as it's designed to provide voice control for computer
games. The original Game Commander won numerous awards for bringing voice
control to games. Game Commander 2 breaks new ground with lightning fast
command response and even more control over your games. You can even run it
concurrently with popular voice chat programs. It can take command of Windows
applications too.
•
Voice commands with no training: put the power of speaker-independent voice
control to work immediately without tedious voice training.
•
Customisable audible feedback: assign your own sounds and recorded speech to
hear your commands being acknowledged and enhance the gaming experience.
•
Global commands: common commands are available across all applications.
•
Automatic command file loading: the right commands are always ready as soon
as you need them. No need to fuss with files while you work and play.
•
Powerful command editing: the Game Commander Studio gives you full access
to all your commands and supports cut, copy, and paste operations to make editing
easy.
•
Multi-channel auto fire: say a command and have it repeated until you tell it to
stop. Issue more commands while auto fire runs, including more auto fire
commands!
•
Massive macro capabilities: unleash up to 256 keystrokes per voice command.
•
Easy keystroke entry: just press the key as you would in the game. Many special
Windows keys and combinations are also supported.
•
Adjustable actions: fine-tune any keystroke or action for maximum control.
•
Extended actions: configurable delay, key up, and key down actions, and step
sequencing add more control capabilities than ever before.
•
Works with many voice chat programs: use push-to-talk to switch between
Game Commander and popular voice chat programs (Windows 9x and Me only)
or use push-to-talk alone to enable command recognition only when you need it.
•
Voice training: for special cases, strong accents, or non-English commands, voice
training takes only three utterances, not ‘War and Peace’.
You can get more information about Game Commander from:
http://www.gamecommander.com
Information Systems: Natural Language Processing (AH)
13
The Game Commander 2 documentation is available for download in Adobe Acrobat
format from:
http://www.gamecommander.com/misc/
1.3.4 Text Analysis
We might want to use computers to analyse texts for a variety of reasons, e.g. to
determine the readability of a piece of text or its suitability for readers of a specified
level; to determine the authorship of a text, etc.
Readability Analysis
A number of word processing programs, e.g. Word and Word Perfect, now have
grammar checkers built in to them and other grammar and style checkers (such as
Correct Grammar and Grammatik) are available separately. These programs can be
useful for some kinds of textual analysis by providing statistics revealing average
sentence length, average number of syllables per word, percentage of sentences in
passive voice, and a readability index such as the Flesch-Kincaid Grade Level or the
Gunning Fog Index.
Microsoft Word can be used to provide readability statistics for documents. Click
Tools on the menu bar, then choose Options, followed by Spelling and Grammar,
then check the ‘readability statistics’ box under Grammar.
The Flesch-Kincaid grade level arose from a book entitled ‘The Art of Readable
Writing’, published by Dr. Rudolph Flesch in 1941, in which he described a simple
method of analysing readability. He analysed text samples of about 100 words,
assigning each sample a readability index based upon the average number of syllables
per word and the average number of words per sentence. Most scores range from 0 to
100. College graduates should be able to follow prose in the 0 - 30 range. Scores of 50
- 60 are high-school level and 90 - 100 should be readable by fourth graders, i.e.
children who have completed four years of primary education.
The General Motors Corporation automated Flesch’s algorithm in the early 1970s.
The program, called GM-STAR (General Motors Simple Test Approach for
Readability) was used to ensure that workshop manuals were readable. be made more
readable. The key to this program is a very simple algorithm to count the number of
syllables in a word.
The Flesch Index (F) for a given text sample is calculated from three statistics:
• The total number of sentences (N),
• The total number of words (W),
• The total number of syllables (L),
Though crude, since it is designed simply to reward short words and sentences, the
index is useful. It gives a basic, objective idea of how hard prose is to wade through.
Fog Indexes are also used to provide a broad estimate of the grade level (number of
years of education) required to understand written material. (Some writers suggest
that ‘fog’ is an acronym for ‘frequency of gobbledygook’.) Even if grade level
standards change, an index can still be used to estimate the relative difficulty of the
material.
Information Systems: Natural Language Processing (AH)
14
There are several fog indexes, but the one developed by Robert Gunning in
‘Technique of Clear Writing’ is one of the simplest and most effective.
A simplified version of Gunning's procedure is as follows:
a. Pick any 100-word segment of text.
b. Count the sentences in the segment (a fragment of a sentence at beginning or
end counts as a whole sentence).
c. Divide 100 by the number of sentences (= average sentence length).
d. Count the words in the segment with 3 or more syllables.
e. Add the two numbers (average sentence length + number of 3-or-moresyllable words).
f. Multiply the sum by 0.4.
g. Round off the result to the nearest whole number. This is the GFI.
The GFI is not an absolute measure, e.g. reader familiarity with the subject matter is
also important in determining readability.
Computer-Based Text Analysis
The first tool of computer-based text analysis was the ‘Concordance’ - a tool that was
transformed by the computer, but originated in methods that go back to the middle
ages. The concordance grew out of medieval biblical scholarship that tried to find
parallels between the Old and New Testaments by finding places where the words in
the text of the Old Testament foreshadowed a passage in the New. Scholars became
aware that it would be useful to group words into categories, and develop indexes that
pointed to all (or at least many) occurrences of those words in the different books of
the Bible. Thus was started the early thematic concordance, which named the major
people, places, things and ideas that appeared in the Bible.
In the late 1940s Roberta Busa, an Italian philosopher decided to produce a
concordance of the complete writings of the Medieval scholar Thomas Aquinas.
Although computers were virtually unknown in Italy at the time, it was clear that the
task required some kind of machinery. The work was begun on with punched cards
and card sorting machines and was completed (33 years later) in the 1970s, using
large IBM mainframe computers with computer-driven typesetting equipment. With
various indexes and other associated information, the Index consists of about 70,000
typeset pages.
There were two full concordances. One, produced directly by the computer, was a
complete list of the occurrences of all word forms. This type of concordance, called
unlemmatised, lists all word forms under separate entries. Busa's concordance also
included a lemmatised concordance, where the list of headwords are standardised as
they might appear in a dictionary - different forms of each verb or noun are gathered
under a single entry.
Information Systems: Natural Language Processing (AH)
15
For the lemmatised concordance, the computer could not automatically bring all
related forms together on its own. Thus, the lemmatised was a machine-assisted
concordance, requiring significant human interaction.
Early on the KWIC (Keyword in Context) concordance format was developed. The
entire vocabulary of the work is listed in alphabetical order. Each word form, called a
headword, is followed by its occurrences. Each occurrence, in turn, is given on a
separate line consisting, first, of some ‘reference information’ that helps the KWIC
user locate the occurrence in the full text, and then by a brief excerpt that shows the
word in its context - hence the name.
Most of the early computer work in text analysis was in the production of
unlemmatised concordances. Obviously, if any work was to be done with a computer
on a text, the text itself had to be in ‘machine-readable form’. By the end of the 1970s
it was clear that more texts needed to be available in electronic form and standard
software was required.
Oxford University became an early leader in both these areas with the establishment
of the Oxford Text Archive - a repository for electronic texts and the Oxford
Concordance Program (OCP), a mainframe-based text analysis system. Both OCP
and the Archive are still important today. Nowadays, a significant part of the
Archive's holdings are available via the Internet. OCP has been overshadowed by
other computing developments, but the software is still in use.
Alastair McKinnon, a philosopher at McGill University, began working in the 1960s
with the goal of publishing a complete printed concordance of the published writings
of the Danish philosopher Soren Kierkegaard (1.9 million words) - based not only on
the Danish text, but also on the standard German, English and French translations.
After the concordance was published he continued to experiment, and by the early
1980s he realised the benefits of having access to a large and significant text in
electronic form, and having computer tools that could answer sophisticated queries.
When his work became known, he published the electronic form of the Kierkagaard
corpus, along with his collection of computer programs (called Textmap) used to
manipulate it.
The appearance of the personal computer in the early 1980s led to further advances in
text analysis. Word processing software was probably the most important
development, but more specialised software was also written. The first widely known
software package was the Brigham Young Concordance program (BYC) – later sold
as WordCruncher. Another widely used package is TACT, developed at the
University of Toronto.
In addition to allowing scholars to search for words or phrases throughout a text, the
availability of electronic texts has made possible increased use of statistical methods often a particular collection of methods called Multivariate Analysis (MVA). Such
methods promise to allow the computer to do more than just find words. Gerard
Ledger used statistical methods to work out the chronology of Plato's dialogues. He
ignored traditional critical methods and used multivariate analysis techniques to
identify which dialogues were likely to have been written earlier or later.
Information Systems: Natural Language Processing (AH)
16
However, for both the literary scholar and the average reader, working with the ideas
found in a text is much more interesting than working with the words themselves.
Making the computer move from identifying individual words to the ideas that these
words represent is difficult. For this reason there is still scepticism about the use of
computers in text analysis and a corresponding need to improve on the existing tools.
Text Analysis Software
TextQuest is a well-known program for text analysis. Full details can be obtained
from:
http://www.textquest.de/tqe.htm
A demo version can also be downloaded from this site. The following applications
can be performed by TextQuest:
•
•
•
•
•
•
•
list of words, sorted by alphabet or by frequency, ascending or descending,
also with exclusion lists (STOP-words)
list of word sequences
list of word permutations
KWICs - key word in context with variable line length
SITs - search patterns in text unit
content analysis with powerful features like interactive coding, control files,
and negation detection
control of multiple search patterns
1.3.5 Scanning Newspaper Stories
NLP techniques have been successfully applied in sorting text into fixed topic
categories. One aspect of this is scanning newspaper stories. There are several
commercial services, which provide subscribers with access to news stories on
specified topics, e.g. news on a particular industry, company or geographical area.
Categorisation has traditionally been carried out by human experts, but in recent years
NLP software has been shown to be just as effective, categorising 90% of stories
correctly. This may seem surprising, considering the lack of success in using NLP
techniques in information retrieval. However, in this case, the categories are fixed, so
researchers can spend their time addressing other problems.
1.3.6 Intelligent Tutoring Systems
Computers have been used in education for since the late 1970s. The earliest systems
were known as Computer-Based Training (CBT) systems. These systems had a
major drawback - instruction was not tailored to the learner's needs. Instead, the
decisions about how to progress through the material were programmed: ‘if question
19 has been answered correctly, proceed to question 56; otherwise go to question 35.’
The learner's abilities were not taken into account.
While CBT may be effective in helping learners, it does not provide the same kind of
individual attention that a student would receive from a human tutor. For a computer
based system to provide such attention, it must reason about the subject matter and the
learner. This has led to research into Intelligent Tutoring Systems (ITSs).
Information Systems: Natural Language Processing (AH)
17
These offer considerable flexibility in presentation of material along with a greater
ability to respond to student needs. ITSs have been shown to be effective at increasing
student motivation and performance, e.g. students using Smithtown, an ITS for
economics, performed as well as students taking a traditional economics course, but
only spent half as much time covering the material.
You can download the Smithtown software from:
http://www.pitt.edu/~akatz/akatz.htm
If you want to see another example of an ITS, you find one for Algebra at:
www.algebratutor.org
Many systems attempt to simulate a realistic working environment in which the
student can learn. One example is the Advanced Cardiac Life Support (ACLS)
Tutor in which a student takes the role of team leader in providing emergency life
support for heart attack patients. The system not only monitors student actions, but
runs a realistic simulation of the patient's condition and maintains an environment that
is reasonably faithful to the real life situation.
Some systems take a less rigorous approach to representing the environment. The
situations presented are similar to real world scenarios, but they are not exact
simulations. Smithtown takes this approach by providing a simulated setting for
students to test hypotheses about economics. However, the underlying model is not an
exact simulation of how the laws of economics would operate in the real world.
Systems tend to concentrate on teaching one type of knowledge. The most common
type of ITS teaches procedural skills; the goal is for students to learn how to perform
a particular task. Systems that are designed according to these principles are often
called cognitive tutors.
Other ITSs concentrate on teaching concepts and mental models. These systems
encounter two main difficulties. First, a more substantial domain knowledge is
needed. Second, since learning concepts and frameworks are less well understood
than learning procedures, there is less cognitive theory to guide knowledge
representation and the pedagogical module. ITSs of this type require a larger domain
knowledge base and are sometimes referred to as knowledge based tutors.
As a result of not having a strong model of skill acquisition or expert performance,
these systems are forced to use general teaching strategies. They also place more
emphasis on the communication and presentation system in order to achieve learning
gains. An example of such a system is the Pedagogical Explanation Generation
(PEG) system which uses a substantial domain knowledge base to construct answers
to student queries about electrical circuits.
Generally, tutors that teach procedural skills use a cognitive task analysis of expert
behaviour, while tutors that teach concepts and frameworks use a larger knowledge
base and place more emphasis on communication. There are exceptions to these rules,
but they are useful guidelines for classifying ITSs.
A good survey of Intelligent Tutoring Systems can be found at:
http://www.dis.port.ac.uk/~callear/CBT64203.htm
Information Systems: Natural Language Processing (AH)
18
1.4 Relationship to Other Disciplines and Other Fields of AI
1.4.1 Other Disciplines
Computer Science
NLP is a sub-field of Artificial Intelligence, itself a field of Computer Science. As a
result it uses many of the same technique and procedures as mainstream Computer
Science. These include programming, data structures and algorithms. One field which
is closely connected is compiler construction. This is hardly surprising, since
compiler construction deals with programming languages, which share many of the
characteristics of natural languages, although they are considerably less complex.
Compiler construction involves parsers, lexical analysers, syntactic analysers and
other tools used in NLP. It also makes extensive use of algorithms and data structures.
The traffic between NLP and Computer Science hasn’t all been one way. NLP has
given Computer Science time sharing, interactive interpreters, the linked-list data type
and some of the key concepts of object-oriented programming and graphical user
interfaces.
Linguistics
The main goal of linguistics (or more correctly, theoretical linguistics) is to produce a
structural description of natural language. Linguists do not usually consider how
actual sentences are processed (parsed) or how sentences can be generated from
structural descriptions.
Linguistic theories should generally hold true across different languages, so linguists
tend to concentrate on the general principles that underlie all natural languages, and
devote less time and effort to examining any particular language. The aim of
linguistics is a formal specification of language structure, in the form of rules that
define the range of possible structures and the constraints on these.
Psychology
The area of psychology that deals with language is known as psycholinguistics. Like
computational linguists, psycholinguists are interested in how people produce and
understand natural language. In psychological terms, a linguistic theory is only useful
if it explains actual behavior.
Psycholinguists are interested in both the representation of linguistic structures and
the processes by which a person can produce such structures from actual sentences.
The primary tool of psycholinguistics is experimentation, i.e. actual measurements
made on people as they produce and understand language.
Areas studied include the time needed to read each word in a sentence or to decide
whether a given item is a valid word or not, the types of errors people make as they
perform various linguistic tasks, and so on. Experimental data is used to validate or
reject specific hypotheses about language. These hypotheses are often derived from
the theories of theoretical or computational linguists.
Information Systems: Natural Language Processing (AH)
19
Language Teaching
For many years, foreign language teaching has been supplemented by the use of
language laboratories. These are rooms, often divided into booths, where students can
listen individually to recordings of foreign language material, and record and play
back their own responses, while being monitored by a teacher.
When language laboratories were introduced, they were hailed as a technique that
would greatly improve the rate and quality of foreign language learning by removing
the burden of repetitive drills from the teacher and providing students with more
opportunities to practice listening and speaking. Although many schools were quick to
install expensive language laboratory equipment, it became obvious within a few
years that there would be no major breakthrough. The expected improvements were
not realized and the popularity of the language laboratory began to wane.
There were several reasons for this failure. Recorded materials were often poorly
designed, leading to frustration and boredom. Materials were not matched to the other
work students were doing in class and few teachers were properly trained in materials
design or laboratory use. Nowadays we have a better appreciation of the strengths
and limitations of the language lab. We also have access to better hardware and
software. Modern language laboratories which make use of interactive multimedia
have proven to be extremely effective.
When used properly, language laboratories can provide a valuable extra dimension.
Recorded material can supply a variety of authentic and well-recorded models for
improving listening comprehension. Laboratories can be used as libraries or material,
giving learners extra opportunities to practice at an appropriate level.
However, the limitations of language laboratories must always be kept in mind. Their
value depends on the development of suitable teaching materials which reinforce what
has been taught in class and provide opportunities for creative use. As in many other
areas, developments in software have failed to keep pace with developments in
hardware.
Modern language labs incorporates Computer-Assisted Language Learning (CALL)
workstations. PCs complement the audio and video facilities, enabling interactive
teaching of written language skills. Several kinds of exercise, such as sentence
restructuring, checking of translation or dictation tasks, and cloze testing can be
controlled by the computer, using texts displayed on the screen. Increasingly clever
interactive games are available. In Storyboard, for example, learners are given a
passage of blanks; they have to ‘buy’ words and complete the passage before their
supply of money runs out.
The use of computers in CALL offers many benefits to students. Computers can offer
real advantages in composing text, and on-line help such as dictionary support can be
very useful. Whilst simple technologies like these are practical and useful to students,
more ambitious assistance, such as providing support through video, has limitations
due to bandwidth requirements. These difficulties will disappear as bandwidth
increases. A history of CALL and a summary of the current state of the field can be
found at:
http://www.gse.uci.edu/markw/call.html
Information Systems: Natural Language Processing (AH)
20
1.4.2 Other Fields of AI
Robotics
Robotics and AI are often seen as totally distinct fields, with Robotics deriving from
Mechanical Engineering and AI from Computer Science. However the two are closely
related and Robotics can be seen to some extent as the physical implementation of AI
principles.
A robot is defined as ‘a reprogrammable, multifunctional manipulator designed to
move material, parts, tools, or specialized devices through various programmed
motions for the performance of a variety of tasks’ (Robot Institute of America, 1979).
The word ‘robot’ was coined by the Czech playwright Karel Capek from the Czech
word for forced labour or serf. He used it in his 1921 play R.U.R. (Rossum's
Universal Robots) which was a huge success throughout Europe. Oddly enough, the
robots in the play were not mechanical in nature but were created through chemical
means.
The term 'robotics' refers to the study and use of robots. It was coined by scientist and
writer Isaac Asimov (1920 - 1992), best known for his many works of science fiction.
He first used the word ‘robotics’ in ‘Runaround’, a short story published in 1942. ‘I,
Robot’, a collection of several short stories about robots, was published in 1950.
Asimov proposed the three ‘Laws of Robotics’ and later added a 'zeroth law'.
•
Law Zero: A robot may not injure humanity, or, through inaction, allow
humanity to come to harm.
•
Law One: A robot may not injure a human being, or, through inaction, allow a
human being to come to harm, unless this would violate a higher order law.
•
Law Two: A robot must obey orders given it by human beings, except where such
orders would conflict with a higher order law.
•
Law Three: A robot must protect its own existence as long as such protection
does not conflict with a higher order law.
Within the research community the first robots were probably Grey Walter's Machina
(1940s) and the John's Hopkins Beast. Remote controlled devices had been built even
earlier with at least the first radio controlled vehicles built by Nikola Tesla in the
1890s. Tesla is better known as the inventor of the induction motor, AC power
transmission, and numerous other electrical devices.
The first industrial modern robots were the Unimates developed by Devol and
Engleberger in the late 50s and early 60s. The first patents were by Devol for parts
transfer machines. Engleberger formed Unimation and was the first to market robots.
As a result, he has been called the ‘father of robotics’. Modern industrial arms have
increased in capability and performance through controller and language
development, improved mechanisms, sensing and drive systems.
Information Systems: Natural Language Processing (AH)
21
The robot industry grew rapidly in the 1980s, primarily due to large investments by
the automotive industry. However, the quick leap into the factory of the future turned
into a plunge when the integration and economic viability of these efforts proved
disastrous.
One of the main applications of AI is in the area of robot control. By using evolving
control architectures, the robot can 'learn' the best way to do a task. Designers can use
neural networks and genetic algorithms to enable the robot to cope with complicated
tasks, such as navigation in a complex environment. Another area is image, sound and
pattern recognition - 3 traits that any anthropomorphic robot would need. Again,
neural-networks could be used to analyse data from the optical or audio device the
robot used.
Robotics is in many respects Mechanical AI. It is also a lot more complicated, since
the data the robot is receiving is real-time, real-world data, a lot more complicated
that more software-based AI programs have to deal with. On top of this more
complicated programming required, algorithms to respond via motors and other
sensors is needed.
Some researchers believe that the field of robotics is where AI is all eventually aimed,
most research is intended to one day become part of a robot.
Machine Vision
Many robotics applications require machine vision. Machine vision replaces human
vision with video cameras and specialised computers, and can improve on human
vision where precise and repeatable visual measurements and inspections are required.
Machine vision is primarily used for guiding robot movement for automated assembly
and for automated quality control. Machine vision poses a number of complex
technical problems, including edge detection, depth perception and dealing with
shadows.
In automated assembly, components are selected and placed on an assembly by the
robot. If the robot lacks vision, the components and assembly must be precisely
positioned so that the robot can locate them. This requires expensive fixtures. A robot
with vision can use cheaper and more general fixtures and can be taught to find and
place the components on the assembly. Visual guidance can compensate for some
variations in the components and the assembly and therefore carry out tasks that are
impossible with blind placement. The additional cost of a vision system can be
recovered from improvements in manufacturing flexibility and quality.
One example of automated assembly that requires machine vision is the placement of
surface mount components on printed circuit boards . The vision system determines
the precise location of the printed circuit board using reference marks on the circuit
board. The robot picks up each component and holds it in front of the camera. The
vision system verifies that the component's leads are correctly positioned and
determines the component's precise location. From this information the robot arm
places the component precisely on the printed circuit board.
Information Systems: Natural Language Processing (AH)
22
Machine vision systems are replacing human vision for quality control inspection of
manufactured objects. These inspections are often too fast or precise for human
vision, and the demand for quality requires repeatable visual inspection on each
object. This can be accomplished with a high-speed machine vision system that is
integrated into flow of the manufacturing process, either as part of robot assembly or
at a station specifically designed for inspection.
Planning and Searching
In AI, searching usually crops up in the context of problem solving. One simple
example is the missionaries and cannibals problem, usually stated as follows: ‘three
missionaries and three cannibals are on one side of a river, along with a boat that can
hold one or two people. Find a way to get everyone to the other side; without ever
leaving a group of missionaries in one place outnumbered by the cannibals in that
place’.
A problem consists of four parts: an initial state, a set of operators, a goal test
function, and a path cost function. The environment of the problem is represented by
a state space. A path through the state space from the initial state to a goal state is a
solution.
The first step is to decide what the right operator set is. We know that the operators
will involve taking one or two people across the river in the boat, but we have to
decide if we need a state to represent the time when they are in the boat, or just when
they get to the other side. Because the boat holds only two people, no ‘outnumbering’
can occur in it; hence, only the end points of the crossing are important.
For the purposes of the solution, when it comes time for a cannibal to get into the
boat, it does not matter which one it is. Any permutation of the three missionaries or
the three cannibals leads to the same outcome.
These considerations lead to the following formal definition of the problem:
•
States: a state consists of an ordered sequence of three numbers representing the
number of missionaries, cannibals, and boats on the bank of the river from which
they started. Thus, the start state is (3,3,1).
•
Operators: from each state the possible operators are to take either one
missionary, one cannibal, two missionaries, two cannibals, or one of each across in
the boat. There are at most five operators, although most states have fewer because
it is necessary to avoid illegal states.
•
Goal test: reached state (0,0,0).
•
Path cost: number of crossings.
This state space is small enough to make it a trivial problem for a computer to solve.
People find it more difficult because some of the moves involve backtracking and we
react intuitively against this.
Information Systems: Natural Language Processing (AH)
23
This is an artificial problem, but similar techniques can be applied to solving a
number of real-world problems.
Route finding
Route finding is defined in terms of specified locations and transitions along links
between them. Route-finding algorithms are used in a variety of applications, such as
routing in computer networks, automated travel advisory systems, and airline travel
planning systems.
The traveling salesman problem
The travelling salesman problem, or TSP for short, is this: given a finite number of
cities, along with the cost of travel between each pair of them, find the cheapest way
of visiting all the cities and returning to your starting point.
The aim is to find the shortest tour. An enormous amount of effort has been expended
to improve the capabilities of TSP algorithms. In addition to planning trips for
traveling salespersons, these algorithms have been used for tasks such as planning
movements of automatic circuit board drills.
You can find a detailed description of the different versions of the problem and the
attempts made to solve it at:
http://www.keck.caam.rice.edu/tsp/index.html
Robot navigation can be reduced to a variation of the traveling salesman problem.
VLSI layout
The design of silicon chips is one of the most complex engineering design tasks
currently undertaken. A typical VLSI chip can have as many as a million gates, and
the positioning and connections of every gate are crucial to its operation. Two of the
most difficult tasks are cell layout and channel routing.
In cell layout, the components of the circuit are grouped into cells, each of which
performs some recognised function. Each cell has a fixed footprint (size and shape)
and requires a certain number of connections to each of the other cells. The aim is to
place the cells on the chip so that they do not overlap and so that there is room for the
connecting wires to be placed between the cells.
Channel routing finds a specific route for each wire using the gaps between the cells.
These search problems are extremely complex, but definitely worth solving.
Assembly sequencing
Automatic assembly of complex objects by a robot was first demonstrated by a robot
called FREDDY in 1972. Progress since then has been slow but sure, to the point
where assembly of objects such as electric motors is economically feasible. In
assembly problems, the problem is to find an order in which to assemble the parts of
some object. If the wrong order is chosen, there will be no way to add some part later
in the sequence without undoing some of the work already done.
Information Systems: Natural Language Processing (AH)
24
SEARCH ALGORITHMS
A single general search algorithm can be used to solve any problem; specific variants
of the algorithm embody different strategies. Search algorithms are judged on the
basis of:
•
Completeness: is the strategy guaranteed to find a solution, if there is one?
•
Time Complexity: how long does it take to find a solution?
•
Space complexity: how much memory is required to find a solution?
•
Optimality: does the strategy find the best solution if there are several possible
solutions?
Breadth-first search expands the shallowest node in the search tree first. It is
complete, optimal if all operators cost the same, but has high time and space
complexity. The space complexity makes it impractical in most complex cases.
Depth-first search expands the deepest node in the search tree first. It is neither
complete nor optimal, and has high time complexity and low space complexity. In
search trees of large or infinite depth, the time complexity makes it impractical.
(Depth-first and breadth-first strategies with respect to parsing are considered in
greater detail in Section 3.4 of these notes.)
Depth-limited search places a limit on how deep a depth-first search can go. If the
limit happens to be equal to the depth of shallowest goal state, then time and space
complexity are minimized.
Iterative deepening search repeats a depth-limited search with increasing limits until
a goal is found. It is complete and optimal, and has medium time and space
complexity.
Uniform-cost search expands the least-cost leaf node first. It is complete, and unlike
breadth-first search is optimal even when operators have differing costs. Its space and
time complexity are the same as for breadth-first search.
Bi-directional search can enormously reduce time complexity, but is not always
applicable. Its memory requirements may be impractical.
Information Systems: Natural Language Processing (AH)
25
PLANNING
Planning and problem solving use different approaches to the representation of goals,
states and actions, and the representation and construction of action sequences. We’ll
now look at some of the difficulties encountered by search-based problem-solving
approaches, and the methods used by planning systems to overcome these.
Let us see how these factors affect the ability of a theoretical intelligent device (which
we’ll refer to as an ‘agent’) to solve the following simple problem: ‘Get a pint of milk,
a brown loaf and a lawnmower’. Treating this as a problem-solving exercise, we need
to specify the initial state: the agent is at home but without any of the desired objects,
and the operator set: all the things that the agent can do.
It is obvious that there are too many actions and too many states to consider. The
agent can only choose among states to decide which is closer to the goal, it cannot
eliminate actions from consideration. Even if we could get the agent into the
supermarket, the agent would then resort to a guessing game, by considering actions,
such as buying an orange, buying corn flakes, buying milk and ranking these as good
or bad. It then knows that buying milk is a good idea, but has no idea what to try next
and must start guessing again.
The fact that the problem-solving agent considers sequences of actions starting from
the initial state also contributes to its difficulties. It forces the agent to decide first
what to do in the initial state, where the relevant choices are essentially to go to any of
a number of other places. Until the agent has figured out how to obtain the various
items, by buying, borrowing or stealing etc., it can’t really decide where to go. It
needs a more flexible way of structuring its thoughts, so that it can work on whichever
part of the problem is most likely to be solvable given the current information.
The first key idea behind planning is to the representation of states, goals, and actions.
Planning algorithms use descriptions in some formal language, usually first-order
logic. States and goals are represented by sets of sentences, and actions are
represented by logical descriptions of preconditions and effects. This enables the
planner to make direct connections between states and actions. For example, if the
agent knows that the goal includes Have(Milk), and that Buy(x) achieves Have(x),
then it knows that it is worthwhile to consider a plan that includes Buy(Milk). It need
not consider irrelevant actions such as Buy(ShoePolish) or GoToSleep.
The second key idea is that the planner is free to add actions to the plan wherever they
are needed, rather than always starting at the initial state. For example; the agent may
decide that it is going to have to Buy(Milk), even before it has decided where to buy
it, how to get there, or what to do afterwards. There is no necessary connection
between the order of planning and the order of execution. The representation of states
as sets of logical sentences plays a crucial role in making this freedom possible. For
example, when adding the action-Buy(Milk) to the plan, the agent can represent the
state in which the action is executed as, say, At(Supermarket). Search algorithms that
require complete state descriptions do not have this option.
Information Systems: Natural Language Processing (AH)
26
The third key idea behind planning is that most parts of the world are independent of
most other parts. This makes it feasible to take a combined goal like ‘Get a pint of
milk, a brown loaf and a lawnmower’ and solve it with a divide-and-conquer strategy.
A sub-plan involving going to the supermarket can be used to achieve the first two
objectives, and another subplan (e.g., going to the hardware store or borrowing from a
neighbour) can be used to achieve the third. The supermarket subplan can be further
divided into a milk subplan and a bread subplan. We can then put all the subplans
together to solve the whole problem. This works because there is little interaction
between the two subplans: going to the supermarket does not interfere with borrowing
from a neighbour, and buying milk does not interfere with buying bread.
Divide-and-conquer algorithms are efficient because it is almost always easier to
solve several small sub-problems rather than one big problem. However, divide-andconquer fails in cases where the cost of combining the solutions to the sub-problems
is high. For tricky puzzles, planning techniques will not do any better than problemsolving techniques. Fortunately, real world sub-goals tend to be nearly independent. If
this were not the case, the sheer size of the real world would make problem solving
impossible.
Expert Systems
In recent years, expert systems have received a great deal of attention. Expert systems
are AI-based programs which have been used to solve a range of problems in a whole
variety of fields, including include computer system design and medical diagnosis. An
expert system stores the knowledge of one or more human experts in a particular field.
The field is called a domain. The experts are called domain experts. A user asks the
expert system about a problem within the domain. The system applies its stored
knowledge to solve it.
A domain expert is a person who, because of training and experience, can do things
the rest of us cannot. Domain experts know a great many things and have tricks and
techniques for applying what they know to problems and tasks. They are also good at
discarding irrelevant information to get at basic issues, and at recognising new
problems as instances of types with which they are already familiar.
The part of the expert system that stores the knowledge is called the knowledge base.
The part that holds details of the problem to be solved is known as the global
database. The part that applies the knowledge to the problem is called the inference
engine.
Like most modern computer programs, expert systems usually have a have a friendly
user interfaces. This doesn't make the system work any better, but it does allow
inexperienced users to specify problems and understand the system's conclusions.
Expert systems are produced by knowledge engineers, who begin by reading
domain-related literature to become familiar with issues and terminology. When this
foundation is established, the knowledge engineer holds extensive interviews with one
or more domain experts to acquire their knowledge. The results of these interviews
are organised and translated into software that a computer can use.
Information Systems: Natural Language Processing (AH)
27
The interviews take the most time and effort of any of these stages. These interviews
can be very time consuming and are often the longest part of the system development.
The format used to capture the knowledge is called a knowledge representation. The
most popular knowledge representation is the production rule (also called the if-then
rule). Production rules reflect the ‘rules of thumb’ or heuristics that experts use in
their day-to-day work. A knowledge base consisting of rules is sometimes called a
rule base.
Machine Learning
Machine learning is the study of algorithms that enable computers to improve their
performance and increase their knowledge base. Research in this area has taken place
since the mid-1950s. One early development was a program, written by Arthur
Samuel, which learned to play draughts well enough to beat skilled human players.
The study of machine learning has provided information about how adults learn concepts and rules and how children's abilities develop, thus contributing to
developmental and cognitive psychology. Some of these ideas concern language
acquisition, making machine learning relevant to linguistics. However, not all work in
machine learning is directly relevant to psychology, since artificial intelligence
researchers sometimes develop algorithms that bear little relation to human learning.
The most successful machine-learning programs produce expert systems that perform
even better than humans. Recently, machine learning has become a useful technology
for mining large databases for information of commercial and scientific importance.
There are six major approaches: to machine learning: learning from examples,
artificial neural networks, genetic algorithms, explanation-based learning, evaluating
hypotheses, and analogical inference (case-based reasoning). The first three
approaches can all function in systems that have very little information and need to
learn from experience, without guidance from previously acquired knowledge. The
final three approaches involve learning in systems that already have extensive
knowledge bases.
Information Systems: Natural Language Processing (AH)
28
LEARNING FROM EXAMPLES
The most active area of research in machine learning has been the learning of
concepts from examples. The input to such learning programs consists of descriptions
of examples, and the output consists of different kinds of representations that
generalize about those examples. These output representations can be new concepts,
new rules, or decision trees-that provide a convenient of classifying new examples.
One of the first projects on learning from examples was a program by Patrick Winston
that learned the concept of an arch. Given examples of arches and examples of things
that were not arches, the program's task was to produce a general description of arches
that would correctly classify additional examples of arches while excluding nonarches.
Some concept-learning algorithms start from specific descriptions that are expanded
to be more general, while others initially produce general descriptions and are then
modified to handle more specific information. For example, a program might learn
the concept of an arch by assuming that every arch is exactly like the first one it
encounters, and then gradually generalise as it receives more examples. Alternatively,
a program might be designed to make an immediate generalisation, such as that an
arch consists of two vertical blocks with an object on top, and then expand this to
encompass new examples.
Another kind of learning from examples produces general rules instead of concepts,
while a third method produces decision trees rather than rules or concepts. Learning
concepts, rules, and decision trees from examples all produce symbolic descriptions,
but a very different kind of output is produced by another approach to machine
learning using artificial neural networks.
Neural Networks
The inputs to a neural network learning program are a network consisting of a set of
nodes connected by excitatory and inhibitory links, along with a set of training
examples. The links represent the weightings of the connections, positive and
negative, between the nodes. Learning algorithms modify these weightings to improve
the performance of the network at some task - for example, predicting the weather or
classifying plants. The output of the learning algorithm is not a new representation
such as a concept or a rule, but rather a modification of the weightings of the
connections in the network.
The commonest learning algorithm in neural networks is called ‘backpropagation,’
which trains a network by adjusting the weights that connect the different nodes. The
network consists of input nodes representing features of examples (e.g. hot, humid,
windy days) and output nodes which representing a conclusion (e.g. sunny or rainy).
The input and output nodes are linked by hidden nodes that have no initial
interpretation.
Random weights are assigned to the links between nodes, then the input nodes are
activated for features that the network is meant to learn about. Activation spreads
through the network to the hidden nodes and then to the output nodes.
Information Systems: Natural Language Processing (AH)
29
Errors are determined by calculating the difference between the computed activation
of the output nodes and the expected activation, e.g. an input of hot and dry may
erroneously predict that the day will be rainy.
To enhance future performance, errors are propagated backwards down the links,
changing the weights in such a way that the errors are reduced. After sufficient
examples have been presented to the network, it will operate correctly.
Genetic Algorithms
In the 1970s, John Holland used an analogy between learning and biological
adaptation to develop an approach to machine learning called ‘genetic algorithms’.
Species adapt to their environments by mutation, which introduces new genetic
combinations, and by natural selection, which favours members of the species who
are genetically suited to survive and reproduce.
The inputs to a genetic algorithm program are structures for performing tasks. These
can be simple structures, such as binary strings representing the presence or absence
of features, or more complex structures such as computer programs. The outputs are
modified structures that should perform the desired task more effectively. To obtain
this improvement, genetic algorithms alter the input structures by randomly modifying
them (mutation) and by combining them (crossover). The resulting structures are then
evaluated for their effectiveness, and variation and selection are repeated. Genetic
algorithms have many useful applications, such as generating expert systems for
designing computer circuits.
Explanation-Based Learning
The methods described above can generate new knowledge using very little
background knowledge. By contrast, explanation-based learning makes extensive use
of existing knowledge. Explanation-based learning can start from a single example
instead of the large sets of training examples typically used in learning from
examples. The input to an explanation-based learning algorithm consists of an
example plus a database of general rules or schemas.
The output is a new concept, formed by constructing an explanation of why the
example is an instance of the goal concept, and by generalising the explanation to
obtain the goal concept. The new concept does not arise from generalisation from
numerous examples, but from the knowledge-intensive attempt to understand what is
happening in a particular example.
Evaluating Hypotheses
Hypothesis evaluation usually takes place in information rich areas such as medical
diagnosis. A doctor needs to generate and evaluate hypotheses that can explain why a
patient has a particular collection of symptoms. The inputs are a set of facts to be
explained and a knowledge base that can be used to generate hypotheses to explain
them. The outputs are judgments about the acceptability of various hypotheses.
The algorithms required need to generate hypotheses by searching the knowledge base
for explanations, then choose among competing hypotheses. Some systems use
Bayesian networks, in which facts and hypotheses are represented by nodes, connected by links that represent probabilities.
Information Systems: Natural Language Processing (AH)
30
Other systems operate more qualitatively, e.g. by treating hypothesis choice as a
process of constraint satisfaction that can be computed by artificial neural networks.
Analogical Inference (Case-Based Reasoning)
Analogical inference is useful when general information about a problem is not
available but there is a stock of similar, previously solved problems. The required
inputs are a problem to be solved (the target) and a knowledge base of previously
solved problems. Problems can include plans, such as how to get to the airport,
designs, such as how to build an aircraft and explanations, such as why a plane
crashed. A new plan, design, or explanation can be formed by adapting a previous one
(the source) that already exists in the knowledge base.
To achieve this, we need algorithms for retrieving a potentially relevant case,
mapping it to the target problem, establishing correspondences and adapting the case
to provide a solution. Once the target problem has been solved, another kind of
learning can take place by noting the common properties of the target and source to
provide a basis for subsequent problem solving.
Information Systems: Natural Language Processing (AH)
31
Information Systems: Natural Language Processing (AH)
32
2.
ANALYSING NLP
2.1 Levels of Processing
The processes involved in the comprehension and production of language can be
analysed at several different levels:
Phonological: the sounds made in the language. The units include phonemes and
syllables. At this level we should know:
•
How to classify sounds as belonging to specific linguistic categories.
•
How to use patterns of pitch and volume to determine boundaries between units.
•
What sounds can legally follow other sounds.
•
How units at one level (e.g. phonemes) combine to form units at a higher level
(e.g. syllables).
Morphological: the shape or structure of the words used in the language, based on the
prefixes, suffixes and root forms making it up. The units are morphemes. At this level
we should know:
•
What order to put, or expect, the morphemes in a polymorphemic word.
•
How the sounds in a morpheme change when it combines with another.
•
How to interpret or produce a word consisting of a novel combination of
morphemes.
There are several different types of morphology:
•
Inflectional morphology: refers to the changes to a word that are required in a
particular grammatical context, e.g. most nouns add an ‘s’ to signify a plural.
•
Derivational morphology: refers to the derivation of a new word from another
word, often of a different category, e.g. ‘baldness’ is derived from the adjective
‘bald’ plus the suffix ‘ness’.
•
Compound morphology: refers to taking two different words and combining
them into one, e.g. ‘goalkeeper’ is a compound of ‘goal’ and ‘keeper’. (‘Keeper’ is
itself derived from the verb ‘keep’ by derivational morphology.)
Syntactic: the grammatical rules of the language. The units are phrases, clauses and
sentences. It is often difficult to separate syntax completely from semantics (see
below). At this level we should know:
•
How to order the components of a sentence to indicate a particular meaning.
•
What roles the components of a sentence play in the state or event described.
•
Where to expect gaps in particular patterns.
Information Systems: Natural Language Processing (AH)
33
Semantic: the meaning of utterances in the language. The units are objects, relations,
variables and worlds. At this level we should know:
•
How words and sentences relate to objects and relations in the world.
•
How to interpret the scope of negation.
•
How to ‘find’ the thing that is referred to by a definite noun phrase (such as ‘the
woman who always parks in front of my house’).
Pragmatic: the context within which the language is used. Units include utterances,
discourses, turns. At this level we should know:
•
How get somebody to do something for you.
•
How the meanings of words such as go and come change with the context.
•
How to begin and end a phone conversation without offending the hearer.
•
What forms sound more formal than others and how to select the forms that are
appropriate to the situation.
•
When someone has said something that is politically incorrect.
2.2
Categories of Words
Words can be classified into different categories according to their function. The same
classes have been used since the days of the Ancient Greeks.
•
Adjective: a word that qualifies a noun, e.g. ‘white’ in ‘white horse’;
•
Adverb: a word that qualifies a verb, e.g. ‘boldly’ in ‘to go boldly’;
•
Auxiliary: a verb which occurs along with another verb to add some further
meaning, e.g. ‘will’ in ‘will fly’ adds the notion of future action.
•
Determiner: a word which specifies whether we are dealing with a specific object
or a definite object, e.g. ‘the’ in ‘the book’;
•
Noun: a word that refers to an object or entity, e.g. ‘horse’;
•
Preposition: a word that denotes position, e.g. ‘on’ in ‘on the shelf’.
•
Pronoun: a word that refers to a person without using a proper name, e.g. ‘he’,
‘you’.
•
Verb: a word that refers to an action, e.g. ‘flying’, ‘washing’, ‘thinking’.
•
Conjunction: a word which joins parts of a sentence together, e.g. ‘and’, ‘or’
Information Systems: Natural Language Processing (AH)
34
We distinguish between content and function words, and between words of open and
closed classes:
Content words (such as nouns and verbs) carry the meaning of a sentence. They are
an open class, since the number of members of the class can be extended almost
indefinitely.
Function words, such as prepositions or conjunctions, serve a grammatical function
in the sentence. They are a closed class since there is only a limited number of each
and it is difficult to add new ones. However, closed classes do change over time. Up
until fairly recently ‘thee’ and ‘thou’ were in common use as pronouns. Nowadays
they only appear in poetry or archaic works.
2.3 Ambiguity
Ambiguity is one of the biggest problem areas in processing natural language by
computer. No-one is really sure why natural language is so ambiguous, although some
linguists believe that ambiguity arises from the variety of sources from which a
language is derived.
Many jokes rely on ambiguity for their effect. We are tricked into expecting a
particular ending, only to find that we are wrong. English is reckoned to be the only
language in the world where your nose can run and your feet can smell.
Ambiguity is a problem in NLP because it increases the range of possible
interpretations of an utterance. If each word in a 10 word sentence has 3 possible
meanings, then the whole sentence would have 310 or 59049 different meanings! This
problem is known as combinatorial explosion. The process of removing ambiguity
is often referred to as disambiguation.
Ambiguity can be local or global. Local ambiguity means that part of a sentence can
have more than one interpretation, but the whole sentence cannot. Global ambiguity
means that the whole sentence can have more than one interpretation.
Consider the sentence:
‘I know more beautiful women than Julie Rodgers’
This can have two distinct meanings:
‘I know women who are more beautiful than Julie Rodgers’
or
‘I know a larger number of beautiful women than Julie Rodgers knows.’
The sentence is globally ambiguous. However if we change it to:
‘I know more beautiful women than Julie Rodgers, although she knows quite a few.’
the ambiguity is removed. However, the first phrase is still locally ambiguous, and it
is only when the second phrase is added that the ambiguity is resolved.
Information Systems: Natural Language Processing (AH)
35
Local ambiguity can often be resolved by syntactic analysis. Consider the following
sentences.
‘The old train the young’
‘The old train pulled out of the station’
Once we realise that ‘train’ is a verb in the first sentence and a noun in the second, the
ambiguity is resolved.
Resolution of global ambiguity requires semantic or pragmatic analysis. Look at the
following sentences:
‘I saw the Houses of Parliament flying over London’
‘I saw a Boeing 747 flying over London’
Here we can differentiate between the two sentences, because we know what can and
because we know what can and cannot fly. The first sentence must mean:
‘I saw the Houses of Parliament (while I was) flying over London’
However, the second sentence remains globally ambiguous as it can have either of the
following meanings:
‘I saw a Boeing 747 (while I was) flying over London’
‘I saw a Boeing 747 (that was) flying over London’
Types of Ambiguity
Structural ambiguity
This occurs where there is more than one way of parsing a given sentence, e.g.
‘You can have peas and bean or carrots with your dinner.’
This can be parsed as:
‘You can have (peas) and (beans or carrots) with your dinner.’
or as
‘You can have (peas and beans) or (carrots) with your dinner.’
The ambiguity can sometimes be resolved by the use of intonation or pauses, if the
sentence is spoken, or by the use of punctuation, if it is written. However, what
about:
‘I saw a man on a hill with binoculars’
There are no commas and there are unlikely to be pauses in speech. We may have to
rely on world knowledge to derive the most likely meaning, e.g. how far away is the
hill, how many men are on it, where are the binoculars?
Information Systems: Natural Language Processing (AH)
36
Form class ambiguity
This occurs when a word can belong to more than one syntactic class, i.e. it can be
represented by more than one terminal symbol. This is sometimes known as syntactic
ambiguity or categorical ambiguity.
E.g. the word ‘time’ can be used as a noun, a verb or an adjective:
‘Time is money’
‘Time me on the last lap’
‘Time travel is impossible’
Or the word ‘flies’ can be a noun or a verb:
‘Time flies like the wind’
‘Fruit flies like bananas’
The ambiguity can be resolved by syntactic analysis, i.e. by deciding which syntactic
class the ambiguous word belongs to.
Word sense ambiguity
This occurs when a word has only one terminal symbol, but can refer to different
concepts. It is sometimes called lexical ambiguity, and arises because a word has
more than one meaning: e.g. ‘lead’ – is it the lead in a pencil, the metal lead or the
dog’s lead?
Consider the use of the word ‘charged’ in the following sentences:
‘The battery was charged with jump leads.’
‘The thief was charged with breaking and entering.’
‘The lecturer was charged with student guidance.’
‘The atmosphere was charged with excitement.’
The ambiguity can be resolved by semantic analysis: we know that jump leads and
student guidance aren’t criminal offences. But what about sentences like:
‘The boy ran away from the bank’
How do we tell whether the bank in question is the side of a river or a financial
institution? We can’t without further information. (See Section 2.4)
Word sense ambiguity can cut across syntactic categories, e.g. the word ‘back’ is an
adverb in ‘go back’, an adjective in ‘back door’, a noun in ‘I have a sore back’ and a
verb in ‘back up your data regularly’.
Information Systems: Natural Language Processing (AH)
37
Referential ambiguity
This occurs where more than one object is referred to by a noun phrase, e.g.
‘Jack was cycling with Colin when he fell off his bike.’
Who fell off the bike?
Referential ambiguity may be resolved in a variety of ways:
‘Jack gave Colin a present and he thanked him profusely.’
We know that it’s most likely that the recipient of the present is doing the thanking,
but what if the present was given in thanks for something else?
An ambiguous pronoun in a second phrase or sentence is more likely to apply to the
subject of the first phrase than to its object.
‘The supervisor fired the worker. He was known to be aggressive.’
It is more likely that the supervisor was ‘known to be aggressive’.
Where there is only one sentence involved, an ambiguous pronoun is more likely to
refer to the person closest to it in the sentence.
‘Kirsten gave Jill a coat because she was cold.’
The most likely interpretation is that Jill was cold. Similar logic would suggest that it
was Colin who fell off the bike in the first example given.
But what about:
‘Jack and Kirsten gave Colin and Jill some computer games because they liked
them.’
Should we assume that ‘them’ refers to Jack and Kirsten, or is this too obvious? We
don't really need to say that we like someone if we give something to them.
2.4 Sequences of Sentences
Sometimes we need to refer to a previous sentence in order to understand the meaning
of a given sentence, e.g. ‘He walked away from the bank.’ It is unclear whether bank
is the side of a river, or a financial institution. However, the following are not
ambiguous:
‘John threw a stone in the river. He walked away from the bank’.
‘John withdrew £50 from the ATM. He walked away from the bank.’
Information Systems: Natural Language Processing (AH)
38
2.5 Garden Path Sentences
Garden path sentences are those which they lead the listener up the garden path to an
incorrect parse. The following sentences may look as though they incorrect, but each
of them is grammatically correct and each has a clear meaning.
1. The horse raced past the barn fell.
2. The man who hunts ducks out on weekends.
3. The cotton clothing is usually made of grows in Mississippi.
4. The prime number few.
5. Fat people eat accumulates.
None of these sentences has random words tacked on; none of them are sentence
fragments stitched together; none of them are incomplete.
Here are the sentences with a bit of explanation that should clarify what they mean.
The horse raced past the barn fell.
The horse (that was) raced past the barn fell (down).
The man who hunts ducks out on weekends.
What does the man who hunts do on weekends? He ducks out on weekends.
The cotton clothing is usually made of grows in Mississippi.
Where do they grow the cotton that that clothing is usually made of? Mississippi.
The prime number few.
The mediocre number many; the prime number few.
Fat people eat accumulates.
The fat that people eat accumulates.
Information Systems: Natural Language Processing (AH)
39
Information Systems: Natural Language Processing (AH)
40
3.
PARSING AND GENERATION TECHNIQUES
3.1 A Simple Formal Grammar
Context-Free Grammars (CFGs) are a method of describing language and other
hierarchical structures. They are related to Phrase Structure Trees.
CFGs have the following characteristics:
•
a left-hand and right-hand side, separated by the symbol ::= (read as ‘consists of’)
•
one symbol only on the left-hand side
•
at least one symbol on the right-hand side
•
symbols on the left-hand side of rules are always non-terminals (that is, they
never appear as leaves on trees)
•
symbols on the right-hand side of rules may be either terminals or nonterminals.
In this section we’ll define a formal grammar for a small subset of the English
language which we’ll call Innglish. Natural languages use fixed sets of letters (in
written form) or sounds (in spoken form), which combine to form a fixed set of
words, the lexicon or vocabulary of the language.
Words can be regarded as the symbols of a formal language. These symbols can be
combined to form strings. Some strings such as ‘This is a sentence’ are valid English
sentences, while others, such as ‘A this sentence is’ are not. We can devise a
grammar, or set of rules which allow us to generate or recognise reasonably complex
sentences.
There are a number of different systems for describing grammars, but most of them
use the concept of grouping symbols together to form phrases and are thus known as
phrase structure grammars.
Phrases such as ‘a boy’, ‘the kitchen’ and ‘the seat in the garden’ are noun phrases.
As we will see later, phrases help us to describe the semantics, or meaning of a
sentence. By using different types of phrase and specifying the permissible
relationships between them helps us to define the allowable stings of the language.
For example, we can say that a noun phrase can combine with a verb phrase (such as
‘sat on the chair’) to form a sentence.
A formal language is defined as a set of strings, each of which consists of a sequence
of symbols chosen from a finite set of terminal symbols. Terminal symbols cannot be
further subdivided, but they can be categorised into groups according to their
function, e.g. nouns, verbs, adjectives etc.
Information Systems: Natural Language Processing (AH)
41
Other symbols, such as <noun phrase>, <verb phrase> and <sentence> consist of
groupings of terminal symbols and are known as non-terminal symbols. The highest
permitted level of non-terminal symbol (in this case, <sentence>) is known as the
distinguished non-terminal symbol.
The first step in defining a grammar is to define a lexicon, or list of acceptable words.
These words are grouped according to their category or part of speech. A brief lexicon
for Innglish is given below. The ellipsis (…) after the members of a category
indicates that it is possible to add more members.
Category
Members
Noun
boy, seat, garden, kitchen, firemen …
Verb
flying, visiting, sitting, running …
Adjective
big, green …
Adverb
quickly, slowly …
Preposition
in, on, at …
Conjunction
and, or …
Article
a, an, the …
We then specify the rules which describe the grammar:
<sentence>
::=
<noun phrase> <verb phrase>
<verb phrase>
::=
|
|
|
|
<verb group>
<verb phrase> <noun phrase>
<verb phrase> <adjectival phrase>
<verb phrase> <prepositional phrase>
<verb phrase> <adverb>
<verb group>
::=
|
<verb>
<auxiliary><verb>
<noun phrase>
::=
<pronoun>
|
<noun>
|
<noun phrase> <conjunction> <pronoun>
|
<adjectival phrase> <noun>
|
<article> <noun>
|
<noun phrase> <prepositional phrase>
<adjectival phrase>
::=
|
<adjective>
<article> <adjective>
<prepositional phrase>
::=
<preposition>
Information Systems: Natural Language Processing (AH)
42
3.2 Drawing Phrase-Structure Trees
Suppose that we have an input string such as ‘they are visiting firemen’. We can use
the grammar to recognise whether the sentence is a member of the language described
by the grammar or not. Indeed, if the sentence is a member of the language, then we
shall be able to draw the phrase-structure tree or trees. We draw trees with the root at
the top. The root is labelled with the distinguished symbol of the grammar:
<sentence>
|
‘They are visiting firemen’
We can then draw in the non-terminals that make up the sentence:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
|
‘they’
‘are visiting firemen’
We then pick one of the non-terminals <noun phrase> and expand it:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
<pronoun>
|
‘they’
‘are visiting firemen’
We then repeat the process by expanding another non-terminal <verb phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
|
|
‘they’
‘are visiting’
‘firemen’
We then repeat the process again by expanding <verb group>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
/
\
<auxiliary> <verb>
|
|
‘they’
‘are’ ‘visiting’ ‘firemen’
Finally, we expand <noun phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
/
\
|
<auxiliary> <verb>
<noun>
|
|
|
‘they’
‘are’ ‘visiting’ ‘firemen’
Information Systems: Natural Language Processing (AH)
43
We have now expanded all the non-terminal symbols of the sentence. However this
sentence can also be parsed in another way. The first 3 steps are the same:
<sentence>
|
‘They are visiting firemen’
We can then draw in the non-terminals that make up the sentence:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
|
‘they’
‘are visiting firemen’
We then pick one of the non-terminals <noun phrase> and expand it:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
<pronoun>
|
‘they’
‘are visiting firemen’
We then repeat the process by expanding another non-terminal <verb phrase>.
However, this time the word ‘visiting’ is placed in the <noun phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
|
|
‘they’
‘are’
‘visiting firemen’
We then expand <verb group>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
<verb>
|
‘they’
‘are’
‘visiting firemen’
We then repeat the process again by expanding <noun phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
/
\
<verb>
<adjectival phrase> <noun>
|
|
|
‘they’
‘are’
‘visiting’
‘firemen’
Information Systems: Natural Language Processing (AH)
44
Then we expand <adjectival phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
/
\
<verb>
<adjectival phrase> <noun>
|
<adjective>
|
‘they’
‘are’
‘visiting’
‘firemen’
As you can see, the sentence can have two different parse trees (and therefore, two
different meanings) depending on whether we treat the word ‘visiting’ as a verb, or as
an adjective. Note that we cannot tell from the sentence itself which of these meanings
is correct. We may be able to tell if the sentence is placed in context:
‘Who are they? They are visiting firemen.’
‘Where are they? They are visiting firemen.’
This example shows a clear case of global ambiguity. Ambiguity is where there is
more than one interpretation possible. In NLP we distinguish between global
ambiguity, where there is more than one possible interpretation of a whole utterance
(e.g. a sentence) and local ambiguity, where part of an utterance seems ambiguous.
Later information will allow us to ignore all or most of the interpretations.
3.3 Terminology
To those coming to NLP without much of a background in linguistics or traditional
grammar, the terms used in syntax can seem very confusing. We've already
encountered the words non-terminal, pre-terminal and terminal. The good news is that
terminals are simply words from the lexicon, so we need concern ourselves with
these no further.
Non-terminals
In Context-Free Grammars (CFGs), we write rules to describe phrases. Phrases are
simply groups of words and the name of the phrase usually takes its name from some
important word within it, e.g.
<noun phrase>
<verb phrase>
<prepositional phrase>
<sentence>
There are other phrases that are sometimes used, such as <verb group>, used in the
previous example.
Information Systems: Natural Language Processing (AH)
45
Pre-terminals
In the CFG we have used so far, we used categories such as: adjective, adverb,
auxiliary, determiner, noun, preposition, pronoun, verb. These are known as preterminals.
3.4 Search and Control in Parsing
We’ll now use the grammar and lexicon presented earlier in this section to show how
phrase-structure trees can be developed top-down or bottom-up. We’ll use Noam
Chomsky’s famous example ‘They are visiting firemen’ to show how the search can
be controlled either depth-first or breadth-first.
The grammar presented earlier was:
<sentence>
::=
<noun phrase> <verb phrase>
<verb phrase>
::=
|
|
|
|
<verb group>
<verb phrase> <noun phrase>
<verb phrase> <adjectival phrase>
<verb phrase> <prepositional phrase>
<verb phrase> <adverb>
<verb group>
::=
|
<verb>
<auxiliary><verb>
<noun phrase>
::=
<pronoun>
|
<noun>
|
<noun phrase> <conjunction> <pronoun>
|
<adjectival phrase> <noun>
|
<article> <noun>
|
<noun phrase> <prepositional phrase>
<adjectival phrase>
::=
|
<adjective>
<article> <adjective>
<prepositional phrase>
::=
<preposition>
The lexicon included:
adjective
auxiliary
noun
pronoun
verb
verb
->
->
->
->
->
->
visiting
are
firemen
they
are
flying
We have already seen that we can derive phrase-structure trees from the grammar for
the sentence ‘They are visiting firemen’. Now we’ll describe how to search for
solutions in the form of phrase-structure trees. There are two main search strategies:
top-down and bottom-up.
Information Systems: Natural Language Processing (AH)
46
The Top-Down Search Strategy
The top-down search strategy is sometimes known as hypothesis-driven search,
because it operates by proposing that the input string (‘they are visiting firemen’), is
covered by the distinguished non-terminal of the grammar. In the example given
above, we would start by proposing <sentence> as the distinguished non-terminal:
<sentence>
|
‘They are visiting firemen’
We can then draw in the non-terminals that make up the sentence:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
|
‘they’
‘are visiting firemen’
We then pick one of the non-terminals <noun phrase> and expand it:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
<pronoun>
|
‘they’
‘are visiting firemen’
We then repeat the process by expanding another non-terminal <verb phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
|
|
|
‘they’
‘are visiting’
‘firemen’
We then repeat the process again by expanding <verb group>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
/
\
<auxiliary> <verb>
|
|
‘they’
‘are’ ‘visiting’ ‘firemen’
Finally, we expand <noun phrase>:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun> <verb group>
<noun phrase>
/
\
|
<auxiliary> <verb>
<noun>
|
|
|
‘they’
‘are’ ‘visiting’ ‘firemen’
Information Systems: Natural Language Processing (AH)
47
The Bottom-Up Search Strategy
The bottom-up search strategy is sometimes known as data-driven search, because it
essentially operates by building upwards from the input string (in this example, ‘they
are visiting firemen’), to the distinguished non-terminal symbol of the grammar.
We start of from the sentence itself, broken down into its constituent parts:
<pronoun> <auxiliary> <verb>
<noun>
|
|
|
|
‘they’
‘are’
‘visiting’ ‘firemen’
We can then try to add a non-terminal that dominates one or more ‘lower’ nodes:
<noun phrase>
|
<pronoun> <auxiliary> <verb>
<noun>
|
|
|
|
‘they’
‘are’
‘visiting’ ‘firemen’
We then repeat the process:
<noun phrase>
|
<pronoun>
‘they’
<verb group>
/
\
<auxiliary> <verb>
<noun>
|
|
|
‘are’
‘visiting’ ‘firemen’
and again:
<noun phrase>
|
<pronoun>
‘they’
<verb group> <noun phrase>
/
\
|
<auxiliary> <verb>
<noun>
|
|
|
‘are’
‘visiting’ ‘firemen’
and again:
<noun phrase>
|
<pronoun>
‘they’
<verb phrase>
/
\
<verb group> <noun phrase>
/
\
|
<auxiliary> <verb>
<noun>
|
|
|
‘are’
‘visiting’ ‘firemen’
until we can finally add the distinguished non-terminal symbol to the root of the tree:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
/
\
<pronoun>
<verb group> <noun phrase>
/
\
|
<auxiliary> <verb>
<noun>
|
|
|
‘they’
‘are’
‘visiting’ ‘firemen’
Information Systems: Natural Language Processing (AH)
48
Control of the search strategy
As we saw earlier, more than one phrase-structure tree may be derived from the
grammar when applied to ‘they are visiting firemen’. We want our NLP systems to
find all trees of globally ambiguous sentences such as this. (We may want to choose
between the trees on the basis of some other kind of information, such as semantic
information.) When we're writing algorithms, we must have a method of ensuring that
we've considered every possibility. The process of handling alternatives in searches is
known as control. There are essentially two kinds of control: depth-first control and
breadth-first control.
Depth-first control of the search strategy
Control is the process of handling alternatives in search: depth-first control pursues
one alternative as far as possible until it is successful or blocks. Only then does it
consider the next alternative. For example, consider the top-down search when it has
got as far as:
<sentence>
/
\
<noun phrase>
<verb phrase>
|
<pronoun>
|
‘they’
‘are visiting firemen’
There are two <verb phrase> rules which can be used:
<verb phrase>
::=
|
<verb group>
<verb phrase> <noun phrase>
Depth-first control uses a stack, a last-in, first-out (LIFO) data structure which
operates like a railway siding. Items are always added (pushed) or removed (popped)
from the top of the stack. In depth-first control, the alternatives are placed on a stack
and the first item popped. This is expanded by placing alternatives on the stack. So
our stack would look like:
<verb group>
<verb phrase>
::=
::=
<auxiliary> <verb>
<verb phrase> <noun phrase>
When the <auxiliary> and <verb> have been consumed, the alternatives rules for
<noun phrase> are placed on the stack:
<noun phrase>::=
<noun phrase>::=
<noun phrase>::=
<noun phrase>::=
<noun phrase>::=
<noun phrase>::=
<verb phrase> ::=
<pronoun>
<noun>
<noun phrase> <conjunction> <pronoun>
<adjectival phrase> <noun>
<article> <noun>
<noun phrase> <prepositional phrase>
<verb phrase> <noun phrase>
Each alternative would be popped from the stack in turn, until there only remains:
<verb phrase> ::=
<verb phrase> <noun phrase>
It is only now that this alternative <verb phrase> rule can be used.
Information Systems: Natural Language Processing (AH)
49
Breadth-first control of the search strategy
Breadth-first control uses a queue rather than a stack data structure. A queue is a
first-in first-out (FIFO) data structure, i.e. elements are added at one end of the data
structure and taken away from the other. If you think about it, this is exactly how a
queue in a supermarket operates. Customers join at the tail of the queue, move
gradually up to the head, pay for their goods and leave.
If we look again at the handling of the alternative <verb phrase> rules, we'll see the
difference. Again, there are two VP rules which can be used:
<verb phrase>
::=
|
<verb group>
<verb phrase> <noun phrase>
The first entry is removed from the queue and expanded, but the new entry goes at the
end of the queue:
<verb phrase> ::=
<verb group> ::=
<verb phrase> <noun phrase>
<auxiliary> <verb>
This means that the next entry to be expanded is the alternative <verb phrase> rule,
not the <verb group>.
With one important exception, the two strategies produce the same results, although
not in the same order. If there are two solutions, one involving more ‘steps’ than the
other, then breadth-first search will find the solution with fewer ‘steps’ first. The order
in which depth-first search will find solutions depends on the order of rules in the
grammar.
The important exception is where the search never halts because there is an infinite
branch in the ‘search tree’. Consider a situation where there is one solution and an
infinite branch in the search tree. Breadth-first search will find the solution before
disappearing into infinity (because the solution must have fewer ‘steps’ than the
infinite branch). Depth-first search may find the solution before disappearing into
infinity, but only if the rules in the grammar are ordered so that the ‘finite’ solution is
found before the infinite branch.
If there are no infinite branches, which control is preferable? There is no obvious
answer to this question. Depth-first used to be the only practical option because on
average there are fewer entries on the stack at any one time. When computers were
very restricted as to the amount of memory available, this was an important criterion.
These days, memory is very much larger and so doesn't affect the choice of control to
the same extent.
We have shown that it is possible to have either a top-down or bottom-up search
strategy, depending on whether we start at the distinguished symbol (hypothesis) or
the words (data). We've also shown that it is possible to have either depth-first control
or breadth-first control. This implies that we can have four kinds of search algorithm:
1. Top-down, depth-first
2. Top-down, breadth-first
3. Bottom-up, depth-first
4. Bottom-up, breadth-first.
Information Systems: Natural Language Processing (AH)
50
4.
FURTHER TECHNIQUES USED IN NLP
4.1 Key Phrase Matching: ELIZA and the Chatterbots
ELIZA is a computer program which emulates the behaviour of a client-centred
psychotherapist. The original program was written by Stanley Weizenbaum at MIT in
1966 for a mainframe computer, but there are numerous PC versions available, as well
as web-based versions. Details of how to obtain some of these are given later.
The following is a typical conversation (Eliza’s responses are in capitals):
Men are all alike.
IN WHAT WAY?
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE?
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
It's true. I am unhappy
DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
I need some help, that much seems certain.
WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
Perhaps I could learn to get along with my mother.
TELL ME MORE ABOUT YOUR FAMILY
My mother takes care of me.
WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
My father.
YOUR FATHER
You are like my father in some ways.
WHAT RESEMBLANCE DO YOU SEE
You are not very aggressive but I think you don't want me to notice that.
WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
You don't argue with me.
Information Systems: Natural Language Processing (AH)
51
WHY DO YOU THINK I DON'T ARGUE WITH YOU
You are afraid of me.
DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
My father is afraid of everybody.
WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OF YOUR FATHER
Bullies.
DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE
YOU COME HERE
The operation of the program is straightforward. The input is read and inspected for
the presence of a keyword. If one is found, the sentence is transformed according to a
rule associated with the keyword or, under certain conditions, an earlier
transformation is retrieved. The generated text is then displayed.
Keywords can have a rank or precedence number. The program will abandon a
keyword already found in the left-to-right scan of the text in favour of one having a
higher rank. Commas and periods are treated as delimiters. If one is encountered after
a keyword has been found, all subsequent text is deleted from the input line. If no
keyword has yet been located, the text prior to the delimiter is deleted and the scan
continues. As a result, only single phrases or sentences are ever transformed.
The fundamental technical problems with which ELIZA has to deal with are the
following:
1. The identification of the ‘most important’ keyword in the input message.
2. The identification of some minimal context within which the chosen
keyword appears; e.g., if the keyword is ‘you’, is it followed by the word
‘are’ (in which case an assertion is probably being made).
3. The choice of an appropriate transformation rule, and, of course, the
making of the transformation itself.
4. The provision of a mechanism that will permit ELIZA to respond
‘intelligently’ when the input text contained no keywords.
Eliza often appears to behave in an ‘intelligent’ fashion, but there is no real analysis
of the input text and certainly no attempt to understand it. Weizenbaum himself went
to great pains to point out how trivial the program was. You can read his original
paper at:
http://acf5.nyu.edu/~mm64/x52.9265/january1966.html
You can find a web-based version of Eliza at:
http://www.uib.no/People/hhiso/eliza.html
Information Systems: Natural Language Processing (AH)
52
A similar program (Dr Werner Wilhelm Webowitz!) can be found at:
http://www.parnasse.com/drwww.shtml
Various versions of Eliza can be downloaded from:
http://www-cgi.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/classics/eliza/0.html
Eliza spawned a whole generation of programs known collectively as chatterbots.
You can find links to most of them here:
http://www.simonlaven.com
One of the most interesting chatterbots was Julia, who emulates a real player in multiuser role-playing games. Julia was sufficiently realistic to encourage at least one
other participant to ask for a date. You can read a hilarious account of her adventures
at:
http://foner.www.media.mit.edu/people/foner/Julia/Julia.html
Another interesting chatterbot is the Chomskybot, which can be found at:
http://rubberducky.org/cgi-bin/chomsky.pl
This program generates paragraphs of text in the style of the noted linguist, Noam
Chomsky. An example is given below:
‘Conversely, any associated supporting element is rather different from the
requirement that branching is not tolerated within the dominance scope of a
complex symbol. If the position of the trace in (99c) were only relatively
inaccessible to movement, the appearance of parasitic gaps in domains
relatively inaccessible to ordinary extraction cannot be arbitrary in the system
of base rules exclusive of the lexicon. I suggested that these results would
follow from the assumption that a descriptively adequate grammar does not
affect the structure of a general convention regarding the forms of the
grammar. So far, the descriptive power of the base component raises serious
doubts about the ultimate standard that determines the accuracy of any
proposed grammar. However, this assumption is not correct, since an
important property of these three types of EC may remedy and, at the same
time, eliminate non-distinctness in the sense of distinctive feature theory.’
If you think this is difficult to follow, you should try reading some real examples of
Chomsky’s work!
Information Systems: Natural Language Processing (AH)
53
4.1.2 Finite State Automata
A Finite State Automaton (FSA) or Finite State Machine is a mathematical model of
a system which has discrete inputs and outputs and can be in any one of a finite
number of states. A state summarises the information concerning past inputs that is
needed to determine the behaviour of the system on subsequent inputs. Typical
examples are the control mechanism for an elevator, or the lexical analyser
component of a programming language compiler.
Let’s look at a very simple FSA which can be used to recognise strings written in the
sheep language, Sheeptalk, i.e. any string from the following set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa! etc.
We can show the automaton as a directed graph: a finite set of vertices (or nodes),
together with a set of directed links (or arcs) between pairs of vertices. We'll represent
vertices with circles and arcs with arrows:
a
b
q0Q0
a
a
q1
q2
!
q3
q4
This automaton has five states, represented by nodes in the graph. State 0 is the start
state, represented by the incoming arrow. State 4 is the final state or accepting state,
represented by the double circle. The automaton also has four transitions, represented
by arcs in the graph.
The FSA can be used for accepting or recognizing strings as follows. Think of the
input as being written on a long tape broken up into cells, with one symbol written in
each cell of the tape:
q0
b
a
a
a
!
The machine begins in the start state (q0), and goes through the following process:
Check the next letter of the input. If it matches the symbol on an arc leaving the
current state, then cross that arc, move to the next state, and advance one symbol in
the input. If the machine in the accepting state (q4) when it runs out of input, it has
successfully recognised an instance of Sheeptalk.
Information Systems: Natural Language Processing (AH)
54
If the machine never gets to the final state, either because it runs out of input, or it
gets some input that doesn't match an arc, or if it gets stuck in some non-final state,
we say it rejects or fails to accept an input.
We can also represent an automaton by a state-transition table. As with the directed
graph notation, the state-transition table represents the start state, the accepting states,
and what transitions leave each state with which symbols.
Input
b a !
1 0 0
0 2 0
0 3 0
0 3 4
0 0 0
State
0
1
2
3
4:
State 4 is marked with a colon to indicate that it's a final state (you can have as many
final states as you want), and the 0 indicates an illegal or missing transition. The first
row should be read as: ‘if we're in state 0 and we see the input b we go to state 1. If
we're in state 0 and we see the input a or !, we fail’.
It is possible to use Finite State Automata to recognise many acceptable English
sentences. However, they cannot model some language constructs, for instance centreembedded phrases. Also Finite State Automata descriptions of the syntax of natural
languages are repetitious and long-winded. The following is a description of part of
English that could account for phrases such as:
•
•
•
•
•
the dog
the large dog
the very large dog
the very very large dog
the very very very large dog
etc.
adjective
q0
article
q1
Information Systems: Natural Language Processing (AH)
noun
q2
55
Instead of labelling the arcs with pre-terminals, such as article, adjective and noun,
we'll use strings of letters to illustrate our examples. We could draw an FSA that
could recognise the following strings:
•
•
•
•
•
cde
ccde
cdee
ccdee
cccccccdeee
c
q0
c
q1
e
d
q2
e
q3
This language allows any number of c and any number of e. However, suppose that
we want to model a recogniser that will accept any of the following:
•
•
•
cde
ccdee
cccdeeee
but reject any of the following:
•
•
•
•
•
ccde
cdee
cccdee
cccccccdeee
cdeeeeeeeeeeee
i.e. the number of c and the number of e must be the same.
It is impossible to draw a recogniser that will recognise this language and only this
language.
Information Systems: Natural Language Processing (AH)
56
It may seem that these strings of arbitrary letters have nothing to do with English.
However, English includes sentences which are similar in structure. Consider a
sentence such as:
‘The girl whose mother told me that she'd been painted by Van Gogh at the party
shouted.’
We can split this into several parts:
The girl
whose mother told me
that she'd been painted
by Van Gogh
at the party
shouted.
We could go on extending this sentence indefinitely by simply embedding more and
more phrases in the centre. This type of structure is known as centre-embedded.
Centre-embedded sentences cannot be modelled by an FSA, so we can reasonably
conclude that it is impossible to model the whole of English grammar by an FSA.
4.1.3 Using Templates
Many web sites wish to display live data (such as the current date or time) or collect
data from users, e.g. payment or delivery information. This can be done by directly
coding the required commands into the web site, usually in HTML. However, the
problem is often easier solved by the use of templates, written in a scripting language
such as Perl. The big advantage of this approach is that all pages depending on a
specific template can be amended simply by changing the template. This is a complex
area, but you can obtain more information from:
http://www.zdnet.com/devhead/stories/articles/0,4413,2184927,00.html
Templates can also be used to generate text. A humorous example can be found at:
http://after.logos.uwaterloo.ca/~tjdonald/harpo/harpo.html
Information Systems: Natural Language Processing (AH)
57
4.1.4 Representations using Logic
One important area of NLP is the study of the semantics or meaning of natural
language statements. In many cases, the most important aspect of semantics is
determining whether a sentence is true or false. We can simplify this task by defining
a formal language with simple semantics and mapping natural language sentences on
to it.
This formal language should be unambiguous, have simple rules of interpretation and
inference and have a logical structure determined by the form of the sentence. Two
commonly used formal languages are propositional logic and predicate logic.
Propositional logic is concerned with the logic of truth functional, sentential or
propositional operators such as and, or and not.
Sentential operators are those, which operate on one or more complete sentences to
give a new sentence. If they are also truth functional operators then the truth of the
resulting sentence can be determined knowing only the truth values of the sentences
from which it was constructed.
An example is the construction known as conjunction. This consists in joining two
sentences with the connective and, e.g. the conjunction of the two sentences:
•
Grass is green
•
Pigs don't fly.
Is the sentence:
•
Grass is green and pigs don't fly.
The conjunction of two sentences will be true if, and only if, each of the two
sentences from which it was formed is true.
Other propositional connectives include:
p or q
known as the disjunction of ‘p’ and ‘q’.
not p
known as the negation of ‘p’.
In natural languages, words whose primary role is truth functional often have other
roles as well. This is one of many ways in which natural languages fail to be ideal for
some logical or technical purposes. To overcome these difficulties formal languages
may be helpful. Where a logic is concerned only with sentential connectives it is
usually called a propositional logic.
The best known, and probably the simplest of these logics is known as classical or
boolean propositional logic, in which it is assumed that all propositions have a
definite truth value; a proposition is either true or it is false.
Information Systems: Natural Language Processing (AH)
58
A predicate is a feature of language, which you can use to make a statement about
something, e.g. to attribute a property to that thing. If you say, ‘Peter is tall’, then you
have applied the predicate ‘is tall’ to Peter. A predicate may be thought of as a kind of
function, which applies to things and yields a proposition. They are therefore
sometimes known as propositional functions.
Analysing the predicate structure of sentences permits us to make use of the internal
structure of atomic sentences, and to understand the structure of arguments, which
cannot be accounted for by propositional logic alone.
Predicates (or relations):
•
•
•
•
Are operators, which yield atomic sentences.
Operate on things other than sentences.
Are therefore not truth functional operators.
Yield atomic sentences whose truth can be determined knowing only the
identity of the things to which the predicate is applied.
The term relation is typically used of a predicate which is applied to more than one
thing, e.g. ‘greater than’, which is applied to two things to make a comparison, but
can also be used for predicates taking one or zero things. The number of ‘things’
involved (as arguments) is called the arity of the predicate or relation.
Though predicates are one of the features, which distinguish predicate logic from
propositional logic, they are really just a bit of extra structure necessary to permit the
study of quantifiers. The two important features of natural languages, which are
captured in predicate logic, are the terms ‘every’ and ‘some’. These are sometimes
called the universal and existential quantifiers. These features of language refer to
one or more individuals or things, which are not by themselves propositions and
which therefore force some kind of analysis of the structure of ‘atomic’ propositions.
Where a logic is concerned not only with sentential connectives but also with the
internal structure of atomic propositions it is usually called a predicate logic. The
best known, and probably the simplest of these logics is known as classical or
boolean first-order predicate logic.
‘Classical’ or ‘boolean’ simply means that propositions are either true or false.
‘First-order’ means that we consider predicates (or relations) on the one hand, and
individuals on the other; that atomic sentences are constructed by applying the former
to the latter; and that quantification is permitted only over the individuals.
A classic example of what can be done with predicate logic is the inference from the
premises:
• All men are mortal.
•
Socrates is a man.
•
Socrates is mortal
to the conclusion:
Information Systems: Natural Language Processing (AH)
59
4.2 Examples of NLP Techniques in Use
4.2.1 Using Logic in Answering Questions From A Database
Database searching is based on the principles of Boolean Logic. Most search
engines use a symbolic logic system called Boolean Logic, named after
George Boole, the British mathematician who invented it. It uses a set of
connecting words - AND, OR, NOT and NEAR -- to make your search more
useful.
The Internet can be regarded as a vast computer database, whose contents can
be searched according to the rules of computer database searching. On many
Internet search engines, the options to construct logical relationships among
search terms extend beyond the traditional practice of Boolean searching. Full
details can be found at:
http://library.albany.edu/internet/boolean.html
4.2.2 Key Phrase Matching in Text Retrieval
Browsing accounts for much of the time uses spend interacting with online text
collections or digital libraries, but it is poorly supported by standard search engines.
Conventional systems often operate at the wrong level, indexing words when people
think in terms of topics, and returning documents when people want a broader view.
As a result, users cannot easily determine what is in a collection, how well a particular
topic is covered, or what kinds of queries will provide useful results.
Gutwin, Paynter et al. built a new kind of search engine, Keyphind, that is explicitly
designed to support browsing. Automatically-extracted keyphrases form the basic unit
of both indexing and presentation, allowing users to interact with the collection at the
level of topics and subjects rather than words and documents. The keyphrase index
also provides a simple mechanism for clustering documents, refining queries, and
previewing results.
The authors compared Keyphind to a traditional query engine in a small usability
study. Users reported that certain kinds of browsing tasks were much easier with the
new interface, indicating that a keyphrase index would be a useful supplement to
existing search tools.
This is described in some detail in their paper ‘Improving Browsing in Digital
Libraries with Keyphrase Indexes’. The complete paper can be found at:
http://www.cs.usask.ca/homepages/faculty/gutwin/1999/keyphind-journal/keyphind-12-final-TR.html
Information Systems: Natural Language Processing (AH)
60
4.2.3 Using Templates in Intelligent Tutoring Systems
We have already given some consideration to Intelligent Tutoring Systems (ITS) in
Section 1. One major problem with ITSs is the expense of developing them. This
expense can be reduced considerably if we do not have to start afresh every time. One
way of doing this is to make use of shells or templates which can be used across a
range of Systems.
This area was one of several addressed at an ITS Workshop held in Montreal in 1996.
A collection of the papers from this workshop is available online at:
http://advlearn.lrdc.pitt.edu/its-arch/papers/index.html
The following papers are of particular interest:
http://advlearn.lrdc.pitt.edu/its-arch/papers/blumenthal.html
http://advlearn.lrdc.pitt.edu/its-arch/papers/brusilovsky.html
http://advlearn.lrdc.pitt.edu/its-arch/papers/fleming.html
http://advlearn.lrdc.pitt.edu/its-arch/papers/goodkov.html
4.3 NLP Software
There are a number of sources on the World Wide Web for NLP software. Try the
following:
Natural Language Software Registry:
http://registry.dfki.de/
Summer Institute of Linguistics:
http://www.sil.org/computing/catalog/
Information Systems: Natural Language Processing (AH)
61
Information Systems: Natural Language Processing (AH)
62
Download