Topic 9

advertisement
ICT619 Intelligent
Systems
Topic 9: Natural Language
Processing and Language
Technology
What is natural language processing
(NLP)?
 An ideal goal for human-computer communication is
the ability to communicate in a natural language
 NLP grew as a sub-domain of AI and linguistics
- the task of developing software capable of
understanding information (commands, text) expressed
in a natural language in order to achieve specific goals
 Understanding natural languages is a challenging task
for computers
 Due to ambiguities, frequent use of context and the
overall knowledge acquisition and use problem
ICT619
2
Speech (voice) recognition and
natural language processing
 Speech recognition concerns understanding spoken
commands or sentences from voice inputs
Example: Telstra’s directory assistance
 A speech recognition system must first extract and
recognise words from audio input
 We might also like the system to be able to answer in
speech - this requires speech generation as well
 In NLP, input is already available in machine-readable
form (eg words as Unicode text)
 Future improvements of speech recognition will to
some extent depend on progress in NLP
ICT619
3
Speech Recognnition – The state-ofthe-art




60-90% accuracy - good enough for general dictation
Speaker dependent – needs training
Cheap desktop software available
Example: IBM ViaVoice, Dragon Naturally Speaking
 Issues:
 Isolated vs. continuous speech
 Vocabulary size
 Better speaker independence
ICT619
4
Language Technology
 Covers all areas related to NLP with a practical focus
 Language technology is defined as:
The application of knowledge about human language
in computer-based solutions
 Applications covered by language technology include:
 Spoken language dialogue systems (speech recognition,
some understanding, and speech generation)
 Machine translation
 Text summarisation
 Information retrieval
ICT619
5
Language Technology (cont’d)
 The input to a language technology system
may be provided through
 speech recognition
 optical character recognition (OCR)
 handwriting recognition
and
 the output may be in the form of speech or
tailored documents, or web pages.
ICT619
6
Approaches to natural language
processing
Main Approaches
 Keyword searching
 Linguistic analysis
 AI-based
 ANN-based
 Statistical analysis
Keyword searching systems
 Early NLP systems - and some in use today - are
based on keyword searching (pattern matching)
ICT619
7
Keyword searching NLP systems
 Selected keywords or phrases are searched for
in the input sentence
 The program responds with specific pre-stored
responses based on the keywords or phrases
 Program may actually construct a response
based on a partial reply coupled with keywords
and phrases from the input
 No real understanding of the input is involved
ICT619
8
Keyword searching NLP systems
(cont’d)
The most well known example
- ELIZA program from MIT mid-1960s
ICT619
9
Keyword systems
 Limitations
 Inflexible - really just reactive responses
 Unable to cope with anything not in their keyword
look-up tables, and
 No knowledge modelling
 Today’s more sophisticated NLP systems
 Try to understand the content of language by doing
syntactical, semantic and pragmatic analyses
 May be able to do some conceptual modelling
 Better able to maintain continuous dialogues
 Attempt to cope with the ambiguity and other
features common in natural language
ICT619
10
Other approaches to NLP
Linguistic analysis approach
 Based on encoding formal grammar rules for
sentence-level processing
 A linguistically-oriented system focuses on the
syntax and semantics
AI based systems
 Focuses on using world knowledge to understand
language
 One example of an AI-based NLP system is BORIS
 written by Michael Dyer, a student of Roger Schank's
 a story understanding program that reads a narrative and
answers questions about it
ICT619
11
AI-based NLP example - BORIS
Richard hadn’t heard from his college roommate Paul for years. Richard had borrowed
money from Paul which was never paid back. But now he had no idea where to find his
old friend. When a letter finally arrived from San Francisco, Richard was anxious to
find out how Paul was.
Q:
What happened to Richard at home?
BORIS: Richard got a letter from Paul.
Q:
Who is Paul?
BORIS: Richard’s friend.
Q:
Did Richard want to see Paul?
BORIS: Yes, Richard wanted to know how Paul was.
Q:
Had Paul helped Richard?
BORIS: Yes, Paul lent money to Richard.
The BORIS system (from Roger Schank and Peter Childers, The Cognitive
Computer).
ICT619
12
Artificial neural networks based
NLP
ANN based systems
 Uses ANNs for processing language, particularly for
lexical disambiguation
 A neural net is trained to disambiguate by using context
 Trained presents units of 6 or so words containing
target word to be learned
 Example: Disambiguation of word “bank” in “We got a
bank loan to buy a house”
 Two possible senses: money sense, river sense
 Groups of co-occurring words (neighbourhoods):
 Money sense: bank money loan branch fee robbery
 River sense: bank river bridge erosion earth slope
ICT619
13
Statistical approach to NLP
Linguistic approach
 Based on extracting statistically significant information tags - from large corpora or bodies of text (millions of
words) and using these as very general indexes to
model parts or responses
 Valuable because it does not require as much handmodelling of knowledge, but acquires the tags
automatically
 Statistical methods are now receiving much attention,
and more systems are likely to incorporate them in
future.
 Most NLP systems use a combination of the linguistic
and AI approaches
ICT619
14
Components of NLP systems
 Five major elements: the parser, the lexicon, the
semantic analyser, the knowledge base, and the
generator
ICT619
15
Components of NLP systems
(cont’d)
 A syntactical parser analyses the input sentence using
the language's grammar or rules of syntax
 Output produced is a structural description of the
sentence - known as a parse tree
 Some rules of syntax for English:
S = NP + VP
S : sentence NP: noun phrase VP: predicate or verb
phrase
The noun phrase can be more than a single noun
NP = D + ADJ + N
 D: determiner (D) eg, “a”, “this”, ADJ: adjective, N:
main noun
ICT619
16
Components of NLP systems (cont.)
The lexicon
 An internal dictionary
used to perform the
syntactic and semantic
analysis
 Contains semantic and
grammatical information
(eg, part-of-speech)
about words or word
strings
Fig. An example parse tree for the
sentence “Mary had a little lamb”
ICT619
17
The semantic analyser and the
knowledge base
 The semantic analyser uses the parse tree and the
knowledge base to try to determine what the sentence
means
 It creates another data structure that represents the
meaning of the input sentences
 It can also draw inferences from input statements using
general knowledge in the KB
 The semantic analyser's data structure and those in
the KB should be in a common knowledge
representation, such as KQML or Conceptual Graphs
ICT619
18
The Generator
 The generator uses the KB data structure created by the semantic
analyser to create a usable output
 The response depends in part on the pragmatics of the input
language eg greetings require greetings, questions require
answers, commands require actions
 The data structure can be used to initiate some action,
 eg the language system is a front-end to a DBMS. The generator
writes commands in a query language to begin a search
 Simple generators feed standard pre-stored output responses to
the user based on the built meaning representation
 More sophisticated generators construct an original response by
instantiating templates based on models of language use
ICT619
19
Applications of NLP - Natural language
interfaces (NLI)
 An NLP system can be the front-end of information
systems to provide a more user-friendly interface
 Eg, the command: “List details of all files in this folder
sorted by time of creation” much friendlier than ”ls –atl”,
especially using voice input
 An NLI processes sentences exchanged between a
user and an application
 Task made easier by the restricted domain of
discourse, eg, in databases due to their highly
restricted domain containing information on a single
area of application
 Interfaces for expert systems, operating systems and
document retrieval systems are also being developed
ICT619
20
Homer: A Language-using Agent
Source: Vere, S. & Bickmore, T. A Basic Agent Computational Intelligence, 1990, 6, 4, 41-60.
Natural Language - Homer
TIM>
DropWhat
the package
barge next Saturday at 9pm.
STEVE>
is in frontatofthe
you?
HOMER>
HOMER> OK.
A log.
TIM>
Are Do
youyou
going
to the
be at
the pier next Saturday?
STEVE>
own
log?
HOMER>
Yes.
HOMER> No I don’t.
TIM>
WhyThe
arelog
youbelongs
going totothe
pier?
STEVE>
you.
HOMER>
go to the pier in order to pick up the package.
HOMER> I’ll
Oh.
TIM>
What
timeeat
aregrass.
you going to pick it up?
STEVE>
Cows
HOMER>
8:56pm.
HOMER> I know.
STEVE>
the
package
be next Saturday?
STEVE> Where
Do youwill
own
the
log now?
ICT619
HOMER>
theI do.
barge.
HOMER> At
Yes
22
Examples of commercial NL :
Intellect
Intellect (Trinzic Corp.)
 One of the most widely used natural language frontend interfaces available for mainframes
 Designed for use with DBMS under IBM operating
systems environments
 In addition to allowing access to data in a database,
Intellect allows creation of databases using natural
language
 The built-in lexicon may be modified to fit a particular
application
ICT619
23
Q&A (Symantec Corp.)
 A basic file manager with a natural language front-end called “The
Intelligent Assistant”
 Parses common English input questions and converts them into
queries that the file manager can understand
 Paraphrases input requests to ensure full understanding of what
user wants
 Eg, User input:
Show the total 1992 sales for the Central Region
 Q&A Intelligent Assistant’s response:
Shall I do the following?
Create a report showing the amount of sales for the central
region in 1992?
Y(es) – Continue
N(o) – Cancel request
 Semantec discontinued and then sold Q&A to a German company
called CAB GmbH.
ICT619
24
Machine translation
Goal:
 To support translation of some language into a language
other than the original
Applications include:
 Desktop and web-based translation services
 Spoken language translation services (eg phone-based)
Requirements:
 Understanding meaning of input sentences
 This would involve a semantic analysis of the input using
semantic knowledge
 An automatic translation system is expected to be robust
and not stop whenever it encounters an item it cannot
understand
ICT619
25
Machine translation (cont’d)
Current approaches use a transfer grammar
 Input text  Partial analysis  1st Intermediate
representation of content (related to the source
language)
 Intermediate representation  Transformation using a
transfer grammar  2nd intermediate representation
(related to the target language)
 2nd intermediate representation  NL generator 
Text in target language
 Machine translation as performed since mid-1960s is
not true “understanding” of text
 By 1991, systems that could process sentences with
limited vocabulary started appearing
ICT619
26
Current state-of-the-art of machine
translation
 Broad coverage MT systems already available on the Web
with fast turnaround time and acceptable error rate
 Higher accuracy achieved by domain-specific systems
 For example, controlled language used in Caterpillar
manuals
Machine translation products
 Bowne Global Solution’s iTranslator
 www.itranslator.com
 Systran’s Babel Fish (used by AltaVista)
 www.systransoft.com
ICT619
27
Current state-of-the-art of machine
translation (cont’d)
An example: Systran’s Web-based Translator
ICT619
28
Spoken language dialogue systems
 Communicate with users via automatic speech recognition
and text-to-speech interfaces
 Mediate the user’s access to a back-end database
Examples:
 Information services: stock quotes, timetables
 Transaction services: banking, betting, flight reservations
 Current technology has been claimed to be capable of
reducing call centre costs from $75 to 18c a call
Some issues:
 Telephony-based systems cannot afford a training period
 Making a conversation too realistic falsely raises user
expectations and can confuse the system
ICT619
29
Spoken language dialog systems
(cont’d)
More issues:
 Error handling is a significant issue
 Giving initiative to the user increases difficulty
Some relatively successful examples:
 A Sydney taxi booking service (about 30% of cases have to
go to human operators).
 Telstra directory assistance service (15-20% accuracy but
15-20% of automation may be useful enough)
 Spoken language dialog systems fielded applications:
 Nuance (www.nuance.com)
 ScanSoft/SpeechWorks( (www.scansoft.com)
 Philips (www.speech.philips.com)
ICT619
30
Text processing
 A number of different applications dealing with the
processing of continuous text may be grouped together
under this heading
 Editing tools
 Most common example: spelling and syntax (or grammar)
checkers
Characterised by avoidance of deep semantic processing




Content extraction
Concerns extraction of specific information from texts
Examples:
Extraction of information related to financial transaction from
a bank telex or of bibliographic information from research
papers
ICT619
31
Text processing (cont’d)
 Content extraction (cont’d)
 Requires deep semantic analysis which is aided by the
restricted domain and a priori knowledge of the
information to be extracted
 Commercial systems exist for electronic mail
processing, banking systems and automatic summary
generation
 Examples:
 ATRANS from Cognitive Systems
 DEAL-READER from Gecosys
ICT619
32
Text processing (cont.)
Text summarisation
Objective:
 To produce a version of a document shorter than the
original document
 Applications of text summarisation are found in
 Information browsing
 Voice delivery of Web pages and email
 Issues concerning text summarisation
 Different kinds of summaries:
 Indicative (what is it about?) vs Informative (what is there of
interest to user?)
 Real summarisation requires real understanding
ICT619
33
Text summarisation state-of-the-art
 Commercial systems work on a ‘sentence-extraction’ model
Sentences regarded as ‘important’ are extracted and put
together
 Importance of sentences decided on the basis of location,
inclusion of key words, statistical information such as
frequency
 Current systems are relatively knowledge-free
 Not based on real understanding of the text




Some text summarisation applications currently available:
CognIT’s CORPORUM (www.cognit.com)
INXight’s Summarizer (www.inxight.com)
MS Word’s summarisation tool
ICT619
34
Search and Information Retrieval
 Ever increasing amount of information available
worldwide, particularly on the Internet
 Searching for and retrieving information relevant to a
topic of interest an active area of research and
application.
 Document retrieval (DR)
 Also known as text retrieval
 Involves retrieving text ranging from paragraph to book
length for humans to read
 DR may involve
 searching well-maintained bibliographic databases
 scanning hard disks for missing files
 searching thousands of Web servers for natural language
articles on a topic of interest
ICT619
35
Search and Information Retrieval
(cont’d)
 Efficacy of a DR system measured by
 Precision –proportion retrieved that are relevant, and
 Recall –proportion of relevant documents retrieved
 Retrieval depends on indexing - indicating what documents are
about
 Indexing requires an indexing language, a term vocabulary, and a
method for constructing requests and document descriptions
 Both controlled language indexing and the more sophisticated
natural language indexing require NLP capabilities
 Compact descriptions of a document’s significance may increase
the efficiency of matching
 Increasing both recall and precision is the fundamental goal of
index languages
ICT619
36
Search and Information Retrieval
(cont’d)
Current topics of interest in search and information retrieval include:
 In a concept-based search, documents are characterised by
relevant concepts and not just key words
 For example, a search for ‘car’ should also retrieve documents on
'automobiles'
 Named entity recognition involves recognising names of peoples,
places, organisations etc.
 One person or organisation can be referred to by many name
variants – eg, John Howard, Mr. Howard, J.W. Howard, the PM
 Many persons or organisations can share the same name – eg,
politician John Howard, actor John Howard
ICT619
37
Search and Information Retrieval
(cont’d)
Search and Information Retrieval State-of-the-art
 Current trend (eg Google) is to expand the search
vocabulary by using thesauri (eg, ‘car’  ‘automobile’)
 Linguistic analysis to identify phrases relevant to the initial
query
 Key phrases can be more useful than just key word
 Can be used to expand an initial user query (Khan & Khor
2004)
 Some current search and information retrieval applications:
 Ultra Find: www.ultradesign.com/untrafind/ultrafind.html
 Lotus Discovery Server:
www.lotus.com/products/discserver.nsf
 Smart text processing suites:
 Inxight: www.inxight.com
 Verity: wwwl.verity.com ICT619
38
Challenges faced by NLP
 A good NLP system must be capable of handling
common linguistic problems caused by ambiguities and
the use of context
 Prepositional phrase attachment
 A sentence can often be analysed in more than one
way, producing multiple parse trees for the sentence.
 Example sentence:
 “John saw the boy in the park with a telescope”
has 3 possible parses
Without contextual knowledge, it is not known whether
John was looking through the telescope, the boy had a
telescope, or the park had a telescope in it.
ICT619
39
Challenges faced by NLP (cont’d)
Lexical ambiguity
 When words have multiple meanings
 A classic example:
 Time flies like an arrow.
 Fruit flies like a banana.
 In the first case, “flies” is a verb and “like” is an
adverb
 In the second case, “flies” is a noun and “like”
is a verb.
ICT619
40
Challenges faced by NLP (cont.)
Anaphoric reference or pronoun resolution
 Problem of figuring out what a pronoun refers to
 Example:
Give me the names of all managers and how much
they earn. (1)
Mary went to see Jane. She was happy to see her (2)
 In (1), easy to decide that “they” refers to the managers
already mentioned
 In (2), difficult to decide who “she” and “her” refer to
– was Mary happy to see Jane, or was Jane happy to
see Mary?
ICT619
41
Challenges faced by NLP (cont.)
Ellipsis
 Sentences appearing to have parts missing
 Example
 John works in Personnel, Mary in Accounting.
“Mary in accounting” lacks a verb but is
understandable using context of entire
sentence
“Mary in accounting” is an elliptical form of
“Mary works in accounting”.
ICT619
42
Challenges faced by NLP (cont.)
Quantifier scope
 Quantifiers such as “all”, “every”, “some”, and “no” can
be ambiguous
 Example:
 Every employee does not like Mr Smith
Meaning - not a single employee likes Mr Smith
or - some do and some don’t.
 No current NLP system can handle all of these
problems – no unrestricted NLP system yet
 Yet some such as HOMER can handle the most
common forms
ICT619
43
REFERENCES








Germain, E., Introducing Natural Language Processing, AI Expert,
August 1992, pp.30-35.
Lewis, D.D., and Jones, K.S., Natural Language Processing for
Information retrieval, Communications of the ACM Vol. 39, No. 1
(January 1996), pp.92-100.
Turban, E., Decision Support and Expert Systems, Prentice Hall,
Englewood Cliffs, New Jersey, 1995, pp. 242-257.
Thayse, A. (Editor), From Natural Language Processing to Logic for
Expert Systems, John Wiley & Sons, 1991.
Cole, R., Zaenen A., & Zampolli (eds), Survey of the State of the Art
in Human Language technology, Cambridge University Press, 1998
Available on the web: http://cslu.cse.ogi.edu/HLTsurvey/
Dale, R., Language Technology: Applications and Techniques
Tutorial 2004, The 8th Pacific Rim Int. Conf. on Artificial Intelligence,
Auckland, 9-13 August, 2004.
Khan, M.S., and Khor, S. “Automatic Query Expansion for Enhanced
Web Document Retrieval”, Journal of the American Society for
Information Science and Technology, Vol. 55, No. 1, 2004, pp.29-40.
ICT619
44
Download