Lecture 1b

advertisement
NLP: Grand Challenges
The Ultimate Goal – For computers to use NL as effectively as
humans do….
Reading and writing text
 Abstracting
 Monitoring
 Extraction into Databases
Interactive Dialogue: Natural, effective access to computer
systems
 Informal Speech Input and Output
Translation: Input and Output in Multiple Languages
The Past: Human Machine Interfaces
 SHRDLU (Winograd, 1969)
• A fragile demonstration of the fundamental vision
 PUNDIT (Hirschman, Palmer, … 1989)
• Information Extraction from Real World texts
SHRDLU: A demonstration proof
Person: PICK UP A BIG RED BLOCK.
Computer: OK. (does it)
Person: GRASP THE PYRAMID.
Computer: I DON'T UNDERSTAND WHICH PYRAMID YOU MEAN.
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE
HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS
TALLER THAN THE ONE I AM HOLDING.
Computer: OK. (does it)
Person: WHAT DOES THE BOX CONTAIN?
Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.
Person: WHAT IS THE PYRAMID SUPPORTED BY?
Computer: THE BOX.
SHRDLU: A demonstration proof
The Past: Information Extraction
PUNDIT (Palmer et al. 1987)
Sample CASREP & automatic summary
FAILURE OF ONE OF TWO SACS. UNIT HAD LOW OUTPUT AIR PRESSURE.
RESULTED IN SLOW GAS TURBINE START. TROUBLESHOOTING
REVEALED NORMAL SAC LUBE OIL PRESSURE AND TEMPERATURE.
EROSION OF IMPELLOR BLADE TIP EVIDENT. CAUSE OF EROSION OF
IMPELLOR BLADE UNDETERMINED. NEW SAC RECEIVED.
Status of Sac:
Part: sac
State: inoperative
Finding:
Part: air pressure
State: low
Finding:
Part: lube oil pressure
State: normal
Finding:
Part: lube oil temperature
State: normal
Damage:
Part: blade tip
State: eroded
Finding:
Agent: ship’s force
State: has new sac
The Past: Crucial flaws in the paradigm
These systems worked well, BUT
 Usually, only for a small set of examples
 Person-years of work to port to new applications
and, often, to extend coverage on a single
application
 Very limited and inconsistent coverage of English
An Early Robust Statistical NLP Application
•A Statistical Model For Etymology
Church, K.W. (1985) "Stress assignment in letter to
sound rules for speech synthesis", Proceedings of the
23rd Annual Meeting (University of Chicago),
[text to speech; phonetics]
•Determining etymology is crucial for text-to-speech
Italian
AldriGHetti
IannuCCi
ItaliAno
English
lauGH, siGH
aCCept
hAte
An Early Robust Statistical NLP Application
Angeletti
100%
Italian
Iannucci
100%
Italian
Italiano
100%
Italian
Lombardino
58%
Italian
Asahara
100%
Japanese
Fujimaki
100%
Japanese
Umeda
96%
Japanese
Anagnostopoulos
100%
Greek
Demetriadis
100%
Greek
Dukakis
99%
Russian
Annette
75%
French
Deneuve
54%
French
Baguenard
54%
Middle French
•Etymology can be determined reasonably accurately
from statistics computed from letter sequences trigrams!
A Central Challenge: Extracting Meaning
Text or speech
??Meaning
Extractor??
Meaning
Literal vs. Implicit Meaning
 Cognitive beings automatically
• combine literal meaning
• with world knowledge
• to see implicit meaning
“The founder of Pakistan's nuclear program, Abdul Qadeer
Khan, has admitted he transferred nuclear technology to
Iran, Libya and North Korea, a Pakistani government official
said Monday… The transfers were made during the late
1980s and in the early and mid 1990s, and were motivated
by "personal greed and ambition," an official said.”
 Q: Whose greed? Q: Whose ambition?
• Understanding this involves inferring implicit meaning
 Recent NLP has focused on robust extraction of
shallow, literal meaning
Levels of Representation
Full
Semantics
Explicit
Semantics
Syntax
Words
Morphology
Word Unigram Representation

The founder of Pakistan's nuclear program,
Abdul Qadeer Khan, has admitted he
transferred nuclear technology to Iran,
Libya and North Korea, a Pakistani
government official said Monday.
Unigrams
Word
# in Document
Khan
15
Khan made the confession in a written
statement submitted "a couple of days
ago" to investigators probing allegations
of nuclear proliferation by Pakistan, the
official told The Associated Press on
condition on anonymity.
nuclear
14
Pakistan
10
transfers
9
official
8
scientists
5
The transfers were made during the late
1980s and in the early and mid 1990s, and
were motivated by "personal greed and
ambition," the official said.
journalists
5
governme
nt
5
Libya
5
The official said the transfers were not
authorized by the government.
officials
4
military
4
…
Word Bigram Representation

The founder of Pakistan's nuclear
program, Abdul Qadeer Khan, has
admitted he transferred nuclear
technology to Iran, Libya and North
Korea, a Pakistani government
official said Monday.
Khan made the confession in a
written statement submitted "a
couple of days ago" to investigators
probing allegations of nuclear
proliferation by Pakistan, the official
told The Associated Press on
condition on anonymity.
The transfers were made during the
late 1980s and in the early and mid
1990s, and were motivated by
"personal greed and ambition," the
official said.
The official said the transfers were
not authorized by the government.
Bigrams
Bigram
# in
Document
North Korea
4
nuclear transfers
3
Government
official
3
Pakistan’s
nuclear
3
written statement
2
told investigators
2
other suspects
2
other Muslim
2
nuclear program
2
nuclear powers
2
military officials
2
become nuclear
2
Syntax Representation: Treebank
NP
NP
PP
NP
S
NP
NP
VP
VP
SBAR
NP
S
VP
NP
PP
NP
TreeBank includes
• Part of speech
• Syntactic structure
NP
NP
NP
The
founder
of
Pakistan’s
nuclear
department
Abdul Qadeer
Khan
has
admitted
he
transferred
nuclear technology
to
Iran,
Libya,
and
North Korea
1995: A breakthrough in parsing
106 words of Treebank Annotation
+ Machine Learning = Robust Parsers
The founder of
Pakistan's nuclear
program, Abdul
Qadeer Khan, has
admitted he
transferred nuclear
technology to Iran,
Libya and North
Korea
training
sentences
Training
Program
answers
NP
NP
Models
PP
NP
S
NP
NP
VP
Trees
Parser
VP
SBAR
NP
S
VP
NP
PP
NP
NP
NP
The
founder
of
Pakistan’s
nuclear department
Abdul Qadeer Khan
has
admitted
he
transferred
nuclear technology
to
Iran,
Libya,
and
North Korea
NP
•1990 Best hand-built parsers:
~40-60% accuracy (guess)
•1995+ Statistical parsers: ~90% accuracy
Rich Linguistic Representations
+ Powerful Machine Learning
= Robust, Effective NLP
1970s, ’80s: Focus on Linguistic Representations
1990s, early 2000s: Focus on Machine Learning
Recently: New work combining the two
Download