NLP: Grand Challenges The Ultimate Goal – For computers to use NL as effectively as humans do…. Reading and writing text Abstracting Monitoring Extraction into Databases Interactive Dialogue: Natural, effective access to computer systems Informal Speech Input and Output Translation: Input and Output in Multiple Languages The Past: Human Machine Interfaces SHRDLU (Winograd, 1969) • A fragile demonstration of the fundamental vision PUNDIT (Hirschman, Palmer, … 1989) • Information Extraction from Real World texts SHRDLU: A demonstration proof Person: PICK UP A BIG RED BLOCK. Computer: OK. (does it) Person: GRASP THE PYRAMID. Computer: I DON'T UNDERSTAND WHICH PYRAMID YOU MEAN. Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX. Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING. Computer: OK. (does it) Person: WHAT DOES THE BOX CONTAIN? Computer: THE BLUE PYRAMID AND THE BLUE BLOCK. Person: WHAT IS THE PYRAMID SUPPORTED BY? Computer: THE BOX. SHRDLU: A demonstration proof The Past: Information Extraction PUNDIT (Palmer et al. 1987) Sample CASREP & automatic summary FAILURE OF ONE OF TWO SACS. UNIT HAD LOW OUTPUT AIR PRESSURE. RESULTED IN SLOW GAS TURBINE START. TROUBLESHOOTING REVEALED NORMAL SAC LUBE OIL PRESSURE AND TEMPERATURE. EROSION OF IMPELLOR BLADE TIP EVIDENT. CAUSE OF EROSION OF IMPELLOR BLADE UNDETERMINED. NEW SAC RECEIVED. Status of Sac: Part: sac State: inoperative Finding: Part: air pressure State: low Finding: Part: lube oil pressure State: normal Finding: Part: lube oil temperature State: normal Damage: Part: blade tip State: eroded Finding: Agent: ship’s force State: has new sac The Past: Crucial flaws in the paradigm These systems worked well, BUT Usually, only for a small set of examples Person-years of work to port to new applications and, often, to extend coverage on a single application Very limited and inconsistent coverage of English An Early Robust Statistical NLP Application •A Statistical Model For Etymology Church, K.W. (1985) "Stress assignment in letter to sound rules for speech synthesis", Proceedings of the 23rd Annual Meeting (University of Chicago), [text to speech; phonetics] •Determining etymology is crucial for text-to-speech Italian AldriGHetti IannuCCi ItaliAno English lauGH, siGH aCCept hAte An Early Robust Statistical NLP Application Angeletti 100% Italian Iannucci 100% Italian Italiano 100% Italian Lombardino 58% Italian Asahara 100% Japanese Fujimaki 100% Japanese Umeda 96% Japanese Anagnostopoulos 100% Greek Demetriadis 100% Greek Dukakis 99% Russian Annette 75% French Deneuve 54% French Baguenard 54% Middle French •Etymology can be determined reasonably accurately from statistics computed from letter sequences trigrams! A Central Challenge: Extracting Meaning Text or speech ??Meaning Extractor?? Meaning Literal vs. Implicit Meaning Cognitive beings automatically • combine literal meaning • with world knowledge • to see implicit meaning “The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday… The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," an official said.” Q: Whose greed? Q: Whose ambition? • Understanding this involves inferring implicit meaning Recent NLP has focused on robust extraction of shallow, literal meaning Levels of Representation Full Semantics Explicit Semantics Syntax Words Morphology Word Unigram Representation The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday. Unigrams Word # in Document Khan 15 Khan made the confession in a written statement submitted "a couple of days ago" to investigators probing allegations of nuclear proliferation by Pakistan, the official told The Associated Press on condition on anonymity. nuclear 14 Pakistan 10 transfers 9 official 8 scientists 5 The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," the official said. journalists 5 governme nt 5 Libya 5 The official said the transfers were not authorized by the government. officials 4 military 4 … Word Bigram Representation The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea, a Pakistani government official said Monday. Khan made the confession in a written statement submitted "a couple of days ago" to investigators probing allegations of nuclear proliferation by Pakistan, the official told The Associated Press on condition on anonymity. The transfers were made during the late 1980s and in the early and mid 1990s, and were motivated by "personal greed and ambition," the official said. The official said the transfers were not authorized by the government. Bigrams Bigram # in Document North Korea 4 nuclear transfers 3 Government official 3 Pakistan’s nuclear 3 written statement 2 told investigators 2 other suspects 2 other Muslim 2 nuclear program 2 nuclear powers 2 military officials 2 become nuclear 2 Syntax Representation: Treebank NP NP PP NP S NP NP VP VP SBAR NP S VP NP PP NP TreeBank includes • Part of speech • Syntactic structure NP NP NP The founder of Pakistan’s nuclear department Abdul Qadeer Khan has admitted he transferred nuclear technology to Iran, Libya, and North Korea 1995: A breakthrough in parsing 106 words of Treebank Annotation + Machine Learning = Robust Parsers The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea training sentences Training Program answers NP NP Models PP NP S NP NP VP Trees Parser VP SBAR NP S VP NP PP NP NP NP The founder of Pakistan’s nuclear department Abdul Qadeer Khan has admitted he transferred nuclear technology to Iran, Libya, and North Korea NP •1990 Best hand-built parsers: ~40-60% accuracy (guess) •1995+ Statistical parsers: ~90% accuracy Rich Linguistic Representations + Powerful Machine Learning = Robust, Effective NLP 1970s, ’80s: Focus on Linguistic Representations 1990s, early 2000s: Focus on Machine Learning Recently: New work combining the two