Course Introduction

advertisement
SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
August 30, 2004
1
Today
Motivation: SIMS student projects
Course Goals
Why NLP is difficult
How to solve it? Corpus-based statistical approaches
What we’ll do in this course
2
ANLP Motivation:
SIMS Masters Projects
Breaking Story (2002)
Summarize trends in news feeds
Needs categories and entities assigned to all news articles
http://dream.sims.berkeley.edu/newshound/
BriefBank (2002)
System for entering legal briefs
Needs a topic category system for browsing
http://briefbank.samuelsonclinic.org/
Chronkite (2003)
Personalized RSS feeds
Needs categories and entities assigned to all web pages
Paparrazi (2004)
Analysis of blog activity
Needs categories assigned to blog content
3
4
5
6
7
8
9
Goals of this Course
Learn about the problems and possibilities of natural
language analysis:
What are the major issues?
What are the major solutions?
– How well do they work
– How do they work (but to a lesser extent than CS 295-4)
At the end you should:
Agree that language is subtle and interesting!
Feel some ownership over the algorithms
Be able to assess NLP problems
– Know which solutions to apply when, and how
Be able to read papers in the field
10
Today
Motivation: SIMS student projects
Course Goals
Why NLP is difficult
How to solve it? Corpus-based statistical approaches
What we’ll do in this course
11
We’ve past the year 2001,
but we are not close
to realizing the dream
(or nightmare …)
12
Dave Bowman: “Open the pod bay doors, HAL”
HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
Why is NLP difficult?
Computers are not brains
There is evidence that much of language
understanding is built-in to the human brain
Computers do not socialize
Much of language is about communicating with people
Key problems:
Representation of meaning
Language presupposed knowledge about the world
Language only reflects the surface of meaning
Language presupposes communication between people
14
Hidden Structure
English plural pronunciation
Toy
+ s  toyz
Book + s  books
Church + s  churchiz
Box
+ s  boxiz
Sheep + s  sheep
;
;
;
;
;
add
add
add
add
add
z
s
iz
iz
nothing
What about new words?
Bach
+ ‘s  boxs
Adapted from Robert Berwick's 6.863J
; why not boxiz?
15
Language subtleties
Adjective order and placement
A
A
A
A
A
big black dog
big black scary dog
big scary dog
scary big dog
black big dog
Antonyms
Which sizes go together?
– Big and little
– Big and small
– Large and small
Large and little
16
World Knowledge is subtle
He arrived at the lecture.
He chuckled at the lecture.
He arrived drunk.
He chuckled drunk.
He chuckled his way through the lecture.
He arrived his way through the lecture.
Adapted from Robert Berwick's 6.863J
17
Words are ambiguous
(have multiple meanings)
I know that.
I know that block.
I know that blocks the sun.
I know that block blocks the sun.
Adapted from Robert Berwick's 6.863J
18
Headline Ambiguity
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
Adapted from Robert Berwick's 6.863J
19
The Role of Memorization
Children learn words quickly
As many as 9 words/day
Often only need one exposure to associate meaning
with word
– Can make mistakes, e.g., overgeneralization
“I goed to the store.”
Exactly how they do this is still under study
20
The Role of Memorization
Dogs can do word association too!
Rico, a border collie in Germany
Knows the names of each of 100 toys
Can retrieve items called out to him with over 90%
accuracy.
Can also learn and remember the names of
unfamiliar toys after just one encounter, putting him
on a par with a three-year-old child.
http://www.nature.com/news/2004/040607/pf/040607-8_pf.html
21
But there is too much to memorize!
establish
establishment
the church of England as the official state church.
disestablishment
antidisestablishment
antidisestablishmentarian
antidisestablishmentarianism
is a political philosophy that is opposed to the
separation of church and state.
Adapted from Robert Berwick's 6.863J
22
Rules and Memorization
Current thinking in psycholinguistics is that we use a
combination of rules and memorization
However, this is very controversial
Mechanism:
If there is an applicable rule, apply it
However, if there is a memorized version, that takes
precedence. (Important for irregular words.)
– Artists paint “still lifes”
 Not “still lives”
– Past tense of
 think  thought
 blink  blinked
This is a simplification; for more on this, see Pinker’s “Words and
Language” and “The Language Instinct”.
23
Representation of Meaning
I know that block blocks the sun.
How do we represent the meanings of “block”?
How do we represent “I know”?
How does that differ from “I know that.”?
Who is “I”?
How do we indicate that we are talking about earth’s sun
vs. some other planet’s sun?
When did this take place? What if I move the block?
What if I move my viewpoint? How do we represent this?
24
How to tackle these problems?
The field was stuck for quite some time.
A new approach started around 1990
Well, not really new, but the first time around, in the
50’s, they didn’t have the text, disk space, or GHz
Main idea: combine memorizing and rules
How to do it:
Get large text collections (corpora)
Compute statistics over the words in those
collections
Surprisingly effective
Even better now with the Web
25
Corpus-based Example:
Pre-Nominal Adjective Ordering
Important for translation and generation
Examples:
big fat Greek wedding
fat Greek big wedding
Some approaches try to characterize this as semantic
rules, e.g.:
Age < color, value < dimension
Data-intensive approaches
Assume adjective ordering is independent of the
noun they modify
Compare how often you see {a, b} vs {b, a}
Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04
26
Corpus-based Example:
Pre-Nominal Adjective Ordering
Data-intensive approaches
Compare how often you see {a, b} vs {b, a}
What happens when you encounter an unseen pair?
– Shaw and Hatzivassiloglou ’99 use transitive closutres
– Malouf ’00 uses a back-off bigram model
 P(<a,b>|{a,b}) vs. P(<b,a>|{a,b})
 He also uses morphological analysis, semantic similarity
calculations and positional probabilities
Keller and Lapata ’04 use just the very simple algorithm
– But they use the web as their training set
– Gets 90% accuracy on 1000 sequences
– As good as or better than the complex algorithms
Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04
27
Real-World Applications of NLP
Spelling Suggestions/Corrections
Grammar Checking
Synonym Generation
Information Extraction
Text Categorization
Automated Customer Service
Speech Recognition (limited)
Machine Translation
In the (near?) future:
Question Answering
Improving Web Search Engine results
Automated Metadata Assignment
Online Dialogs
Adapted from Robert Berwick's 6.863J
28
NLP in the Real World
Synonym generation for
Suggesting advertising keywords
Suggesting search result refinement and expansion
29
Synonym Generation
30
Synonym Generation
31
Synonym Generation
32
Synonym Generation
33
What We’ll Do in this Course
Read research papers and tutorials
Use NLTK (Natural Language ToolKit) to try out
various algorithms
Some homeworks will be to do some NLTK exercises
Three mini-projects
Two involve a selected collection
The third is your choice, can also be on the selected
collection
34
What We’ll Do in this Course
Adopt a large text collection
Use a wide range of NLP techniques to process it
Release the results for others to use
35
Which Text Collection?
36
How to analyze a big collection?
Your ideas go here
37
Python
A terrific language
Interpreted
Object-oriented
Easy to interface to other things (web, DBMS, TK)
Good stuff from: java, lisp, tcl, perl
Easy to learn
– I learned it this summer by reading Learning Python
FUN!
38
Questions?
39
Download