ppt

advertisement
Natural Language Processing
Why “natural language”?

Natural vs. artificial

Language vs. English
2
Why “natural language”?

Natural vs. artificial


Not precise, ambiguous, wide range of
expression
Language vs. English

English, French, Japanese, Spanish
3
Why “natural language”?

Natural vs. artificial


Language vs. English


Not precise, ambiguous, wide range of
expression
English, French, Japanese, Spanish
Natural language processing = programs,
theories towards understanding a
problem or question in natural language
and answering it
4
Approaches

System building
Interactive
 Understanding only
 Generation only


Theoretical

Draws on linguistics, psychology,
philosophy
5


Building an NL system is hard
Unlikely to be possible without solid
theoretical underpinnings
6
Natural language is useful

Question-answering systems


Mixed initiative systems


http://nlp.cs.nyu.edu/info-extr/biomedicalsnapshot.jpg
Systems that write/speak



http://www.cs.columbia.edu/~noemie/match.mpg
Information extraction


http://tangra.si.umich.edu/clair/NSIR/NSIR.cgi
http://www-2.cs.cmu.edu/~awb/synthesizers.html
MAGIC
Machine translation

http://world.altavista.com/babelfish
7
Topics

Syntax

Semantics

Pragmatics

Statistical NLP: combining learning
and NL processing
8
Goal of Interpretation

Identify sentence meaning

Do something with meaning

Need some representation of
action/meaning
9
Analysis of form: Syntax

Which parts were damaged by larger
machines?
Which parts damaged larger machines?
Which larger machines damaged parts?

Approaches:


Statistical part of speech tagging
 Parsing using a grammar
 Shallow parsing: identify meaningful
chunks

10
Which parts were damaged by larger
machines?
S (Q)
VP
NP
ADJ
N
V (past)
damage
larger
NP (Q)
Det (Q)
machines
which
N
parts
11
Which parts were damaged by machines?
– with functional roles
S (Q)
VP
NP (SUBJ)
ADJ
larger
N
machines
V (past)
damage
NP (Q) (OBJ)
Det (Q)
which
N
parts
12
Which parts damaged machines? – with
functional roles
S (Q)
NP (Q)
(SUBJ)
VP
NP (OBJ)
V (past)
Det (Q)
which
N
ADJ
N
parts
damage
larger
machines
13
Parsers

Grammar



Different types of grammars



S -> NP VP
NP -> DET {ADJ*} N
Context Free vs. Context Sensitive
Lexical Functional Grammar vs. Tree Adjoining
Grammars
Different ways of acquiring grammars



Hand-encoded vs. machine learned
Domain independent (TreeBank, Wall Street
Journal)
Domain dependent (Medical texts)
14
Semantics: analysis of meaning

Word meaning





picked
picked
picked
picked
Phrasal meaning




John
John
John
John
up
up
up
up
a bad cold
a large rock.
Radio Netherlands on his radio.
a hitchhiker on Highway 66.
Baby bonuses -> allocations
Senior citizens -> personnes agees
Causing havoc -> seme le dessaroi
Approaches



Representing meaning
Statistical word disambiguation
Symbolic rule-based vs. shallow statistical
semantics
15
Representing Meaning - WordNet
16
17
OMEGA


http://omega.isi.edu:8007/index
http://omega.is.edu/doc/browsers.h
tml
18
19
Statistical Word Sense Disambiguation
Context within the sentence determines which sense is
correct





The candidate picked up [sense6] thousands of
additional votes.
He picked up [sense2] the book and started to read.
Her performance in school picked up [sense13].
The swimmers got out of the river and climbed the
bank [sloping land] to retrieve their towels.
The investors took their money out of the bank
[financial institution] and moved it into stocks and
bonds.
20
Goal



A program which can predict which sense
is the correct sense given a new sentence
containing “pick up” or “bank”
Avoid manually itemizing all words which
can occur in sentences with different
meanings
Can we use machine learning?
21
What do we need?

Data

Features

Machine Learning algorithm



Decision tree vs. SVM/Naïve Bayes
Inspecting the output
Accuracy of these methods
22
Using Categories from Roget’s
Thesaurus (e.g., machine vs. animal)
for training
23
Training data for “machines”
24
25
Predicting the correct sense in unseen
text



Use presence of the salient words in
context
50 word window
Use Baye’s rule to compute
probabilities for different categories
26
“Crane”



Occurred 74 times in Grolliers, 36
as animal, 38 as machine
Prediction in new sentences were
99% correct
Example: lift water and to grind
grain .PP Treadmills attached to
cranes were used to lift heavy
objects from Roman times.
27
28
29
Going Home – A play in one act





Scene 1: Pennsylvania Station, NYC
Bonnie: Long Beach?
Passerby: Downstairs, LIRR Station
Scene 2: ticket counter: LIRR
Bonnie: Long Beach?
Clerk: $4.50
Scene 3: Information Booth, LIRR
Bonnie: Long Beach?
Clerk: 4:19, Track 17
Scene 4: On the train, vicinity of Forest Hills
Bonnie: Long Beach?
Conductor: Change at Jamaica
Scene 5: On the next train, vicinity of Lynbrook
Bonnie: Long Beach?
Conductor: Rigtht after Island Park.
30
Question Answering on the web



Input: English question
Data: documents retrieved by a
search engine from the web
Output: The phrase(s) within the
documents that answer the
question
31
Examples

When was X born?






When was Mozart born?
Mozart was born in 1756.
When was Gandhi born?
Gandhi (1869-1948)
Where are the Rocky Mountains
located?
What is nepotism?
32
Common Approach

Create a query from the question


When was Mozart born -> Mozart born
Use WordNet to expand terms and increase
recall:


Which high school was ranked highest in the US in
1998?
“high school” ->
(high&school)|(senior&high&school)|(senior&high(
|high|highschool

Use search engine to find relevant
documents

Pinpoint passage within document that
has answer using patterns

From IR to NLP
33
PRODUCE A BIOGRAPHY OF [PERON].
Only these fields are Relevant:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Name(s), aliases:
*Date of Birth or Current Age:
*Date of Death:
*Place of Birth:
*Place of Death:
Cause of Death:
Religion (Affiliations):
Known locations and dates:
Last known address:
Previous domiciles:
Ethnic or tribal affiliations:
Immediate family members
Native Language spoken:
Secondary Languages spoken:
Physical Characteristics
Passport number and country of issue:
Professional positions:
Education
Party or other organization affiliations:
Publications (titles and dates):
34
Biography of Han Ming

Han Ming, born 1944 March in Pyongyan, South
Korean Lei Fa Women’s University in French law,
literature, a former female South Korean people,
chairman of South Korea women’s groups,…Han,
62, has championed women’s rights and liberal
political ideas. Han was imprisoned from 1979 to
1981 on charges of teaching pro-Communist
ideas to workers, farmers and low-income
women. She became the first minister of gender
equality in 2001 and later served as an
environment minister.
35
Biography – two approaches


To obtain high precision, we handle
each slot independently using
bootstrapping to learn IE patterns.
To improve the recall, we utilize a
biography Language Model.
36
Approach

Characteristics of the IE approach


Training resource: Wikipedia and its manual annotations
Bootstrapping interleaves two corpora to improve
precision





No manual annotation or automatic tagging of corpus
Use seed tuples (person, date-of-birth) to find patterns
This approach is scalable for any corpus



Wikipedia: reliable but small
Web: noisy but many relevant documents
Irrespective of size
Irrespective of whether it is static or dynamic
The IE system is augmented with language models to
increase recall
37
Biography as an IE task



We need patterns to extract information
from a sentence
Creating patterns manually is a time
consuming task, and not scalable
We want to find these patterns
automatically
38
Biography patterns from Wikipedia
39
Biography patterns from Wikipedia
• Martin Luther King, Jr., (January 15, 1929 – April 4,
1968) was the most …
• Martin Luther King, Jr., was born on January 15, 1929,
in Atlanta, Georgia.
40
Run IdFinder on these sentences



<Person> Martin Luther King, Jr. </Person>,
(<Date>January 15, 1929</Date> – <Date>
April 4, 1968</Date>) was the most…
<Person> Martin Luther King, Jr. </Person>, was
born on <Date> January 15, 1929 </Date>, in
<GPE> Atlanta, Georgia </GPE>.
Take the token sequence that includes the tags of
interest + some context (2 tokens before and 2
tokens after)
41
Convert to Patterns:

<My_Person> (<My_Date> – <Date>) was the

<My_Person> , was born on <My_Date>, in

Remove more specific patterns – if there is a
pattern that contains other, take the smallest > k
tokens.

 <MY_Person> , was born on <My_Date>

 <My_Person> (<My_Date> – <Date>)

Finally, verify the patterns manually to remove
irrelevant patterns.
42
Examples of Patterns:

502 distinct place-of-birth patterns:








600
169
44
10
10
1
…
<MY_Person> was born in <MY_GPE>
<MY_Person> ( born <Date> in <MY_GPE> )
Born in <MY_GPE> <MY_Person>
<MY_Person> was a native <MY_GPE>
<MY_Person> 's hometown of <MY_GPE>
<MY_Person> was baptized in <MY_GPE>
770
92
19
16
3
1
…
<MY_Person> ( <Date> - <MY_Date> )
<MY_Person> died on <MY_Date>
<MY_Person> <Date> - <MY_Date>
<MY_Person> died in <GPE> on <MY_Date>
< MY_Person> passed away on < MY_Date >
< MY_Person> committed suicide on <MY_Date>
291 distinct date-of-death patterns:







43
Biography as an IE task


This approach is good for the
consistently annotated fields in
Wikipedia: place of birth, date of
birth, place of death, date of death
Not all fields of interests are
annotated, a different approach is
needed to cover the rest of the slots
44
Bouncing between Wikipedia and Google

Use one seed only:

<my person> and <target field>

Google: “Arafat” “civil engineering”, we get:
45
46
Bouncing between Wikipedia and Google

Use one seed only:
 <my person> and <target field>
 Google: “Arafat” “civil engineering”, we get:





Arafat graduated with a bachelor’s degree in civil engineering
Arafat studied civil engineering
Arafat, a civil engineering student
…
Using these snippets, corresponding patterns are
created, then filtered out manually.
47
Bouncing between Wikipedia and Google

Use one seed tuple only:
 <my person> and <target field>
 Google: “Arafat” “civil engineering”, we get:






Arafat graduated with a bachelor’s degree in civil
engineering
Arafat studied civil engineering
Arafat, a civil engineering student
…
Using these snippets, corresponding patterns are
created, then filtered out manually
To get more seed pairs, go to Wikipedia biography
pages only and search for:


“graduated with a bachelor’s degree in”
We get:
48
49
Bouncing between Wikipedia and Google

New seed tuples:






“Burnie Thompson” “political science“
“Henrey Luke” “Environment Studies”
“Erin Crocker” “industrial and management
engineering”
“Denise Bode” “political science”
…
Go back to Google and repeat the
process to get more seed patterns!
50
Bouncing between Wikipedia and Google

This approach worked well for a few
fields such as: education, publication,
Immediate family members, and Party or other
organization affiliations


Did not provide good patterns for
every field, such as: Religion, Ethnic or tribal
affiliations, and Previous domiciles), we got a lot
of noise
For some slots, we created some
patterns manually
51
Biography as Sentence Selection and Ranking

To obtain high recall, we also want to include
sentences that IE may miss, perhaps due to illformed sentences (ASR and MT)

Get the top 100 documents from Indri

Extract all sentences that contain the person or
reference to him/her

Use a variety of features to rank these
sentence…
52
Download