pptx - Fabian M. Suchanek

advertisement
Natural Language
Processing
Session in the section “Web search”
of the course “Web Mining”
at the Ecole Nationale Superieure des Télécommunications,
in Paris/France, in fall 2010
by Fabian M. Suchanek
This document is available under a
Creative Commons Attribution Non-Commercial License
Organisation
• 2h class on Natural Language Processing
• Small home-work given at the end, to be handed in for the next
session
(on paper or by email)
• Web-site: http://suchanek.name  Teaching
• Class will be held in English, questions and answers in French are OK

2
Natural Language
... appears everywhere on the Internet
Inbox: 3726 unread messages
(250 billion mails/day;
≈ number of stars in our galaxy;
80% spam)
(100 million blogs)
me: Would you like to have dinner
with me tonight?
Cindy: no.
(1 trillion Web sites;
1 trillion = 10^12
≈ number of cells in the human body)
(1 billion chat msg/day on Facebook;
1 billion = 10^9 = distance Chicago3
Tokio in cm)
Natural Language: Tasks
• Automatic text summarization
Let me first say how proud I am to be the president of this country.
Yet, in the past years, our country... [1 hour speech follows]
 Summary: Taxes will increase by 20%.
• Machine translation
librairie  book store
• Information Extraction
Elvis Presley lives on the moon.  lives(ElvisPresley, moon)
• Natural language understanding
Close the file! Clean up the kitchen!
• Natural language generation
Dear user, I have cleaned up the kitchen for you.
• Text Correction
My hardly loved mother-in law  My heartily loved mother-in-law
• Question answering
Where is Elvis?  On the moon
4
Fields of Linguistics
/ai kεn rəzist εniθiŋ.../
(Phonology, the
study of pronunciation)
temptations / temptation
(Morphology, the
study of word constituents)
“I can resist anything
but temptation”
[Oscar Wilde]
Sentence
Noun
phrase
“I” =
Verbal
phrase
(Syntax, the
study of grammar)
(Semantics, the
study of meaning)
5
Phonology
• Spelling and pronunciation do not always coincide
Spelling
Pronunciation
French “eaux”
/o/
English “women”
/wɪmən/
• The International Phonetic Alphabet (IPA) binds phonetic symbols
to exact mouth positions and thus sounds
 We can describe the sounds
of language precisely
6
Phonology
• The same spelling can be pronounced in different ways
(such words are called heteronyms)
I read a book every day.
I read a book yesterday.
/ ... ri:d .../
/ ... rɛ:d .../
...ough: ought, plough, cough, tough, though, through
• The same pronunciation can be spelled in different ways
(such words are called homophones)
site / sight / cite
• Conclusion: It is hard to wreck a nice beach
(= It is hard to recognize speech)
7
Fields of Linguistics
/ai kεn rəzist εniθiŋ.../
(Phonology, the
study of pronunciation)
✔
temptations / temptation
(Morphology, the
study of word constituents)
“I can resist anything
but temptation”
[Oscar Wilde]
Sentence
Noun
phrase
“I” =
Verbal
phrase
(Syntax, the
study of grammar)
(Semantics, the
study of meaning)
8
Morphology: Morphemes
• Words can consist of constituents that carry meaning
(morphemes)
prefix
suffix
affixes
un-break-able
• “un” is a morpheme that indicates negation
• “break” is the root morpheme
• “able” is a morpheme that indicates being capable of something
• Words can be assembled from other words
houseboat = house + boat
Donaudampfschifffahrtskapitaen
= Donau + Dampf + Schiff + Fahrt + Kapitaen
[captain of a steam ship ride on the Danube]
Druckerzeugnisse
= Druck-Erzeugnisse ?
= Drucker-Zeugnisse ?
9
Morphology: Non-Triviality
• Morphemes do not always simply add up
happy + ness
dish + s
un + correct ✗
≠
≠
happyness (but: happiness)
dishs (but: dishes)
in +correct = incorrect
• Example: plural and singular
boy + s -> boys (easy)
city + s -> cities
atlas -> atlas, bacterium -> bacteria, automaton -> automata
alumnus -> alumni, axis -> axes, man -> men, mouse -> mice,
person -> people, physics -> (no pl), mechanic -> mechanics
10
Morphology: Stemming
Stemming: The process of mapping different related words onto one
single word form.
bike, biking, bikes, racebike -> BIKE
Stemming is useful to allow search engines to find related words:
User: “biking”
Stemming
word does not
appear 
This Web page
tells you
everything
about bikes.
...
Stemming
BIKE
word appears 
THIS WEB PAGE
TELL YOU
EVERYTHING
ABOUT BIKE.
...
11
Morphology: Stemming
Stemming can be done at different levels of agressivity:
• Just mapping plural forms to singular
words  word
Stemming is the process of mapping different related words
onto one word form.
Stemming is the process of mapping different related word
onto one word form.
Still not trivial:
universities  university
emus  emu, but genus  genus
mechanics  mechanic (the guy) or mechanics (the science)
12
Morphology: Stemming
Stemming can be done at different levels of aggressivity:
• Just mapping plural forms to singular
• Reduction to the lemma, i.e., the non-inflected form
mapping  map, stemming  stem, is  be, related  relate
Stemming is the process of mapping different related words
onto one word form.
Stem
be the process of map
onto one word form.
different relate
word
Still not trivial:
interrupted, interrupts, interrupt  interrupt
ran  run
13
Morphology: Stemming
Stemming can be done at different levels of agressivity:
• Just mapping plural forms to singular
• Reduction to the lemma, i.e., the non-inflected form
• Reduction to the stem, i.e., the common core of all related words
different  differ (because of “to differ”)
Stemming is the process of mapping different related words
onto one word form.
Stem
be the process of map
onto one word form.
differ
relate
word
May be too strong:
interrupt, rupture, disrupt  rupt
14
Morphology: Stemming
Stemming can be done at different levels of agressivity:
• Just mapping plural forms to singular
• Reduction to the lemma, i.e., the non-inflected form
• Reduction to the stem, i.e., the common core of all related words
Common stemming techniques:
• rule-based (e.g., the Porter Stemmer for reduction to the stem)
• dictionary-based
• stochastic
15
Fields of Linguistics
/ai kεn rəzist εniθiŋ.../
(Phonology, the
study of pronunciation)
✔
temptations / temptation
(Morphology, the
study of word constituents)
“I can resist anything
but temptation”
[Oscar Wilde]
Sentence
Noun
phrase
“I” =
Verbal
phrase
(Syntax, the
study of grammar)
(Semantics, the
study of meaning)
16
✔
Semantics: Meaning of Words
• A word can have multiple meanings
(such a word is called a homonym)
bow
• A meaning can be expressed by multiple words
(such words are called synonyms)
author
writer
17
Semantics: Taxonomy of Words
• A word is a hypernym of another word,
if its meaning is more general that that of the other word
The other word is called the hyponym.
• This allows us to arrange words in a hierarchy
(called “taxonomy”)
entity
person
singer
author
singer-songwriter
object
(This is not clean because it
mixes up words and concepts)
weapon
bow
“Every bow is a weapon.”
“Every weapon is an object.”
18
Semantics: WordNet
• WordNet is a well-known semantic lexion,
http://wordnet.princeton.edu
A concept is identified by a “synset”,
i.e., a set of synonymous words
{entity}
{person, human}
{singer} {author, writer}
{singer-songwriter}
{object, thing}
{weapon}
{decoration}
{bow}
{bow, bow tie}
Homonyms belong
to multiple synsets.
19
Semantics: WSD
• Word Sense Disambiguation (WSD) is the process of
finding the meaning of a word in a context
They used a bow to hunt animals.
?
Words associated with
“bow (weapon)”:
{ kill, hunt, Indian, prey }
Words associated with
“bow (bow tie)”:
{ suit, clothing, reception }
e.g. from
WordNet
Words of the sentence:
{ they, used, to, hunt, animals }
Overlap: 1/5
✔
Overlap: 0/5
✗
20
Semantics: WSD
• Word Sense Disambiguation (WSD) is the process of
finding the meaning of a word in a context
• Naive method: Comparison of contexts
21
Fields of Linguistics
/ai kεn rəzist εniθiŋ.../
(Phonology, the
study of pronunciation)
✔
temptations / temptation
(Morphology, the
study of word constituents)
✔
“I can resist anything
but temptation”
[Oscar Wilde]
Sentence
Noun
phrase
✔
“I” =
Verbal
phrase
(Syntax, the
study of grammar)
(Semantics, the
study of meaning)
22
Syntax: Correct Sentences
Bob stole the cat.
✔
Cat the Bob stole.
✗
Bob, who likes Alice, stole the cat.
Bob, who likes Alice, who hates Carl, stole the cat.
Bob, who likes Alice, who hates Carl, who owns the cat, stole the cat.
 There are infinitively many correct sentences,
...yet not all sentences are correct.
23
Syntax: Grammars
Bob stole the cat.
✔
Cat the Bob stole.
✗
Grammar: A formalism that decides whether a sentence is syntactically
correct.
Example:
Bob eats
Sentence -> Noun Verb
Noun -> Bob
Verb -> eats
24
Syntax: Phrase Structure Grammars
Non-terminal symbols: abstract phrase constituent names,
such as “sentence”, “noun”, “verb” (in blue)
Terminal symbols: words of the language,
such as “Bob”, “eats”, “drinks”
U
Given two disjoint sets of symbols, N and T, a (context-free) grammar
is a relation between N and strings over N u T: G
N x (N u T)*
Sentence -> Noun Verb
Noun -> Bob
Verb -> eats
N = {Sentence, Noun, Verb}
T = {Bob, eats}
Lexicon
Production rules
Rules always start from nonterminal symbols, never
from terminal symbols
(hence the name)
25
Syntax: Using Grammars
Sentence -> Noun Verb
Noun -> Bob
Verb -> stole
N = {Sentence, Noun, Verb}
T = {Bob, eats}
Sentence
Apply rule Sentence -> Noun Verb
All symbols are terminal
 no more rule applicable
 stop here
 we have a sentence
Noun + Verb
Sentence
Apply rule Noun -> Bob
Bob Verb
Noun
Ver
b
Bob
eats
Apply rule Verb -> eats
Bob eats
Rule derivation
=
Parse tree
26
Syntax: A More Complex Example
1. Sentence -> NounPhrase VerbPhrase
2. NounPhrase -> ProperNoun
3. ProperNoun -> Bob
4. VerbPhrase -> Verb NounPhrase
5. Verb -> stole
Sentence
6. NounPhrase -> Determiner Noun
7. Determiner -> the
8. Noun -> cat
NounPhrase
VerbPhrase
ProperNoun
Bob
Verb
stole
NounPhrase
Determiner
the
Noun
cat
27
Syntax: A More Complex Example
1. Sentence -> NounPhrase VerbPhrase
2. NounPhrase -> ProperNoun
3. ProperNoun -> Bob
4. VerbPhrase -> Verb NounPhrase
5. Verb -> stole
Sentence
6. NounPhrase -> Determiner Noun
7. Determiner -> the
8. Noun -> cat
NounPhrase
VerbPhrase
Determiner
the
Noun
cat
Verb
NounPhrase
stole
ProperNoun
Bob
28
Syntax: Recursive Structures
1. Sentence -> NounPhrase VerbPhrase
2. NounPhrase -> ProperNoun
3. NounPhrase -> Determiner Noun
4. NounPhrase -> NounPhrase Subordinator VerbPhrase
5. VerbPhrase -> Verb NounPhrase
Sentence
NounPhrase
NounPhrase
ProperNoun
Subordinator
who
VerbPhrase
VerbPhrase
Verb
NounPhrase
ProperNoun
Bob
Recursive rules:
allow a circle
in the derivation
likes
Alice
Verb
stole
NounPhrase
Determiner
the
Noun
cat
29
Syntax: Recursive Structures
1. Sentence -> NounPhrase VerbPhrase
2. NounPhrase -> ProperNoun
3. NounPhrase -> Determiner Noun
4. NounPhrase -> NounPhrase Subordinator VerbPhrase
5. VerbPhrase -> Verb NounPhrase
Sentence
NounPhrase
NounPhrase
ProperNoun
Bob
Subordinator
who
VerbPhrase
VerbPhrase
Verb
likes
NounPhrase
Verb
stole
Alice who hates ...
Carl who owns...
NounPhrase
Determiner
the
Noun
cat
30
Syntax: Language
Language of a grammar: The set of all sentences that can be
derived from the start symbol by rule applications.
Bob stole
Bob stole the cat
Bob likes Alice
Bob stole Alice
Alice stole Bob who likes the cat
The cat likes Alice who stole Bob
Bob likes Alice who likes Alice who likes Alice...
...
The Bob stole likes.
Stole stole stole.
Bob cat Alice likes.
...
The grammar is
a finite description
of an infinite set
of sentences
31
Syntax: What we cannot (yet) do
What is difficult to do with context-free grammars:
• agreement between words
Bob kicks the dog.
I kicks the dog. ✗
• sub-categorization frames
Bob sleeps.
Bob sleeps you. ✗
• meaningfulness
Bob switches the computer off.
Bob switches the cat off. ✗
32
Syntax: Parsing
Parsing is the process of, given a grammar and a sentence, finding
the phrase structure tree.
Sentence -> Noun Verb
Noun -> Bob
Verb -> stole
N = {Sentence, Noun, Verb}
T = {Bob, eats}
Sentence
Noun
Ver
b
Bob
eats
33
Syntax: Ambiguity
Sentence
NounPhrase
VerbPhrase
NounPhrase
Pronoun
Verb
Adjective
They
were
visiting
playing
Noun
relatives
children
= They were relatives who came to visit.
34
Syntax: Ambiguity
Sentence
NounPhrase
VerbPhrase
NounPhrase
Auxiliary
Verb
Verb
Pronoun
They
Noun
were
visiting
cooking
relatives
dinner
= They were on a visit to relatives.
35
Syntax: Parsers
Parsing is the process of, given a grammar and a sentence, finding
the parse tree.
There may be multiple parse trees for a given sentence
(a phenomenon called syntactic ambiguity).
Common techniques:
• naive: Bottom up trial and error
• more advanced: with dynamic programming, stochastic
36
Syntax: Part Of Speech
Sentence
NounPhrase
VerbPhrase
VerbPhrase
NounPhrase
NounPhrase
NounPhrase
ProperNoun
Bob
Subordinator Verb ProperNoun
who
likes
Alice
Verb
Determiner
stole
the
Noun
cat
The part of speech
(POS) of “Bob”
37
Syntax: Part Of Speech
The part of speech (POS) of a terminal symbol in a parse tree is
the parent node of the terminal symbol.
The POS of a word determines the role of the word in the sentence.
Proper nouns, nouns, verbs, prepositions, adjectives, ...
ProperNoun
Bob
Subordinator Verb ProperNoun
who
likes
Alice
Verb
Determiner
stole
the
Noun
cat
The part of speech
(POS) of “Bob”
38
Syntax: POS Tagging
POS tagging is the process of, given a sentence and a (partial)
grammar, determining the part of speech of each word.
The train that Alice took was heading to Rome.
The/Det train/Noun that/Subordinator Alice/ProperNoun took/Verb
was/AuxiliaryVerb heading/Verb to/Preposition Rome/ProperName.
39
Syntax: POS Tagging
POS tagging is the process of, given a sentence and a (partial)
grammar, determining the part of speech of each word.
POS tagging is not trivial, because the same word can appear
with different POS:
• Some words belong to two word classes (“run” as a verb or noun)
• Some word forms may be ambiguous:
Sound sounds sound sound.
Common techniques:
• rule-based
• statistical
• using dynamic programming
40
Syntax: Word Classes
A word class is a set of words sharing the same POS.
E.g.: Nouns, verbs, subordinators, determiners,...
An open word class is a word class that can be extended:
• Proper nouns: Alice, Fabian, Elvis, ...
• Nouns: computer, weekend, ...
• Adjectives: self-reloading fridge, ...
• Verbs: download, ...
A
•
•
•
•
closed word class is a word class that cannot be extended:
Pronouns: he, she, it, this, ... (≈ what can replace a noun)
Determiners: the, a, these, your, my, ... (≈ what goes before a noun)
Prepositions: in, with, on, ... (≈ what goes before determiner + noun)
Subordinators: who, whose, that, which, because, ...
(≈ what introduces a sub-ordinate sentence)
41
Syntax: Stop words
Words of the closed word classes are often perceived
as contributing less to the meaning of a sentence.
Words
closed word classes
contributing less
meaning
often perceived
sentence.
Therefore, these words are often ignored in web search.
Words that are ignored in a Web search are called stop words.
a, the, in, those, could, can, not, ...
This may not always be reasonable
“Vacation outside Europe”
“Vacation Europe”
42
Fields of Linguistics
/ai kεn rəzist εniθiŋ.../
(Phonology, the
study of pronunciation)
✔
temptations / temptation
(Morphology, the
study of word constituents)
✔
“I can resist anything
but temptation”
[Oscar Wilde]
Sentence
Noun
phrase
Verbal
phrase
(Syntax, the
study of grammar)
✔
“I” =
✔
(Semantics, the
study of meaning)
43
Homework
• Phonology: Find two French words that sound the same, but
are written differently (homophones)
• Morphology: Find an example Web search query where
stemming to the stem (most aggressive variant) is too
aggressive.
• Semantics: Make a taxonomy of at least 5 words in a thematic
domain of your choice
• Syntax: POS-tag the sentence
The quick brown fox jumps over the lazy dog
 Hand in by e-mail to f.m.suchanek@gmail.com
or on paper in the next session
44
Download