I256 Applied Natural Language Processing Fall 2009

advertisement
I256
Applied Natural Language
Processing
Fall 2009
Lecture 9
• Review
Barbara Rosario
Why NLP is difficult
•
Fundamental goal: deep understand of broad language
– Not just string processing or keyword matching
•
Language is ambiguous
– At all levels: lexical, phrase, semantic
•
Language is flexible
– New words, new meanings
– Different meanings in different contexts
•
•
•
Language is subtle
Language is about human communication
Problem of scale
– Many (infinite?) possible words, meanings, context
•
Problem of sparsity
– Very difficult to do statistical analysis, most things (words, concepts) are never
seen before
•
•
Long range correlations
Representation of meaning
2
Linguistics essentials
• Important distinction:
– study of language structure (grammar)
– study of meaning (semantics)
• Grammar
– Phonology (the study of sound systems and abstract
sound units).
– Morphology (the formation and composition of words)
– Syntax (the rules that determine how words combine
into sentences)
• Semantics
– The study of the meaning of words (lexical semantics)
and fixed word combinations (phraseology), and how
these combine to form the meanings of sentences
3
http://en.wikipedia.org/wiki/Linguistics
Today’s review
•
Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
•
•
Lower level text processing
Semantics
• What I hope we achieved
• Overall idea of linguistic
problems
• Overall understanding of
“lower level” NLP tasks
– Lexical semantics
– Word sense disambiguation (WSD)
– Lexical acquisition
•
•
•
Corpus-based statistical approaches
to tackle NLP problems
•
Corpora
Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
– POS, WSD, language models,
segmentation, etc
– NOTE: Will be used for
preprocessing and as features
for higher level tasks
Initial understanding of Stat
NLP
– Corpora & annotation
– probability theory, GM
– Sparsity problem
• Familiarity with Python and
NLTK
4
Morphology
• Morphology is the study of the internal structure of
words, of the way words are built up from smaller
meaning units.
• Morpheme:
– The smallest meaningful unit in the grammar of a language.
• Two classes of morphemes
– Stems: “main” morpheme of the word, supplying the main
meaning (i.e. establish in the example below)
– Affixes: add additional meaning
•
•
•
•
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
– Examples: unladylike, dogs, technique
5
Types of morphological processes
• Inflection:
– Systematic modification of a root form by means of prefixes and
suffixes to indicate grammatical distinctions like singular and plural.
– Doesn’t change the word class
– New grammatical role
– Usually produces a predictable, non idiosyncratic change of meaning.
• run  runs | running | ran
• hope+ing  hoping
hop  hopping
• Derivation:
– Ex: compute  computer  computerization
– Less systematic that inflection
– It can involve a change of meaning
• Compounding:
– Merging of two or more words into a new word
• Downmarket, (to) overtake
6
Stemming & Lemmatization
• The removal of the inflectional ending from
words (strip off any affixes)
• Laughing, laugh, laughs, laughed laugh
– Problems
• Can conflate semantically different words
– Gallery and gall may both be stemmed to gall
– Regular Expressions for Stemming
– Porter Stemmer
– nltk.wordnet.morphy
• A further step is to make sure that the resulting
form is a known word in a dictionary, a task
known as lemmatization.
7
Grammar: words: POS
• Words of a language are grouped into classes to reflect
similar syntactic behaviors
• Syntactical or grammatical categories (aka part-ofspeech)
–
–
–
–
–
Nouns (people, animal, concepts)
Verbs (actions, states)
Adjectives
Prepositions
Determiners
• Open or lexical categories (nouns, verbs, adjective)
– Large number of members, new words are commonly added
• Closed or functional categories (prepositions,
determiners)
– Few members, clear grammatical use
8
Part-of-speech (English)
9
From Dan Klein’s cs 288 slides
Terminology
• Tagging
– The process of associating labels with each
token in a text
• Tags
– The labels
– Syntactic word classes
• Tag Set
– The collection of tags used
10
Example
• Typically a tagged text is a sequence of
white-space separated base/tag tokens:
These/DT
findings/NNS
should/MD
be/VB
useful/JJ
for/IN
therapeutic/JJ
strategies/NNS
and/CC
the/DT
development/NN
of/IN
immunosuppressants/NNS
targeting/VBG
the/DT
CD28/NN
costimulatory/NN
pathway/NN
./.
11
Part-of-speech (English)
12
From Dan Klein’s cs 288 slides
Part-of-Speech Ambiguity
Words that are highly ambiguous as to their part of speech tag
13
Sources of information
• Syntagmatic: tags of the other words
– AT JJ NN is common
– AT JJ VBP impossible (or unlikely)
• Lexical: look at the words
– The  AT
– Flour  more likely to be a noun than a verb
– A tagger that always chooses the most common tag is
90% correct (often used as baseline)
• Most taggers use both
14
What does Tagging do?
1. Collapses Distinctions
•
•
Lexical identity may be discarded
e.g., all personal pronouns tagged with PRP
2. Introduces Distinctions
•
•
Ambiguities may be resolved
e.g. deal tagged with NN or VB
3. Helps in classification and prediction
15
Why POS?
• A word’s POS tells us a lot about the
word and its neighbors:
– Limits the range of meanings (deal), pronunciation
(text to speech) (object vs object, record) or both
(wind)
– Helps in stemming: saw[v] → see, saw[n] → saw
– Limits the range of following words
– Can help select nouns from a document for
summarization
– Basis for partial parsing (chunked parsing)
16
Choosing a tagset
• The choice of tagset greatly affects the
difficulty of the problem
• Need to strike a balance between
– Getting better information about context
– Make it possible for classifiers to do their job
17
Tagging methods
• Hand-coded
• Statistical taggers
– N-Gram Tagging
– HMM
– (Maximum Entropy)
• Brill (transformation-based) tagger
18
Unigram Tagger
• Unigram taggers are based on a simple statistical
algorithm: for each token, assign the tag that is most
likely for that particular token.
– For example, it will assign the tag JJ to any occurrence of the word
frequent, since frequent is used as an adjective (e.g. a frequent word)
more often than it is used as a verb (e.g. I frequent this cafe).
P(tn | wn)
19
N-Gram Tagging
• An n-gram tagger is a generalization of a unigram tagger whose
context is the current word together with the part-of-speech tags of
the n-1 preceding tokens
• A 1-gram tagger is another term for a unigram tagger: i.e., the
context used to tag a token is just the text of the token itself. 2gram taggers are also called bigram taggers, and 3-gram taggers
are called trigram taggers.
trigram tagger
P(tn | wn, tn  1, tn  2 )20
N-Gram Tagging
• Why not 10-gram taggers?
• As n gets larger, the specificity of the contexts
increases, as does the chance that the data we
wish to tag contains contexts that were not
present in the training data.
• This is known as the sparse data problem, and
is quite pervasive in NLP. As a consequence,
there is a trade-off between the accuracy and
the coverage of our results (and this is related to
the precision/recall trade-off)
21
Markov Model Tagger
• Bigram tagger
• Assumptions:
– Words are independent of each other
– A word identity depends only on its tag
– A tag depends only on the previous tag
22
Markov Model Tagger
t1
t2
tn
w1
w2
wn
P(t , w)  P(t1, t 2,.., tn, w1, w2,.., wn, )   P(ti | ti  1 )P(wi | ti)
i
23
Rule-Based Tagger
• The Linguistic Complaint
– Where is the linguistic knowledge of a tagger?
– Just massive tables of numbers P(tn | wn, tn  1, tn  2 )
– Aren’t there any linguistic insights that could
emerge from the data?
– Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a
determiner tag it as a noun.
24
The Brill tagger
(transformation-based tagger)
• An example of Transformation-Based Learning
– Basic idea: do a quick job first (using frequency), then
revise it using contextual rules.
• Very popular (freely available, works fairly well)
– Probably the most widely used tagger (esp. outside
NLP)
– …. but not the most accurate: 96.6% / 82.0 %
• A supervised method: requires a tagged corpus
25
Brill Tagging: In more detail
• Start with simple (less accurate)
rules…learn better ones from tagged
corpus
– Tag each word initially with most likely POS
– Examine set of transformations to see which
improves tagging decisions compared to tagged
corpus
– Re-tag corpus using best transformation
– Repeat until, e.g., performance doesn’t improve
– Result: tagging procedure (ordered list of
transformations) which can be applied to new,
untagged text
26
An example
• Examples:
– They are expected to race tomorrow.
– The race for outer space.
• Tagging algorithm:
1. Tag all uses of “race” as NN (most likely tag in
the Brown corpus)
•
•
They are expected to race/NN tomorrow
the race/NN for outer space
2. Use a transformation rule to replace the tag NN
with VB for all uses of “race” preceded by the tag
TO:
•
•
They are expected to race/VB tomorrow
the race/NN for outer space
27
What gets learned? [from Brill 95]
Tags-triggered transformations
Morphology-triggered transformations
Rules are linguistically interpretable
28
Today’s review
• Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
• Lower level text processing
• Semantics
– Word sense disambiguation
(WSD)
– Lexical semantics
– Lexical acquisition
• Corpora
• Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
29
Phrase structure
• Words are organized in phrases
• Phrases: grouping of words that are
clumped as a unit
• Syntax: study of the regularities and
constraints of word order and phrase
structure
30
Major phrase types
• Sentence (S) (whole grammatical unit).
Normally rewrites as a subject noun phrase
and a verb phrase
• Noun phrase (NP): phrase whose head is a
noun or a pronoun, optionally accompanied
by a set of modifiers
– The smart student of physics with long hair
31
Major phrase types
• Prepositional phrases (PP)
– Headed by a preposition and containing a NP
• She is [on the computer]
• They walked [to their school]
• Verb phrases (VP)
– Phrase whose head is a verb
• [Getting to school on time] was a struggle
• He [was trying to keep his temper]
• That woman [quickly showed me the way to hide]
32
Phrase structure grammar
• Syntactic analysis of sentences
– (Ultimately) to extract meaning:
• Mary gave Peter a book
• Peter gave Mary a book
33
Phrase structure parsing
• Parsing: the process of reconstructing the
derivation(s) or phrase structure trees that
give rise to a particular sequence of words
• Parse is a phrase structure tree
– New art critics write reviews with computers
34
Phrase structure parsing &
ambiguity
• The children ate the cake with a spoon
• PP Attachment Ambiguity
• Why is it important for NLP?
35
Today’s review
• Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
• Lower level text processing
– Text normalization
– Segmentation
• Semantics
• Corpora
• Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
36
Text Normalization
• Stemming
• Convert to lower case
• Identifying non-standard words including numbers,
abbreviations, and dates, and mapping any such tokens
to a special vocabulary.
– For example, every decimal number could be mapped to a single
token 0.0, and every acronym could be mapped to AAA. This
keeps the vocabulary small and improves the accuracy of many
language modeling tasks.
• Lemmatization
– Make sure that the resulting form is a known word in a dictionary
– WordNet lemmatizer only removes affixes if the resulting word is
in its dictionary
37
Segmentation
• Word segmentation
– For languages that do not put spaces
between words
• Chinese, Japanese, Korean, Thai, German (for
compound nouns)
• Tokenization
• Sentence segmentation
– Divide text into sentences
38
Tokenization
• Divide text into units called tokens (words, numbers, punctuations)
– Page 124—136 Manning
• What is a word?
– Graphic word: string of continuous alpha numeric character surrounded
by white space
• $22.50
– Main clue (in English) is the occurrence of whitespaces
– Problems
• Periods: usually remove punctuation but sometimes it’s useful to keep
periods (Wash.  wash)
• Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction
could be useful to have 2 separate forms: is + n’t or not)
• Hyphenation:
– Sometime best a single word: co-operate
– Sometime best as 2 separate words: 26-year-old, aluminum-export ban
• (RE for tokenization)
39
Sentence Segmentation
• Sentence:
– Something ending with a .. ?, ! (and sometime also :)
– “You reminded me,” she remarked, “of your mother.”
• Nested sentences
• Note the .”
• Sentence boundary detection algorithms
– Heuristic (see figure 4.1 page 135 Manning)
– Statistical classification trees (Riley 1989)
• Probability of a word to occur before or after a boundary,
case and length of words
– Neural network (Palmer and Hearst 1997)
• Part of speech distribution of preceding and following words
– Maximum Entropy (Mikheev 1998)
40
For reference see Manning
Sentence Segmentation
• Sentence:
– Something ending with a .. ?, ! (and sometime also :)
– “You reminded me,” she remarked, “of your mother.”
• Nested sentences
• Note the .”
• Sentence boundary detection algorithms
– Heuristic (see figure 4.1 page 135 Manning)
– Statistical classification trees (Riley 1989)
• Probability of a word to occur before or after a boundary,
case and length of words
– Neural network (Palmer and Hearst 1997)
• Part of speech distribution of preceding and following words
– Maximum Entropy (Mikheev 1998)
Note: MODELS and Features
41
Segmentation as classification
• Sentence segmentation can be viewed as a
classification task for punctuation:
– Whenever we encounter a symbol that could possibly
end a sentence, such as a period or a question mark,
we have to decide whether it terminates the
preceding sentence.
– We’ll return on this when we cover classification
• See Section 6.2 NLTK book
• For word segmentation see section 3.8 NLTK
book
– Also page 180 of Speech and Language Processing
Jurafsky and Martin
42
Today’s review
• Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
• Lower level text processing
• Semantics
– Lexical semantics
– Word sense disambiguation
(WSD)
– Lexical acquisition
• Corpora
• Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
43
Semantics
•
Semantics is the study of the meaning of
words, construction and utterances
1. Study of the meaning of individual words
(lexical semantics)
2. Study of how meanings of individual
words are combined into the meaning of
sentences (or larger units)
44
Lexical semantics
• How words are related with each other
• Hyponymy
– scarlet, vermilion, carmine, and crimson are all
hyponyms of red
• Hypernymy
• Antonymy (opposite)
– Male, female
• Meronymy (part of)
– Tire is meromym of car
• Etc..
45
Word Senses
• Words have multiple distinct meanings, or senses:
– Plant: living plant, manufacturing plant, …
– Title: name of a work, ownership document, form of address,
material at the start of a film, …
•
Many levels of sense distinctions
– Homonymy: totally unrelated meanings (river bank, money bank)
– Polysemy: related meanings (star in sky, star on tv, title)
– Systematic polysemy: productive meaning extensions
(metonymy such as organizations to their buildings) or metaphor
– Sense distinctions can be extremely subtle (or not)
•
Granularity of senses needed depends a lot on the task
46
Taken from Dan Klein’s cs 288 slides
Word Sense Disambiguation
• Determine which of the senses of an ambiguous word is invoked in
a particular use of the word
• Example: living plant vs. manufacturing plant
• How do we tell these senses apart?
– “Context”
• The manufacturing plant which had previously sustained the town’s
economy shut down after an extended labor strike.
– Maybe it’s just text categorization
– Each word sense represents a topic
•
Why is it important to model and disambiguate word senses?
– Translation
• Bank  banca or riva
– Parsing
• For PP attachment, for example
– information retrieval
• To return documents with the right sense of bank
47
Adapted from Dan Klein’s cs 288 slides
Features
• Bag-of-words (use words around with no order)
– The manufacturing plant which had previously
sustained the town’s economy shut down after an
extended labor strike.
– Bags of words = {after, manufacturing, which, labor, ..}
• Bag-of-words classification works ok for noun
senses
– 90% on classic, shockingly easy examples (line,
interest, star)
– 80% on senseval-1 nouns
– 70% on senseval-1 verbs
48
Verb WSD
• Why are verbs harder?
– Verbal senses less topical
– More sensitive to structure, argument choice
– Better disambiguated by their argument (subject-object): importance of
local information
– For nouns, a wider context likely to be useful
•
Verb Example: “Serve”
–
–
–
–
–
–
–
[function] The tree stump serves as a table
[enable] The scandal served to increase his popularity
[dish] We serve meals for the homeless
[enlist] She served her country
[jail] He served six years for embezzlement
[tennis] It was Agassi's turn to serve
[legal] He was served by the sheriff
• Different types of information may be appropriate for different part of
speech
49
Adapted from Dan Klein’s cs 288 slides
Better features
• There are smarter features:
– Argument selectional preference:
• serve NP[meals] vs. serve NP[papers] vs. serve NP[country]
• Subcategorization:
–
–
–
–
–
[function] serve PP[as]
[enable] serve VP[to]
[tennis] serve <intransitive>
[food] serve NP {PP[to]}
Can capture poorly (but robustly) with local
windows… but we can also use a parser and get
these features explicitly
50
Taken from Dan Klein’s cs 288 slides
Various Approaches to WSD
• Unsupervised learning
– We don’t know/have the labels
– More than disambiguation is discrimination
• Cluster into groups and discriminate between
these groups without giving labels
• Clustering
– Example: EM (expectation-minimization),
Bootstrapping (seeded with some labeled
data)
• Supervised learning
51
Adapted from Dan Klein’s cs 288 slides
Supervised learning
• Supervised learning
– When we know the truth (true senses) (not always
true or easy)
– Classification task
– Most systems do some kind of supervised learning
– Many competing classification technologies perform
about the same (it’s all about the knowledge sources
you tap)
– Problem: training data available for only a few words
– Examples: Bayesian classification
• Naïve Bayes (simplest example of Graphical models)
52
Adapted from Dan Klein’s cs 288 slides
Semantics: beyond individual words
• Once we have the meaning of the
individual words, we need to assemble
them to et the meaning of the whole
sentence
• Hard because natural language does not
obey the principle of compositionality by
which the meaning of the whole can be
predicted by the meanings of the parts
53
Semantics: beyond individual words
•
Collocations
– White skin, white wine, white hair
•
Idioms: meaning is opaque
– Kick the bucket
54
Lexical acquisition
• Develop algorithms and statistical
techniques for filling the holes in existing
dictionaries and lexical resources by
looking at the occurrences of patterns of
words in large text corpora
– Collocations
– Semantic similarity
– (Logical metonymy)
– Selectional preferences
55
Collocations
• A collocation is an expression consisting of two
or more words that correspond to some
conventional way of saying things
– Noun phrases: weapons of mass destruction, stiff
breeze (but why not *stiff wind?)
– Verbal phrases: to make up
– Not necessarily contiguous: knock…. door
• Limited compositionality
– Compositional if meaning of expression can be
predicted by the meaning of the parts
– Idioms are most extreme examples of noncompositionality
• Kick the bucket
56
Collocations
• Non Substitutability
– Cannot substitute words in a collocation
• *yellow wine
• Non modifiability
– To get a frog in one’s throat
• *To get an ugly frog in one’s throat
• Useful for
– Language generation
• *Powerful tea, *take a decision
– Machine translation
• Easy way to test if a combination is a collocation is to
translate it into another language
– Make a decision  *faire une decision (prendre), *fare una
decisione (prendere)
57
Finding collocations
• Frequency
– If two words occur together a lot, that may be
evidence that they have a special function
– Filter by POS patterns
– A N (linear function), N N (regression coefficients) etc..
• Mean and variance of the distance of the words
• For not contiguous collocations
• Mutual information measure
58
Lexical acquisition
• Examples:
– “insulin” and “progesterone” are in WordNet 2.1 but
“leptin” and “pregnenolone” are not.
– “HTML” and “SGML”, but not “XML” or “XHTML”.
– “Google” and “Yahoo”, but not “Microsoft” or “IBM”.
• We need some notion of word similarity to
know where to locate a new word in a
lexical resource
59
Semantic similarity
• Similar if contextually interchangeable
– The degree for which one word can be substituted for
another in a given context
• Suit similar to litigation (but only in the legal context)
• Measures of similarity
– WordNet-based
– Vector-based
• Detecting hyponymy and other relations with
patterns
60
Lexical acquisition
• Lexical acquisition problems
– Collocations
– Semantic similarity
– (Logical metonymy)
– Selectional preferences
61
Selectional preferences
• Most verbs prefer arguments of a
particular type: selectional preferences or
restrictions
– Objects of eat tend to be food, subjects of
think tend to be people etc..
– “Preferences” to allow for metaphors
• Feat eats the soul
• Why is it important for NLP?
62
Selectional preferences
• Why Important?
– To infer meaning from selectional restrictions
• Suppose we don’t know the words durian (not in
the vocabulary)
• Susan ate a very fresh durian
• Infer that durian is a type of food
– Ranking the possible parses of a sentence
• Give higher scores to parses where the verbs has
‘natural argument”
63
Today’s review
• Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
• Lower level text processing
• Semantics
– Lexical semantics
– Word sense disambiguation (WSD)
– Lexical acquisition
• Corpus-based statistical approaches
to tackle NLP problems
• Corpora
• Intro to probability theory and graphical
models (GM)
– Example for WSD
– Language Models (LM) and smoothing
64
Corpus-based statistical
approaches to tackle NLP problem
• Data (corpora, labels, linguistic resources)
• Feature extractions (usually linguistics motivated)
• Statistical models
65
The NLP Pipeline
•
1.
For a given problem to be tackled
Choose corpus (or build your own)
–
Low level processing done to the text before the ‘real work’
begins
•
–
Low-leveling formatting issues
•
•
•
2.
Junk formatting/content (Html tags, Tables)
Case change (i.e. everything to lower case)
Tokenization, sentence segmentation
Choose annotation to use (or choose the label set and
label it yourself )
1.
3.
4.
5.
6.
Important but often neglected
Check labeling (inconsistencies etc…)
Extract features
Choose or implement new NLP algorithms
Evaluate
(eventually) Re-iterate
66
Corpora
• Text Corpora & Annotated Text Corpora
– NLTK corpora
– Use/create your own
• Lexical resources
–
–
–
–
WordNet
VerbNet
FrameNet
Domain specific lexical resources
• Corpus Creation
• Annotation
67
Annotated Text Corpora
• Many text corpora contain linguistic
annotations, representing genres, POS
tags, named entities, syntactic structures,
semantic roles, and so forth.
• Not part of the text in the file; it explains
something of the structure and/or
semantics of text
68
Annotated Text Corpora
• Grammar annotation
– POS, parses, chunks
• Semantic annotation
– Topics, Named Entities, sentiment, Author,
Language, Word senses, co-reference …
• Lower level annotation
– Word tokenization, Sentence Segmentation,
Paragraph Segmentation
69
Processing Search Engine Results
• The web can be thought of as a huge
corpus of unannotated text.
• Web search engines provide an efficient
means of searching this text
70
Lexical Resources
• A lexicon, or lexical resource, is a collection of
words and/or phrases along with associated
information such as part of speech and sense
definitions.
• Lexical resources are secondary to texts, and
are usually created and enriched with the help of
texts
– A vocabulary (list of words in a text) is the simplest
lexical resource
•
•
•
•
WordNet
VerbNet
FrameNet
Medline
71
Annotation: main issues
• Deciding Which Layers of Annotation to
Include
– Grammar annotation
– Semantic annotation
– Lower level annotation
• Markup schemes
• How to do the annotation
• Design of a tag set
72
Annotation: design of a tag set
• Tag set: the set of the annotation classes: genres, POS
etc.
• The tags should reflect distinctive text properties, i.e.
ideally we would want to give distinctive tags to words (o
documents) that have distinctive distributions
– That: complementizer and preposition: 2 very different
distributions:
• Two tags or only one?
• If two: more predictive
• If one: automatic classification easier (fewer classes)
• Tension: splitting tags/classes to capture useful
distinctions gives improved information for prediction but
can make the classification task harder
73
How to do the annotation
•
By hand
–
–
Can be difficult, time consuming, domain knowledge and/or training may be required
Amazon’s Mechanical Turk (MTurk, http://www.mturk.com) allows to create and post a task
that requires human intervention (offering a reward for the completion of the task)
•
•
•
Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment)
We obtained labels for 3627 text segments for under $70.
HIT completed (by all 3 “workers”) within a few minutes to a half-hour
–
•
•
•
Unsupervised methods do not use labeled data and try to learn a task from the
“properties” of the data.
Automatic (i.e. using some other metadata available)
Bootstrapping
–
•
Bootstrapping is an iterative process where, given (usually) a small amount of labeled data
(seed-data), the labels for the unlabeled data are estimated at each round of the process,
and the (accepted) labels then incorporated as training data.
Co-training
–
–
–
•
[Yakhnenko and Rosario 07]
Co-training is a semi-supervised learning technique that requires two views of the data. It
assumes that each example is described using two different feature sets that provide
different, complementary information about the instance.
“the description of each example can be partitioned into two distinct views” and for which
both (a small amount of) labeled data and (much more) unlabeled data are available.
co-training is essentially the one-iteration, probabilistic version of bootstrapping
Non linguistic (i.e. clicks for IR relevance)
74
Why Probability?
• Statistical NLP aims to do statistical
inference for the field of NLP
• Statistical inference consists of taking
some data (generated in accordance with
some unknown probability distribution) and
then making some inference about this
distribution.
75
Why Probability?
• Examples of statistical inference are WSD,
the task of language modeling (ex how to
predict the next word given the previous
words), topic classification, etc.
• In order to do this, we need a model of the
language.
• Probability theory helps us finding such
model
76
Probability Theory
• How likely it is that something will happen
• Sample space Ω is listing of all possible
outcome of an experiment
– Sample space can be continuous or discrete
– For language applications it’s discrete (i.e.
words)
• Event A is a subset of Ω
• Probability function (or distribution)
P : Ω  0,1
77
78
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
79
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Prior Probability
• Prior probability: the probability before we
consider any additional knowledge
P( A)
80
Conditional probability
• Sometimes we have partial knowledge
about the outcome of an experiment
• Conditional (or Posterior) Probability
• Suppose we know that event B is true
• The probability that A is true given the
knowledge about B is expressed by
P( A | B)
P(A,B)
P(A|B) 
P(B)
81
82
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Conditional probability (cont)
P( A, B)  P( A | B) P( B)
 P( B | A) P( A)
•
•
•
•
•
Note: P(A,B) = P(A ∩ B)
Chain Rule
P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the
probability that B happens times the probability that A happens, given B has
occurred.
P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the
probability that A happens times the probability that B happens, given A has
occurred.
Multi-dimensional table with a value in every cell giving the probability of
that specific state occurring
83
Chain Rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
84
Chain Rule  Bayes' rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(B|A)P(A)
P(A|B) 
P(B)
Bayes' rule
Useful when one quantity is more easy to calculate;
trivial consequence of the definitions we saw but it’ s
extremely useful
85
Bayes' rule
P(A|B)P(A)
P(A|B) 
P(B)
Bayes' rule translates causal knowledge into diagnostic knowledge.
For example, if A is the event that a patient has a disease, and B is the
event that she displays a symptom, then P(B | A) describes a causal
relationship, and P(A | B) describes a diagnostic one (that is usually
hard to assess).
If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B)
for free.
86
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
P( S | M ) P( M )
P( M | S ) 
P( S )
0.5 1 / 50,000

 0.0002
1 / 20
87
(Conditional) independence
• Two events A e B are independent of each
other if
P(A) = P(A|B)
• Two events A and B are conditionally
independent of each other given C if
P(A|C) = P(A|B,C)
88
Back to language
• Statistical NLP aims to do statistical inference for
the field of NLP
– Topic classification
• P( topic | document )
– Language models
• P (word | previous word(s) )
– WSD
• P( sense | word)
• Two main problems
– Estimation: P in unknown: estimate P
– Inference: We estimated P; now we want to find
(infer) the topic of a document, or the sense of a word
89
Language Models (Estimation)
• In general, for language events, P is
unknown
• We need to estimate P, (or model M of the
language)
• We’ll do this by looking at evidence about
what P must be based on a sample of data
90
Inference
• The central problem of computational Probability
Theory is the inference problem:
• Given a set of random variables X1, … , Xk and
their joint density P(X1, … , Xk), compute one or
more conditional densities given observations.
– Compute
•
•
•
•
P(X1 | X2 … , Xk)
P(X3 | X1 )
P(X1 , X2 | X3, X4,)
Etc …
• Many problems can be formulated in these
terms.
91
Bayes decision rule
•
•
•
•
w: ambiguous word
S = {s1, s2, …, sn } senses for w
C = {c1, c2, …, cn } context of w in a corpus
V = {v1, v2, …, vj } words used as contextual features for
disambiguation
• Bayes decision rule
– Decide sj if P(sj | c) > P(sk | c) for sj ≠ sk
• We want to assign w to the sense s’ where
s’ = argmaxsk P(sk | c)
92
Graphical Models
• Within the Machine Learning framework
• Probability theory plus graph theory
• Widely used
–
–
–
–
–
–
–
NLP
Speech recognition
Error correcting codes
Systems diagnosis
Computer vision
Filtering (Kalman filters)
Bioinformatics
93
(Quick intro to)
Graphical Models
Nodes are random variables
D
A
Edges are annotated with
conditional probabilities
Absence of an edge between
nodes implies conditional
independence
“Probabilistic database”
B
C
P(A)
P(D)
P(B|A)
P(C|A,D)
94
Graphical Models
• Define a joint probability
distribution:
• P(X1, ..XN) = i P(Xi | Par(Xi) )
• P(A,B,C,D) =
P(A)P(D)P(B|A)P(C|A,D)
• Learning
D
A
B
C
– Given data, estimate the
parameters P(A), P(D), P(B|A),
P(C | A, D)
95
Graphical Models
• Define a joint probability distribution:
• P(X1, ..XN) = i P(Xi | Par(Xi) )
• P(A,B,C,D) =
P(A)P(D)P(B|A)P(C,A,D)
• Learning
– Given data, estimate P(A), P(B|A), P(D),
P(C | A, D)
• Inference: compute conditional
probabilities, e.g., P(A|B, D) or P(C | D)
• Inference = Probabilistic queries
• General inference algorithms (e.g.
Junction Tree)
D
A
B
C
96
Naïve Bayes models
• Simple graphical model
Y
x1
x2
x3
• Xi depend on Y
• Naïve Bayes assumption: all xi are
independent given Y
• Currently used for text classification and
spam detection
97
Naïve Bayes models
Naïve Bayes for document classification
topic
w1
w2
wn
Inference task: P(topic | w1, w2 … wn)
98
Naïve Bayes for SWD
sk
v1
v2
v3
• Recall the general joint probability distribution:
P(X1, ..XN) = i P(Xi | Par(Xi) )
P(sk, v1..v3) = P(sk) P(vi | Par(vi))
= P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
99
Naïve Bayes for SWD
sk
v1
v2
v3
P(sk, v1..v3) = P(sk) P(vi | Par(vi))
= P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training): Given data, estimate:
P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
100
Naïve Bayes for SWD
sk
v1
v2
v3
P(sk, v1..v3) = P(sk) P(vi | Par(vi))
= P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training): Given data, estimate:
P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Inference (Testing): Compute conditional probabilities
of interest: P(sk| v1, v2, v3)
101
Graphical Models
• Given Graphical model
– Do estimation (find parameters from data)
– Do inference (compute conditional
probabilities)
• How do I choose the model structure (i.e.
the edges)?
102
How to choose the model
structure?
sk
sk
v1
v2
v3
v1
sk
v1
v2
v2
v3
sk
v3
v1
v2
v3
103
Model structure
• Learn it: structure learning
– Difficult & need a lot of data
• Knowledge of the domain and of the
relationships between the variables
– Heuristics
– The fewer dependencies (edges) we can
have, the “better”
• Sparsity: more edges, need more data
sk
– Direction of arrows
v1
v2
P (v3 | sk, v1, v2)
v3
104
Generative vs. discriminative
v1
Generative
Discriminative
sk
sk
v2
v3
v1
v2
v3
P(sk, v1..v3) = P(sk) P(vi | Par(vi))
P(sk, v1..v3) = P(v1) P(v2) P(v3 ) P( sk | v1, v2 v3)
= P(sk) P(v1| sk) P(v2| sk) P(v3| sk )
Estimation (Training):
Estimation (Training):
Given data, estimate:
Given data, estimate:
P(v1), P(v2), P(v3 ), and
P(sk), P(v1| sk), P(v2| sk) and
P( sk | v1, v2 v3)
P(v3| sk )
Inference (Testing):
Compute: P(sk| v1, v2, v3)
Conditional pb. of interest is “ready”:
P(sk| v1, v2, v3) i.e. modeled directly
(there are algorithms to find these
cond. Pb, not covered here)
105
Do inference to find Pb of interest
Pb of interest is modeled directly
Naïve Bayes for topic classification
T
w1
w2
wn
Recall the general joint probability distribution:
P(X1, ..XN) = i P(Xi | Par(Xi) )
P(T, w1..wn) = P(T) P(w1| T) P(w2| T) … P(wn| T )=
= P(T) i P(wi | T)
Estimation (Training): Given data, estimate:
P(T), P(wi | T)
Inference (Testing): Compute conditional probabilities:
P(T | w1, w2, ..wn)
106
Exercise
•
Topic = politics (num words = 19)
•
Topic = sport (num words = 15)
•
•
•
•
•
•
D1: Obama hoping rally support
D2: billion stimulus package
D3: House Republicans tax
D4: cuts spending GOP games
D4: Republicans obama open
D5: political season
•
•
•
•
•
D1: 2009 open season
D2: against Maryland Sept
D3: play six games
D3: schedule games weekends
D4: games games games
Estimate:
Pˆ ( wi , Tj )
Pˆ ( wi | Tj ) 
Pˆ (Tj )
Pˆ ( wi | Tj ) , Pˆ (Tj )
Pˆ (wi , Tj )  c(wi , Tj )
for each wi , Tj
TotWords
Pˆ (Tj )  c(words in Tj )
TotWords
P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19
P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0
P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19
P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19
P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19
P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0
107
Exercise: inference
• What is the topic of new documents:
– Republicans obama season
– games season open
– democrats kennedy house
108
Exercise: inference
• Recall: Bayes decision rule
Decide Tj if P(Tj | c) > P(Tk | c) for Tj ≠ Tk
c is the context, here the words of the documents
• We want to assign the topic T for which
T’ = argmaxTj P(Tj | c)
109
Exercise: Bayes classification
• We compute P(Tj | c) with Bayes rule
P(c,Tj) P(c|Tj)
P(Tj|c) 

P(Tj)
P(c)
P(c)
P(c|Tj)  P( w1,..., wn | Tj )   P( wi | Tj )
i
Tˆ



arg max TjP(Tj | c)
P(c|T j)P(Tj)
arg max Tj
P(c)
arg max TjP(c|T j)P(Tj)
 arg max Tj  P(wi|T j)P(Tj)
Because of the
dependencies
encoded in the GM
Bayes rule
This GM
i
110
Exercise: Bayes classification
T '  arg max TjP(Tj) P(wi|T j)
i
That is, for each Tj we calculate
P(Tj) P(wi|T j) and see which one is higher
i
New sentences: republicans obama season
T = politics?
P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) =
19/34
2/19
2/19
1/19 > 0
T = sport?
P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) =
15/34
0
0
1/19
=0
Choose T = politics
111
Exercise: Bayes classification
T '  arg max Tj  P(wi|T j)P(Tj)
i
That is, for each Tj we calculate
 P(wi|T j)P(Tj) and see which one is higher
i
New sentences: democrats kennedy house
T = politics?
P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) =
19/34
0
0
1/19
=0
democrats kennedy: unseen words  data sparsity
How can we address this?
112
Today’s review
•
Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
•
•
Lower level text processing
Semantics
– Lexical semantics
– Word sense disambiguation
(WSD)
– Lexical acquisition
•
•
•
Corpus-based statistical
approaches to tackle NLP problem
Corpora
Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
113
Language Models
• Model to assign scores to sentences
P( w1 , w2 ,..., wN )
• Probabilities should broadly indicate likelihood of
sentences
– P( I saw a van) >> P( eyes awe of an)
• Not grammaticality
– P(artichokes intimidate zippers) ≈ 0
• In principle, “likely” depends on the domain, context,
speaker…
114
Adapted from Dan Klein’s CS 288 slides
Language models
• Related: the task of predicting the next word
P(wn | w1,..., wn  1 )
• Can be useful for
– Spelling corrections
• I need to notified the bank
–
–
–
–
–
Machine translations
Speech recognition
OCR (optical character recognition)
Handwriting recognition
Augmentative communication
• Computer systems to help the disabled in communication
– For example, systems that let choose words with hand
movements
115
Language Models
• Model to assign scores to sentences
P( w1 , w2 ,..., wN )
– Sentence: w1, w2, … wn
– Break sentence probability down with chain rule (no loss
of generality)
P( w1 , w2 ,..., wN )   P( wi w1w2 ,..., wi 1 )
i
– Too many histories!
116
Markov assumption: n-gram
solution
P(wi w1w2 ,..., wi 1 )
w1
wi
• Markov assumption: only the prior local context -- the last “few” n words– affects the next word
• N-gram models: assume each word depends
only on a short linear history
– Use N-1 words to predict the next one
P(wi | wi  n , , , , wi  1 )
P( w1 , w2 ,..., wN )   P( wi wi n ...wi 1 )
i
Wi-3
wi
117
Markov assumption: n-gram
solution
• Unigrams (n =1)
P(wi | wi  n , , , , wi  1 )  P(wi)
P(w1, w2,... wn, )   P(wi)
i
• Bigrams (n = 2)
P(wi | wi  n , , , , wi  1 )  P(wi | wi  n)
P(w1, w2,... wn, )   P(wi | wi  1 )
i
• Trigrams (n = 3)
P(wi | wi  n , , , , wi  1 )  P(wi | wi  2, wi  1 )
P( w1 , w2 ,..., wN )   P( wi | wi 1 , wi 2 )
i
118
Choice of n
• In principle we would like the n of the n-gram to
be large
–
–
–
–
–
green
large green
the large green
swallowed the large green
swallowed should influence the choice of the next
word (mountain is unlikely, pea more likely)
– The crocodile swallowed the large green ..
– Mary swallowed the large green ..
– And so on…
119
Discrimination vs. reliability
• Looking at longer histories (large n) should
allows us to make better prediction (better
discrimination)
• But it’s much harder to get reliable
statistics since the number of parameters
to estimate becomes too large
– The larger n, the larger the number of
parameters to estimate, the larger the data
needed to do statistically reliable estimations
120
Language Models
• N size of vocabulary
• Unigrams
P(wi | wi  n , , , , wi  1 )  P(wi)
• Bi-grams
P( w1 , w2 ,..., wN )   P( wi | wi 1 )
i
• Tri-grams
P( w1 , w2 ,..., wN )   P( wi | wi 1 , wi 2 )
i
For each wi calculate P(wi):
N of such numbers: N parameters
For each wi, wj
calculate P(wi | wj,):
NxN parameters
For each wi, wj wk
calculate P(wi | wj, wk):
NxNxN parameters
121
N-grams and parameters
• Assume we have a vocabulary of 20,000 words
• Growth in number of parameters for n-grams models:
Model
Parameters
Bigram model
20,0002 = 400 million
Trigram model
20,0003 = 8 trillion
Four-gram model
20,0004 = 1.6 x 1017
122
Sparsity
• Zipf’s law: most words are rare
– This makes frequency-based approaches to language
hard
• New words appear all the time, new bigrams more often,
trigrams or more, still worse!
P( wi ) 
c( wi )
 P( wi )  0 if c( wi )  0
N
c( wi , wi  1)
 0 if c( wi , wi  1)  0
c( wi  1)
c( wi , wi  1, wi  2)
P( wi | wi  1, wi  2) 
 0 if c( wi , wi  1, wi  2)  0
c( wi  1, wi  2)
P( wi | wi  1) 
• These relative frequency estimates are the MLE
(maximum likelihood estimates): choice of parameters
123
that give the highest probability to the training corpus
Sparsity
• The larger the number of parameters, the
more likely it is to get 0 probabilities
• Note also the product:
P( w1 , w2 ,..., wN )   P( wi wi n ...wi 1 )
i
• If we have one 0 for un unseen events, the 0
propagates and gives us 0 probabilities for the
whole sentence
124
Tackling data sparsity
• Discounting or smoothing methods
– Change the probabilities to avoid zeros
– Remember pd have to sum to 1
– Decrease the non zeros probabilities (seen
events) and put the rest of the probability
mass to the zeros probabilities (unseen
events)
125
Smoothing
126
From Dan Klein’s CS 288 slides
Smoothing
• Put probability mass on “unseen events”
P( w | w1 ) 
c( w, w1 )
c( w1 )
c( w, w1 )   (1 / V )
c( w1 )  
• Add one /delta (uniform prior)
P( w | w1 ) 
• Add one /delta (unigram prior)
c( w, w1 )  Pˆ ( w)
P ( w | w1 ) 
c( w1 )  
• Linear interpolation
P( w | w1 )  1 Pˆ ( w | w1 )  2 Pˆ ( w)
• ….
127
Smoothing: Combining estimators
• Make linear combination of multiple probability
estimates
– (Providing that we weight the contribution of each of
them so that the result is another probability function)
• Linear interpolation or mixture models
PLI (wi | wi 1 , wi 2 )  1P1(wi )   2 P 2(wi | wi 1 )   3 P3(wi | wi 1 , wi 2 )
0  i  1,
  1
i
i
128
Smoothing: Combining estimators
• Back-off models
– Special case of linear interpolation
 C ( wi , wi 1 , wi  2 )
 C ( w , w ) if C ( wi , wi 1 , wi  2 )  k
i 1
i 2

PBO( wi | wi 1 , wi  2 )  

PBO( wi | wi 1 ) otherwise


129
Smoothing: Combining estimators
• Back-off models: trigram version
C ( wi , wi 1 , wi  2 )


P
(
w
i | wi 1 , wi  2 )  

C ( wi 1 , wi  2 )



C ( wi , wi 1 )
PBO( wi | wi 1 , wi  2 )  
1 P( wi | wi 1 )  1
C ( wi 1 )



C ( wi )

P
(
w
i
)



2
2
N

if C ( wi , wi 1 , wi  2 )  0
if C ( wi , wi 1 , wi  2 )  0
and C ( wi , wi 1 )  0
otherwise
130
Today’s review
•
Grammar
– Morphology
– Part-of-speech (POS)
– Phrase level syntax
•
•
Lower level text processing
Semantics
– Lexical semantics
– Word sense disambiguation
(WSD)
– Lexical acquisition
•
•
•
Corpus-based statistical
approaches to tackle NLP problem
Corpora
Intro to probability theory and
graphical models (GM)
– Example for WSD
– Language Models (LM) and
smoothing
• What I hope we achieved
• Overall idea of linguistic
problems
• Overall understanding of
“lower level” NLP tasks
– POS, WSD, language models,
segmentation, etc
– NOTE: Will be used for
preprocessing and as features
for higher level tasks
• Initial understanding of Stat
NLP
– Corpora & annotation
– probability theory, GM
– Sparsity problem
• Familiarity with Python and
NLTK
131
Next classes
• How do we now tackle ‘higher level’ NLP problems?
• NLP applications
• Text Categorization
– Classify documents by topics, language, author, spam filtering,
information retrieval (relevant, not relevant), sentiment classification
(positive, negative)
•
•
•
•
Spelling & Grammar Corrections
Information Extraction
Speech Recognition
Information Retrieval
– Synonym Generation
•
•
•
•
Summarization
Machine Translation
Question Answering
Dialog Systems
– Language generation
132
The NLP Pipeline
•
1.
For a given problem to be tackled
Choose corpus (or build your own)
–
Low level processing done to the text before the ‘real work’
begins
•
–
Low-leveling formatting issues
•
•
•
2.
Junk formatting/content (Html tags, Tables)
Case change (i.e. everything to lower case)
Tokenization, sentence segmentation
Choose annotation to use (or choose the label set and
label it yourself )
1.
3.
4.
5.
6.
Important but often neglected
Check labeling (inconsistencies etc…)
Extract features
Choose or implement new NLP algorithms
Evaluate
(eventually) Re-iterate
133
Next classes
• Classification: important: many NLP app can
be framed as classification
– Text Categorization (topics, language, author, spam
filtering, sentiment classification) (positive, negative)
– Information Extraction
– Information Retrieval
• Feature extraction
• Projects
• NLP applications
134
Download