Supervised and Unsupervised learning for Natural language

advertisement
Supervised and Unsupervised
learning for
Natural language processing
Manaal Faruqui
Language Technologies Institute
SCS, CMU
Natural Language
Processing
+
Linguistics
Computer
Science
Natural Language
Processing
But Why ?
• Inability to
handle large
amount of data
• Much much
faster
information
access
Natural Language
Processing
How can this be done ?
• Can you teach a computer ?
Natural Language
Processing
=
Are you kidding me !
Mathematics
Using Maths to
learn language ???
Machine Learning
Teaching computers make decisions like
humans
Computer vision
Machine Translation
Clustering
Machine Learning
Supervised
Unsupervised
Semisupervised
Learning by
examples
Learning by
patterns
Learning by
patterns +
examples
Formal & Informal address
•
Most languages distinguish formal (V) and informal (T)
address in direct speech (Brown and Gilman, 1960)
• Formal address: Neutrality, distance
• Informal address: Friends, subordinates
• Variety of realization in different languages
• French: Pronoun usage (Vous/Tu)
• German: Pronoun usage (Sie/Du)
• Hindi: Pronoun usage (Aap/Tum)
• Japanese: Verbal inflections
• English: ???
Main goals of this work
•
Goal 1: Determine whether English distinguishes
between V & T consistently
• If yes, what are the indicators ?
•
Goal 2: Develop a computational model that labels
English sentences as T or V
• Ideally without spending effort on annotation
Methodology
•
Use a parallel corpus to analyze aligned sentences
with overt (De) T/V choice and covert (En) T/V choice
•
•
For Goal 1: Compare De & En sentences
For Goal 2 : Project De labels onto En sentences
Digression: Creation of a parallel corpus
• Current parallel corpora not suitable
• Europarl: Overwhelmingly formal (99%)
• Newswire: No dialogue
• Creation of a new corpus: De-En literary texts
• 106 19th century novels (Project Gutenberg)
• Sentence-aligned: Gargantuan (Braune & Fraser
2010)
• POS-tagged (Schmidt 1994)
•
German sentence can be labeled as T, V or None
• Using orthographic rules
•
Corpus: http://cs.cmu.edu/~mfaruqui
Goal 1: Compare De and En address
• Give English monolingual text to human annotators
• Ask for T/V judgment
• Their annotation provides the following information
• How well do annotators agree on English text?
• Does English monolingual text provide enough
information to identify T/V? (1a)
• How well do annotators agree with copied labels?
• Is there a direct correspondence ? (1b)
• Only if this is the case is the copying of labels
appropriate
Experiment 1: Human Annotation
• 200 randomly drawn English sentences
• Two annotators (“A1”, “A2”)
• Two conditions:
– No context: just one sentence
– In context: three sentences pre- and post-context
each
Results: Reliability
A1 vs. A2
No Context
In Context
.75 (k=.49)
.79 (k=.58)
• Context improves reliability
– Many sentences can not be tagged with T/V in
isolation
“And she is a sort of relation of your lordship’s,” said Dawson.
“And perhaps sometime you may see her.”
•
Goal 1a ✓
Reliability in context is reasonable:
• English does provide strong clues on T/V
Results: Correspondence
(A1∩ A2) vs. Projection
No Context
In Context
.67 (k=.34)
.79 (k=.58)
• Agreement with German projected labels again reasonable,
but not perfect
Goal 1b ✓
• Error analysis showed strong influence of social norms
• Example: Lovers in 19th cent. novels use V (!)
[...] she covered her face with the other to conceal her tears. “Corinne!”, said
Oswald, “Dear Corinne! My absence has then rendered you unhappy!”
Experiment 2: Prediction of T/V
• Copy German T/V labels onto English: No annotation
• Learn L2-regularized logit classifier on train set; optimize
on dev set; evaluate on test set
• Feature candidates :
– Lexical features (bag-of-words, χ² feature selection)
– Distributional semantic word classes
• 200 word classes clustered with the algorithm by Clark
(2003)
– Politeness theory (Brown & Levinson 2003)
• Polite speech has specific features, which are inherited by V
Supervised Learning
Logistic regression classifier
•
Linear combination of features
•
Every feature assigned a weight acc. to its importance
• higher weight = more importance
• L2 regularization to avoid overfitting
•
Used “Weka” as the open-source toolkit
Context
• As shown by human annotation: Individual sentences
often insufficient for classification
• Simplest solution: Compute features over a window of
context sentences
– Problem: context typically includes non-speech
sentences
“I am going to see his ghost!” Lorry quietly chafed the hands that held
his arm.
Context
•
Our solution: A simple
“direct speech”
recognizer CRF-based
sequence tagger (Mallet)
trained on 1000
sentences
•
Ideal results for 8
sentences of direct
speech context +5%
accuracy over no context
B-SP: “I
am going to see his ghost!”
O: Lorry quietly chafed the hands that held his arm.
Speech context
Sentence context
Quantitative results
(Faruqui & Pado, 2011; 2012)
Model
Accuracy
Frequency BL (V)
Lexical features
Semantic class features
Politeness features
59.1
67.0
57.5
59.6
• Only lexical features yield significant improvement over
frequency baseline
Goal 2 ✓
Qualitative analysis: Lexical features
Top 10 lexical features
Conclusions
• Formal and informal language exists in English as well
– Indicators more dispersed across context
• Bootstrapping a T/V classifier for English possible
• Results still fairly modest
– Asymmetry: V more marked than T → better features
– Difficult to operationalize features with high recall
(sociolinguistic features, first names, …)
References
•
•
•
•
•
•
•
M. Faruqui & S. Pado, “I thou thee, thou traitor”: Predicting formal vs.
informal address in English literature. ACL 2011.
M. Faruqui & S. Pado, Towards a model of formal and informal address in
English. EACL 2012.
Roger Brown and Albert Gilman. 1960. The pronouns of power and
solidarity. In Thomas A. Sebeok, editor, Style in Language, pages 253–277.
MIT Press, Cambridge, MA.
Penelope Brown and Stephen C. Levinson. 1987. Politeness: Some Universals
in Language Usage. Number 4 in Studies in Interactional Sociolinguistics.
Cambridge University Press.
Fabienne Braune & Alexander Fraser. Improved unsupervised sentence
alignment for symmetrical and asymmetrical parallel corpora. COLING 2010
Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision
Trees. In Proceedings of the International Conference on New Methods in
Language Processing, pages 44–49, Manchester, UK.
Andrew Kachites McCallum. 2002. Mallet: A machine learning for language
toolkit. http://mallet.cs.umass.edu.
Unsupervised Learning
Learning by finding patterns in
data
Clustering
Word clustering
Why ?
•
Feature reduction
• From words to word classes
•
Generalization of unseen words
• Bangalore ~ Bengaluru
•
Identification of words with similar meaning
• Word-sense disambiguation
•
Reduces the need for tagged data
Word clustering
How ?
•
Distributional similarity
• How similar is the occurrence pattern of two
words in a given corpus ?
“You shall know a word by the company it keeps” – J. R. Firth
•
Morphological similarity
• How similar are two words orthographically ?
• Madras ~ Chennai … NO
• Bangalore ~ Bengaluru … YES
Word clustering
Language modeling approach
1. Ranjitha cooks Uttapam.
2. Ranjitha cooks Rava masala dosa.
3. Ranjitha cooks Facebook.
How do you know which one is wrong ??
Word clustering
Language modeling approach
•
Maximize the probability of occurrence of a sequence
of words
S: Ranjitha cooks Facebook
• P(S) = P(Ranjitha) * P(cooks|Ranjitha) *
P(Facebook|cooks)
• P(Facebook|cooks) will be very near zero OR zero !
Word clustering
S: w1 w2 w3 w4
C1
C2
C3
C4
W1
W2
W3
W4
P(S) = P(C1) * P(w1|C1) * P(C2|C1) * P(w2|C2)
*…
(Och, 1999)
This is called a Hidden-Markov Model (HMM)
Word clustering
Adding morphology (Clark, 2003)
C1
C2
C3
C4
W1
W2
W3
W4
P(S) = P(C1) * P(w1|C1) * Pm(w1|C1) * P(C2|C1)
*
P(w2|C2) * Pm(w2|C2) …
Word clustering
Implementation
• Initialization of clusters
• Randomized
• Heuristic-based
• Optimization algorithm
• Greedy as closed form solution not present
• Transfer word to the cluster with highest
improvement
• Termination
• Till no more words are exchanged
• Till a specific no. of words are exchanged
Word clustering
Application / Evaluation
• Named Entity Recognition
• Identification and labeling of names of people,
places, organization etc.
• Pre-processing task for many NLP applications
• Tags from the CoNLL-03 shared-task on NER:
• PERson, ORGanization, LOCation, MISCellaneous
(Sonia Gandhi)PER is an (Italian)MISC who lives in (India)LOC.
Named Entity Recognition
NER for German: Challenges
Complex
Morphology:
Difficult
lemmatization
Sparse data:
Only one NEtagged dataset
(CoNLL 2003)
Common noun
capitalization:
no easy entity
detection
Poor performance, in particular poor Recall
Named Entity Recognition
NER for German: Challenges
Recall
Precision
F-Score
English
88.5%
89.0%
88.8%
German
63.7%
83.9%
72.4%
Recall is a problem !
• More amount of training data can help, but expensive
!
• Semantic generalization ?
Named Entity Recognition
Word clustering
• Provides a way to semantic generalization
But how can it help ?
Deutschland (70)
Ostdeutschland(0)
Westdeutschland(5)
LOC
Named Entity Recognition
Experimental setup
• Cluster German words with Clark’s clustering software
on the basis of an untagged generalization corpus
• HGC, deWac (Baroni et. al, 2009)
• Stanford’s CRF-based NER system (Finkel and Manning
2009)
• Training on an NER-tagged corpus (CoNLL 2003 German train
set newswire)
• Evaluate on CoNLL 2003 testb set (50M words, in-domain)
Named Entity Recognition
Results (Faruqui & Pado, 2010)
Recall
Precision
F-Score
Florian et. al 2003
83.9%
63.7%
72.4%
Baseline (0/0)
84.5%
63.1%
72.3%
HGC (175m/600)
86.6%
71.2%
78.2%
deWac (175m/400)
86.4%
68.5%
76.4%
Multilingual word clustering
• Clustering words from two languages together
• If parallel data in two languages available
• Word alignments can give additional information
• Additional constraints may give better clustering
I
You
We
They
She
Ich
Sie
Uns
Er
Multilingual word clustering
Language
1
Language
2
Multilingual word clustering
Language
1
Language
2
Multilingual word clustering
• Minimize the randomness of the clustering
• Minimize the entropy of the clustering
• If clustering of L1 is represented by a random variable X
• We want to minimize the entropy of one clustering given
the other:
Multilingual word clustering
• We optimize both the monolingual and multilingual
objective together:
• Further edge filtering heuristics can be used
• Words aligned with stop words generally noisy
• Low frequency words are important
• Finding out whether edge filtering is language
dependent or not
References
• M. Faruqui & S. Pado, Training and Evaluating a German Named Entity
Recognizer with Semantic Generalization, KONVENS 2010.
• Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009.
The wacky wide web: A collection of very large linguistically processed webcrawled corpora. JLRE, 43(3):209–226.
• Alexander Clark. 2003. Combining distributional and morphological
information for part of speech induction. Proc. EACL 59–66, Budapest,
Hungary.
• Jenny Rose Finkel and Christopher D. Manning. 2009. Nested named entity
recognition. Proc. EMNLP, pages 141–150, Singapore.
• Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named
entity recognition through classifier combination. Proc. CoNLL, pages 168–
171. Edmonton.
• Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the
CoNLL-2003 shared task: Language-independent named entity recognition.
Proc. CoNLL, pages 142–147, Edmonton, AL
Thank you!
Questions?
Please write to:
mfaruqui@cs.cmu.edu
Download