Learning Within-Sentence Semantic Coherence

advertisement
Learning Within-Sentence
Semantic Coherence
Elena Eneva
Rose Hoberman
Lucian Lita
Carnegie Mellon University
Semantic (in)Coherence



Trigram: content words unrelated
Effect on speech recognition:
– Actual Utterance: “THE BIRD FLU HAS
AFFECTED CHICKENS FOR YEARS BUT ONLY
RECENTLY BEGAN MAKING HUMANS SICK”
– Top Hypothesis: “THE BIRD FLU HAS
AFFECTED SECONDS FOR YEARS BUT ONLY
RECENTLY BEGAN MAKING HUMAN SAID”
Our goal: model semantic coherence
A Whole Sentence Exponential
Model [Rosenfeld 1997]
1
Pr( s )   P 0( s )  exp(  i  fi ( s ))
Z
i
def
P0(s) is an arbitrary initial model
(typically N-gram)
 fi(s)’s are arbitrary computable
properties of s (aka features)
 Z is a universal normalizing constant

(
A Methodology for Feature
Induction
Given corpus T of training sentences:
1.
Train best-possible baseline model, P0(s)
2.
Use P0(s) to generate corpus T0 of “pseudo
sentences”
3.
Pose a challenge: find (computable) differences
that allow discrimination between T and T0
4.
Encode the differences as features fi(s)
5.
Train a new model:
1
P1( s)   P 0( s)  exp(  i  fi ( s ))
Z
i
Discrimination Task:
Are these content words generated from a
trigram or a natural sentence?
1.
2.
- - - feel - - sacrifice - - sense - - - - - - - - meant - - - - - - - - trust - - - - truth
- - kind - free trade agreements - - - living
- - ziplock bag - - - - - - university japan's
daiwa bank stocks step –
Building on Prior Work
Define “content words” (all but top 50)
 Goal: model distribution of content
words in sentence
 Simplify: model pairwise co-occurrences
(“content word pairs”)
 Collect contingency tables; calculate
measure of association for them

Q Correlation Measure
Derived from
Co-occurrence
Contingency
Table
W2 yes
W2 no
W1
yes
W1
no
c11
c12
c21
c22
c1 1c 2 2 -c1 2c 2 1
Q 
c1 1c 2 2  c1 2c 2 1

Q values range from –1 to +1
Density Estimates

We hypothesized:
– Trigram sentences: wordpair correlation
completely determined by distance
– Natural sentences: wordpair correlation
independent of distance

kernel density estimation
– distribution of Q values in each corpus
– at varying distances
Q Distributions
Distance = 3
Density
Distance = 1
---- Trigram Generated
Broadcast News
Q Value
Likelihood Ratio Feature
Pr(Qij | dij, BNews )
L 
wordpairs i , j Pr(Qij | dij, Trigram )
she is a country singer searching for fame
and fortune in nashville
Q(country,nashville) = 0.76 Distance = 8
Pr (Q=0.76|d=8,BNews) = 0.32
Pr(Q=0.76|d=8,Trigram) = 0.11
Likelihood ratio = 0.32/0.11 = 2.9
Simpler Features

Q Value based
– Mean, median, min, max of Q values for content
word pairs in the sentence (Cai et al 2000)
– Percentage of Q values above a threshold
– High/low correlations across large/small distances

Other
– Word and phrase repetition
– Percentage of stop words
– Longest sequence of consecutive stop/content
words
Datasets

LM and contingency tables (Q values)
derived from 103 million words of BN
 From remainder of BN corpus and sentences
sampled from trigram LM:
– Q value distributions estimated from ~100,000
sentences
– Decision tree trained and test on ~60,000
sentences

Disregarded sentences with < 7 words
– “Mike Stevens says it’s not real”
– “We’ve been hearing about it”
Experiments

Learners:
– C5.0 decision tree
– Boosting decision stumps with
Adaboost.MH

Methodology:
– 5-fold cross validation on ~60,000
sentences
– Boosting for 300 rounds
Results
Feature Set
Q mean, median, min,
max (Previous Work)
Likelihood Ratio
Classification
Accuracy
73.39 ± 0.36
77.76 ± 0.49
All but Likelihood Ratio
80.37 ± 0.42
All Features
80.37 ± 0.46
Likelihood Ratio + non-Q
Shannon-Style Experiment

50 sentences
– ½ “real” and ½ trigram-generated
– Stopwords replaced by dashes

30 participants
– Average accuracy of 73.77% ± 6
– Best individual accuracy 84%

Our classifier:
– Accuracy of 78.9% ± 0.42
Summary
Introduced a set of statistical features
which capture aspects of semantic
coherence
 Trained a decision tree to classify with
accuracy of 80%
 Next step: incorporate features into
exponential LM

Future Work

Combat data sparsity
– Confidence intervals
– Different correlation statistic
– Stemming or clustering vocabulary

Evaluate derived features
– Incorporate into an exponential language model
– Evaluate the model on a practical application
Agreement among Participants
Expected Perplexity Reduction

Semantic coherence feature
– 78% of broadcast news sentences
– 18% of trigram-generated sentences

Kullback-Leibler divergence: .814
 Average perplexity reduction per word =
.0419 (2^.814/21) per sentence?
 Features modify probability of entire sentence
 Effect of feature on per-word probability is
small
Distribution of Likelihood Ratio
---- Trigram Generated
Density
Broadcast News
Likelihood Value
Discrimination Task

Natural Sentence:
– but it doesn't feel like a sacrifice in a sense that you're
really saying this is you know i'm meant to do things
the right way and you trust it and tell the truth

Trigram-Generated:
– they just kind of free trade agreements which have been
living in a ziplock bag that you say that i see university
japan's daiwa bank stocks step though
Q Values at Distance 1
Density
---- Trigram Generated
Broadcast News
Q Value
Q Values at Distance 3
Density
---- Trigram Generated
Broadcast News
Q Value
Outline
The problem of semantic (in)coherence
 Incorporating this into the wholesentence exponential LM
 Finding better features for this model
using machine learning
 Semantic coherence features
 Experiments and results

Download