Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University Semantic (in)Coherence Trigram: content words unrelated Effect on speech recognition: – Actual Utterance: “THE BIRD FLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK” – Top Hypothesis: “THE BIRD FLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID” Our goal: model semantic coherence A Whole Sentence Exponential Model [Rosenfeld 1997] 1 Pr( s ) P 0( s ) exp( i fi ( s )) Z i def P0(s) is an arbitrary initial model (typically N-gram) fi(s)’s are arbitrary computable properties of s (aka features) Z is a universal normalizing constant ( A Methodology for Feature Induction Given corpus T of training sentences: 1. Train best-possible baseline model, P0(s) 2. Use P0(s) to generate corpus T0 of “pseudo sentences” 3. Pose a challenge: find (computable) differences that allow discrimination between T and T0 4. Encode the differences as features fi(s) 5. Train a new model: 1 P1( s) P 0( s) exp( i fi ( s )) Z i Discrimination Task: Are these content words generated from a trigram or a natural sentence? 1. 2. - - - feel - - sacrifice - - sense - - - - - - - - meant - - - - - - - - trust - - - - truth - - kind - free trade agreements - - - living - - ziplock bag - - - - - - university japan's daiwa bank stocks step – Building on Prior Work Define “content words” (all but top 50) Goal: model distribution of content words in sentence Simplify: model pairwise co-occurrences (“content word pairs”) Collect contingency tables; calculate measure of association for them Q Correlation Measure Derived from Co-occurrence Contingency Table W2 yes W2 no W1 yes W1 no c11 c12 c21 c22 c1 1c 2 2 -c1 2c 2 1 Q c1 1c 2 2 c1 2c 2 1 Q values range from –1 to +1 Density Estimates We hypothesized: – Trigram sentences: wordpair correlation completely determined by distance – Natural sentences: wordpair correlation independent of distance kernel density estimation – distribution of Q values in each corpus – at varying distances Q Distributions Distance = 3 Density Distance = 1 ---- Trigram Generated Broadcast News Q Value Likelihood Ratio Feature Pr(Qij | dij, BNews ) L wordpairs i , j Pr(Qij | dij, Trigram ) she is a country singer searching for fame and fortune in nashville Q(country,nashville) = 0.76 Distance = 8 Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9 Simpler Features Q Value based – Mean, median, min, max of Q values for content word pairs in the sentence (Cai et al 2000) – Percentage of Q values above a threshold – High/low correlations across large/small distances Other – Word and phrase repetition – Percentage of stop words – Longest sequence of consecutive stop/content words Datasets LM and contingency tables (Q values) derived from 103 million words of BN From remainder of BN corpus and sentences sampled from trigram LM: – Q value distributions estimated from ~100,000 sentences – Decision tree trained and test on ~60,000 sentences Disregarded sentences with < 7 words – “Mike Stevens says it’s not real” – “We’ve been hearing about it” Experiments Learners: – C5.0 decision tree – Boosting decision stumps with Adaboost.MH Methodology: – 5-fold cross validation on ~60,000 sentences – Boosting for 300 rounds Results Feature Set Q mean, median, min, max (Previous Work) Likelihood Ratio Classification Accuracy 73.39 ± 0.36 77.76 ± 0.49 All but Likelihood Ratio 80.37 ± 0.42 All Features 80.37 ± 0.46 Likelihood Ratio + non-Q Shannon-Style Experiment 50 sentences – ½ “real” and ½ trigram-generated – Stopwords replaced by dashes 30 participants – Average accuracy of 73.77% ± 6 – Best individual accuracy 84% Our classifier: – Accuracy of 78.9% ± 0.42 Summary Introduced a set of statistical features which capture aspects of semantic coherence Trained a decision tree to classify with accuracy of 80% Next step: incorporate features into exponential LM Future Work Combat data sparsity – Confidence intervals – Different correlation statistic – Stemming or clustering vocabulary Evaluate derived features – Incorporate into an exponential language model – Evaluate the model on a practical application Agreement among Participants Expected Perplexity Reduction Semantic coherence feature – 78% of broadcast news sentences – 18% of trigram-generated sentences Kullback-Leibler divergence: .814 Average perplexity reduction per word = .0419 (2^.814/21) per sentence? Features modify probability of entire sentence Effect of feature on per-word probability is small Distribution of Likelihood Ratio ---- Trigram Generated Density Broadcast News Likelihood Value Discrimination Task Natural Sentence: – but it doesn't feel like a sacrifice in a sense that you're really saying this is you know i'm meant to do things the right way and you trust it and tell the truth Trigram-Generated: – they just kind of free trade agreements which have been living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though Q Values at Distance 1 Density ---- Trigram Generated Broadcast News Q Value Q Values at Distance 3 Density ---- Trigram Generated Broadcast News Q Value Outline The problem of semantic (in)coherence Incorporating this into the wholesentence exponential LM Finding better features for this model using machine learning Semantic coherence features Experiments and results