Learning Within-Sentence Semantic Coherence

Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University Semantic (in)Coherence    Trigram: content words unrelated Effect on speech recognition: – Actual Utterance: “THE BIRD FLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK” – Top Hypothesis: “THE BIRD FLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID” Our goal: model semantic coherence A Whole Sentence Exponential Model [Rosenfeld 1997] 1 Pr( s )   P 0( s )  exp(  i  fi ( s )) Z i def P0(s) is an arbitrary initial model (typically N-gram)  fi(s)’s are arbitrary computable properties of s (aka features)  Z is a universal normalizing constant  ( A Methodology for Feature Induction Given corpus T of training sentences: 1. Train best-possible baseline model, P0(s) 2. Use P0(s) to generate corpus T0 of “pseudo sentences” 3. Pose a challenge: find (computable) differences that allow discrimination between T and T0 4. Encode the differences as features fi(s) 5. Train a new model: 1 P1( s)   P 0( s)  exp(  i  fi ( s )) Z i Discrimination Task: Are these content words generated from a trigram or a natural sentence? 1. 2. - - - feel - - sacrifice - - sense - - - - - - - - meant - - - - - - - - trust - - - - truth - - kind - free trade agreements - - - living - - ziplock bag - - - - - - university japan's daiwa bank stocks step – Building on Prior Work Define “content words” (all but top 50)  Goal: model distribution of content words in sentence  Simplify: model pairwise co-occurrences (“content word pairs”)  Collect contingency tables; calculate measure of association for them  Q Correlation Measure Derived from Co-occurrence Contingency Table W2 yes W2 no W1 yes W1 no c11 c12 c21 c22 c1 1c 2 2 -c1 2c 2 1 Q  c1 1c 2 2  c1 2c 2 1  Q values range from –1 to +1 Density Estimates  We hypothesized: – Trigram sentences: wordpair correlation completely determined by distance – Natural sentences: wordpair correlation independent of distance  kernel density estimation – distribution of Q values in each corpus – at varying distances Q Distributions Distance = 3 Density Distance = 1 ---- Trigram Generated Broadcast News Q Value Likelihood Ratio Feature Pr(Qij | dij, BNews ) L  wordpairs i , j Pr(Qij | dij, Trigram ) she is a country singer searching for fame and fortune in nashville Q(country,nashville) = 0.76 Distance = 8 Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9 Simpler Features  Q Value based – Mean, median, min, max of Q values for content word pairs in the sentence (Cai et al 2000) – Percentage of Q values above a threshold – High/low correlations across large/small distances  Other – Word and phrase repetition – Percentage of stop words – Longest sequence of consecutive stop/content words Datasets  LM and contingency tables (Q values) derived from 103 million words of BN  From remainder of BN corpus and sentences sampled from trigram LM: – Q value distributions estimated from ~100,000 sentences – Decision tree trained and test on ~60,000 sentences  Disregarded sentences with < 7 words – “Mike Stevens says it’s not real” – “We’ve been hearing about it” Experiments  Learners: – C5.0 decision tree – Boosting decision stumps with Adaboost.MH  Methodology: – 5-fold cross validation on ~60,000 sentences – Boosting for 300 rounds Results Feature Set Q mean, median, min, max (Previous Work) Likelihood Ratio Classification Accuracy 73.39 ± 0.36 77.76 ± 0.49 All but Likelihood Ratio 80.37 ± 0.42 All Features 80.37 ± 0.46 Likelihood Ratio + non-Q Shannon-Style Experiment  50 sentences – ½ “real” and ½ trigram-generated – Stopwords replaced by dashes  30 participants – Average accuracy of 73.77% ± 6 – Best individual accuracy 84%  Our classifier: – Accuracy of 78.9% ± 0.42 Summary Introduced a set of statistical features which capture aspects of semantic coherence  Trained a decision tree to classify with accuracy of 80%  Next step: incorporate features into exponential LM  Future Work  Combat data sparsity – Confidence intervals – Different correlation statistic – Stemming or clustering vocabulary  Evaluate derived features – Incorporate into an exponential language model – Evaluate the model on a practical application Agreement among Participants Expected Perplexity Reduction  Semantic coherence feature – 78% of broadcast news sentences – 18% of trigram-generated sentences  Kullback-Leibler divergence: .814  Average perplexity reduction per word = .0419 (2^.814/21) per sentence?  Features modify probability of entire sentence  Effect of feature on per-word probability is small Distribution of Likelihood Ratio ---- Trigram Generated Density Broadcast News Likelihood Value Discrimination Task  Natural Sentence: – but it doesn't feel like a sacrifice in a sense that you're really saying this is you know i'm meant to do things the right way and you trust it and tell the truth  Trigram-Generated: – they just kind of free trade agreements which have been living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though Q Values at Distance 1 Density ---- Trigram Generated Broadcast News Q Value Q Values at Distance 3 Density ---- Trigram Generated Broadcast News Q Value Outline The problem of semantic (in)coherence  Incorporating this into the wholesentence exponential LM  Finding better features for this model using machine learning  Semantic coherence features  Experiments and results 

Learning Within-Sentence Semantic Coherence

Related documents

Products

Support

Learning Within-Sentence Semantic Coherence

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib