Mono- and bilingual modeling of selectional preferences Sebastian Padó Institute for Computational Linguistics Heidelberg University (joint work with Katrin Erk, Ulrike Pado, Yves Peirsman) Some context • Computational lexical semantics: modeling the meaning of words and phrases • Distributional approach Corpus • Observe the usage of words in corpora Knowledg e • Robustness: Broad coverage, manageable complexity • Flexibility: Corpus choice determines model Structure Application: Predictions of plausibility judgments Methods: Distributional semantics Phenomena: Semantic relations in bilingual dictionaries Plausibility of Verb-Relation-Argument-Triples Verb Relation Argument Plausibility eat subject customer 6.9 eat object customer 1.5 eat subject apple 1.0 eat object apple 6.4 • Central aspect of language • Selectional preferences [Katz & Fodor 1963, Wilks 1975] • Generalization of lexical similarity • Incremental language processing [McRae & Matsuki 2009] • Disambiguation [Toutanova et al. 2005], Applicability of inference rules [Pantel et al. 2007], SRL [Gildea & Jurafsky 2002] Modelling Plausibility • Approximating plausibility by frequency English corpus (eat, obj, apple) 100 (eat, obj, hat) 1 (eat, obj, telephone) 0 (eat, obj, caviar) 0 (eat, obj, apple): highly plausible (eat, obj, hat): somewhat plausible (eat, obj, telephone): ? (eat, obj, caviar): ? • Two lexical variables: Frequency of most triples is zero • Implausibility or sparse data? • Generalization based on an ontology (WordNet) [Resnik 1996] • Generalization based on vector space [Erk, Padó, und Padó 2010] Semantic Spaces cultiver cultiv rouler er Fr 5 1 clémentine 4 1 mandarine voiture • 1 20 mandarine clémentine voiture rouler Characterization of word meaning though profile over occurrence contexts [Salton, Wang, and Yang 1974, Landauer & Dumais 1997, Schütze 1998] • • Geometrically: Vector in high-dimensional space High vector similarity implies high semantic similarity • Next neighbors = synonyms Similarity-based generalization [Pado, Pado & Erk 2010] • Plausibility is average vector space similarity to seen arguments • (v, r, a): verb – relation – argument head word triple • seenargs: set of argument head words seen in the corpus • wt: weight function • Z: normalization constant • sim: semantic (vector space) similarity Geometrical interpretation apple telephone orange caviar breakfast Seen objects of “eat” Peter husband child Seen subjects of “eat” Evaluation Modell Abdeckung Spearman’s rho Resnik 1996 [ontology-based] 100% 0.123 n.s. EPP [vector space-based] 98% 0.325 *** U. Pado et al. 2006 [“deep” model] 78% 0.415 *** • Triples with human plausibility ratings [McRae et al. 1996] • Evaluation: Correlation of model predictions with human judgments • Spearman’s = 1: perfect correlation; = 0: no correlation • Result: Vector space model attains almost quality of “deep” model at 98% coverage From one to many languages… • Vector space model reduces the need for language resources to predict plausibility judgments • No ontologies • Still necessary: Observations of triples, target words • Large, accurately parsed corpus •Problematic for basically all languages except English Resnik [Brockmann & Lapata 2002] TIGER+ GermaNet ρ= .37 EPP [Pado & Peirsman 2010] HGC ρ= .33 • Can we extend our strategy to new languages? Predicting plausibility for new languages • Transfer with a bilingual lexicon [Koehn and Knight 2002] English corpus • Cross-lingual knowledge transfer cultiver – grow (cultiver, Obj, pomme) pomme – apple English model (grow, obj, apple): highly plausible • Print dictionaries are problematic • Instead: acquire from distributional data Bilingual semantic space (cultiv er, grow) E Fr 1 mandarin 4 2 1 cultiver/gro w mandarine mandarin car mandarin 5 e car • (rouler, drive) rouler/drive 20 Joint semantic space for words from both languages [Rapp 1995, Fung & McKeown 1997] • • • Dimensions are bilingual word pairs, can be bootstrapped Frequencies observable from comparable corpora Nearest neighbors: Cross-lingual synonyms ⟷ Translations Nearest neighbors in bilingual space (cultiv er, grow) (rouler, drive) pear 5 1 pomme 4 2 car 1 20 E Fr • • cultiver/gro w pear car Similar usages / context profiles do not necessarily indicate synonymy Bilingual case: Peirsman & Pado (2011) • pomme Lexicon extraction for EN/DE and EN/NL rouler/drive Evaluation against Gold Standard • Evaluation of nearest cross-lingual neighbors against a translators’ dictionary Analysis of 200 noun pairs (EN-DE) Meta-Relation Relation Synonymy (50%) Frequen Example cy 99 Verhältnis relationship Antonymy 1 Inneres - exterior CoHyponymy 15 Straßenbahn - bus Hyponymy 3 Kunstwerk - painting Hypernymy 15 Dramatiker - poet Semantic relatedness (19%) 39 Kapitel - essay Errors (14%) 28 DDR-Zeit – trainee Semantic similarity (16%) Similarity by relation How to proceed? • Classical reaction: Focus on cross-lingual synonyms • Aggressive filtering of nearest-neighbor lists • Risk: Sparse data issues • Our hypothesis (prelimimary version): • Non-synonymous pairs still provide information about bilingual similarity • Should be exploited for cross-lingual knowledge transfer • Experimental validation: Vary number of synonyms, observe effect on cross-lingual knowledge transfer Varying the number of neighbors • Nearest neighbors: 50% of synonyms • Further neighbors: quick decline to 10% of synonyms Experimental setup English corpus rouler – drive (bagnole, subj, rouler) bagnole – jalopy, banger, car English model Consider plausibilities für: (jalopy, subj, drive) (banger, subj, drive) (car, subj, drive) Details • Model: • • • English model: trained on BNC as before Bilingual lexicon extracted from BNC und Stuttgarter Nachrichtenkorpus HGC as comparable corpora Prediction based on n nearest English neighbours for German argument • Evaluation: • 90 German (v,r,a) triples with human plausibility ratings [Brockmann & Lapata 2003] Results – EN-DE Model Resources Sperman’s ρ Resnik [Brockmann & Lapata 2002] TIGER corpus, German Word Net .37 EPP German [Pado & Peirsman 2010] HGC corpus parsed with PCFG .33 Translated English EPP 1 NN 2 NN 3 NN 4 NN 5 NN 0.34 0.44 0.41 0.46 0.40 • Result: Transfer model significantly better than monolingual model, but only if non-synonymous neighbors are included Results: Details 1 NN 2 NN 3 NN 4 NN 5 NN English EPP (all ) 0.34 0.41 0.44 0.46 0.40 English EPP (subjects) 0.53 0.51 0.56 0.56 0.55 English EPP (objects) 0.58 0.61 0.61 0.64 0.58 English EPP (pp objects) 0.33 0.45 0.45 0.46 0.42 Sources of the positive effect • Non-synonyms are in fact informative for plausibility translation • Semantically similar verbs: eat – munch – feast • Similar events, similar arguments [Fillmore et al. 2003, Levin 1993] • Semantically related verbs: peel – cook – eat • Schemas/narrative chains: shared participants [Shank & Abelson 1977, Chambers & Jurafsky 2009] Our hypothesis with qualifications • Using non-synonymous translation pairs is helpful 1. if transferred knowledge is lexical • 2. Many infrequently observed datapoints if knowledge is stable across semantically related/similar word pairs • Counterexample: polarity/sentiment judgments • food – feast – grub • Parallel experiment: best results for single nearest neighbor Summary • Plausibility can be modeled with fairly shallow methods • Seen head words plus generalization in vector space • Precondition: accurately parsed corpus • If unavailable: Transfer from better-endowed language • Translation through automatically induced lexicons • Transfer of knowledge about certain phenomena can benefit from non-synonymous translations • Corresponding to monolingual results from QA [Harabagiu et al. 2000], paraphrases [Lin & Pantel 2001], entailment [Dagan et al. 2006], …