Are Distributional Dimensions Semantic Features? Katrin Erk University of Texas at Austin Meaning in Context Symposium München September 2015 Joint work with Gemma Boleda Semantic features by example: Katz & Fodor Different meanings of a word characterized by lists of semantic features Semantic features • In linguistics: Katz&Fodor, Wierzbicka, Jackendoff, Bierwisch, Pustejovsky, Asher, … • In computational linguistics/AI: Schank, Wilks, Masterman, Sowa… Schank, Conceptual Dependencies “drink” in preference semantics (Wilks): ((*ANI SUBJ) (((FLOW STUFF) OBJE) (MOVE CAUSE)) Semantic features: Characteristics • Primitive (not themselves defined), unanalyzable • Small set • Lexicalized in all languages • Combined, they characterize semantics of all lexical expressions in all languages • Precise, fixed meaning, which is not part of language. • Wilks: not so • Individually enable inferences • Feature lists or complex graphs Compiled from: Wierzbicka, Geeraerts, Schank Uses of semantic features • Event structure in the lexical semantics of verbs (Levin): • change-of-state verbs: [ [ x ACT] CAUSE [BECOME [y <result-state>]] • Handle polysemy (Pustejovsky, Asher) • Characterize selectional constraints (e.g. in VerbNet) • Characterize synonyms, also cross-linguistically (application: translation) • Enable inferences: John is a bachelor John is unmarried, John is a man Are distributional dimensions semantic features? Alligator: believe-v 0.794065 american-a 2.245667 kill-v 1.946722 consider-v 0.047781 seem-v 0.410991 turn-v 0.919250 side-n 0.098926 serve-v 0.479459 involve-v 0.435661 report-v 0.483651 little-a 1.175299 big-a 1.468021 water-n 1.806485 attack-n 1.795050 much-a 0.011354 …. Computed from UKWaC+Wikipedia + BNC + Gigaword, 2 word window, PPMI transform Are distributional dimensions semantic features? • [The] differences between vector space encoding and more familiar accounts of meaning is easy to exaggerate. For example, a vector space encoding is entirely compatible with the traditional doctrine that concepts are ‘bundles’ of semantic features. Indeed, the latter is a special case of the former, the difference being that […] semantic dimensions are allowed to be continuous. (Fodor and Lepore 1999: All at Sea in Semantic Space) (About connectionism and particularly Churchland, not distributional models) Are distributional dimensions semantic features? • If so, they either address or inherit methodological problems: • Coverage of a realistic vocabulary • Empirically determining semantic features • Meaning creep: Predicates used in CyC did not stay stable in their meaning over the years (Wilks 2008) Are distributional dimensions semantic features? • If so, they inherit theoretical problems • Lewis 1970: “Markerese” • Fodor et al 1980, Against Definitions; Fodor and Lepore 1999, All at Sea in Semantic Space • Asymmetry between words and primitives: • What makes the primitives more basic? • Also, how can people communicate if their semantic spaces differ? Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference Semantic features: Characteristics • Primitive (not themselves defined), unanalyzable • Small set • Lexicalized in all languages • Combined, they characterize semantics of all lexical expressions in all languages • Precise, fixed meaning, not part of language. • Individually enable inferences • Feature lists or complex graphs Neither primitive nor with a fixed meaning • Not unanalyzable: Any distributional feature can in principle be a distributional target • Compare: Target and dimensions as a graph (with similarity determined on the basis of random walks): d1 dd1 target d2 d3 Neither primitive nor with a fixed meaning • But are they treated as unanalyzed in practice? • Features in vector usually not analyzed further • SVD, topic modeling, prediction-based models: • induce latent features • exploiting distributional properties of features • Are latent features unanalyzable? No, linked to original dimensions • No fixed meaning, distributional features can be ambiguous Then is it“Markerese”? • Inference = deriving something non-distributional from distributional representations • Inference from relation to other words • “X cause Y”, “Y trigger X” occur with similar X, Y, hence they are probably close in meaning • “alligator” appears in a subset of the contexts of “animal”, hence they are probably animals • Inference from co-occurrence with extralinguistic information • Distributional vectors linked to images for the same target • Alligators are similar to crocodiles, crocodiles are listed in the ontology as animals, hence alligators are probably animals No individual inferences • Distributional representation as a whole, in the aggregate, allows for inferences using aggregate techniques: • Distributional similarity • Distributional inclusion • Whole-vector mappings to visual vectors No individual inferences • Feature-based inference possible with “John Doe” features: • Take text representation • Take apart into features that are individually almost meaningless • Aggregate of such features allows for inferences Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference Redefining the dichotomy • Not semantic features versus distributional dimensions: Individual features versus aggregate features • Individual features: • • • • • • Individually allow for inferences May be relevant to grammar Are introspectively salient Not necessarily primitive Also hypernyms and synonyms Aggregate features • May be individually almost meaningless • Allow for aggregate inference • Two modes of inference: individual and aggregate Individual features in distributional representations • Some distributional dimensions can be cognitively relevant features • Thill et al 2014: Because distributional models focus on how words are frequently used, they point to how humans experience concepts • Freedom: (features from Baroni&Lenci 2010) • positive events: guarantee, secure, grant, defend, respect • negative events: undermine, deny, infringe on, violate Individual features in distributional representations • Approaches that find cognitively plausible features distributionally: • Almuhareb & Poesio 2004 • Cimiano & Wenderoth 2007 • Schulte im Walde et al 2008: German association norms • Baroni et al 2010: STRUDEL • Baroni & Lenci 2010: Distributional memory • Devereux et al 2010: dependency paths extracted from Wikipedia Individual features in distributional representations • Difficult: only small fraction of human-elicited features can be retrieved • Baroni et al 2010: Distributional features tend to be different from human-elicited features • preference for “‘actional’ and ‘situated’ descriptions” • motorcycle: • elicited: wheels, dangerous, engine, fast • distributional: ride, sidecar, park, road Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference Not a competition • Use both kinds of features! • Computational perspective: • Distributional features are great • learned automatically • enable many inferences • Human-defined semantic features are great • less noisy • enable inferences with more certainty • enable inferences that distributional models do not provide • How can we integrate the two? Speculation: Learning both individual and aggregate features • Learner makes use of features from textual environment • Some features almost meaningless, others more meaningful • Some of them relevant to grammar (CAUSE, BECOME) • Both meaningful and near-meaningless features enter aggregate inference • Only certain features allow individual inference • (Unclear: This should not be feature lists, there is structure! But where does that fit in this picture?) Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference Inferring individual features from aggregates • Johns and Jones 2012: • Compute weight of feature bird for nightingale as summed similarity of nightingale to known birds • Fagarasan/Vecchi/Clark 2015: • Learn a mapping from distributional vectors to vectors of individual features • Herbelot/Vecchi 2015: • Learn a mapping from distributional space to “set-theoretic space”, vectors of quantified individual features (ALL apes are muscular, SOME apes live on coasts) Inferring individual features from aggregates • Gupta et al 2015: • Regression to learn properties of unknown cities/countries from those of known cities/countries • Snow/Jurafsky/Ng 2006: • Infer location of a word in the WordNet hierarchy using a distributional co-hyponymy classifier Individual features influencing aggregate representations • Andrews/Vigliocco/Vinson 2009, Roller/Schulte im Walde 2013: Topic modeling, including known individual features of words in the text • Faruqui et al 2015: Update vector representation to better match known synonymy, hypernymy, hyponymy information Individual features influencing aggregate representations • Boyd-Graber/Blei/Zhu 2006: • WordNet hierarchy as part of a topic model. • Generate a word: choose topic, then walk down WN hierarchy based on the topic • aim: best WN sense for each word in context • Riedel et al 2013, Rocktäschel et al 2015: Universal Schema • Relation characterized by vector of Named Entity pairs (entity pairs that fill the relation) • Both human-defined and corpus-extracted relations • Matric factorization over union of human-defined and corpusextracted relations • Predict whether a relation holds of an entity pair Conclusion • Distributional features are not semantic features: • Not primitive • Inference from relations between word representations, co-occurrence with extra-linguistic information • Not (necessarily) individually meaningful • Inference from the aggregate of features • Two modes of inference: individual and aggregate • Use both individual and aggregate features • How to integrate the two, and infer one from the other? References • Almuhareb, A., & Poesio, M. (2004). Attribute-based and value-based clustering: an evaluation (pp. 1–8). Presented at the EMNLP. • Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463–498. • Asher, N. (2011) Lexical meaning in context: a web of words. Cambridge University Press. • Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A Corpus-Based Semantic Model Based on Properties and Types. Cognitive Science, 34(2), 222–254 • Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. • Bierwisch, M. (1969) On certain problems of semantic representation. Foundations of Language 5: 153–84. • Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation. Presented at the EMNLP. References • Cimiano, Philipp and Johanna Wenderoth. 2007. Automatic acquisition of ranked qualia structures from the Web. In Proceedings of ACL, pages 888–895, Prague. • Devereux, B., Pilkington, N., Poibeau, T., & Korhonen, A. (2010). Towards Unrestricted, LargeScale Acquisition of Feature-Based Conceptual Representations from Corpus Data. Research on Language and Computation, 7(2-4), 137–170. • Fagarasan, L., E. Vecchi, S. Clark (2015). From distributional semantics to feature norms: grounding semantic models in human perceptual data. Proceedings of IWCS. • Faruqui, M., Dodge, J., Jauhar, S., Dyer, C., Hovy, E., & Smith, N. (2015). Retrofitting Word Vectors to Semantic Lexicons. Presented at the NAACL. • Fodor, J., Garrett, M. F., Walker, E. C. T., & Parkes, C. H. (1980). Against definitions. Cognition, 8(3), 263–367. • Fodor, J., & Lepore, E. (1999). All at sea in semantic space: Churchland on meaning similarity. The Journal of Philosophy, 96(8), 381–403. • Geeraerts, D. (2009) Theories of Lexical Semantics. Oxford University Press. References • Gupta, A., Boleda, G., Baroni, M., & Pado, S. (2015). Distributional vectors encode referential attributes. Proceedings of EMNLP. • Herbelot, A., & Vecchi, E. M. (2015). Building a shared world:Mapping distributional to model-theoretic semantic spaces. Proceedings of EMNLP. • Jackendoff, R. (1990) Semantic Structures. MIT Press. • Johns, B. T., & Jones, M. N. (2012). Perceptual Inference Through Global Lexical Similarity. Topics in Cognitive Science, 4(1), 103–120 • Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39(2), 170. • Lewis, D. (1970). General semantics. Synthese, 22(1):18– 67. • Pustejovsky, J. (1991) The Generative Lexicon. Computational Linguistics 17(4). References • Rapaport Hovav, M., and B. Levin (2001). An event structure account of English resultatives. Language 77(4). • Riedel, S., Yao, L., McCallum, A., & Marlin, B. (2013). Relation Extraction with Matrix Factorization and Universal Schemas. Presented at the NAACL. • Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting Logical Background Knowledge into Embeddings for Relation Extraction. Presented at the NAACL. • Roller, S., & Schulte im Walde, S. (2013). A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities. Presented at the EMNLP. • Schank, R. (1969). A conceptual dependency parser for natural language. Proceedings of COLING 1969 • Schulte im Walde, S., A. Melinger, M. Roth, A. Weber (2008). An Empirical Characterisation of Response Types in German Association Norms. Research on Language and Computation 6(2):205-238, 2008. References • Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence (pp. 801–808). Presented at the ACL-COLING. • Sowa, J. (1992). Logical Structures in the Lexicon. In J. Pustejovsky & S. Bergler (Eds.), Lexical semantics and knowledge representation (LNCS, Vol. 627, pp. 39–60). • Thill, S., Pado, S., & Ziemke, T. (2014). On the Importance of a Rich Embodiment in the Grounding of Concepts: Perspectives From Embodied Cognitive Science and Computational Linguistics. Topics in Cognitive Science, 6(3), 545–558. • Wierzbicka, A. (1996) Semantics. Primes and Universals. Oxford University Press. • Wilks, Y. (2008). What would a Wittgensteinian computational linguistics be like? Presented at the AISB workshop on computers and philosophy, Aberdeen.