Are Distributional Dimensions Semantic Features?

advertisement
Are Distributional Dimensions
Semantic Features?
Katrin Erk
University of Texas at Austin
Meaning in Context Symposium
München September 2015
Joint work with Gemma Boleda
Semantic features by example:
Katz & Fodor
Different meanings of
a word characterized
by lists of semantic
features
Semantic features
• In linguistics: Katz&Fodor, Wierzbicka, Jackendoff,
Bierwisch, Pustejovsky, Asher, …
• In computational linguistics/AI: Schank, Wilks,
Masterman, Sowa…
Schank, Conceptual Dependencies
“drink” in preference semantics (Wilks):
((*ANI SUBJ) (((FLOW STUFF) OBJE) (MOVE CAUSE))
Semantic features:
Characteristics
• Primitive (not themselves defined), unanalyzable
• Small set
• Lexicalized in all languages
• Combined, they characterize semantics of all lexical
expressions in all languages
• Precise, fixed meaning, which is not part of language.
• Wilks: not so
• Individually enable inferences
• Feature lists or complex graphs
Compiled from:
Wierzbicka, Geeraerts, Schank
Uses of semantic features
• Event structure in the lexical semantics of verbs (Levin):
• change-of-state verbs:
[ [ x ACT] CAUSE [BECOME [y <result-state>]]
• Handle polysemy (Pustejovsky, Asher)
• Characterize selectional constraints (e.g. in VerbNet)
• Characterize synonyms, also cross-linguistically
(application: translation)
• Enable inferences:
John is a bachelor
John is unmarried, John is a man
Are distributional dimensions
semantic features?
Alligator:
believe-v
0.794065
american-a 2.245667
kill-v
1.946722
consider-v 0.047781
seem-v 0.410991
turn-v
0.919250
side-n
0.098926
serve-v 0.479459
involve-v 0.435661
report-v 0.483651
little-a 1.175299
big-a
1.468021
water-n 1.806485
attack-n 1.795050
much-a 0.011354
….
Computed from UKWaC+Wikipedia +
BNC + Gigaword,
2 word window, PPMI transform
Are distributional dimensions
semantic features?
• [The] differences between vector space encoding and
more familiar accounts of meaning is easy to exaggerate.
For example, a vector space encoding is entirely
compatible with the traditional doctrine that concepts
are ‘bundles’ of semantic features. Indeed, the latter is a
special case of the former, the difference being that […]
semantic dimensions are allowed to be continuous.
(Fodor and Lepore 1999: All at Sea in Semantic Space)
(About connectionism and particularly Churchland, not
distributional models)
Are distributional dimensions
semantic features?
• If so, they either address or inherit
methodological problems:
• Coverage of a realistic vocabulary
• Empirically determining semantic features
• Meaning creep: Predicates used in CyC did not stay
stable in their meaning over the years (Wilks 2008)
Are distributional dimensions
semantic features?
• If so, they inherit theoretical problems
• Lewis 1970: “Markerese”
• Fodor et al 1980, Against Definitions;
Fodor and Lepore 1999, All at Sea in Semantic Space
• Asymmetry between words and primitives:
• What makes the primitives more basic?
• Also, how can people communicate if their semantic
spaces differ?
Outline
• Differences between distributional dimensions and
semantic features
• Redefining the dichotomy
• No dichotomy after all
• Integrated inference
Semantic features:
Characteristics
• Primitive (not themselves defined), unanalyzable
• Small set
• Lexicalized in all languages
• Combined, they characterize semantics of all lexical
expressions in all languages
• Precise, fixed meaning, not part of language.
• Individually enable inferences
• Feature lists or complex graphs
Neither primitive nor
with a fixed meaning
• Not unanalyzable: Any distributional feature can in
principle be a distributional target
• Compare: Target and dimensions as a graph (with
similarity determined on the basis of random walks):
d1
dd1
target
d2
d3
Neither primitive nor
with a fixed meaning
• But are they treated as unanalyzed in practice?
• Features in vector usually not analyzed further
• SVD, topic modeling, prediction-based models:
• induce latent features
• exploiting distributional properties of features
• Are latent features unanalyzable?
No, linked to original dimensions
• No fixed meaning, distributional features can be
ambiguous
Then is it“Markerese”?
• Inference = deriving something non-distributional from
distributional representations
• Inference from relation to other words
• “X cause Y”, “Y trigger X” occur with similar X, Y, hence
they are probably close in meaning
• “alligator” appears in a subset of the contexts of “animal”,
hence they are probably animals
• Inference from co-occurrence with extralinguistic
information
• Distributional vectors linked to images for the same target
• Alligators are similar to crocodiles, crocodiles are listed in the
ontology as animals, hence alligators are probably animals
No individual inferences
• Distributional representation as a whole,
in the aggregate, allows for inferences using
aggregate techniques:
• Distributional similarity
• Distributional inclusion
• Whole-vector mappings to visual vectors
No individual inferences
• Feature-based inference possible with “John Doe”
features:
• Take text representation
• Take apart into features that are individually almost
meaningless
• Aggregate of such features allows for inferences
Outline
• Differences between distributional dimensions and
semantic features
• Redefining the dichotomy
• No dichotomy after all
• Integrated inference
Redefining the dichotomy
•
Not semantic features versus distributional dimensions:
Individual features versus aggregate features
•
Individual features:
•
•
•
•
•
•
Individually allow for inferences
May be relevant to grammar
Are introspectively salient
Not necessarily primitive
Also hypernyms and synonyms
Aggregate features
• May be individually almost meaningless
• Allow for aggregate inference
•
Two modes of inference: individual and aggregate
Individual features in
distributional representations
• Some distributional dimensions can be cognitively
relevant features
•
Thill et al 2014: Because distributional models focus
on how words are frequently used, they point to how
humans experience concepts
• Freedom: (features from Baroni&Lenci 2010)
• positive events: guarantee, secure, grant, defend, respect
• negative events: undermine, deny, infringe on, violate
Individual features in
distributional representations
• Approaches that find cognitively plausible features
distributionally:
• Almuhareb & Poesio 2004
• Cimiano & Wenderoth 2007
• Schulte im Walde et al 2008: German association
norms
• Baroni et al 2010: STRUDEL
• Baroni & Lenci 2010: Distributional memory
• Devereux et al 2010: dependency paths extracted from
Wikipedia
Individual features in
distributional representations
• Difficult: only small fraction of human-elicited
features can be retrieved
• Baroni et al 2010: Distributional features tend to be
different from human-elicited features
• preference for “‘actional’ and ‘situated’ descriptions”
• motorcycle:
• elicited: wheels, dangerous, engine, fast
• distributional: ride, sidecar, park, road
Outline
• Differences between distributional dimensions and
semantic features
• Redefining the dichotomy
• No dichotomy after all
• Integrated inference
Not a competition
• Use both kinds of features!
• Computational perspective:
• Distributional features are great
• learned automatically
• enable many inferences
• Human-defined semantic features are great
• less noisy
• enable inferences with more certainty
• enable inferences that distributional models do not provide
• How can we integrate the two?
Speculation: Learning both
individual and aggregate features
• Learner makes use of features from textual environment
• Some features almost meaningless, others more
meaningful
• Some of them relevant to grammar (CAUSE, BECOME)
• Both meaningful and near-meaningless features enter
aggregate inference
• Only certain features allow individual inference
• (Unclear: This should not be feature lists, there is
structure! But where does that fit in this picture?)
Outline
• Differences between distributional dimensions and
semantic features
• Redefining the dichotomy
• No dichotomy after all
• Integrated inference
Inferring individual features
from aggregates
• Johns and Jones 2012:
• Compute weight of feature bird for nightingale as
summed similarity of nightingale to known birds
• Fagarasan/Vecchi/Clark 2015:
• Learn a mapping from distributional vectors to vectors of
individual features
• Herbelot/Vecchi 2015:
• Learn a mapping from distributional space to “set-theoretic
space”, vectors of quantified individual features (ALL apes
are muscular, SOME apes live on coasts)
Inferring individual features
from aggregates
• Gupta et al 2015:
• Regression to learn properties of unknown
cities/countries from those of known cities/countries
• Snow/Jurafsky/Ng 2006:
• Infer location of a word in the WordNet hierarchy
using a distributional co-hyponymy classifier
Individual features influencing
aggregate representations
• Andrews/Vigliocco/Vinson 2009,
Roller/Schulte im Walde 2013:
Topic modeling, including known individual features
of words in the text
• Faruqui et al 2015:
Update vector representation to better match known
synonymy, hypernymy, hyponymy information
Individual features influencing
aggregate representations
• Boyd-Graber/Blei/Zhu 2006:
• WordNet hierarchy as part of a topic model.
• Generate a word: choose topic, then walk down WN hierarchy
based on the topic
• aim: best WN sense for each word in context
• Riedel et al 2013, Rocktäschel et al 2015: Universal Schema
• Relation characterized by vector of Named Entity pairs
(entity pairs that fill the relation)
• Both human-defined and corpus-extracted relations
• Matric factorization over union of human-defined and corpusextracted relations
• Predict whether a relation holds of an entity pair
Conclusion
• Distributional features are not semantic features:
• Not primitive
• Inference from relations between word representations,
co-occurrence with extra-linguistic information
• Not (necessarily) individually meaningful
• Inference from the aggregate of features
• Two modes of inference: individual and aggregate
• Use both individual and aggregate features
• How to integrate the two, and infer one from the other?
References
•
Almuhareb, A., & Poesio, M. (2004). Attribute-based and value-based clustering: an evaluation
(pp. 1–8). Presented at the EMNLP.
•
Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data
to learn semantic representations. Psychological Review, 116(3), 463–498.
•
Asher, N. (2011) Lexical meaning in context: a web of words. Cambridge University Press.
•
Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A Corpus-Based Semantic
Model Based on Properties and Types. Cognitive Science, 34(2), 222–254
•
Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based
semantics. Computational Linguistics, 36(4), 673–721.
•
Bierwisch, M. (1969) On certain problems of semantic representation. Foundations of Language 5:
153–84.
•
Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation.
Presented at the EMNLP.
References
•
Cimiano, Philipp and Johanna Wenderoth. 2007. Automatic acquisition of ranked qualia
structures from the Web. In Proceedings of ACL, pages 888–895, Prague.
•
Devereux, B., Pilkington, N., Poibeau, T., & Korhonen, A. (2010). Towards Unrestricted, LargeScale Acquisition of Feature-Based Conceptual Representations from Corpus Data. Research on
Language and Computation, 7(2-4), 137–170.
•
Fagarasan, L., E. Vecchi, S. Clark (2015). From distributional semantics to feature norms:
grounding semantic models in human perceptual data. Proceedings of IWCS.
•
Faruqui, M., Dodge, J., Jauhar, S., Dyer, C., Hovy, E., & Smith, N. (2015). Retrofitting Word
Vectors to Semantic Lexicons. Presented at the NAACL.
•
Fodor, J., Garrett, M. F., Walker, E. C. T., & Parkes, C. H. (1980). Against definitions. Cognition,
8(3), 263–367.
•
Fodor, J., & Lepore, E. (1999). All at sea in semantic space: Churchland on meaning similarity.
The Journal of Philosophy, 96(8), 381–403.
•
Geeraerts, D. (2009) Theories of Lexical Semantics. Oxford University Press.
References
•
Gupta, A., Boleda, G., Baroni, M., & Pado, S. (2015). Distributional vectors encode
referential attributes. Proceedings of EMNLP.
•
Herbelot, A., & Vecchi, E. M. (2015). Building a shared world:Mapping
distributional to model-theoretic semantic spaces. Proceedings of EMNLP.
•
Jackendoff, R. (1990) Semantic Structures. MIT Press.
•
Johns, B. T., & Jones, M. N. (2012). Perceptual Inference Through Global Lexical
Similarity. Topics in Cognitive Science, 4(1), 103–120
•
Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language,
39(2), 170.
•
Lewis, D. (1970). General semantics. Synthese, 22(1):18– 67.
•
Pustejovsky, J. (1991) The Generative Lexicon. Computational Linguistics 17(4).
References
•
Rapaport Hovav, M., and B. Levin (2001). An event structure account of English
resultatives. Language 77(4).
•
Riedel, S., Yao, L., McCallum, A., & Marlin, B. (2013). Relation Extraction with
Matrix Factorization and Universal Schemas. Presented at the NAACL.
•
Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting Logical Background
Knowledge into Embeddings for Relation Extraction. Presented at the NAACL.
•
Roller, S., & Schulte im Walde, S. (2013). A Multimodal LDA Model integrating
Textual, Cognitive and Visual Modalities. Presented at the EMNLP.
•
Schank, R. (1969). A conceptual dependency parser for natural language.
Proceedings of COLING 1969
•
Schulte im Walde, S., A. Melinger, M. Roth, A. Weber (2008). An Empirical
Characterisation of Response Types in German Association Norms. Research on
Language and Computation 6(2):205-238, 2008.
References
•
Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from
heterogenous evidence (pp. 801–808). Presented at the ACL-COLING.
•
Sowa, J. (1992). Logical Structures in the Lexicon. In J. Pustejovsky & S. Bergler
(Eds.), Lexical semantics and knowledge representation (LNCS, Vol. 627, pp. 39–60).
•
Thill, S., Pado, S., & Ziemke, T. (2014). On the Importance of a Rich Embodiment
in the Grounding of Concepts: Perspectives From Embodied Cognitive Science and
Computational Linguistics. Topics in Cognitive Science, 6(3), 545–558.
•
Wierzbicka, A. (1996) Semantics. Primes and Universals. Oxford University Press.
•
Wilks, Y. (2008). What would a Wittgensteinian computational linguistics be like?
Presented at the AISB workshop on computers and philosophy, Aberdeen.
Download