Corpus-based analysis of near-synonymous

advertisement
Corpus-based analysis of near-synonymous constructions:
Larger, faster and more objective
Natalia Levshina, Kris Heylen and Dirk Geeraerts
Introduction
In this paper we propose a radically corpus-driven methodology for studies of nearsynonymous constructions. It is based on the quantitative method of semantic space models
(Lin 1998, Heylen et al. 2008), which is used to create a fully bottom-up and objective
classification of constructional slot fillers. On the material of Dutch causative constructions
with doen and laten we show how this method can be applied to popular statistical models of
constructional near-synonyms, such as regression models and collostructional analysis, and
discuss the advantages of the method, as well its future challenges.
Aim
The semantic properties of constructional slot fillers are central in the usage-based studies of
constructional semantics, often being the main source of evidence (e.g. Stefanowitsch and
Gries 2003, Bybee and Eddington 2006). Most commonly, the researchers employ a readymade classification of the semantic classes like Levin’s (1993), although corpus-driven
bottom-up classes have been proposed, too (Gries and Stefanowitch 2010). These a priori
classifications, however, run the risk of missing some semantic distinctions relevant for the
constructional meaning. We suggest a more realistic and methodologically rigorous approach,
evaluating several possible classifications by their predictive power and parsimony.
Method and data
Our classifications are created with the help of semantic vector space models, which take into
account a variety of distributional (lexical and/or syntactic) properties of lexemes (Geeraerts
2010: 173-176). The words are clustered into varying numbers of classes according to their
distributional profiles. The data for the experiment are taken from the syntactically parsed
Twente Nieuws Corpus (Bouma et al. 2001, Orderlman et al. 2007).
Experiments with Dutch causative constructions
The Dutch causative constructions with doen and laten consist of several slots. Our
experiments involve two most common nominal slots: the Causer and the Causee, as well as
one verbal slot: the Effected Predicate (see Kemmer and Verhagen 1994). In our first
experiment, we compare the predictive power of various classifications for the three slots,
taken separately and together, on the basis of the observations with doen and laten in a large
data set. We also compare the bottom-up classes with several manual classifications. In the
second experiment, the criterion is how successfully the classification can predict the
attraction of the lexeme to the construction with doen or laten as established by distinctive
collexeme analysis (Gries and Stefanowitsch 2004).
Preliminary results
The preliminary results suggest that the classifications have substantial predictive power,
although the manual classifications still perform better. In its current state, however, the
method can be successfully used to formulate linguistic hypotheses about the use of
constructions, providing a quick estimation of large amounts of data. Interestingly, the
behaviour of the verbal slot fillers is fundamentally different from that of the nominal slot
fillers: whereas the noun classes perform well with a small number of classes, verbs do not
allow for a large data reduction. We offer an explanation of this and other findings from the
constructionist perspective.
References
Bouma, Gosse, van Noord, Gertjan, and Malouf, Robert 2001. Alpino: Wide-coverage Computational
Analysis of Dutch. In: Dalemans, W. et al. (Eds.), Computational linguistics in the Netherlands
2000: Selected papers from the Eleventh CLIN meeting, 45-59. Rodopi, Amsterdam.
Bybee, Joan L. and David Eddington 2006. A Usage-based Approach to Spanish Verbs of
‘Becoming’. Language 82 (2): 323-355.
Geeraerts, Dirk 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.
Gries, Stefan Th. and Anatol Stefanowitsch 2004. Extending collostructional analysis: a corpus-based
perspective on ‘alternations’. International Journal of Corpus Linguistics 9(1). 97-129.
Gries, Stefan Th. and Anatol Stefanowitsch 2010. Cluster analysis and the identification of collexeme
classes. In John Newman & Sally Rice (eds.). Empirical and experimental methods in
cognitive/functional research, 73-90. Stanford, CA: CSLI.
Heylen, Kris, Yves Peirsman and Dirk Geeraerts 2008. Modelling Word Similarity. An Evaluation of
Automatic Synonymy Extraction Algorithms. In Proceedings of the Language Resources and
Evaluation Conference (LREC) Marrakech, Morocco, 2008.
Kemmer, Suzanne and Verhagen, Arie 1994. The grammar of causatives and the conceptual structure
of events. Cognitive Linguistics 5, 115-156.
Levin, Beth 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago:
University of Chicago Press.
Lin, Dekang 1998. Automatic retrieval and clustering of similar words. COLING-ACL98, Montreal,
Canada, August, 1998, 768-774.
Ordelman, Roeland, de Jong, Franciska, van Hessen, Arjan and Hondorp, Henri 2007. TwNC: a
Multifaceted Dutch News Corpus. ELRA Newsletter 12 (3-4). Available at:
http://doc.utwente.nl/68090/ (last accessed 7 August 2010).
Stefanowitsch, Anatol and Stefan Th. Gries 2003. Collostructions: Investigating the interaction of
words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.
Download