Corpus-based analysis of near-synonymous constructions: Larger, faster and more objective Natalia Levshina, Kris Heylen and Dirk Geeraerts Introduction In this paper we propose a radically corpus-driven methodology for studies of nearsynonymous constructions. It is based on the quantitative method of semantic space models (Lin 1998, Heylen et al. 2008), which is used to create a fully bottom-up and objective classification of constructional slot fillers. On the material of Dutch causative constructions with doen and laten we show how this method can be applied to popular statistical models of constructional near-synonyms, such as regression models and collostructional analysis, and discuss the advantages of the method, as well its future challenges. Aim The semantic properties of constructional slot fillers are central in the usage-based studies of constructional semantics, often being the main source of evidence (e.g. Stefanowitsch and Gries 2003, Bybee and Eddington 2006). Most commonly, the researchers employ a readymade classification of the semantic classes like Levin’s (1993), although corpus-driven bottom-up classes have been proposed, too (Gries and Stefanowitch 2010). These a priori classifications, however, run the risk of missing some semantic distinctions relevant for the constructional meaning. We suggest a more realistic and methodologically rigorous approach, evaluating several possible classifications by their predictive power and parsimony. Method and data Our classifications are created with the help of semantic vector space models, which take into account a variety of distributional (lexical and/or syntactic) properties of lexemes (Geeraerts 2010: 173-176). The words are clustered into varying numbers of classes according to their distributional profiles. The data for the experiment are taken from the syntactically parsed Twente Nieuws Corpus (Bouma et al. 2001, Orderlman et al. 2007). Experiments with Dutch causative constructions The Dutch causative constructions with doen and laten consist of several slots. Our experiments involve two most common nominal slots: the Causer and the Causee, as well as one verbal slot: the Effected Predicate (see Kemmer and Verhagen 1994). In our first experiment, we compare the predictive power of various classifications for the three slots, taken separately and together, on the basis of the observations with doen and laten in a large data set. We also compare the bottom-up classes with several manual classifications. In the second experiment, the criterion is how successfully the classification can predict the attraction of the lexeme to the construction with doen or laten as established by distinctive collexeme analysis (Gries and Stefanowitsch 2004). Preliminary results The preliminary results suggest that the classifications have substantial predictive power, although the manual classifications still perform better. In its current state, however, the method can be successfully used to formulate linguistic hypotheses about the use of constructions, providing a quick estimation of large amounts of data. Interestingly, the behaviour of the verbal slot fillers is fundamentally different from that of the nominal slot fillers: whereas the noun classes perform well with a small number of classes, verbs do not allow for a large data reduction. We offer an explanation of this and other findings from the constructionist perspective. References Bouma, Gosse, van Noord, Gertjan, and Malouf, Robert 2001. Alpino: Wide-coverage Computational Analysis of Dutch. In: Dalemans, W. et al. (Eds.), Computational linguistics in the Netherlands 2000: Selected papers from the Eleventh CLIN meeting, 45-59. Rodopi, Amsterdam. Bybee, Joan L. and David Eddington 2006. A Usage-based Approach to Spanish Verbs of ‘Becoming’. Language 82 (2): 323-355. Geeraerts, Dirk 2010. Theories of Lexical Semantics. Oxford: Oxford University Press. Gries, Stefan Th. and Anatol Stefanowitsch 2004. Extending collostructional analysis: a corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics 9(1). 97-129. Gries, Stefan Th. and Anatol Stefanowitsch 2010. Cluster analysis and the identification of collexeme classes. In John Newman & Sally Rice (eds.). Empirical and experimental methods in cognitive/functional research, 73-90. Stanford, CA: CSLI. Heylen, Kris, Yves Peirsman and Dirk Geeraerts 2008. Modelling Word Similarity. An Evaluation of Automatic Synonymy Extraction Algorithms. In Proceedings of the Language Resources and Evaluation Conference (LREC) Marrakech, Morocco, 2008. Kemmer, Suzanne and Verhagen, Arie 1994. The grammar of causatives and the conceptual structure of events. Cognitive Linguistics 5, 115-156. Levin, Beth 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago: University of Chicago Press. Lin, Dekang 1998. Automatic retrieval and clustering of similar words. COLING-ACL98, Montreal, Canada, August, 1998, 768-774. Ordelman, Roeland, de Jong, Franciska, van Hessen, Arjan and Hondorp, Henri 2007. TwNC: a Multifaceted Dutch News Corpus. ELRA Newsletter 12 (3-4). Available at: http://doc.utwente.nl/68090/ (last accessed 7 August 2010). Stefanowitsch, Anatol and Stefan Th. Gries 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.