Corpus, outils et modélisation statistique pour l'acquisition du langage Corpora, tools and statistical models for language acquisition research 12-13 November 2013 Salle de conférences, CNRS Pouchet, 59 rue Pouchet, 75017 Paris 12 November / 12 novembre Morning / Matin 9h30 – 12h30: Pauline Beaupoil & Christophe Parisse Introduction à l’utilisation de CLAN et perfectionnement 11h: Coffee break / pause café Afternoon / Après-midi 14h–15h: Stefan Gries Some suggestions for better statistics in corpus-based language acquisition research 15h–16h: Dylan Glynn Correspondence analysis. Exploring categorical data and identifying patterns 16h–16h30: Coffee break / pause café 16h30–17h30: Round table: Using corpora of spontaneous child-adult interaction: How much does the child's production match the child's input? 13 November / 13 novembre Morning / Matin 9h30 – 12h30: Christophe Parisse Outils de traitement et analyse de corpus de langage 11h: Coffee break / pause café Afternoon / Après-midi 14h–15h: Thomas Hills Word learning: Growing semantic networks on the statistical structure of language 15h–16h: Colin Bannard Rethinking the role of imitation in language development 16h–16h30: Coffee break / pause café 16h30–17h30: Anna Theakston Learning grammatical constructions: insights from corpus data Abstracts / Résumés Colin Bannard: Rethinking the role of imitation in language development This talk is concerned with children's use of imitation as a strategy in language learning. I will describe a range of different studies, all of which are concerned in some way with understanding when children will choose to imitate language they have heard others use, and when they will choose to be selective or creative in their productions. I will explain the utility of imitation via a statistical analysis of the language that children hear, and particularly a discussion of the bias-variance problem, a central concept in machine learning. I will use this to motivate corpus-derived statistical models that allow us to predict when children will imitate and when they will innovate, and will discuss the utility of such models in accounting for grammatical development. Dylan Glynn: Correspondence Analysis. Exploring categorical data and identifying patterns Correspondence analysis as an exploratory technique that reveals frequency-based associations in complex categorical data. The technique visualises these associations in ‘biplots’, or maps that depict degrees of correlation and variation through the relative proximity of data points (which represent linguistic usage features and / or the actual examples of use). Linguists often wish to find relations between given linguistic forms, between their meanings and in what situations those forms and meanings are used. Correspondence analysis is especially designed for identifying such usage patterning. Stefan Gries: Some suggestions for better statistics in corpus-based language acquisition research Compared to many other areas in linguistics - especially formal linguistics - research in language acquisition has had a history of being based on empirical data from observational and/or experimental approaches. This in turn resulted in language acquisition research featuring more and more advanced statistical analyses than many other sub-disciplines of linguistics. However, given recent developments in quantitative corpus linguistics and statistical methodology, it is maybe time to take stock, to discuss some common methodological choices in corpus-based language acquisition research, and to explore options of improving on them conceptually and methodologically. In this talk, I will discuss a few methodological choices that have frequently been made or that, more or less implicitly, underlie much corpusbased language acquisition research with an eye to then propose alternative ways to think about such data and methods. Among other things, I will be concerned with the multifactoriality of corpus-based language acquisition data, the question of acquisitional stages, and the identification of trends. Thomas Hills: Word learning: Growing semantic networks on the statistical structure of language Children learn language in a sea of words. In this talk, I will report on my recent research attempting to predict word learning based on the structure of child-directed speech. This work is based on a new theory of language acquisition called the associative structure of language, which posits that word associations and contextual diversity work hand-inhand to help children learn language. This work involves using network analysis to compete computational models of language acquisition against one another, using large corpora of adult- and child-directed speech. Anna Theakston: Learning grammatical constructions: insights from corpus data From a constructivist perspective, children are thought to acquire the grammatical constructions of their language from the language to which they are exposed – caregiver input. From this perspective, first, the distributional properties of the input are centrally important in determining the pattern of acquisition observed in early child speech and second, the development of adult-like grammatical knowledge is assumed to emerge gradually as a function of a growing and increasingly more connected network of representations. In this talk I will give an overview of a number of research projects in which we have investigated the acquisition of grammatical constructions of different kinds through the analysis of corpus data, including a consideration of grammatical constructions (the transitive), grammatical errors (case marking) and morphological systems (the past tense). Round table: Using corpora of spontaneous child-adult interaction: How much does the child's production match the child's input? Corpora of child language in interactions are very often used as a means to study children’s production in natural settings and to evaluate the degree of correspondence between their production and input. Dense corpora provide an even better image of the correspondence between the child and the adult. However, there is always a part of the data that is missing in such corpora. Is it possible to estimate this part and to evaluate how far we can expect to find valid correspondences and how far we should expect the child to be creative or to remember what she heard in the input?