Corpus, outils et modélisation statistique pour l`acquisition

advertisement
Corpus, outils et modélisation statistique pour l'acquisition du langage
Corpora, tools and statistical models for language acquisition research
12-13 November 2013
Salle de conférences, CNRS Pouchet,
59 rue Pouchet, 75017 Paris
12 November / 12 novembre
Morning / Matin

9h30 – 12h30: Pauline Beaupoil & Christophe Parisse
Introduction à l’utilisation de CLAN et perfectionnement

11h: Coffee break / pause café
Afternoon / Après-midi

14h–15h: Stefan Gries
Some suggestions for better statistics in corpus-based language acquisition research

15h–16h: Dylan Glynn
Correspondence analysis. Exploring categorical data and identifying patterns

16h–16h30: Coffee break / pause café

16h30–17h30: Round table: Using corpora of spontaneous child-adult interaction: How much does
the child's production match the child's input?
13 November / 13 novembre
Morning / Matin

9h30 – 12h30: Christophe Parisse
Outils de traitement et analyse de corpus de langage

11h: Coffee break / pause café
Afternoon / Après-midi

14h–15h: Thomas Hills
Word learning: Growing semantic networks on the statistical structure of language

15h–16h: Colin Bannard
Rethinking the role of imitation in language development

16h–16h30: Coffee break / pause café

16h30–17h30: Anna Theakston
Learning grammatical constructions: insights from corpus data
Abstracts / Résumés
Colin Bannard: Rethinking the role of imitation in language development
This talk is concerned with children's use of imitation as a strategy in language learning. I will describe a range of
different studies, all of which are concerned in some way with understanding when children will choose to imitate
language they have heard others use, and when they will choose to be selective or creative in their productions. I will
explain the utility of imitation via a statistical analysis of the language that children hear, and particularly a discussion
of the bias-variance problem, a central concept in machine learning. I will use this to motivate corpus-derived
statistical models that allow us to predict when children will imitate and when they will innovate, and will discuss the
utility of such models in accounting for grammatical development.
Dylan Glynn: Correspondence Analysis. Exploring categorical data and identifying patterns
Correspondence analysis as an exploratory technique that reveals frequency-based associations in complex categorical
data. The technique visualises these associations in ‘biplots’, or maps that depict degrees of correlation and variation
through the relative proximity of data points (which represent linguistic usage features and / or the actual examples of
use). Linguists often wish to find relations between given linguistic forms, between their meanings and in what
situations those forms and meanings are used. Correspondence analysis is especially designed for identifying such
usage patterning.
Stefan Gries: Some suggestions for better statistics in corpus-based language acquisition research
Compared to many other areas in linguistics - especially formal linguistics - research in language acquisition has had a
history of being based on empirical data from observational and/or experimental approaches. This in turn resulted in
language acquisition research featuring more and more advanced statistical analyses than many other sub-disciplines of
linguistics. However, given recent developments in quantitative corpus linguistics and statistical methodology, it is
maybe time to take stock, to discuss some common methodological choices in corpus-based language acquisition
research, and to explore options of improving on them conceptually and methodologically. In this talk, I will discuss a
few methodological choices that have frequently been made or that, more or less implicitly, underlie much corpusbased language acquisition research with an eye to then propose alternative ways to think about such data and
methods. Among other things, I will be concerned with the multifactoriality of corpus-based language acquisition
data, the question of acquisitional stages, and the identification of trends.
Thomas Hills: Word learning: Growing semantic networks on the statistical structure of language
Children learn language in a sea of words. In this talk, I will report on my recent research attempting to predict word
learning based on the structure of child-directed speech. This work is based on a new theory of language acquisition
called the associative structure of language, which posits that word associations and contextual diversity work hand-inhand to help children learn language. This work involves using network analysis to compete computational models of
language acquisition against one another, using large corpora of adult- and child-directed speech.
Anna Theakston: Learning grammatical constructions: insights from corpus data
From a constructivist perspective, children are thought to acquire the grammatical constructions of their language
from the language to which they are exposed – caregiver input. From this perspective, first, the distributional
properties of the input are centrally important in determining the pattern of acquisition observed in early child speech
and second, the development of adult-like grammatical knowledge is assumed to emerge gradually as a function of a
growing and increasingly more connected network of representations. In this talk I will give an overview of a number
of research projects in which we have investigated the acquisition of grammatical constructions of different kinds
through the analysis of corpus data, including a consideration of grammatical constructions (the transitive),
grammatical errors (case marking) and morphological systems (the past tense).
Round table: Using corpora of spontaneous child-adult interaction: How much does the child's production
match the child's input?
Corpora of child language in interactions are very often used as a means to study children’s production in natural
settings and to evaluate the degree of correspondence between their production and input. Dense corpora provide an
even better image of the correspondence between the child and the adult. However, there is always a part of the data
that is missing in such corpora. Is it possible to estimate this part and to evaluate how far we can expect to find valid
correspondences and how far we should expect the child to be creative or to remember what she heard in the input?
Download