abstract_laippala (2)

TIAS-project 2014-2016: Blog, comment and discuss! A quantitative study of French Internet texts
using automatic morpho-syntactic analysis.
My project aims at a quantitative analysis of the characteristics and distinguishing features of a
large collection of different Internet texts in French, using automatic morphological and full
dependency syntax analyses, i.e. the detection of word forms and their functions in the sentence
as well as the identification of the entire sentence structure.
The corpus of the study consists of a combination of private corpora collected for the purposes of
other studies: French politicians’ blogs and comments, chats from different news sites, discussion
forum texts from student sites and from news sites as well as follow-up discussion of Le Monde
The automatic morphological and syntactic analyses will be done using the TALISMANE toolkit
developed in the University of Toulouse-Le-Mirail. If necessary, a syntax parser specifically for the
analysis of Internet texts will be constructed in collaboration with the original developers of the tool
by annotating a small collection of example sentences of this project corpus and by combining
these to the existing parts of the toolkit.
The quantitative linguistic analyses will be done using different statistical methods, namely factor
analysis and key word analysis and key structure analysis, enabled by the morphological and
syntactic informations furnished. This not only gives new information on the studied texts but also
enables the development of these methods as especially syntactic features are still rarely used in
quantitative studies.
The resulting corpora with morphological and syntactic analyses will be released as freely as
possible, giving other researchers as well the possibility to benefit from the study and enabling
research that currently is not possible.
Key words: Computer-mediated communication, automatic morphological and syntactic analysis,
corpus linguistics, factor analysis, key word analysis