euralex_dallaji_moerth__v005

advertisement
Laying the Foundations for a Diachronic
Dictionary of Tunis Arabic
A First Glance at an Evolving New Language Resource
Karlheinz Mörth1, Stephan Procházka2, Ines Dallaji2
1Institute
of Corpus Linguistics and Text Technology (Austrian Academy of Sciences)
2Department of Oriental Studies (University of Vienna)
[email protected]
[email protected]
[email protected]
Introduction
Two projects
Vienna Corpus of Arabic Varieties (VICAV)
Linguistic Dynamics in the Greater Tunis Area: A Corpusbased Approach (TUNICO)
Text technology + Linguistics
Introduction
VICAV
==> Vienna Corpus of Arabic Varieties
Digital language resources of a wide range of spoken
Arabic varieties: dictionaries, corpora, bibliographies,
language profiles, best practices
Cooperation of University of Vienna and the Austrian
Academy of Sciences
http://corpus3.aac.oeaw.ac.at/vicav2/
Introduction
VICAV
Introduction
VICAV
Introduction
VICAV
Introduction
TUNICO
==> Linguistic Dynamics in the Greater Tunis Area: A
Corpus-based Approach
Funded by the Austrian Science Fund (FWF, P 25706-G23)
Main objectives:
Linguistic exploration of spoken, contemporary Arabic
Two digital language resources
Corpus of spoken youth language
Dictionary of Tunis Arabic
Arabic dialect lexicography
No comprehensive dictionary of the Arabic dialect of
Tunis
Basis for diachronic research:
• Nicolas, A. (1911). Dictionnaire français-arabe
• Beaussier, M. (2006). Dictionnaire pratique arabe-français (arabe
maghrébin)
• Quéméneur, J. (1961). “Notes sur quelques vocables du parler
Tunisien”
• Quéméneur, J. (1962). “Glossaire de dialectal”
• Abdellatif, K. (2010). Dictionnaire «le Karmous» du Tunisien
• Marçais, W. , Guîga, A. (1958-61). Textes arabes de Takroûna. II:
Glossaire
Dictionary of Tunis Arabic
- micro-diachronic and machine-readable
- up-to-date and easily accessible lexical information
- incorporation of:
a) contemporary data from a digital corpus
b) various historical sources (e.g. Stumme, H.)
- information added is kept traceable to its origin
- basis: data taken from didactic materials
- 3 other main sources: newly created corpus,
interviews and historical publications
Dictionary of Tunis Arabic
Contemporary sources
1) Corpus of spoken youth language (dialogues,
narratives):
uncommon approach in Arabic dialectology:
dialectological interests in language of older people --> only older
forms of particular varieties known
focus on modern language, contemporary usage and lexical
neologisms
2) Additional interviews to complete the data gained from
corpus and historical sources
Dictionary of Tunis Arabic
Historical sources
- 800-page grammar of the Medina of Tunis by Hans-Rudolf Singer
(1984): evaluation of data, integration of excerpted lexicographic
data into dictionary
- Verification and completion of collected data with other
historical resources
- Diachronic dimension helps to understand processes in the
development of the lexicon
- Material gathered will allow analysis of recent developments
(migration of parents from rural areas, influence by other Arabic
varieties, influence of revolution, foreign elements)
Dictionary of Tunis Arabic
Dictionary of Tunis Arabic
Technical issues
Modelling the data
Tools
Dictionary of Tunis Arabic
Technical issues
Single schema for a range of dictionaries
LMF, RDF, SKOS, TEI (P5)
Dictionary of Tunis Arabic
Technical issues
Using the TEI dictionary module to encode digitised print
dictionaries is a fairly common standard procedure in
digital humanities.
The TEI dictionary module needs to be further constrained:
• to enhance interoperability
• to reduce alternate constructs
• to achieve a high degree of compliance with LMF (ISO
24613)
Easy to impose in the creation of digitally born dictionaries.
Dictionary of Tunis Arabic
Basic schema
<TEI>
<teiHeader>
...
</teiHeader>
<text>
<body>
<div type="entries">
<entry>...</entry>
<entry>...</entry>
<entry>...</entry>
...
...
...
</div>
</body>
</text>
</TEI>
Dictionary of Tunis Arabic
Basic schema
<body>
<div type="entries">
<entry>...</entry>
<entry>...</entry>
<entry>...</entry>
...
...
...
</div>
<div type="examples">
<cit type="example">...</cit>
<cit type="example">...</cit>
<cit type="example">...</cit>
...
...
...
</div>
</body>
Dictionary of Tunis Arabic
Basic schema
<entry id="ktaab_001">
<form type="lemma">
<orth lang="ar-aeb-x-tunis-vicav">ktāb</orth></form>
<form type="inflected" ana="#n_pl">
<orth lang="ar-aeb-x-tunis-vicav">ktub</orth></form>
<gramGrp>
<gram type="pos">noun</gram>
<gram type="root" lang="ar-aeb-x-tunis-vicav">ktb</gram>
</gramGrp>
<sense>
<cit type="translation" lang="en">
<quote>book</quote></cit>
<cit type="translation" lang="de">
<quote>Buch</quote></cit>
<cit type="translation" lang="fr">
<quote>livre</quote></cit>
</sense>
</entry>
Dictionary of Tunis Arabic
Representing diachrony
…
<bibl>
<author>Ritt-Benmimoun</author>
<date>2014</date>
</bibl>
…
<bibl>
<author>Singer</author>
<date>1958</date>
<biblScope unit="page">56</biblScope>
</bibl>
…
Dictionary of Tunis Arabic
Documentation
http://corpus3.aac.ac.at/vicav2/query/
tools/dictionary_encoding_guidelines
Dictionary of Tunis Arabic
Tools
Viennese Lexicographic Editor (VLE)
XML editor providing functionalities typically needed in compiling
lexicographic data
Web-based standalone application
Designed to process standard-based lexicographic and
terminological data such as LMF, TBX, RDF or TEI.
Automating procedures
Freely configurable visualisation (via XSLT)
Validation: MSXML Schema
Client-server architecture (php + mysql)
Freely available and easy to setup
Dictionary of Tunis Arabic
Tools
Dictionary of Tunis Arabic
Tools
Corpus – Dictionary interface
tokenEditor
Specialised Web-browser
Dictionary of Tunis Arabic
Tools
corpus_shell
... a modular framework of reusable software components to access
and publish heterogeneous and distributed language resources
such as language corpora, dictionaries, encyclopaedic databases,
prosopographic databases, bibliographies, metadata, and schemata.
Language Resources Portal
clarin.oeaw.ac.at/ccv/corpus_shell.
clarin.oeaw.ac.at/ccv/
Dictionary of Tunis Arabic
Status and outlook
CLARIN-ERIC (Common Language Resources and
Technology Infrastructure).
Open access and open source.
~5000 entries
Thank you for your attention!
! ‫شكراً النتباهكم‬
Karlheinz Mörth1, Stephan Procházka2, Ines Dallaji2
1Institute
of Corpus Linguistics and Text Technology (Austrian Academy of Sciences)
2Department of Oriental Studies (University of Vienna)
[email protected]
[email protected]
[email protected]
Download
Related flashcards

Sound recording

24 cards

Computer file systems

17 cards

XML

28 cards

Computer storage media

28 cards

XML

35 cards

Create Flashcards