Digital Editions & Language Resources Portal Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ACDH, ÖAW matej.durco@oeaw.ac.at What kind of data? TEXT http://www 2 What kind of data? Dictionaries • Persian – English Dictionary • German – Russian Dictionary • Dictionary of Bavarian dialects in Austria Cooperation with Austrian dictionary and the dictionary of German variants Full word-form corpus-based lexical database of German Databases e.g. prosopographic, bibliographic data, … Audio – speech recordings (project Tunico) 3 Complexity, Formats XML TEI Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning) Multi-level enrichment: Structural markup, linguistic / semantic annotation (stand-off) Linking: • Combining lexicographic material with information from corpora (encoding in TEI) • semantic representation of lexicographic resources in RDF Audio with aligned transcription 4 Size? qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936) ~ 22.500 pages, ~ 6 mio. tokens AAC ~ 500 mio. tokens + facsimiles 40TB! AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB 23 000 entries prosopographic database A number of smaller editions/corpora 5 – 50 works/resources, rich annotation, < 100.000 t Multiple dictionaries with a few thousand entries 5 Metadata Bibliographic information encoded as teiHeader CMDI – Metadata Infrastructure used within CLARIN Allows for flexible „profiles“ specific to the type of resource and project/context - Lexical Resource - TextCorpus - Collection - teiHeader (emulated in CMDI) 6 Requirements on online availability Varying combinations of: full-text search semantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages) specialized visualizations (temporal, spatial, graph) raw data available for download stable references to resources and resource fragments BUT before publication: collaborative editing VRE ! 7 Solutions Publishing framework: corpus_shell Repository for digital objects (Fedora-based) Viennese Lexicographic Editor Collaborative environment for lexicographic work oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilities Most recently: Language Resources Portal 8 European Research Infrastructures CLARIN „under construction“ but many real services already available Stable organisational structure has been set up general assembly, board of directors, national coordinators, thematic committees, … Network of Centres (real ones with computing and storage – not virtual) • Certification process (centre assessment) • Typ: A (infrastructure), B (LRT data/services), C (metadata) • currently 14 centres certified (+ 4 pending) • Coordinated through SCCTC Standing Committee on CLARIN Technical Centres 9 CLARIN Infrastructure Federated Identity AAI, Single-Sign-on Persistent Identifier CMDI – Component Metadata Infrastructure flexible framework for creation and publication of metadata FCS – Federated Content Search distributed system for searching in the content of the resources (corpora, …) Fostering the use of standards CLARIN Standards Committee (SCS) 10 Publishing Framework corpus_shell modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment distributed setup FCS-based integration with CLARIN metadata infrastructure (reusing) specialized resource viewers for specific types of resources multiple implementations (php, perl, XQuery) cooperation/integration with SADE Scalable Architecture for Digital Editions, BBAW, Berlin open source (code on github) 11 corpus_shell instances vicav 12 corpus_shell instances ABaC:us 13 Lexicography suite Dictionary Server • Open and freely available software that can be readily distributed (MySQL, PHP) • Integrated with corpus_shell (FCS as common protocol) • Connected to the clients through a REST-style web service Vienna Lexicographic Editor The corresponding client DictGate • a service for hosting lexicographic data • for smaller lexicographic projects 14 Lexicography suite Viennese Lexicographic Editor (VLE) • • • • • • XML editor specialized for editing lexicographic data Generic – support for any (XML) format (LMF, TEI, TBX, RDF) Making use of cognate technologies (XSLT, XPath, XSD) Various editing modes, configurable keyboard layouts Optimised corpus-dictionary interface On-the-fly data visualisations 15 CLARIN Centre Vienna clarin.oeaw.ac.at First Austrian node in the network of CLARIN Centres DSA (Data Seal of Approval) and CLARIN Centre B status April 2014 Language Resources Portal Mission: National depositing and publishing service for digital language resources Tools corpus_shell, lexicographic suite, … Infrastructure Services - „Knowledge Hub“ mostly about metadata (under development) 16 clarin.oeaw.ac.at CLARIN Centre Vienna 17 Thank you! Matej Ďurčo Austrian Centre for Digital Humanities Österreichische Akademie der Wissenschaften matej.durco@oeaw.ac.at