Language Resources Portal - Digital Humanities Austria

advertisement
Digital Editions &
Language Resources Portal
Workshop - Save the data, 2. 12. 2014, Wien
Matej Ďurčo
ICLTT/ACDH, ÖAW
matej.durco@oeaw.ac.at
What kind of data?
TEXT
http://www
2
What kind of data?
Dictionaries
• Persian – English Dictionary
• German – Russian Dictionary
• Dictionary of Bavarian dialects in Austria
Cooperation with Austrian dictionary and the dictionary of
German variants
Full word-form corpus-based lexical database of German
Databases
e.g. prosopographic, bibliographic data, …
Audio – speech recordings (project Tunico)
3
Complexity, Formats
XML TEI
Sources: plain text, images (need OCR), Word
documents (need conversion), audio (need
transcribing), digitally born - web! (needs cleaning)
Multi-level enrichment:
Structural markup, linguistic / semantic annotation (stand-off)
Linking:
• Combining lexicographic material with information from
corpora (encoding in TEI)
• semantic representation of lexicographic resources in RDF
Audio with aligned transcription
4
Size?
qualitative vs. quantitative
K. Kraus „Die Fackel“ (1899 – 1936)
~ 22.500 pages, ~ 6 mio. tokens
AAC ~ 500 mio. tokens + facsimiles 40TB!
AMC ~ 8 billion tokens in over 35 mio. articles
of recent journalistic texts (complete newspapers &
magazines in Austria over last 20 years!) – 100 GB
23 000 entries prosopographic database
A number of smaller editions/corpora
5 – 50 works/resources, rich annotation, < 100.000 t
Multiple dictionaries with a few thousand entries
5
Metadata
Bibliographic information
encoded as teiHeader
CMDI – Metadata Infrastructure used within CLARIN
Allows for flexible „profiles“ specific to the type of resource
and project/context
- Lexical Resource
- TextCorpus
- Collection
- teiHeader (emulated in CMDI)
6
Requirements on online availability
Varying combinations of:
full-text search
semantic search (search for persons, places,
search by categories and classifications)
full-view (e.g. text and facsimile of individual
pages)
specialized visualizations (temporal, spatial, graph)
raw data available for download
stable references to resources and resource
fragments
BUT before publication: collaborative editing  VRE !
7
Solutions
Publishing framework: corpus_shell
Repository for digital objects (Fedora-based)
Viennese Lexicographic Editor
Collaborative environment for lexicographic work
oXygen, XML-database eXist
Apache Solr, Sketch Engine (/NoSke), DDC for
fast advanced (linguistic) search capabilities
Most recently: Language Resources Portal
8
European Research Infrastructures
CLARIN
„under construction“
but many real services already available
Stable organisational structure has been set up
general assembly, board of directors, national coordinators,
thematic committees, …
Network of Centres
(real ones with computing and storage – not virtual)
• Certification process (centre assessment)
• Typ: A (infrastructure), B (LRT data/services), C (metadata)
• currently 14 centres certified (+ 4 pending)
• Coordinated through SCCTC
Standing Committee on CLARIN Technical Centres
9
CLARIN
Infrastructure
Federated Identity
AAI, Single-Sign-on
Persistent Identifier
CMDI – Component Metadata Infrastructure
flexible framework for creation and publication of metadata
FCS – Federated Content Search
distributed system for searching in the content
of the resources (corpora, …)
Fostering the use of standards
CLARIN Standards Committee (SCS)
10
Publishing Framework
corpus_shell
modular framework for publishing a wide range of language
resources designed to operate in a distributed and
heterogeneous environment
distributed setup FCS-based
integration with CLARIN metadata infrastructure
(reusing) specialized resource viewers for specific
types of resources
multiple implementations (php, perl, XQuery)
cooperation/integration with SADE
Scalable Architecture for Digital Editions, BBAW, Berlin
open source (code on github)
11
corpus_shell instances
vicav
12
corpus_shell instances
ABaC:us
13
Lexicography suite
Dictionary Server
• Open and freely available software that can be readily
distributed (MySQL, PHP)
• Integrated with corpus_shell (FCS as common protocol)
• Connected to the clients through a REST-style web service
Vienna Lexicographic Editor
The corresponding client
DictGate
• a service for hosting lexicographic data
• for smaller lexicographic projects
14
Lexicography suite
Viennese Lexicographic Editor (VLE)
•
•
•
•
•
•
XML editor specialized for editing lexicographic data
Generic – support for any (XML) format (LMF, TEI, TBX, RDF)
Making use of cognate technologies (XSLT, XPath, XSD)
Various editing modes, configurable keyboard layouts
Optimised corpus-dictionary interface
On-the-fly data visualisations
15
CLARIN Centre Vienna
clarin.oeaw.ac.at
First Austrian node in the network
of CLARIN Centres
DSA (Data Seal of Approval)
and CLARIN Centre B status April 2014
Language Resources Portal
Mission: National depositing and publishing service
for digital language resources
Tools
corpus_shell, lexicographic suite, …
Infrastructure Services - „Knowledge Hub“
mostly about metadata (under development)
16
clarin.oeaw.ac.at
CLARIN Centre Vienna
17
Thank you!
Matej Ďurčo
Austrian Centre for Digital Humanities
Österreichische Akademie der Wissenschaften
matej.durco@oeaw.ac.at
Download