Archiving and Accessing Language Resources Peter Wittenburg Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands what is the MPI? • at MPI fundamental research in mental language processing, language acquisition and language & cognition • methods: experiments, signal processing (eye movements, gestures), computer-simulations, brain imaging, multimedia observation • is one of the 80 institutes of Max Planck Society (Germany) • as member of central IT committee pushing eScience ideas essential questions for us • which tools can we build to improve linguists' efficiency for the different tasks in their daily work? • which infrastructures can we build that allow linguists to focus on research work? • how can we preserve the data about languages and the knowledge about them for future generations? our "product" is the Language Archiving Technology LAT why care about languages? (thanks to our Ethnologue/SIL colleagues) • between 6000 and 7000 languages - every two weeks one is dying • 96 % of languages spoken by only 3 % of people • all our languages getting mixed - loose structure, loose identity etc major workflow change • speech analysis people handled speech for long time on computers but • for observational linguistics not normal • recordings on separate carriers which were transcribed and then put into the cellar • the annotations were the basis for all further work • fact: almost no one accessed the carriers anymore • due to technological innovation things changed completely • from 90-ies digitization of material at MPI (gesture, sign) • from 1995 first multimedia/multimodal annotation tools • at first much skepticism from users storage capacity as indicator 45 Terabyte 100 90 80 70 60 50 40 30 20 10 0 Data Volume Data Increase 1998 2000 2002 2004 2006 2008 2010 2012 increase 12 TB • more important • service was persistent - people could rely on it - immediate availability of digital data • around 98/2000 systematic digitization/capturing (also retro) • currently digitizing big human-ethological tape archive • all accessible via web mechanisms • terabytes not all • > 60.000 sessions in the archive • > 300.000 objects in the archive with complex relations few major facilitators • costs per Megabyte in 1977 as currently per Terabyte • relying on persistent institutional repository • availability of the ELAN annotation tool at early 20-ies with some advanced features •"generic" schema, structure and vocabulary user definable • based on extensive world-wide discussion • availability of the IMDI Metadata Infrastructure with a schema based set and various tools to structure, manage and find resources ELAN screenshot viewers & controls video player controls waveform viewer crosshair annotations tiers timeline viewer IMDI interactive "catalogue" of course available to OAI-PMH based service providers DOBES impulse • 2000 decision of the VolkswagenFoundation to start the distributed language documentation programme DOBES • currently 45 Teams with about 60 languages operating worldwide • MPI became the central archive, i.e. long term preservation and access in focus preservation problems • our storage media are problematic (reliability, life time) • UNESCO: 80% of our recordings are highly endangered • so how to preserve bit-streams? 0 years 250 years various e-media 500 years 1000 years 2000 years clay tablets • our "standards" for encoding and structuring come and go • take video as an obvious example Cinepak, MPEG1, MPEG2, MPEG4 (H.264), mJPEG2000 • how to maintain interpretability? • thus • continuous change and migration at various levels • a nightmare for a traditional archivist preservation at bit-stream level • 6 copies in centers with professional LTP strategy thus regular technology migration • in 2008 11 regional archives based on LAT technology • in 2009 additional regional archives • are offering synchronization support due to lack of money for archiving • all based on proper & fair agreements what at curation level • difficult …. • emergence of universal character encoding standard (UNICODE) • agreement about lossless media encoding schemes (mJPEG2000) • XML accepted as structuring language for texts • emergence of generic schemas such as • EAF (at MPI for annotated media) • Linguistic Annotation Framework (ISO TC37/SC4) ? • Lexical Markup Framework (ISO TC37/SC4) ? • DOBES: all needs to be formatted according to such open standards • application of "immediate transformation policy" of course: store old formats as well due to lossy transformation • only format coherent and consistent content can be easily accessed and transformed in future • Beagrie: costs for late curation are at least factor 30 higher but some metadata facts • statistics on 27.000 MD Records • language name usage 100% • language code usage ~40% • content genre usage ~30 % • researchers are not fully committed yet • no pressure and funding for data curation/preservation aspects • well - not all our tools are user friendly enough % w ritten resource language ID w ritten resource character encoding w ritten resource content encoding w ritten resource size w ritten resource f ormat w ritten resource subtype w ritten resource type w ritten resource resource link media f ile quality media f ile format media f ile type media f ile size media f ile resource link actor description actor education actor sex actor age actor birth actor ethnic group actor family social role actor code actor fullname actor name actor role actor language name actor language ID actor language description content language name content language ID content language description communication context channel communication context event communication context social context communication context involvement communication context planning type communication context interactivity content subject content modalities content task content subgenre content genre content description project name session region session address session country session continent session description session recording date session.title session.name % 0 20 40 60 80 100 120 metadata benefits • benefits become now apparent after a decade • combination of metadata and content for longitudinal studies "use of syntactic forms by children of different age" • requires a critical mass and high quality MD • special portals for DOBES communities where metadata elements can be used dynamically • very simple in IMDI due to REST interface • requires hq MD access and Live archives • attractiveness is important for survival and of course researchers want a "dynamic archive" • therefore various ways of accessing and enriching the data • enriching can mean • adding resources or annotation layers • uploading new versions (requires a persistent identifier schema) • commenting on resources • drawing relations between resources in various ways (-> PIDs) • etc Language Archiving Technology Suite Shoebox CHAT Transcriber some XML? many smart developers LAT GIS based access Multimedia Lexicon Described Corpus Photos Video Clips Annotated Media last development: conceptual spaces • allow users to create relational domains on top of archive material • results in completely different views • semantic views • genealogical view • etc • much more interesting than a boring catalogue for example • also much more inspiring for community people acceptance Documentation Task what has been achieved in MPI/DOBES? • large archive with equal access to primary and derived material thus almost theory neutral re-usage • neutral and atomic access options • make data explicit by handing over a copy • researchers learned to act just as software developers don't wait until finished, make versions explicit and usable • lots of awareness building about formats, needs and benefits of metadata, good tools, etc • lots of discussions about rights and ethical aspects • this is a change in scientific culture and of work paradigms • some technical advancements does it cost something? type k€/y comment basic IT infrastructure 80 4-8 years innovation cycle copies at large computer centers <5 system management (1 FTE) 60 shared for different activities archive management (1 FTE) 80 advice, curation, consistency repository software maintenance 60 without new functionality utilization software maintenance >120 wide spectrum of tools total 405 (225 without sw) Maintaining a large and complex living archive costs 400 k€/year. (linguistic support, SW development, etc. not calculated) of course: economy of scale to a certain extent Digital Dilemma Report of Academy of Motion Picture Arts and Sciences: maintaining a digital master file costs 12 times as much is something missing? • let's address a few simple questions • can a researcher do a useful content search on the whole archive? • can a researcher easily use a certain lexicon when operating on texts? • can a researcher easily align a piece of text with a speech signal? • can a researcher combine Trumai data from MPI and AILLA (Austin)? • can we easily integrate catalogues? • Henry Thompson gave part of the answer: Except for XML and UNICODE we don’t have agreed descriptive systems in the linguistic domain. • • • • • CLARIN addressed it this way • data resources (LR) oYYo Yo are language resources/tools (LRT)inina good state? LR visible ONNO NO are LRT visiblefor forresearchers? researchers? LR accessible OONNOO NOO are LRT accessiblebybyresearchers? researchers? can LRT LR be OOONNOOO NOOO becombined combinedtotovirtual virtualcollections? collections/workflows? an operation on on a virtual NOOOO can you execute joint operations virtual collections? OOOONNOOOO collection? tools|data data like Christmas trees but … • need to get it out of the expert labs and make it available which resources and tools? • typical resources types • semi-structured texts (newspapers, books, etc) • transcriptions • annotated media recordings (sound, video, photos) • (annotated) time series data eye tracking, motion tracking, data glove etc • lexica (with multimedia extensions) • grammar descriptions • tree databases (syntax descriptions) • concept registries, relation registries, ontologies • metadata descriptions • schemas, components, profiles • etc • numerous tools operating on these resources what is CLARIN? • stands for Common Language Resource and Technology Infrastructure • currently a group of 144 of the strongest LRT institutions from almost all European countries • is meant to build up a persistent infrastructure that helps overcoming the huge fragmentation in the field of LRT and that can give services to all researchers working with LRT • is meant to start tackling the problems mentioned • is a fully distributed approach with three layers of responsibility European - national - institutional • current funding scheme • 3 years EC project called preparatory phase (4.1 Mio €) • much funding commitment from increasing number of countries (D, F, NL, Fi, Dk, Ro, Cz, Sp, etc now already at 20 Mio €) • intention is to get commitments for many years for a stable RI • • which dimensions of work 2 major dimensions – need to understand the state of LRT and its characteristics (structure, encoding principles, concepts/terminology, etc) – need to implement integration and interoperability is that all? – need to collaborate/interact with SSH communities – need to simplify/harmonize IPR/licensing/ethical issues – need to find an organizational model for operation – need to do a lot of awareness building, education, training etc – basically the work package structure WP5 WP2 MPI WP3/4 WP7 WP8 WP6 CLARIN network of stable centres • need to move from a domain of accidental collaborations to a structured domain of centres with clear responsibilities and commitments for the services they give • need to convince everyone to make use of centres • identified different types of centres • new business models required - close to research • • integration level researchers are members of national Identity Federations – is this true for all in Europe? what with guests etc? – what about state of harmonization (TERENA, EduGain) – single identity and sign on possible centres are part of a "Service Provider Federation" – what are the requirements (attributes, values etc) eJournal Service Providers Schema Trust Agreements national Identity Federations LRT Service Providers Trust Agreement single identity scenario need persistent identifiers Biological and cultural processes have evolved together, in a symbiotic spiral; they are now indissolubly linked, with human survival unlikely without such culturally produced aids as clothing, cooked food, and tools. The twelve original essays collected in this volume take an evolutionary perspective on human culture, examining the emergence of culture in evolution and the underlying role of brain and cognition. The essay authors, all internationally prominent researchers in their fields, draw on the cognitive sciences -- including linguistics, developmental psychology, and cognition -- to develop conceptual and methodological tools for understanding the interaction of culture and genome. They go beyond the "how" -- the questions of behavioral mechanisms -- to address the "why" -- the evolutionary origin of our psychological functioning. What was the "X-factor," the magic ingredient of culture -- the element that took humans out of the general run of mammals and other highly social organisms? Several essays identify specific behavioral and functional factors that could account for human culture, including the capacity for "mind reading" that underlies social and cultural learning and the nature of morality and inhibitions, while others emphasize multiple partially independent factors -planning, technology, learning, and language. The Xfactor, these essays suggest, is a set of cognitive adaptations for culture. ePublication Repository 1 how long? eResource Repository 2 need persistent identifiers ISO Concept Registry eResource1 Repository 1 how long? eResource2 Repository 2 Ontology open registry • • • PID - need a highly available system need an offer – a Handle based system is ready to be launched is it specific for CLARIN - NO – will offer it to the research community – why not DOIs need some services – fragment addressability – authenticity by checksum – access permissions should go with PID – little metadata for citation purposes – etc virtual LRT observatory • need to create an open market place of Language resources and technology • all LRT should be visible, accessible and re-usable • allow users to build virtual collections • allow users to build virtual workflows of services • standards based rich metadata descriptions are central • • VO - joint metadata domain what do the astronomers have: – clear dimensions (radial, wavelength, etc) and metrics what do we have (or better not have): – Henry Thompson: In linguistics there is no agreed descriptive system except UNICODE and XML – well come on Henry we have • 8 years of experience with IMDI, OLAC, TEI Headers etc • 8 years of experience with a single schema • we have the TEI ODD component framework • we are working on generic schemas such as LMF • we have the ISOcat data category registry – Is it all working smoothly yet? NO service oriented architecture NSF workshop: still live in a down-load first and not in a cyberinfrastructure domain User-Service Interaction current way of interaction: • user interacts with a web-site • receives intermediate result • manipulates this result and • sends it to the next web-site • etc Service-Service Interaction User Desk Algorithm Service Internet Service Centre Web Application User Desk Service Centre Database Archive Interfaces described by WSDL Algorithm Service better way of interaction: • users interacts with an application • the application makes use of different services without bothering the user • user receives the final result • SOA not at all simple to achieve, but only architecture that is scalable and flexible enough • standardization and harmonization is required to realize workflow mechanisms where are we now? • do an online inventory of all LRT in Europe - huge participation – want to move this towards hierarchical metadata domain • carry out centre self-assessment to identify strong institutes and to identify the gaps – do a lot of talks - bring grid/federation experts and linguists together • specified the requirements for a service provider federation – PIDs, AAI harmonization, PKI, etc • specified the requirements of a component metadata infrastructure – start developing - of course ISOcat data category registry is crucial • discussed the requirements for web services and workflows – working on an inventory of typical processing chains – import/export is the big problem - need more standards we are not alone • a number of big challenges aside technology – how to convince our providers of best practices – how to achieve broad consensus about standards – how to get the required funds for the follow-up phases • we are not alone – umbrella allies: APA, DRIVER, e-IRG, etc – grid/federation experts: EGEE, DEISA, TERENA/eduGain – standards: TEI, ISO TC37, etc – colleagues in SSH: DARIAH, CESSDA, BAMBOO, etc will we get there? Falls nicht to end in Babylonish scenario nous avons still etwas time üm na te think. Thanks for your attention! www.mpi.nl; www.mpi.nl/dobes; www.clarin.eu System Managers, Archive Managers, Software Developers funds from MPG, MPI, NWO, VWS, BMBF, EC Lexical Markup Framework (ISO TC37/SC4) 1..1 1..1 Lexical DB 1..1 0..n Global Info Lexical Entry 1..1 1..1 1..1 0..n 1..n Form Sense 1..n /orthography/ /variant for/ Morphology 1..1 /identifier/ /lemma/ /POS/ /gender/ /key form/ 0..1 Paradigm 1..1 0..n Inflexion /orthography/ /gender/ /number/ /tense/ /person/ /mood/ ISOcat Model • ISO 12620 standard • hat administrative, linguistische und deskriptive Information • Unterschied zwischen "simple" und "complex categories" • komplexe Kategorien haben eine Werte Bereich • "grammatical gender" hat values male, female, neuter • has language sections • has options for alternative names ISOcat Modell User selects appropriate components to create a metadata description Component registry Location Country Coordinates Text Language Title user Semantic interoperability partly solved via references to ISOcat concept registry ISOcat concept registry Country dcr:1001 Language dcr:1002 BirthDate dcr:1000 DCMI concept registry Title: dc:title Actor BirthDate MotherTongue Dance Name Type Recording CreationDate Type ISOcat Modell conversion resource NLP/ASR/ manual Process metadata journal? annotation1 metadata1 journal1 annotation2 metadata2 journal2 annotation3 metadata3 journal3 mycollection repository I..K registry X..Z Problems: • how does service know where to store distinguish between • how can others find things • how does a service find all relevant objects and • how does it find all relevant detailed information repository I..K everything fine? - metadata example now cost effective schemes are possible researchers understood that preservation and access can’t be handled at reasonable costs The CHAOS Archive X all individuals and teams creating independently but ingest is done in a coordinated manner