CLARIN Research Infrastructure Peter Wittenburg, Martin Wynne Max Planck Institute for Psycholinguistics Oxford Text Archive Basic Facts 1 Gigabyte (1GB) = 1000 MB CD album 1 Terabyte (1TB) = 1000 GB Word yearly book production 1 Petabyte (1PB) = 1000 TB One LHC experiment yearly data production 1 Exabyte (1EB) = 1000 PB World yearly information production too a large extent this is language material (texts, audio, video, eye tracking, motion tracking, eeg, fMRI, etc) volume not the only parameter, complexity counts simply take the digital archive at MPI as one example 45 Terabyte 100 90 80 70 60 50 40 30 20 10 0 Data Volume Data Increase 1998 2000 2002 2004 2006 2008 2010 2012 increase 12 TB Basic Questions are language resources/tools data resources (LR) (LRT)inina good state? LR visible are LRT visiblefor forresearchers? researchers? are LRT LR accessible accessiblebybyresearchers? researchers? LR be can LRT becombined combinedtotovirtual virtualcollections? collections/workflows? can you execute joint an operation on on a virtual operations virtualcollection? collections? oYYo Yo ONNO NO OONNOO NOO OOONNOOO NOOO OOOONNOOOO NOOOO tools|data data like Christmas trees but … PS: it's not an NLP infrastructure - there is place for all kinds of LRT including minority languages, endangered languages, multimodality studies, child language studies, etc Is it true? Researcher Dream at MPI He/she would like to easily align text and speech to better search for interesting acoustic/phonetic phenomena? and you follow then the sign Kleef that’s the Oranje Single yeah then you follow the sign Kleef There are aligners for the big languages. But can a normal researcher use them? Can they be applied for small languages? The answer is NO. Why is this so? suffer from a huge fragmentation in various dimensions researchers create resources/tools but no visibility/persistence component in funding schemes awareness of research data as a common treasure to come of course there are IPR and personality rights (video) lack accepted and open dedicated centres to host data/tools MPI archive is open for deposits of external researchers lack integration between these centres (SSO, PIDs, etc) lack structural and semantic interoperability lack open interfaces (APIs) and their systematic description lack … Terminology What is CLARIN? wants to overcome these hurdles wants to offer services to all researchers interested in LRT 2 major dimensions need to understand the state of LRT and its characteristics (structure, encoding principles, concepts/terminology, etc) need to implement integration and interoperability is that all? need to collaborate/interact with SSH communities need to simplify/harmonize IPR/licensing/ethical issues need to find an organizational model for operation need to do a lot of awareness building, education, training etc basically the work package structure WP5 WP2 WP3/4 WP7 WP8 WP6 Basic Character CLARIN needs to be an open and distributed infrastructure i.e. don't know the actors (LRT contributors, LRT users, Service Centers) and their activities activities will be highly asynchronous some will do exactly the same without knowing some will just try out something some will create serious results to be shared is not a closed project where you can define scopes, formats, vocabularies, processes etc. CLARIN will direct its services to HSS users, i.e. laymen and in general not power users get technology out of the expert labs not at all a simple task Network of Centres • need to move from a domain of accidental collaborations to a structured domain of centres with clear responsibilities and commitments • basis of such a domain is visibility and interoperability -> registries, etc virtual observatory of language resources and technology • need to convince everyone to make use of centres Centre Types • various types of centers in CLARIN • A: infrastructure centers with high availability and persistence (AA infra, PID, center registry, metadata portals, concept registry, etc) • B: resource and technology service providers with a certain commitment for persistent services (texts, lexica, multimedia recordings, parsers, translators, etc) • C: metadata service providers without access to the content (enrichment of the visibility of LRT) • R: centers having resources and tools, but without machine readable access level • E: external centers offering services of various types (libraries, national IDFs, national grid centers, TERENA, MPG will offer PID service to research world, etc) Pillars of Integration - AAI secure server interaction based on TACAR certificates researchers are members of national Identity Federations is this true for all in Europe? what with guests etc? what about state of harmonization (TERENA, EduGain) single identity and sign on possible centres are part of a Service Provider Federation what are the requirements (attributes, values etc) eJournal Service Providers Trust Agreements Schema national Identity Federations Trust Agreement LRT Service Providers Pillars of Integration - PIDs Biological and cultural processes have evolved together, in a symbiotic spiral; they are now indissolubly linked, with human survival unlikely without such culturally produced aids as clothing, cooked food, and tools. The twelve original essays collected in this volume take an evolutionary perspective on human culture, examining the emergence of culture in evolution and the underlying role of brain and cognition. The essay authors, all internationally prominent researchers in their fields, draw on the cognitive sciences -- including linguistics, developmental psychology, and cognition -- to develop conceptual and methodological tools for understanding the interaction of culture and genome. They go beyond the "how" -- the questions of behavioral mechanisms -- to address the "why" -- the evolutionary origin of our psychological functioning. What was the "X-factor," the magic ingredient of culture -- the element that took humans out of the general run of mammals and other highly social organisms? Several essays identify specific behavioral and functional factors that could account for human culture, including the capacity for "mind reading" that underlies social and cultural learning and the nature of morality and inhibitions, while others emphasize multiple partially independent factors -planning, technology, learning, and language. The Xfactor, these essays suggest, is a set of cognitive adaptations for culture. ePublikation Repository 1 how long? eRessource Repository 2 Pillars of Integration - PIDs eRessource1 Repository 1 eRessource2 Repository 2 how long? Ontology open registry Pillars of Integration - PIDs A label in a context associated with a "thing" URLs: HTTP URIs: URNs: Handles: DOI: ARKs: XRIs: PURLs: OpenURLs: InfoURIs etc http:/www.mpi.nl/imdi/doc/white-paper http://www.isocat.org/isodcr#12345 urn:nbn:nl:ui:13-54321 hdl:1839/00-0000-0000-0005-82B0-2 Handles + Business Model http://ark.cdlib.org/ark:/13030/ft4w10060w xri://broadview.library.example.com/ (urn:isbn:0-395-36341-1) http://purl.oclc.org/OCLC/PURL/FAQ parameterized http-get requests integrate legacy material into Web all W3C EU Libs etc many Publisher few ? many ? ? Pillars of Integration - PIDs Standard Robust Software Resolution System Resolution Type Security Admin Assoc Info Costs URL RFC2616 ? yes (DNS) single no no no URN:ISSN ISO2397 no no ? no no no URN:ISBN ISO2108 no no ? no no no URN:NBN RFC3188 no no ? no no ? PURL no no yes single no no no Handle RFC3650 yes yes multiple yes yes little DOI Z39.84… yes yes (Handle) multiple yes yes large ARK no no (yes) multiple (no) yes ? info URI RFC3668 no no ? no no no XRI no no no ? no ? ? Handle only system operating robustly DOI too expensive for required granularity (MPI: 30.000 €/y) Pillars of Integration - PIDs need an offer a Handle based system is ready to be launched is it specific for CLARIN - NO will offer it to the research community need some services fragment addressability authenticity by checksum access permissions should go with PID little metadata for citation purposes etc need independence from CNRI Open LRT Market Place • need to create an open market place of Language resources and technology • all LRT should be visible, accessible and re-usable • allow users to build virtual collections • allow users to build virtual workflows of services • standards based rich metadata descriptions are central Virtual LRT Observatory VO = joint metadata and navigation domain what do the astronomers have: clear dimensions (radial, wavelength, etc) and metrics what do we have (or better not have): HT: In linguistics there is no agreed descriptive system except UNICODE and XML (even no schemas) well come on Henry we have 8 years of experience with IMDI, OLAC, TEI Headers etc ☺ 8 years of experience with a single schema ☺ we have the TEI ODD component framework ☺ we are working on generic schemas such as LMF ☺ we have the ISOcat data category registry ☺ Is it all working smoothly yet? NO Virtual LRT Observatory User selects appropriate components to create a metadata description Component registry Location Country Coordinates Text Language Title user Semantic interoperability partly solved via references to ISOcat concept registry ISOcat concept registry Country dcr:1001 Language dcr:1002 BirthDate dcr:1000 DCMI concept registry Title: dc:title Actor BirthDate MotherTongue Dance Name Type Recording CreationDate Type Service Oriented Architecture still live in a down-load first and not in a cyberinfrastructure domain User-Service Interaction current way of interaction: • user interacts with a web-site • receives intermediate result • manipulates this result and • sends it to the next web-site • etc Service-Service Interaction User Desk Algorithm Service Internet Service Centre Web Application User Desk Service Centre Database Archive Interfaces described by WSDL Algorithm Service better way of interaction: • users interacts with an application • the application makes use of different services without bothering the user • user receives the final result • SOA not at all simple to achieve, but only architecture scalable and flexible enough • standardization and harmonization is required to realize workflow mechanisms Basic Annotation Architecture conversion resource NLP/ASR/ manual Process metadata journal? annotation1 metadata1 journal1 annotation2 metadata2 journal2 annotation3 metadata3 journal3 mycollection repository I..K registry X..Z Problems: • how does service know where to store distinguish between • how can others find things • how does a service find all relevant objects and • how does it find all relevant detailed information repository I..K Some Points of Interest there is lot to be done our inventories/registrations indicate clearly the size of the task a number of big challenges aside technology how to convince our providers of best practices how to achieve broad consensus about standards how to get the required funds we are not alone allies: APA, DRIVER, e-IRG, etc standards: TEI, ISO TC37, etc colleagues in SSH: DARIAH, CESSDA, etc Will we succeed? Falls nicht to end in Babylonish scenario nous avons still etwas time üm na te think. Thanks for your attention!