CLARIN Research Infrastructure Peter Wittenburg, Martin Wynne Max Planck Institute for Psycholinguistics

advertisement
CLARIN Research Infrastructure
Peter Wittenburg, Martin Wynne
Max Planck Institute for Psycholinguistics
Oxford Text Archive
Basic Facts
1 Gigabyte (1GB) = 1000 MB CD album
1 Terabyte (1TB) = 1000 GB Word yearly book production
1 Petabyte (1PB) = 1000 TB One LHC experiment yearly data production
1 Exabyte (1EB) = 1000 PB World yearly information production
too a large extent this is language material
(texts, audio, video, eye tracking, motion tracking, eeg, fMRI, etc)
volume not the only parameter, complexity counts
simply take the digital archive at MPI as one example
45
Terabyte
100
90
80
70
60
50
40
30
20
10
0
Data Volume
Data Increase
1998
2000
2002
2004
2006
2008
2010
2012
increase
12 TB
Basic Questions
are language resources/tools
data resources (LR)
(LRT)inina good state?
LR visible
are LRT
visiblefor
forresearchers?
researchers?
are LRT
LR accessible
accessiblebybyresearchers?
researchers?
LR be
can LRT
becombined
combinedtotovirtual
virtualcollections?
collections/workflows?
can you execute joint
an operation
on on
a virtual
operations
virtualcollection?
collections?
oYYo
Yo
ONNO
NO
OONNOO
NOO
OOONNOOO
NOOO
OOOONNOOOO
NOOOO
tools|data
data
like Christmas trees but …
PS: it's not an NLP infrastructure - there is place for all kinds of LRT
including minority languages, endangered languages, multimodality studies,
child language studies, etc
Is it true?
Researcher Dream at MPI
He/she would like to easily align text and speech to better search for interesting
acoustic/phonetic phenomena?
and you follow then the sign Kleef that’s the Oranje Single yeah then you follow the sign Kleef
There are aligners for the big languages. But can a normal researcher use
them? Can they be applied for small languages?
The answer is NO.
Why is this so?
suffer from a huge fragmentation in various dimensions
researchers create resources/tools
but no visibility/persistence component in funding schemes
awareness of research data as a common treasure to come
of course there are IPR and personality rights (video)
lack accepted and open dedicated centres to host data/tools
MPI archive is open for deposits of external researchers
lack integration between these centres (SSO, PIDs, etc)
lack structural and semantic interoperability
lack open interfaces (APIs) and their systematic description
lack …
Terminology
What is CLARIN?
wants to overcome these hurdles
wants to offer services to all researchers interested in LRT
2 major dimensions
need to understand the state of LRT and its characteristics
(structure, encoding principles, concepts/terminology, etc)
need to implement integration and interoperability
is that all?
need to collaborate/interact with SSH communities
need to simplify/harmonize IPR/licensing/ethical issues
need to find an organizational model for operation
need to do a lot of awareness building, education, training etc
basically the work package structure
WP5
WP2
WP3/4
WP7
WP8
WP6
Basic Character
CLARIN needs to be an open and distributed infrastructure i.e.
don't know the actors (LRT contributors, LRT users, Service
Centers) and their activities
activities will be highly asynchronous
some will do exactly the same without knowing
some will just try out something
some will create serious results to be shared
is not a closed project where you can define scopes, formats,
vocabularies, processes etc.
CLARIN will direct its services to HSS users, i.e. laymen and in
general not power users
get technology out of the expert labs
not at all a simple task
Network of Centres
• need to move from a domain of accidental collaborations to a
structured domain of centres with clear responsibilities and commitments
• basis of such a domain is visibility and interoperability -> registries, etc
virtual observatory of language resources and technology
• need to convince everyone to make use of centres
Centre Types
• various types of centers in CLARIN
• A: infrastructure centers with high availability and persistence
(AA infra, PID, center registry, metadata portals, concept registry, etc)
• B: resource and technology service providers with a certain
commitment for persistent services
(texts, lexica, multimedia recordings, parsers, translators, etc)
• C: metadata service providers without access to the content
(enrichment of the visibility of LRT)
• R: centers having resources and tools, but without machine readable
access level
• E: external centers offering services of various types
(libraries, national IDFs, national grid centers, TERENA,
MPG will offer PID service to research world, etc)
Pillars of Integration - AAI
secure server interaction based on TACAR certificates
researchers are members of national Identity Federations
is this true for all in Europe? what with guests etc?
what about state of harmonization (TERENA, EduGain)
single identity and sign on possible
centres are part of a Service Provider Federation
what are the requirements (attributes, values etc)
eJournal Service Providers
Trust
Agreements
Schema
national Identity Federations
Trust
Agreement
LRT Service Providers
Pillars of Integration - PIDs
Biological and cultural processes have evolved
together, in a symbiotic spiral; they are now
indissolubly linked, with human survival unlikely
without such culturally produced aids as clothing,
cooked food, and tools. The twelve original essays
collected in this volume take an evolutionary
perspective on human culture, examining the
emergence of culture in evolution and the underlying
role of brain and cognition. The essay authors, all
internationally prominent researchers in their fields,
draw on the cognitive sciences -- including
linguistics, developmental psychology, and cognition
-- to develop conceptual and methodological tools for
understanding the interaction of culture and genome.
They go beyond the "how" -- the questions of
behavioral mechanisms -- to address the "why" -- the
evolutionary origin of our psychological functioning.
What was the "X-factor," the magic ingredient of
culture -- the element that took humans out of the
general run of mammals and other highly social
organisms?
Several essays identify specific behavioral and
functional factors that could account for human
culture, including the capacity for "mind reading"
that underlies social and cultural learning and the
nature of morality and inhibitions, while others
emphasize multiple partially independent factors -planning, technology, learning, and language. The Xfactor, these essays suggest, is a set of cognitive
adaptations for culture.
ePublikation
Repository 1
how long?
eRessource
Repository 2
Pillars of Integration - PIDs
eRessource1
Repository 1
eRessource2
Repository 2
how long?
Ontology
open registry
Pillars of Integration - PIDs
A label in a context associated with a "thing"
URLs:
HTTP URIs:
URNs:
Handles:
DOI:
ARKs:
XRIs:
PURLs:
OpenURLs:
InfoURIs
etc
http:/www.mpi.nl/imdi/doc/white-paper
http://www.isocat.org/isodcr#12345
urn:nbn:nl:ui:13-54321
hdl:1839/00-0000-0000-0005-82B0-2
Handles + Business Model
http://ark.cdlib.org/ark:/13030/ft4w10060w
xri://broadview.library.example.com/
(urn:isbn:0-395-36341-1)
http://purl.oclc.org/OCLC/PURL/FAQ
parameterized http-get requests
integrate legacy material into Web
all
W3C
EU Libs etc
many
Publisher
few
?
many
?
?
Pillars of Integration - PIDs
Standard
Robust
Software
Resolution
System
Resolution
Type
Security Admin
Assoc Info
Costs
URL
RFC2616
?
yes (DNS)
single
no
no
no
URN:ISSN
ISO2397
no
no
?
no
no
no
URN:ISBN
ISO2108
no
no
?
no
no
no
URN:NBN
RFC3188
no
no
?
no
no
?
PURL
no
no
yes
single
no
no
no
Handle
RFC3650
yes
yes
multiple
yes
yes
little
DOI
Z39.84…
yes
yes (Handle)
multiple
yes
yes
large
ARK
no
no
(yes)
multiple
(no)
yes
?
info URI
RFC3668
no
no
?
no
no
no
XRI
no
no
no
?
no
?
?
Handle only system operating robustly
DOI too expensive for required granularity (MPI: 30.000 €/y)
Pillars of Integration - PIDs
need an offer
a Handle based system is ready to be launched
is it specific for CLARIN - NO
will offer it to the research community
need some services
fragment addressability
authenticity by checksum
access permissions should go with PID
little metadata for citation purposes
etc
need independence from CNRI
Open LRT Market Place
• need to create an open market place of Language
resources and technology
• all LRT should be visible, accessible and re-usable
• allow users to build virtual collections
• allow users to build virtual workflows of services
• standards based rich metadata descriptions are central
Virtual LRT Observatory
VO = joint metadata and navigation domain
what do the astronomers have:
clear dimensions (radial, wavelength, etc) and metrics
what do we have (or better not have):
HT: In linguistics there is no agreed descriptive system
except UNICODE and XML (even no schemas)
well come on Henry we have
8 years of experience with IMDI, OLAC, TEI Headers etc ☺
8 years of experience with a single schema ☺
we have the TEI ODD component framework ☺
we are working on generic schemas such as LMF ☺
we have the ISOcat data category registry ☺
Is it all working smoothly yet? NO
Virtual LRT Observatory
User selects appropriate
components to create a
metadata description
Component registry
Location
Country
Coordinates
Text
Language
Title
user
Semantic interoperability
partly solved via references
to ISOcat concept registry
ISOcat
concept
registry
Country dcr:1001
Language dcr:1002
BirthDate dcr:1000
DCMI
concept
registry
Title:
dc:title
Actor
BirthDate
MotherTongue
Dance
Name
Type
Recording
CreationDate
Type
Service Oriented Architecture
still live in a down-load first and not in a cyberinfrastructure domain
User-Service
Interaction
current way of interaction:
• user interacts with a web-site
• receives intermediate result
• manipulates this result and
• sends it to the next web-site
• etc
Service-Service
Interaction
User
Desk
Algorithm
Service
Internet
Service
Centre
Web
Application
User
Desk
Service
Centre
Database
Archive
Interfaces
described
by WSDL
Algorithm
Service
better way of interaction:
• users interacts with an application
• the application makes use of
different services without bothering
the user
• user receives the final result
• SOA not at all simple to achieve, but only architecture scalable and flexible enough
• standardization and harmonization is required to realize workflow mechanisms
Basic Annotation Architecture
conversion
resource
NLP/ASR/
manual
Process
metadata
journal?
annotation1
metadata1
journal1
annotation2
metadata2
journal2
annotation3
metadata3
journal3
mycollection
repository I..K
registry X..Z
Problems:
• how does service know where to store distinguish between
• how can others find things
• how does a service find all relevant objects and
• how does it find all relevant detailed information
repository I..K
Some Points of Interest
there is lot to be done
our inventories/registrations indicate clearly the size of the
task
a number of big challenges aside technology
how to convince our providers of best practices
how to achieve broad consensus about standards
how to get the required funds
we are not alone
allies: APA, DRIVER, e-IRG, etc
standards: TEI, ISO TC37, etc
colleagues in SSH: DARIAH, CESSDA, etc
Will we succeed?
Falls nicht to end in Babylonish scenario nous avons still etwas
time üm na te think.
Thanks for your attention!
Download