Archiving and Accessing Language Resources Peter Wittenburg Max Planck Institute for Psycholinguistics

advertisement



Archiving and Accessing Language Resources
Peter Wittenburg
Max Planck Institute for Psycholinguistics
Nijmegen, The Netherlands



what is the MPI?
• at MPI fundamental research in mental language processing,
language acquisition and language & cognition
• methods: experiments, signal processing (eye movements,
gestures), computer-simulations, brain imaging,
multimedia observation
• is one of the 80 institutes of Max Planck Society (Germany)
• as member of central IT committee pushing eScience ideas



essential questions for us
• which tools can we build to improve linguists' efficiency for the
different tasks in their daily work?
• which infrastructures can we build that allow linguists to focus on
research work?
• how can we preserve the data about languages and the knowledge
about them for future generations?
our "product" is the Language Archiving Technology
LAT



why care about languages?
(thanks to our Ethnologue/SIL colleagues)
• between 6000 and 7000 languages - every two weeks one is dying
• 96 % of languages spoken by only 3 % of people
• all our languages getting mixed - loose structure, loose identity etc



major workflow change
• speech analysis people handled speech for long time on computers but
• for observational linguistics not normal
• recordings on separate carriers which were transcribed and then
put into the cellar
• the annotations were the basis for all further work
• fact: almost no one accessed the carriers anymore
• due to technological innovation things changed completely
• from 90-ies digitization of material at MPI (gesture, sign)
• from 1995 first multimedia/multimodal annotation tools
• at first much skepticism from users



storage capacity as indicator
45
Terabyte
100
90
80
70
60
50
40
30
20
10
0
Data Volume
Data Increase
1998
2000
2002
2004
2006
2008
2010
2012
increase
12 TB
• more important
• service was persistent - people could rely on it - immediate
availability of digital data
• around 98/2000 systematic digitization/capturing (also retro)
• currently digitizing big human-ethological tape archive
• all accessible via web mechanisms
• terabytes not all
• > 60.000 sessions in the archive
• > 300.000 objects in the archive with complex relations



few major facilitators
• costs per Megabyte in 1977 as currently per Terabyte
• relying on persistent institutional repository
• availability of the ELAN annotation tool at early
20-ies with some advanced features
•"generic" schema, structure and vocabulary
user definable
• based on extensive world-wide discussion
• availability of the IMDI Metadata Infrastructure with a schema based
set and various tools to structure, manage and find resources



ELAN screenshot
viewers & controls
video player
controls
waveform viewer
crosshair
annotations
tiers
timeline viewer



IMDI interactive "catalogue"
of course available to OAI-PMH
based service providers



DOBES impulse
• 2000 decision of the VolkswagenFoundation to start the distributed
language documentation programme DOBES
• currently 45 Teams with about 60 languages operating worldwide
• MPI became the central archive, i.e. long term preservation and access
in focus



preservation problems
• our storage media are problematic (reliability, life time)
• UNESCO: 80% of our recordings are highly endangered
• so how to preserve bit-streams?
0 years
250 years
various e-media
500 years
1000 years
2000 years
clay tablets
• our "standards" for encoding and structuring come and go
• take video as an obvious example
Cinepak, MPEG1, MPEG2, MPEG4 (H.264), mJPEG2000
• how to maintain interpretability?
• thus
• continuous change and migration at various levels
• a nightmare for a traditional archivist



preservation at bit-stream level
• 6 copies in centers with
professional LTP strategy
thus regular technology
migration
• in 2008 11 regional archives
based on LAT technology
• in 2009 additional regional
archives
• are offering synchronization
support due to lack of
money for archiving
• all based on proper & fair
agreements



what at curation level
• difficult ….
• emergence of universal character encoding standard (UNICODE)
• agreement about lossless media encoding schemes (mJPEG2000)
• XML accepted as structuring language for texts
• emergence of generic schemas such as
• EAF (at MPI for annotated media)
• Linguistic Annotation Framework (ISO TC37/SC4) ?
• Lexical Markup Framework (ISO TC37/SC4) ?
• DOBES: all needs to be formatted according to such open standards
• application of "immediate transformation policy"
of course: store old formats as well due to lossy transformation
• only format coherent and consistent content can be easily accessed
and transformed in future
• Beagrie: costs for late curation are at least factor 30 higher



but some metadata facts
• statistics on 27.000 MD Records
• language name usage 100%
• language code usage ~40%
• content genre usage ~30 %
• researchers are not fully
committed yet
• no pressure and funding for
data curation/preservation
aspects
• well - not all our tools are
user friendly enough
%
w ritten resource language ID
w ritten resource character encoding
w ritten resource content encoding
w ritten resource size
w ritten resource f ormat
w ritten resource subtype
w ritten resource type
w ritten resource resource link
media f ile quality
media f ile format
media f ile type
media f ile size
media f ile resource link
actor description
actor education
actor sex
actor age
actor birth
actor ethnic group
actor family social role
actor code
actor fullname
actor name
actor role
actor language name
actor language ID
actor language description
content language name
content language ID
content language description
communication context channel
communication context event
communication context social context
communication context involvement
communication context planning type
communication context interactivity
content subject
content modalities
content task
content subgenre
content genre
content description
project name
session region
session address
session country
session continent
session description
session recording date
session.title
session.name
%
0
20
40
60
80
100
120



metadata benefits
• benefits become now apparent after a decade
• combination of metadata and content for longitudinal studies
"use of syntactic forms by children of different age"
• requires a critical mass and high quality MD
• special portals for
DOBES communities
where metadata elements
can be used dynamically
• very simple in IMDI
due to REST interface
• requires hq MD



access and Live archives
• attractiveness is important for survival and of course researchers
want a "dynamic archive"
• therefore various ways of accessing and enriching the data
• enriching can mean
• adding resources or annotation layers
• uploading new versions (requires a persistent identifier schema)
• commenting on resources
• drawing relations between resources in various ways (-> PIDs)
• etc



Language Archiving Technology Suite
Shoebox
CHAT
Transcriber
some XML?
many smart
developers
LAT



GIS based access
Multimedia Lexicon
Described
Corpus
Photos
Video Clips
Annotated Media



last development: conceptual spaces
• allow users to create relational domains on top of archive material
• results in completely different views
• semantic views
• genealogical view
• etc
• much more interesting than a boring catalogue for example
• also much more inspiring for community people



acceptance
Documentation
Task



what has been achieved in MPI/DOBES?
• large archive with equal access to primary and derived material
thus almost theory neutral re-usage
• neutral and atomic access options
• make data explicit by handing over a copy
• researchers learned to act just as software developers
don't wait until finished, make versions explicit and usable
• lots of awareness building about formats, needs and benefits of
metadata, good tools, etc
• lots of discussions about rights and ethical aspects
• this is a change in scientific culture and of work paradigms
• some technical advancements



does it cost something?
type
k€/y
comment
basic IT infrastructure
80
4-8 years innovation cycle
copies at large computer centers
<5
system management (1 FTE)
60
shared for different activities
archive management (1 FTE)
80
advice, curation, consistency
repository software maintenance
60
without new functionality
utilization software maintenance
>120 wide spectrum of tools
total
405
(225 without sw)
Maintaining a large and complex living archive costs 400 k€/year.
(linguistic support, SW development, etc. not calculated)
of course: economy of scale to a certain extent
Digital Dilemma Report of Academy of Motion Picture Arts and Sciences:
maintaining a digital master file costs 12 times as much



is something missing?
• let's address a few simple questions
• can a researcher do a useful content search on the whole archive?
• can a researcher easily use a certain lexicon when operating on texts?
• can a researcher easily align a piece of text with a speech signal?
• can a researcher combine Trumai data from MPI and AILLA (Austin)?
• can we easily integrate catalogues?
• Henry Thompson gave part of the answer:
Except for XML and UNICODE we don’t have agreed descriptive
systems in the linguistic domain.



•
•
•
•
•
CLARIN addressed it this way
•
data resources (LR)
oYYo
Yo
are language resources/tools
(LRT)inina good state?
LR visible
ONNO
NO
are LRT
visiblefor
forresearchers?
researchers?
LR accessible
OONNOO
NOO
are LRT
accessiblebybyresearchers?
researchers?
can LRT
LR be
OOONNOOO
NOOO
becombined
combinedtotovirtual
virtualcollections?
collections/workflows?
an operation
on on
a virtual
NOOOO
can you execute joint
operations
virtual collections? OOOONNOOOO
collection?
tools|data
data
like Christmas trees but …
•
need to get it out of the expert labs and make it available



which resources and tools?
• typical resources types
• semi-structured texts (newspapers, books, etc)
• transcriptions
• annotated media recordings (sound, video, photos)
• (annotated) time series data
eye tracking, motion tracking, data glove etc
• lexica (with multimedia extensions)
• grammar descriptions
• tree databases (syntax descriptions)
• concept registries, relation registries, ontologies
• metadata descriptions
• schemas, components, profiles
• etc
• numerous tools operating on these resources



what is CLARIN?
• stands for Common Language Resource and Technology Infrastructure
• currently a group of 144 of the strongest LRT institutions from almost all
European countries
• is meant to build up a persistent infrastructure that helps overcoming the
huge fragmentation in the field of LRT and that can give services to all
researchers working with LRT
• is meant to start tackling the problems mentioned
• is a fully distributed approach with three layers of responsibility
European - national - institutional
• current funding scheme
• 3 years EC project called preparatory phase (4.1 Mio €)
• much funding commitment from increasing number of countries
(D, F, NL, Fi, Dk, Ro, Cz, Sp, etc now already at 20 Mio €)
• intention is to get commitments for many years for a stable RI



•
•
which dimensions of work
2 major dimensions
– need to understand the state of LRT and its characteristics
(structure, encoding principles, concepts/terminology, etc)
– need to implement integration and interoperability
is that all?
– need to collaborate/interact with SSH communities
– need to simplify/harmonize IPR/licensing/ethical issues
– need to find an organizational model for operation
– need to do a lot of awareness building, education, training etc
–
basically the work package structure
WP5
WP2
MPI
WP3/4
WP7
WP8
WP6



CLARIN network of stable centres
• need to move from a domain of accidental collaborations to a
structured domain of centres with clear responsibilities and commitments
for the services they give
• need to convince everyone to make use of centres
• identified different types of centres
• new business models required - close to research



•
•
integration level
researchers are members of national Identity Federations
– is this true for all in Europe? what with guests etc?
– what about state of harmonization (TERENA, EduGain)
– single identity and sign on possible
centres are part of a "Service Provider Federation"
– what are the requirements (attributes, values etc)
eJournal Service Providers
Schema
Trust
Agreements
national Identity Federations
LRT Service Providers
Trust
Agreement



single identity scenario



need persistent identifiers
Biological and cultural processes have evolved
together, in a symbiotic spiral; they are now
indissolubly linked, with human survival unlikely
without such culturally produced aids as clothing,
cooked food, and tools. The twelve original essays
collected in this volume take an evolutionary
perspective on human culture, examining the
emergence of culture in evolution and the underlying
role of brain and cognition. The essay authors, all
internationally prominent researchers in their fields,
draw on the cognitive sciences -- including
linguistics, developmental psychology, and cognition
-- to develop conceptual and methodological tools for
understanding the interaction of culture and genome.
They go beyond the "how" -- the questions of
behavioral mechanisms -- to address the "why" -- the
evolutionary origin of our psychological functioning.
What was the "X-factor," the magic ingredient of
culture -- the element that took humans out of the
general run of mammals and other highly social
organisms?
Several essays identify specific behavioral and
functional factors that could account for human
culture, including the capacity for "mind reading"
that underlies social and cultural learning and the
nature of morality and inhibitions, while others
emphasize multiple partially independent factors -planning, technology, learning, and language. The Xfactor, these essays suggest, is a set of cognitive
adaptations for culture.
ePublication
Repository 1
how long?
eResource
Repository 2



need persistent identifiers
ISO Concept Registry
eResource1
Repository 1
how long?
eResource2
Repository 2
Ontology
open registry



•
•
•
PID - need a highly available system
need an offer
– a Handle based system is ready to be launched
is it specific for CLARIN - NO
– will offer it to the research community
– why not DOIs
need some services
– fragment addressability
– authenticity by checksum
– access permissions should go with PID
– little metadata for citation purposes
– etc



virtual LRT observatory
• need to create an open market place of Language
resources and technology
• all LRT should be visible, accessible and re-usable
• allow users to build virtual collections
• allow users to build virtual workflows of services
• standards based rich metadata descriptions are central



•
•
VO - joint metadata domain
what do the astronomers have:
– clear dimensions (radial, wavelength, etc) and metrics
what do we have (or better not have):
– Henry Thompson: In linguistics there is no agreed descriptive
system except UNICODE and XML
– well come on Henry we have
• 8 years of experience with IMDI, OLAC, TEI Headers etc 
• 8 years of experience with a single schema 
• we have the TEI ODD component framework 
• we are working on generic schemas such as LMF 
• we have the ISOcat data category registry 
–
Is it all working smoothly yet? NO 



service oriented architecture
NSF workshop: still live in a down-load first and not in a cyberinfrastructure domain
User-Service
Interaction
current way of interaction:
• user interacts with a web-site
• receives intermediate result
• manipulates this result and
• sends it to the next web-site
• etc
Service-Service
Interaction
User
Desk
Algorithm
Service
Internet
Service
Centre
Web
Application
User
Desk
Service
Centre
Database
Archive
Interfaces
described
by WSDL
Algorithm
Service
better way of interaction:
• users interacts with an application
• the application makes use of
different services without bothering
the user
• user receives the final result
• SOA not at all simple to achieve, but only architecture that is scalable and flexible enough
• standardization and harmonization is required to realize workflow mechanisms



where are we now?
• do an online inventory of all LRT in Europe - huge participation
– want to move this towards hierarchical metadata domain
• carry out centre self-assessment to identify strong institutes and to
identify the gaps
– do a lot of talks - bring grid/federation experts and linguists together
• specified the requirements for a service provider federation
– PIDs, AAI harmonization, PKI, etc
• specified the requirements of a component metadata infrastructure
– start developing - of course ISOcat data category registry is crucial
• discussed the requirements for web services and workflows
– working on an inventory of typical processing chains
– import/export is the big problem - need more standards



we are not alone 
• a number of big challenges aside technology
– how to convince our providers of best practices
– how to achieve broad consensus about standards
– how to get the required funds for the follow-up phases
• we are not alone
– umbrella allies: APA, DRIVER, e-IRG, etc
– grid/federation experts: EGEE, DEISA, TERENA/eduGain
– standards: TEI, ISO TC37, etc
– colleagues in SSH: DARIAH, CESSDA, BAMBOO, etc



will we get there?
Falls nicht to end in Babylonish scenario nous avons still etwas
time üm na te think.
Thanks for your attention!
www.mpi.nl; www.mpi.nl/dobes; www.clarin.eu
System Managers, Archive Managers, Software Developers
funds from MPG, MPI, NWO, VWS, BMBF, EC



Lexical Markup Framework (ISO TC37/SC4)
1..1
1..1
Lexical DB
1..1
0..n
Global Info
Lexical Entry
1..1
1..1
1..1
0..n
1..n
Form
Sense
1..n
/orthography/
/variant for/
Morphology
1..1
/identifier/
/lemma/
/POS/
/gender/
/key form/
0..1
Paradigm
1..1
0..n
Inflexion
/orthography/
/gender/
/number/
/tense/
/person/
/mood/



ISOcat Model
• ISO 12620 standard
• hat administrative,
linguistische und deskriptive
Information
• Unterschied zwischen "simple"
und "complex categories"
• komplexe Kategorien haben
eine Werte Bereich
• "grammatical gender" hat
values male, female, neuter
• has language sections
• has options for alternative
names



ISOcat Modell
User selects appropriate
components to create a
metadata description
Component registry
Location
Country
Coordinates
Text
Language
Title
user
Semantic interoperability
partly solved via references
to ISOcat concept registry
ISOcat
concept
registry
Country dcr:1001
Language dcr:1002
BirthDate dcr:1000
DCMI
concept
registry
Title:
dc:title
Actor
BirthDate
MotherTongue
Dance
Name
Type
Recording
CreationDate
Type



ISOcat Modell
conversion
resource
NLP/ASR/
manual
Process
metadata
journal?
annotation1
metadata1
journal1
annotation2
metadata2
journal2
annotation3
metadata3
journal3
mycollection
repository I..K
registry X..Z
Problems:
• how does service know where to store distinguish between
• how can others find things
• how does a service find all relevant objects and
• how does it find all relevant detailed information
repository I..K



everything fine? - metadata example
now cost effective schemes are possible
researchers understood that preservation and
access can’t be handled at reasonable costs
The
CHAOS
Archive
X
all individuals and teams creating independently
but ingest is done in a coordinated manner
Download