Language Resources and Technology for Humanities Research

advertisement
Language Resources and
Technology for Humanities
Research
Michael Rosner/Vanessa Camilleri
Dept Artificial Intelligence, Univerisity of Malta
mike.rosner,vanessa.camilleri@um.edu.mt
Acknowledgement
Steven Krauwer
CLARIN Coordinator
Utrecht institute of Linguistics UiL-OTS (NL)
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
2
Outline
•
•
•
•
Essential background
Example
Overview of CLARIN
Call for Proposals for Collaboration with
HSS projects
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
3
Essential Concepts
• Language Resources and Technology
(LRT)
– Language Resources
– Language Technology
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
4
Language Resources
The term language resource subsumes a
whole range of linguistic data types including
–
–
–
–
–
–
–
–
–
–
text corpora
speech corpora
multimodal corpora
annotated corpora
lexica
digitised manuscripts
typological databases
rules (syntax/morphology),
treebanks
ontologies, schemas
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
5
Language Technology
• The term language technology covers a wide
range of processing and annotation components
–
–
–
–
–
–
–
–
–
–
Tokenisers
Part-of-speech taggers,
Parsers
Named entity recognisers
Semantic annotation (automated),
Manual annotation tools
Speech to text and text to speech
Speech alignment tools, etc.
Multilingual
Etc.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
6
Archive Example
Shoah Visual History Foundation
Established in 1994 by Steven Spielberg to
collect the testimonies of survivors and
other eyewitnesses to the Holocaust
Mission statement:
To overcome prejudice, intolerance, and bigotry - and the
suffering they cause - through the educational use of the
Foundation’s visual history testimonies.
http://www.vhf.org
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
7
The Shoah Archive
– 52,000 testimonies from Jewish survivors,
Jehovah’s Witnesses, Roma and Sinti,
homosexuals, political prisoners, rescuers, and
liberators of concentration camps
– 32 languages including English, Russian, Hebrew,
French, German, Dutch, Hungarian, Italian
– 1-18 hours in length, average 2.5 hours
– 117,000 hours of video
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
8
Technology
• 180 terabyte archive located at the Shoah
Foundation
• Robot to load the tapes
• Requires Internet 2 connection
• Requires 1 terabyte of local cache
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
9
SHOAH Issues
• Nature of Technology Platform
• How to Integrate Resources into Curriculum
• Usability – design of instructional strategies with
digital video
• Tools for Instructors and Students
• Intellectual Property - privacy and security
concerns
• Impact on Support - Management of delivery
technologies
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
10
The Problem in General
• Much data in digital
archives is language
based
• Only known to
insiders
• Archives mostly
unconnected
• Every archive has its
own standards for
storage and access
• Normally only simple
retrieval of files (text,
audio or video
documents)
M Rosner & V Camilleri
January 2009
• Social sciences and
humanities
researchers are not
language or speech
technologists
• They are often not
aware of the potential
benefits of using
language and speech
technology
• Available tools are
hard to use for nonspecialist
Language Resources and
Technologies for HSS,
11
The CLARIN Mission
What:
• Create an
infrastructure that
makes language
resources and
technology (LRT),
available to scholars
of all disciplines,
especially social
sciences and
humanities (SSH)
M Rosner & V Camilleri
January 2009
How:
• Unite existing
digital archives into
a federation of
archives with unified
web access
• Provide language
and speech
technology tools
as web services
operating on
language data in
archives
Language Resources and
Technologies for HSS,
12
Why a European
infrastructure?
• too much
fragmentation
• lack of coordination
across countries
• lack of visibility
• lack of
interoperability
• lack of sustainability
M Rosner & V Camilleri
January 2009
• expertise exists but
not in all countries
• language
independent tools
can be shared
• language
dependent tools
can often be ported
• most countries not
able to bear the cost
Language Resources and
Technologies for HSS,
13
Why now?
• Exponential growth of digital data
• Increasing maturity of language and speech
technology:
– high speed
– large volumes
– new research questions
• Growing EU interest in Research Infrastructures
• CLARIN is one of 35 accepted RI proposals
• Receives 3 yrs funding for preparatory phase
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
14
Why an infrastructure
for SSH?
• Many infrastructures address risks and
threats:
–
–
–
–
Environment
Energy
Climate change
Health
• CLARIN addresses Social Climate Change as
caused or reflected by e.g.:
–
–
–
–
Mobility
Minorities
Language diversity
Cultures in contact
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
15
Who else do we need?
• The CLARIN consortium has now 32 partners
from 22 EU and associated countries
• BUT membership and our consortium are
quite unbalanced:
– Speech & multimodality under-represented
– Humanities other than linguistics underrepresented
– Social sciences under-represented
– Some countries still missing
• There is no money to extend the consortium
but we have to fill these gaps
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
16
Overall plan for CLARIN
2008-1010: Preparatory phase:
– Put everything in place
– 2011-2015 Construction phase:
– Build and populate with tools and resources
2016-Exploitation phase:
– CLARIN in full service
Budget:
– Prep phase: 4.1 M€ from EC, ??? from countries
– Overall budget until 2020: ca 200 M€
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
17
4-dimensional approach
in the preparatory phase
First 3 years dedicated to the design of
• The technical
dimension
• The user dimension
• The language
dimension
• The governance and
legal dimension
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
18
Technical
• Technical
specification of the
infrastructure
• Construction of a
prototype
• Validation on rich
variety of
– languages (>20)
– resources
– services
M Rosner & V Camilleri
January 2009
• Federation of
existing archives
• Based on existing
resources, tools
• Strong focus on
interoperability
standards
• Conversion of
existing resources
• Encapsulation of
existing tools
Language Resources and
Technologies for HSS,
19
Languages
• Cover all languages
spoken or studied in
participating
countries
• Representational
and descriptive
standards should be
adequate and
validated for all
languages
M Rosner & V Camilleri
January 2009
• Same minimal
coverage of basic
resources and tools
for all languages
• BLARK (Basic
Language
Resources Toolkit)
to be defined and
implemented (funds
from other sources
needed)
Language Resources and
Technologies for HSS,
20
Language activities
• Survey of resources
and tools, including:
– encoding and
annotation data
– quality indicators
• taxonomies and
ontologies
• agreeing on
common standards
M Rosner & V Camilleri
January 2009
Focus on
• integration of tools
• interoperability
• usage scenarios
• creating missing
essential resources
• validating
specifications and
prototype
Language Resources and
Technologies for HSS,
21
User
• Users are SSH
• Actions:
scholars (including
– analyze past and
ongoing SSH
linguists, translation
projects
experts)
– user consultation
• Do WE know what
– launch typical
they need?
example projects to
• Do THEY know what
show potential
they need?
– expertise centers
– awareness actions
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
22
Legal
IPR issues
• aim at open source,
but IPR for existing
and future non-open
resources must be
accommodated
• federation of
archives requires
authentication,
authorization and
trust between
archives
M Rosner & V Camilleri
January 2009
• aim at limited
number of template
license agreements
for most common
cases
• respect national
legislation
• address ethical
issues
Language Resources and
Technologies for HSS,
23
What CLARIN is NOT about
• building the infrastructure – we are just
preparing it
• creating new resources – at this stage we
want to use what is there and adapt it if
necessary
• creating applications – except maybe some
demonstrators
• focusing on the big languages – we find all
languages equally important
• strengthening European industry – our target
audience are SSH researchers, but we don’t
want to exclude anyone
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
24
Work Packages
• WP1: Management and coordination
• WP2: Designing the infrastructure and building
the prototype
• WP3: Humanities overview
• WP5: Language resources and technology
overview
• WP6: Dissemination
• WP7: IPR and business models
• WP8: Construction and exploitation agreement
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
25
How we work (2)
WP8
Org&Legal
Framework
5
1
8
WP7
IPR, A&A,
licensing
4
WP2
Infrastructure
Prototype
6
3
2
WP5
LRT
Exploration
M Rosner & V Camilleri
January 2009
7
Language Resources and
Technologies for HSS,
WP3
Humanities
Projects
26
Tasks
• Build national community
• Support participation in WGs by others
than partners
• Validation tasks for own languages
• Creation or adaptation of essential
resources
• Pilots and demonstrators
• Humanities projects
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
27
Call for Proposals
• Call for Proposals for Collaborating with
Humanities and Social Science Projects
• Pre-proposals are invited for Humanities
research projects that would benefit from
access to LRT
• Research institutions or consortia with
funding but little or no access to LRT or
related expertise are targeted in this call
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
28
Example 1
• A literature project wants to study censorship in
translation.
• It has access to uncensored and censored
translations of novels.
• To support the analysis, the project may benefit
from producing a searchable parallel corpus
where different versions of each sentence are
aligned.
• CLARIN participation could involve access to a
corpus alignment tool and transfer of skills in
using the tool.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
29
Example 2
• A history project wanting to study cultural
attitudes in Medieval Northern Europe
wants to search through runic inscriptions.
• CLARIN participation might assist in
– locating existing digitized corpora of runes in
different countries and
– providing assistance for converting the
different materials to a common encoding.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
30
Who is Targeted
•
•
•
•
Especially the following target groups are
addressed:
Groups of individual researchers with basic
institutional funding.
Early stage researchers in funded PhD
positions, with their supervisors.
Research groups or consortia in an advanced
pre-proposal stage with prospects of external
funding.
Research groups or consortia that have secured
external funding.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
31
Benefits of Collaboration
• CLARIN will provide consultancy and technical
support to selected projects that are otherwise
financed but lack the necessary resources and
expertise to enhance their activities with LRT
• The contribution of CLARIN to selected
projects will therefore consist of providing
guidance and access to LRT.
• This will involve advice on standards and the
technologies to adopt for the particular
objectives of selected projects.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
32
Example Benefits
• Access to digital language resources and tools for
– Management and exploration of corpora.
– Extraction of terms, multi-word units, names;
– Speech processing (STT/TTS)
• Assistance
– format conversion.
– creation of a methodologically sound workflow with data, tools
and modelling approaches for innovative research and
development.
– Inclusion of LRT into research plans
• Consultancy on methodology and the use of resources and tools.
• Training in the use of specific tools and methods
• Dissemination of the results and outcomes using various CLARIN
dissemination channels.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
33
Collaboration example 1
A project unable to acquire and utilize a
text alignment tool may be given access to
relevant software and may receive expert
advice and training regarding its effective
use from a CLARIN partner institution.
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
34
Selection Procedure
• Proposals will be selected in a fast two-step procedure.
• Step 1: (optional)
– short pre-proposals (6 pages) are submitted.
– CLARIN will perform an elegibility check and will provide
feedback to proposers,
• Step 2: full proposals will be invited.
• Full proposals will be reviewed non-anonymously by
three experts including the national representative for the
proposal coordinator's country.
• Proposals will be judged according to the following
criteria:
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
35
Criteria
• LRT needs / use of LRT towards research goals.
• Capacity of CLARIN to provide needed
LRT/expertise.
• Relevance for testing the CLARIN infrastructure.
• Adherence to CLARIN standards and best
practice.
• Capability to demonstrate the potential of the
CLARIN infrastructure to HSS projects.
• Multilinguality / cross-boundary dimension
• National and European needs and priorities (to
the extent these are formulated).
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
36
Collaboration example 2
For a project in need of a database of
runic inscriptions, a CLARIN partner
institution might negotiate access to
existing databases and, if necessary,
assist in converting them to a common
encoding standard to make them
searchable
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
37
Important Dates
• February 15, 2009 (noon UT): deadline for
pre-proposal submission
• March 7, 2009: feedback is provided to
proposers
• April 1, 2009 (noon UT): deadline for full
proposal submission
• April 21, 2009: final decision
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
38
More info
• CLARIN Website: http://www.clarin.eu
• CLARIN Office: clarin@clarin.eu
• CLARIN Newsletter:
http://www.clarin.eu/newsletter
• CLARIN Members:
http://www.clarin.eu/members
M Rosner & V Camilleri
January 2009
Language Resources and
Technologies for HSS,
39
Download