Language Resources and Technology for Humanities Research Michael Rosner/Vanessa Camilleri Dept Artificial Intelligence, Univerisity of Malta mike.rosner,vanessa.camilleri@um.edu.mt Acknowledgement Steven Krauwer CLARIN Coordinator Utrecht institute of Linguistics UiL-OTS (NL) M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 2 Outline • • • • Essential background Example Overview of CLARIN Call for Proposals for Collaboration with HSS projects M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 3 Essential Concepts • Language Resources and Technology (LRT) – Language Resources – Language Technology M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 4 Language Resources The term language resource subsumes a whole range of linguistic data types including – – – – – – – – – – text corpora speech corpora multimodal corpora annotated corpora lexica digitised manuscripts typological databases rules (syntax/morphology), treebanks ontologies, schemas M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 5 Language Technology • The term language technology covers a wide range of processing and annotation components – – – – – – – – – – Tokenisers Part-of-speech taggers, Parsers Named entity recognisers Semantic annotation (automated), Manual annotation tools Speech to text and text to speech Speech alignment tools, etc. Multilingual Etc. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 6 Archive Example Shoah Visual History Foundation Established in 1994 by Steven Spielberg to collect the testimonies of survivors and other eyewitnesses to the Holocaust Mission statement: To overcome prejudice, intolerance, and bigotry - and the suffering they cause - through the educational use of the Foundation’s visual history testimonies. http://www.vhf.org M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 7 The Shoah Archive – 52,000 testimonies from Jewish survivors, Jehovah’s Witnesses, Roma and Sinti, homosexuals, political prisoners, rescuers, and liberators of concentration camps – 32 languages including English, Russian, Hebrew, French, German, Dutch, Hungarian, Italian – 1-18 hours in length, average 2.5 hours – 117,000 hours of video M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 8 Technology • 180 terabyte archive located at the Shoah Foundation • Robot to load the tapes • Requires Internet 2 connection • Requires 1 terabyte of local cache M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 9 SHOAH Issues • Nature of Technology Platform • How to Integrate Resources into Curriculum • Usability – design of instructional strategies with digital video • Tools for Instructors and Students • Intellectual Property - privacy and security concerns • Impact on Support - Management of delivery technologies M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 10 The Problem in General • Much data in digital archives is language based • Only known to insiders • Archives mostly unconnected • Every archive has its own standards for storage and access • Normally only simple retrieval of files (text, audio or video documents) M Rosner & V Camilleri January 2009 • Social sciences and humanities researchers are not language or speech technologists • They are often not aware of the potential benefits of using language and speech technology • Available tools are hard to use for nonspecialist Language Resources and Technologies for HSS, 11 The CLARIN Mission What: • Create an infrastructure that makes language resources and technology (LRT), available to scholars of all disciplines, especially social sciences and humanities (SSH) M Rosner & V Camilleri January 2009 How: • Unite existing digital archives into a federation of archives with unified web access • Provide language and speech technology tools as web services operating on language data in archives Language Resources and Technologies for HSS, 12 Why a European infrastructure? • too much fragmentation • lack of coordination across countries • lack of visibility • lack of interoperability • lack of sustainability M Rosner & V Camilleri January 2009 • expertise exists but not in all countries • language independent tools can be shared • language dependent tools can often be ported • most countries not able to bear the cost Language Resources and Technologies for HSS, 13 Why now? • Exponential growth of digital data • Increasing maturity of language and speech technology: – high speed – large volumes – new research questions • Growing EU interest in Research Infrastructures • CLARIN is one of 35 accepted RI proposals • Receives 3 yrs funding for preparatory phase M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 14 Why an infrastructure for SSH? • Many infrastructures address risks and threats: – – – – Environment Energy Climate change Health • CLARIN addresses Social Climate Change as caused or reflected by e.g.: – – – – Mobility Minorities Language diversity Cultures in contact M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 15 Who else do we need? • The CLARIN consortium has now 32 partners from 22 EU and associated countries • BUT membership and our consortium are quite unbalanced: – Speech & multimodality under-represented – Humanities other than linguistics underrepresented – Social sciences under-represented – Some countries still missing • There is no money to extend the consortium but we have to fill these gaps M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 16 Overall plan for CLARIN 2008-1010: Preparatory phase: – Put everything in place – 2011-2015 Construction phase: – Build and populate with tools and resources 2016-Exploitation phase: – CLARIN in full service Budget: – Prep phase: 4.1 M€ from EC, ??? from countries – Overall budget until 2020: ca 200 M€ M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 17 4-dimensional approach in the preparatory phase First 3 years dedicated to the design of • The technical dimension • The user dimension • The language dimension • The governance and legal dimension M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 18 Technical • Technical specification of the infrastructure • Construction of a prototype • Validation on rich variety of – languages (>20) – resources – services M Rosner & V Camilleri January 2009 • Federation of existing archives • Based on existing resources, tools • Strong focus on interoperability standards • Conversion of existing resources • Encapsulation of existing tools Language Resources and Technologies for HSS, 19 Languages • Cover all languages spoken or studied in participating countries • Representational and descriptive standards should be adequate and validated for all languages M Rosner & V Camilleri January 2009 • Same minimal coverage of basic resources and tools for all languages • BLARK (Basic Language Resources Toolkit) to be defined and implemented (funds from other sources needed) Language Resources and Technologies for HSS, 20 Language activities • Survey of resources and tools, including: – encoding and annotation data – quality indicators • taxonomies and ontologies • agreeing on common standards M Rosner & V Camilleri January 2009 Focus on • integration of tools • interoperability • usage scenarios • creating missing essential resources • validating specifications and prototype Language Resources and Technologies for HSS, 21 User • Users are SSH • Actions: scholars (including – analyze past and ongoing SSH linguists, translation projects experts) – user consultation • Do WE know what – launch typical they need? example projects to • Do THEY know what show potential they need? – expertise centers – awareness actions M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 22 Legal IPR issues • aim at open source, but IPR for existing and future non-open resources must be accommodated • federation of archives requires authentication, authorization and trust between archives M Rosner & V Camilleri January 2009 • aim at limited number of template license agreements for most common cases • respect national legislation • address ethical issues Language Resources and Technologies for HSS, 23 What CLARIN is NOT about • building the infrastructure – we are just preparing it • creating new resources – at this stage we want to use what is there and adapt it if necessary • creating applications – except maybe some demonstrators • focusing on the big languages – we find all languages equally important • strengthening European industry – our target audience are SSH researchers, but we don’t want to exclude anyone M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 24 Work Packages • WP1: Management and coordination • WP2: Designing the infrastructure and building the prototype • WP3: Humanities overview • WP5: Language resources and technology overview • WP6: Dissemination • WP7: IPR and business models • WP8: Construction and exploitation agreement M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 25 How we work (2) WP8 Org&Legal Framework 5 1 8 WP7 IPR, A&A, licensing 4 WP2 Infrastructure Prototype 6 3 2 WP5 LRT Exploration M Rosner & V Camilleri January 2009 7 Language Resources and Technologies for HSS, WP3 Humanities Projects 26 Tasks • Build national community • Support participation in WGs by others than partners • Validation tasks for own languages • Creation or adaptation of essential resources • Pilots and demonstrators • Humanities projects M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 27 Call for Proposals • Call for Proposals for Collaborating with Humanities and Social Science Projects • Pre-proposals are invited for Humanities research projects that would benefit from access to LRT • Research institutions or consortia with funding but little or no access to LRT or related expertise are targeted in this call M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 28 Example 1 • A literature project wants to study censorship in translation. • It has access to uncensored and censored translations of novels. • To support the analysis, the project may benefit from producing a searchable parallel corpus where different versions of each sentence are aligned. • CLARIN participation could involve access to a corpus alignment tool and transfer of skills in using the tool. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 29 Example 2 • A history project wanting to study cultural attitudes in Medieval Northern Europe wants to search through runic inscriptions. • CLARIN participation might assist in – locating existing digitized corpora of runes in different countries and – providing assistance for converting the different materials to a common encoding. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 30 Who is Targeted • • • • Especially the following target groups are addressed: Groups of individual researchers with basic institutional funding. Early stage researchers in funded PhD positions, with their supervisors. Research groups or consortia in an advanced pre-proposal stage with prospects of external funding. Research groups or consortia that have secured external funding. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 31 Benefits of Collaboration • CLARIN will provide consultancy and technical support to selected projects that are otherwise financed but lack the necessary resources and expertise to enhance their activities with LRT • The contribution of CLARIN to selected projects will therefore consist of providing guidance and access to LRT. • This will involve advice on standards and the technologies to adopt for the particular objectives of selected projects. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 32 Example Benefits • Access to digital language resources and tools for – Management and exploration of corpora. – Extraction of terms, multi-word units, names; – Speech processing (STT/TTS) • Assistance – format conversion. – creation of a methodologically sound workflow with data, tools and modelling approaches for innovative research and development. – Inclusion of LRT into research plans • Consultancy on methodology and the use of resources and tools. • Training in the use of specific tools and methods • Dissemination of the results and outcomes using various CLARIN dissemination channels. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 33 Collaboration example 1 A project unable to acquire and utilize a text alignment tool may be given access to relevant software and may receive expert advice and training regarding its effective use from a CLARIN partner institution. M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 34 Selection Procedure • Proposals will be selected in a fast two-step procedure. • Step 1: (optional) – short pre-proposals (6 pages) are submitted. – CLARIN will perform an elegibility check and will provide feedback to proposers, • Step 2: full proposals will be invited. • Full proposals will be reviewed non-anonymously by three experts including the national representative for the proposal coordinator's country. • Proposals will be judged according to the following criteria: M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 35 Criteria • LRT needs / use of LRT towards research goals. • Capacity of CLARIN to provide needed LRT/expertise. • Relevance for testing the CLARIN infrastructure. • Adherence to CLARIN standards and best practice. • Capability to demonstrate the potential of the CLARIN infrastructure to HSS projects. • Multilinguality / cross-boundary dimension • National and European needs and priorities (to the extent these are formulated). M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 36 Collaboration example 2 For a project in need of a database of runic inscriptions, a CLARIN partner institution might negotiate access to existing databases and, if necessary, assist in converting them to a common encoding standard to make them searchable M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 37 Important Dates • February 15, 2009 (noon UT): deadline for pre-proposal submission • March 7, 2009: feedback is provided to proposers • April 1, 2009 (noon UT): deadline for full proposal submission • April 21, 2009: final decision M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 38 More info • CLARIN Website: http://www.clarin.eu • CLARIN Office: clarin@clarin.eu • CLARIN Newsletter: http://www.clarin.eu/newsletter • CLARIN Members: http://www.clarin.eu/members M Rosner & V Camilleri January 2009 Language Resources and Technologies for HSS, 39