View/Open

advertisement
Mapping endangered records of
endangered cultures
or
We have harvesters but not
enough fruit
Nick Thieberger
School of Languages and Linguistics
University of Melbourne
Charting Vanishing Voices:
A Collaborative Workshop to
Map Endangered Oral Cultures:
WOLP 2012 Workshop
Pacific and Regional Archive for Digital Sources in
Endangered Cultures (PARADISEC)
Metrics (June 2012)
274 collections of which 181 are publicly available
8,268 items of which 7,637 are publicly available
59,987 files
Size : 6.04 TB
Time : 3,390 hours
716 languages represented in the collection, from 65 countries
Type
Files
Size
.cha
.dv
.eaf
.jpg
.lbl
.mov
.mp3
.mp4
.mpg
39 173.39 KB
32 145.07 GB
125
7.95 MB
21,956
39.88 GB
30 734.92 KB
66 493.62 GB
9,490 181.59 GB
81
19.13 GB
106
34.42 GB
.mxf
.pdf
.rtf
.tab
.tif
.trs
.txt
.wav
.xml
42
5,035
4,681
40
1,626
189
363
9,492
6,546
356.05 GB
3.10 GB
87.30 MB
819.65 KB
52.46 GB
1.05 MB
23.70 MB
4.73 TB
142.92 MB
Pacific and Regional Archive for Digital Sources in
Endangered Cultures (PARADISEC)
Collaborative archiving project begun in 2002
Team made up of linguists and musicologists
Thee universities in a consortium (Sydney, Melbourne, ANU)
Endangered records
Too little is recorded in most of the world’s
languages
Much of what is recorded is not being looked after
properly
We can’t even find what has been recorded
How can we change that?
Too little is recorded in most of the world’s
languages
How much fieldwork is going on?
• Newman (1992 and 2004) reports 34 US departments
running fieldmethods courses
• LLL conference 2009 – 180 abstracts
• 2nd International Conference Language Documentation
and Conservation 2011 – 230 abstracts
-
Too little is recorded in most of the world’s
languages
How much fieldwork is going on?
• Assume at least 100 current fieldwork-based linguistic
projects
• Since 1960, assuming 50 per year there should be
reasonable records of 2500 languages
• Recordings, texts, dictionaries
– paper and digital (from the late 1980s onwards)
Too little is recorded in most of the world’s
languages
• Not even all funded projects are producing well-formed
records
– Well formed means described, archived and
accessible, e.g.,
ELDP – funded 2641 projects but ELAR has somewhere
around 1102 deposits
1 http://www.hrelp.org/grants/projects/index.php?year=all
2 http://www.paradisec.org.au/blog/2012/04/elar-update-update
Too little is recorded in most of the world’s
languages
• More recording by non-linguists is necessary
Too little is recorded in most of the world’s
languages
• More recording by non-linguists is necessary
• New methods (e.g., Basic Oral Language
Documentation - BOLD) that could include more
recording by speakers
Too little is recorded in most of the world’s
languages
• More recording by non-linguists is necessary
• New methods (e.g., Basic Oral Language
Documentation - BOLD) that could include more
recording by speakers
• Social media as a source of recordings/texts/etc
Too little is recorded in most of the world’s
languages
• More recording by non-linguists is necessary
• New methods (e.g., Basic Oral Language
Documentation - BOLD) that could include more
recording by speakers
• Social media as a source of recordings/texts/etc
• How to ensure this kind of recording has longevity?
There should be reasonable records of 2500
languages
• Where are they?
• How do we find them?
What is recorded is not being looked after properly
What is recorded is not being looked after properly
Digital recordings more fragile than analog, but most are
not being archived
We can’t even find what has been recorded
Harvesting tools:
WorldCat http://www.oclc.org/worldcat
LLMap (Linguist List, USA) http://www.llmap.org
Multitree
http://multitree.org
UNESCO Atlas
http://www.unesco.org/culture/languages-atlas
ELCat / Endangered Language Catalog
http://www.endangeredlanguages.com
Aggregated information
http://oralliterature.org/database, since mid-2010
We can’t even find what has been recorded
Language codes as a basis for searching
- ISO-639-3, three-letter codes
Typically not used by most repositories (small regional
libraries, State libraries, Film and Sound archives)
We can’t even find what has been recorded
British Library
We can’t even find what has been recorded
National Library of Australia
We can’t even find what has been recorded
Vienna Phonogrammarchiv
||Aikwe (Naro)
Abron
Abuluti
Abzachisch
(Adygeisch
Dialekt)
Acholi
Adygeisch
Adygeisch Dialekt
(Adygeisch)
Afrikaans
Agau
Aghul
Aghul Dialekt
(Aghul) Darra-i Nur
(Pashai)
Pashtu
Pashtu Dialekt
(Pashtu)
Pelende
Permjakisch
Persisch
Persisch
Standardsprache
(Persisch)
Phakey
Pidgin- und
Kreolsprachen,
englisch-basiert
Pokomo
Polnisch
Polnisch
Standardsprache
(Polnisch)
Polynesisch
Pomo
Pondo (Pana)
Portugiesisch
Pulaar Fulfulde
Punjabi
Rajasthani
Raji
Rakhshani (Baluči
Dialekt)
Rathwi-Bhilali
(Bhilali)
Rätoromanisch
Rätoromanisch
Dialekt
(Rätoromanisch)
Raute
Rendille
Romagnolisch
(Italienisch Dialekt)
Romanes
Romanes nonvlax
Balkan (Romanes)
Romanes nonvlax
Gopti (Romanes)
Romanes nonvlax
Nord Ost
(Romanes)
Romanes nonvlax
Nord West
(Romanes)
Romanes nonvlax
Zentral Nord
(Romanes)
Romanes nonvlax
Zentral Süd
(Romanes)
Romanes vlax
(Romanes)
Romanisch
Romanisch Dialekt
aus Italien
(Romanisch)
Roncalés
(Baskisch Dialekt)
Ronga
Rugciriku
Rumänisch
Rumänisch Dialekt
(Rumänisch)
Rumänisch
Standardsprache
(Rumänisch)
Russisch
Russisch
Standardsprache
(Russisch)
Ruthenisch
Rutulisch
Sadani
Safen
Saho
Šahrī
Sala
Samaritanisch
Samba
Samba Daka
Samburu
Sambyu
(Kwangari)
Sami
Samo
Sanaga
Sanga
Sango
Sanskrit
Sanye
Sara
Sardisch
Sardisch Dialekt
(Sardisch)
Scherpa
Schopski
(Bulgarisch
Dialekt)
Schottisch-Gälisch
Schottisch-Gälisch
Dialekt
(SchottischGälisch)
Schottisch-Gälisch
Standardsprache
(SchottischGälisch)
Schottisches
Englisch (Englisch)
Schottisches
Englisch
Standardsprache
(Schottisches
Englisch)
Schottisches)
Online searching for language material
e.g., ‘Lewo’ as a language name?
Google – ‘Lewo’ – 3,080,000 hits
Google – ‘Lewo grammar’ – 2,200 hits
Open Language Archives Community (OLAC) – ‘Lewo’ 13
hits
OLAC search result
What else is out there?
• Items held in personal collections can’t be located
• speakers who recorded their families
• missionaries
• patrol officers
• These could be listed in catalogs, even if online access
is restricted
Existing resources =
low-hanging fruit
e.g., http://anglicanhistory.org/oceania/
Existing resources =
low-hanging fruit
Problems of longevity of websitebased data sources
Existing resources =
low-hanging fruit
Problems of longevity of websitebased data sources
Use the Internet Archive for a
persistent identifier
06/19/12
Endangered recordings
• Linguists need a shared infrastructure in which to locate
their recordings
– to make them discoverable
– to provide standard descriptions which can be
located by standard search mechanisms
– to enter metadata before it is forgotten
From the laptop to the archive
ExSite9
Metadata creation without (too many) tears
File browser – assigning attributes to files created in
fieldwork
Application writes an XML file capturing relationships
expressed by ‘drag and drop’ in the browser
XML file submitted to an archive’s catalog
From the laptop to the archive
ExSite9
06/19/12
From the laptop to the archive
ExSite9
06/19/12
06/19/12
06/19/12
ExSite9
In development in mid-2012
Cross-platform tool
Expected release later in 2012
EOPAS – Delivery of text and media
Encourage deposit of text and media
- Provide presentation formats for recorded texts
- Based on a linguist’s normal workflows
Record > Transcribe (Elan) > Interlinearise (Toolbox) >
XML output > EOPAS
http://linguistics.unimelb.edu.au/research/projects/eopas/
Playable media
Metadata
http://www.eopas.org/transcripts/55
Selected text
Keyword in
Context /
Concordance
in all texts of
that language
http://www.eopas.org/transcripts/55
Ability to
turn off
morphemic
view
http://www.eopas.org/transcripts/55
Reference
to
morphemelevel
http://www.eopas.org/transcripts/55
Reference
to timed
chunk
http://www.eopas.org/transcripts/55
Stories
Recorded by researchers
Strong source community interest in hearing
recordings and reading texts
Stored in digital archives
Digitised from analog sources
Central harvesting by language
code (ISO-639-3)
Stories in many of the world’s
7,000 languages
Harvesting tools need something to harvest!
Persuade linguists to create research data properly and to
deposit their materials in archives
- create incentives in academia to create collections
Locate existing digital material and incorporate it into
principled online catalogs
Location of analog collections and their digitisation and
incorporation into principled online catalogs
Building example texts/media for as many languages as
possible
http:/paradisec.org.au
thien@unimelb.edu.au
http://www.nflrc.hawaii.edu/ldc/
Download