Design of a Multimedia Corpus of Austronesian Linguistics

advertisement
Design of a Multimedia Corpus of Austronesian Linguistics
Zhemin Lin, Li-May Sung, I-wen Su
Graduate Institute of Linguistics of the National Taiwan University
Abstract In this paper, the design of an integrated platform of multimedia
online corpora aiming to serve both linguists and the public is introduced along
with database schema and programming details.
Compared with the
Formosan language archive of Academia Sinica, our design emphasizes more in
terms of normalization, accessibility and interoperability of the system.
The
design of an automatically generated dictionary with cross-references and the
capability of searching the entire database in various ways are also described
here.
1
Introduction
The development of natural language processing techniques and dynamic web pages
has generated wide interest in the construction of an integrated platform which
enables people to submit, to browse and to search among collected texts in corpora.
However, most online corpora are specially built for experts; they are sentence-based
and do not provide multimedia contents.
The NTU corpus of Austronesian
languages1 introduced in this paper is an attempt to construct a multi-lingual online
corpus with multimedia contents meeting the needs of both linguists and the public.
In the following sections, we will take a brief review of previous works and then
focus on the features of our current work.
1 http://corpus.linguistics.ntu.edu.tw
106
Zhemin Lin, Li-May Sung, I-wen Su
2
Formosan Language Archive of Academia Sinica
Zeitoun et al. (2003) discussed some of the problems in the conservation of Formosan
Austronesian languages.
The continuous enhancement of their work with many
newly designed tools is further described in Zeitoun and Yu (2005).
As discussed in
the two articles, fieldwork data are rarely shared in the linguistic community.
Collected materials are sometimes inaccessible even in the office where they are
stored, due to the change of storage media or data damage.
One of the most serious
problems is that, although there are elicitated sentences and recordings, few of them
are rearranged and published.
As a response to the problems, researchers in the
Academia Sinica have built a Formosan language archive, i.e., an online corpora with
texts, translations, word glosses and sounds from native speakers of 14 languages and
dialects.2
Despite their labour, there are however insufficiencies in their system, one of
them being the theoretical issue: the Sinica corpora are sentence-based, where pauses,
pause fillers, repetitions, intonation contours, IU boundaries and other discoursal
clues are either discarded or missing.
A sentence-based corpus excludes important
linguistic information only present in discourse.
Words in the system are written in
an ad hoc mixture style via International Phonetic Alphabet (IPA), in a transcription
style that prevents their respective native speakers from using the data directly.
Nearly every word is altered to some extent.
Example (1) is a Saisiyat example
extracted from the Sinica archive.
(1)
(a)
yao noka maʔiiæh ... hayðaʔ ʔæhæʔ maʔiiæh la m-waaiʔ, yao minaŋaʔŋaʔ nak hini mina-ʃaaəŋ.
(b)
ʔinʔalay hikor may nak hini yakin, ʃβət yakin ho.
(c)
ʔok-ik ʃəβət, m-waaiʔ nak hini pa-paʃœʃ, yao h<œm>ʃœʃ atomalan.
(Extracted from 05.002a -- 05.002c of “5.我的故事” of the Sinica archive.)
There is so far no dictionary available with cross-referencing function in the
2 http://formosan.sinica.edu.tw/formosan/ch/select_corpus.htm
Design of a Multimedia Corpus of Austronesian Linguistics
107
Sinica corpus, even though cross-referencing for an online corpus is essential for
researchers deal with elicited or authentic data.
Like KWIC (Keyword-in-context,
cf. Luhn (1960)), a user can trace a word back to the context where it occurs, and
browse its surrounding IUs.
on Microsoft Access.
query language.
Zeitoun et al (2003) has planed a data schema that ran
Their design, however, cannot take advantage of the SQL92
Moreover, they designed an XML dialect to improve the
interoperability, which does not encourage researchers to share their collected data in
a convenient way.
The Sinica archive, though primitive in design, is the first attempt
to provide public access to the nearly extinct linguistic data, which is an effort highly
respectable by itself.
3
NTU Corpus of Austronesian Languages
The system designed in this paper is based on the NTU corpus of Austronesian
languages.
The NTU corpus, first described in Huang, Su, and Sung (2003), is
composed of spoken texts in various languages.
Currently NTU Saisiyat corpus
contains 22 texts, 3081 intonation units (IUs) and approximately 10635 words, whose
transcription follows the conventions of Du Bois (1993).
There are one
conversation, eight narratives of indigenous legends, thirteen elicited narratives based
on “Pear Stories” (5 narratives based on a six-minute color mute film made by
Wallace Chafe, see Chafe (1980)) and “Frog Stories” (8 narratives from a sketch book
by Mayer (1980)).
An example of an original data segment follows:
(2)
9. ... (1.7)
m-wa:i' 'aehae' ka
AF-come one
10. ...(1.1) ma'iaeh ima
h<oem>oehoe'
person ASP
11. ...
may
kabih
ka
<AF>pull
hiza
pass.by[AF]
12. ...(1.9) ilahiza
NOM
there
siri'
ACC
goat
108
Zhemin Lin, Li-May Sung, I-wen Su
move.to.that.place
side
“(The man pulling a goat) passed by this way and went that way.” (Pear 3:912)
Spoken corpus, in contrast to written corpus, is composed of utterances shorter or
equal to sentences, which are transcribed according to certain criteria, such as turntaking, pause, and ruptures in intonation contours of monologue (Tao 1996:35).
Fig.
1. shows a unified intonation contours in a praat3 window.
Fig. 1.
A unified intonation contour
When a corpus is transcribed, tagged and analyzed, one needs to look for a means
to make it accessible to the public.
An integrated platform to store, to represent and
to lower the technological boundaries for further use of the collected data is thus
necessary.
With the insufficiencies of the Sinica archive in mind, normalization,
accessibility and interoperability are emphasized in the design of our system.
The
following guidelines are thus proposed.
(3) Guidelines of the integrated platform
3 praat is a programmable phonetic analyser written by Paul Boersma and David Weenink,
Institute of Phonetic Sciences, University of Amsterdam. It is licensed in GNU Public
License, with the courtesy of their outrageous work and the free software. Cf.
http://www.fon.hum.uva.nl/praat/
Design of a Multimedia Corpus of Austronesian Linguistics
109
(a) Easy to customize for most Austronesian languages
(b) Standardized procedures of transcription, annotation and process
(c) Automatic extraction of morphosyntactic information to reduce
repetition of human labor
(d) Web-based, unified input/output interface
(e) Searchable corpus that fit the needs of both linguists and the public
(f) Multimedia representation of collected texts
(g) Interoperable with other systems
(h) Cross-platform, operating system independent
Below is a description of the input, processing and output of our system design.
3.1
Standardization of text commitment and standards of committed texts
The standardization comprises the procedure of handling transcribed texts, the
transcription itself and morphosyntactic and discoursal codes used in the transcription.
The procedure to handle collected texts is designed with low coupling in order to
reduce complexity.
Therefore, the dependence in human manipulation in the system
is almost uni-directional, as can be seen in Figure 2.
collected, some worker transcribes it.
Whenever a spoken text is
Once the transcription is complete, it is given
to the database maintainer for processing and storage.
The web interface shows the
corpus in the database, so that people on the other end of Internet can browse and
search the corpus.
110
Zhemin Lin, Li-May Sung, I-wen Su
Fig. 2.
Use cases of the system
Design of a Multimedia Corpus of Austronesian Linguistics
111
The transcription follows Du Bois (1993), a de facto standard in the linguistic
community.
Word glosses and annotations follow a standardized coding list
inherited from conventional mark-ups (cf. Appendix A) and the Leipzig glossing
rules4.
A standard operation is also set for the database maintainer to handle
fieldwork collections as shown in Figure 3.
Fig.3.
Standard operation of text commitment
The corpus in our system is stored in Unicode (UTF-8 encoding) for potential
need of IPA, Japanese, and annotations in other languages.
If some of the tribes
decide to adopt non-ASCII letters, such as “ ɖ ʈ ɼ ɫ ʔ ”, into their writing systems, the
programs can process them correctly with no need of modification.
As Unicode
BOM (byte-order-mark, U+FEFF) is appended in the beginning of the file in
Microsoft Windows and is absent in Unix-based systems, the mark may cause a
potential problem in reading files edited in different operating systems.
It is
properly dealt with in order to fulfil the criteria of platform-independence.
A set of metadata is defined in the head of committed files.
An example of
the head of a committed text is given in (4) and the description of the fields is shown
4
http://www.eva.mpg.de/lingua/files/morpheme.html
112
Zhemin Lin, Li-May Sung, I-wen Su
in Table 1.
(4)
Topic: Pear story
Type: Narrative
Language: Kavalan
Dialect: Xinshe
Speaker: Imui, 潘金妹, F,1952
Time: 00:01:15
Total IUs: 31
Collected: 2003-05-30
Revised: 2003-11-11
Transcribed by: 葉俞廷, 王以勤
Double checked: 鍾曉芳,沈嘉琪 ,葉俞廷
Table 1.
Field name
Metadata of committed data
Description
Format
Topic
Topic of text
String (e.g., Pear Story)
Type
Style of text
Narrative|Conversation|...
Language
Language of text
String, first letter in capital
Dialect
Dialect or district
String
Design of a Multimedia Corpus of Austronesian Linguistics
Field name
Description
113
Format
Speaker
Base data of the informant
Native/Chinese name, Gender, Age
Time
Length of recording
hh:mm:ss
Total IUs
Number of IUs in text
Numeric
Collected
Date of recording
yyyy-mm-dd
Revised
Date of latest revision
yyyy-mm-dd
Transcribed by
Transcribers and annotaters Comma separated string
Double checked Inspectors of text
Comma separated string
The text following the metadata is described below.
(5)
5.
[IU #, with a period in the end]
.. qay- .. qay-byabas 'nay ,_
[words separated by spaces]
QAY-guava
that
[English gloss separated by
spaces]
那
QAY-芭樂
6.
... razat
'nay
person
人
nani.\
that
那
DM
DM
[Chinese gloss separated by spaces]
114
Zhemin Lin, Li-May Sung, I-wen Su
#e That person picked guavas.
Then,
#c 那個人採芭樂。然後,
#n Elicitaion notes
#n (More elicitation notes)
Lines beginning with a sharp (#) are processor instructions (PI).
“#e” indicates
a line of English translation of a paragraph composed of the IUs from the last
translation to the current one.
elicitation notes.
“#c” marks a Chinese translation, and “#n” is
It is possible to have more than one note.
native words and glosses is automatically done.
The alignment of
Morpheme boundaries,
morphological information and word senses are extracted using the techniques
introduced in Lin (2005: Chapters 2 and 4.2).
As the transcription is supposed to more or less reflect actual pronunciation of an
informant, spelling may vary slightly from word to word.
For the system not to be
confused by these variations, a feature vector is configured for each formosan
language.
A vector describes how to reduce the variants into a simpler form.
For
example, the pronunciation of a and ae is quite similar in Saisiyat, and glottal-stops
are sometimes omitted. 'aehae' “one” is usually spelled 'ahae or aehae.
Below are
feature vectors of Saisiyat and Kavalan. 5
Saisiyat: ae → a, oe → o,
Kavalan:
th → l,
S → s,
'→∅
d → l,
'→∅
A string substitution is executed before any operation in the database in order to
prevent possible duplicated entries; otherwise full-text search may fail to work.
3.2
Database design
Database design affects the efficiency in search and storage.
For simplifying
5 Kavalan is an Austronesian language spoken in Hualien County, east Taiwan.
Design of a Multimedia Corpus of Austronesian Linguistics
115
programming logic and high-speed query, we proposed a schema that differs from the
Sinica archive.
Every relational database engine that follows the SQL92 standard
can be used in the implementation of the schema.
SQLite 6 , among relational
database systems, is recommended for the following reasons:
1. It is light-weight, fast and platform independent.
2. A database is stored in a single file, thus is easy to maintain.
3. It supports UTF-8 encoding.
4. It is a free software.
One formosan language is placed in one database and is thus stored in a single file.
The schema of every language should be the same, therefore cross-linguistic search
can be executed in a single page.
normalized to the third level. 7
efficiency.
It is often argued that a database has to be
To be realistic, our system is designed for the sake of
The relational diagram of tables in the database is shown in Figure 4.
A full list of database schema is given in Appendix B.
6 http://www.sqlite.org
7 There is a good tutorial about database normalization at http://dev.mysql.com/techresources/articles/intro-to-normalization.html
116
Zhemin Lin, Li-May Sung, I-wen Su
Fig.4.
Relational diagram of tables in the database
The text is mainly stored in Table “iu”.
In contrast to the word-based design in
the Sinica archive, every intonation unit is stored in one row.
For example,
article
: pear3
nat
: ...(1.2) ima h-oem-angaw kasna'itol ray kahoey babaw
sim
: . ima homangaw kasnaitol ray kahoy babaw
eng
: . Asp set_a_ladder-AF move_up-AF Loc tree above
For a full-text search, a simple query of “%keyword%” to every field listed
above returns the correct results.
among spelling variants.
The simplified spelling is stored for searching
Words in the database are separated by a single space, so
that they are easily processed in programs by a single function (explode () in PHP and
split () in Python).
Places where no gloss is available are occupied by a period (“.”);
Design of a Multimedia Corpus of Austronesian Linguistics
117
thus, words and glosses are always aligned across the fields.
Another specialized data structure is designed in Table “lemma”.
In order to
properly search an affix, the stem is marked for every word in the dictionary.
morpheme before the stem is a prefix and the one after it a suffix.
The
For example,
Saisiyat kapapama'an 'bicycle' is stored as ka-#papama'#-an in the table.
If one
looks for a prefix ka- or a suffix -an, one can always obtain the right answer by taking
the elements before the first sharp (#) or after the second sharp.
Since infixation is
simple in the two languages, it is currently analyzed on the fly by external programs. 8
3.3
Back-end programs and the POS-tagger
Database maintainer commits a pre-processed transcription into the database through
a batch of back-end programs.
Commitment is preferably done in the command-
line, so that mismatches in alignment or failure of automated morphological analysis
may be corrected immediately and interactively.
prove the workability of the system.
A prototype is implemented to
Here is a list of programs.
features.py
defines language-specific feature-vectors and provides connection DSN.
simplify.py
is the common library for reducing spelling variants.
canon.py
checks input validity, including metadata and text format.
It writes the data
into the database when the check passes.
extractmorph.py
defines morphological and discoursal codes and extracts them from the texts.
makedict.py
extracts information from imported texts and updates the dictionary.
8 After the corpus complies with the Leipzig glossing rules, infixation will be marked by < and
>.
118
Zhemin Lin, Li-May Sung, I-wen Su
mp3splt.py/mpgsplt.py
splits .mp3 / .mpg files according to the time-file (see below).
tidy.py
utility to convert Chinese punctuation into ASCII and remove unnecessary
Microsoft Word mark-ups.
The coupling of the modules is fairly low.
“features.py” and “simplify.py”
provide the necessary functions for all programs.
As texts have been put into the database, they are tagged by a TBL tagger (cf. Lin
2005: Chapter 2), and the dictionary is updated at the same time.
When a user looks
up a word, the part-of-speech information can be obtained along with its frequency in
the corpus.
Any time the database maintainer finds an error in the tagged corpus, it
can be corrected on-line as an immediate feedback to the tagger.
The tagger can
later be retrained by a single click.
3.4
Unified output interface
For the corpus to be accessible to the public, a unified user-friendly interface is built.
The system follows HTML 4.01 (loose) proposed by the World-Wide Web
Consortium9 and is designed to be browsed with a browser, because this is one of the
major means to access data from the Internet.
For a dynamic and interactive
representation, the Document Object Model (DOM) 10 and JavaScript 1.2 are
preferably used.
Popular browsers, such as Internet Explorer 5.0, Mozilla 1.7,
Firefox 0.9 and Opera 4, are compliant to these standards.
major browsers for the purpose of accessibility.
web site under construction.
9 http://www.w3.org/TR/REC-html40/
10 http://www.w3.org/DOM/
It is important to support
Figure 5 is a screen dump of the
Design of a Multimedia Corpus of Austronesian Linguistics
Fig.5.
119
Screen dump of the web site (under construction)
The interface is composed of the following parts: a window with the informant's
photo where movie clips are played, a list of metadata in the upper-left corner, several
switches to adjust browsing effects and a frame in the bottom of the screen to dump
the selected article in a format following linguistic convention.
A dictionary is
popped-up anytime when a user clicks on an unknown word (see Figure 6).
The
pages are being revised for a better visual effect.
Ethnological notes and examples are preferably given in the dictionary with crossreference.
The design for an interface for searching is simple, yet complicated and
special linguistic needs are still possible.
For example, by typing tabatathan a user
can find the occurrences of the Kavalan word ta-batad-an; typing 'ahae or aehae
results in 'aehae' for Saisiyat, and so on.
are also kept for further improvement.
Interfaces to user-defined functions (UDF)
120
Zhemin Lin, Li-May Sung, I-wen Su
Fig. 6. Pop-up dictionary with cross-references
As the bandwidth is quite limited, it is suggested that multimedia data are stored
and transferred in the formats of 16Kbps 11kHz MPEG-1 layer 3 for audio data and
MPEG-1 for video data.
3.5
Interoperability
It is important to share the corpus with the linguistic community.
The Extensible
Mark-up Language (XML)11 is a simple and flexible language used to exchange data
between different systems.
It is now a de facto standard on the web.
For
researchers of natural language processing to easily profit from our collected data, the
corpus should be able to be exported in XML.
Morphological information, gloss
and part-of-speech of every word may be output in a uniform manner.
format is given below.
11 http://www.w3.org/XML/
An exported
Design of a Multimedia Corpus of Austronesian Linguistics
<?xml version="1.0" encoding="utf-8" ?>
<article id="pear_imui">
<topic>Pear Story</topic>
<language>Kavalan</language>
<dialect>Xinshe</dialect>
<speaker>
<natname>imui</natname>
<chnname>潘金妹</chnname>
<gender>F</gender>
<age-of-record>51</age-of-record>
</speaker>
<duration>00:01:15</duration>
<total-iu>31</total-iu>
<collected>2003-05-30</collected>
<revised>2003-11-11</revised>
<transcriber>葉俞廷</transcriber>
<transcriber>王以勤</transcriber>
<doublecheck>鍾曉芳</doublecheck>
<doublecheck>沈嘉琪</doublecheck>
<doublecheck>葉俞廷</doublecheck>
<text>
<iu id="iu_1">
<word>
<nat>tangi</nat>
<sim>tangi</sim>
<eng>today</eng>
<chn>今天</chn>
<pos>RB</pos>
</word>
<word>
121
122
Zhemin Lin, Li-May Sung, I-wen Su
...
</word>
</iu>
<iu id="iu_2"> ...
</iu>
...
<para von="1" bis="4">
<eng>I just saw a person there ...</eng>
<chn>我剛剛看到 ...</chn>
<notes>Some elicitation notes</notes>
</para>
...
</text>
</article>
4
Conclusive Remarks
The online version of NTU corpus of Austronesian languages is still under
construction and more texts are to be added. The adaptation of the Leipzig glossing
rules will be adopted in the near future.
As normalization, accessibility and
interoperability are emphasized for the system, it should be useful and helpful for
linguists, teachers and even native speakers of Austronesian languages.
that our work could contribute to the language communities.
It is hoped
The system is
extendible for the processing of other languages once the proper feature vector is set.
The implementation is still on its experimental stage.
As Saisiyat and most
Formosan languages are on the verge of being endangered, more people are urged to
participate, to use and to promote the enlargement of the corpora.
Coding Lists
Table 2.
Morphological coding list
Appendix A.
Design of a Multimedia Corpus of Austronesian Linguistics
English code
Chinese code
Description
1SG
1SG
1st person singular
2SG
2SG
2nd person singular
3SG
3SG
3rd person singular
1IPL.NOM
1IPL.主格
1st person plural, Inclusive, Nominative
1EPL.NOM
1EPL.主格
1st person plural, Exclusive, Nominative
1PL
1PL
1st person plural
2PL
2PL
2nd person plural
3PL
3PL
3rd person plural
ACC
受格
Accusative
AF
主焦
Agent Focus
ASP
動貌
Aspect
AUX
助動詞
Auxiliary
BC
BC
Back Channel / Reactive Token
BF
予焦
Benefactive Focus
CAU
使役
Causative
CLF
量詞
Classifier
CLF.HUM
人量詞
Human Classifier
CLF.NHUM
非人量詞
Non-human Classifier
COM
?
Comitative
COMP
補語詞
Complementizer
COND
條件詞
Conditional Marker
DAT
予格
Dative
DEF
定指
Definite
DET
限定詞
Determiner
DIST
遠距
Distal
DM
DM
Discourse Marker
EXCL
排除
Exclusive
123
124
Zhemin Lin, Li-May Sung, I-wen Su
EXIST
存在
Existential
EXPER
經驗
Experiential
FIL
FIL
Pause Filler
FS
FS
False Start
FUT
未來
Future
GEN
屬格
Genitive
IF
工焦
Instrumental Focus
IMP
祈使
Imperative
INCL
包含
Inclusive
INDF
不定指
Indefinite
INS
工具格
Instrument
INT
感嘆
Interjection
INVIS
不可見
Invisible
IRR
非實現
Irrealis
LF
處焦
Locative Focus
LNK
連詞
Linker
LOC
處格
Locative
NCM
Ncm
Non-common Name Marker
NEG
否定
Negative
NEU
中性格
Neutral
NMZ
名物化
Nominalizer/Nominalization
NOM
主格
Nominative
NRFUT
即將
Near Future
OBL
斜格
Oblique
PF
受焦
Patient Focus
PFV
完成
Perfective
PN
人名/地名
proper name/place name
POSS
所有格
Possessive
PROG
進行
Progressive
Design of a Multimedia Corpus of Austronesian Linguistics
PROX
近距
Proximal/Proximate
Q
疑問
Question Marker
QUOT
QUOT
Quotative
REC
交互
Reciprocal
RED
重疊
Reduplication
REL
關係詞
Relativizer
REFL
反身
Reflexive
RF
指焦
Referential Focus
TOP
主題
Topic
VIS
可見
Visible
VOC
呼格
Vocative
X
X
Uncertain Hearing
TU
TU
mazmun
many.HUM
many (humans)
mwaza
many.NHUM
many (animals)
this
這個
that
那個
Table 3.
Discourse coding list (adopted from Du Bois (1993))
Meaning
Marker
Units
Intonation Unit
((newline))
Truncated IU
--
Word
((space))
Truncated word
-
Speaker identity / turn start
:
Speech Overlap
[]
125
126
Zhemin Lin, Li-May Sung, I-wen Su
Meaning
Marker
Transitional Continuity
Final
.
Continuing
,
Appeal
?
Terminal Pitch Direction
Fall
\
Rise
/
Level
_
Accent and Lengthening
Primary accent
^
Secondary accent
`
High booster
!
Low booster
;
Lengthening
==
Tone
Fall
\
Rise
/
Fall-Rise
\/
Rise-fall
/\
Level
_
Pause
Long
...(N)
Design of a Multimedia Corpus of Austronesian Linguistics
Meaning
Marker
Medium
...
Short
..
Latching
0
Vocal Noises
Vocal noises
(CAPITAL LETTERS)
Inhalation
(H)
Exhalation
(Hx)
Glottal stop
%
Laughter
@
Quality
Quality
<Y Y>
Laugh quality
<@ @>
Quotation quality
<Q Q>
Phonetics
Phonetic / phonemic transcription
(/ /)
Transcriber's Perspective
Researcher's comment
((
Uncertain hearing
<X X>
Indecipherable syllable
X
Specialized Notations
Duration
(N)
IU boundary
&
))
127
128
Zhemin Lin, Li-May Sung, I-wen Su
Meaning
Marker
Accent unit boundary
|
Embedded IU
<|
Restart
{Capital Initial}
False start
<
Code switching
<L2
Nontranscription line
$
|>
>
L2>
Reserved Symbols
Phonetic / orthographic symbols
'
Morphosyntactic coding
+*#{}
User-definable symbols
Appendix B.
"~
Database Schema
Table meta: metadata of text
Field name
Format
Description
Example
article
varchar(80)
Filename
pear_imui
topic
varchar(80)
Text name
Pear Story
texttype
varchar(40)
Text style
narrative
language
varchar(40)
Text language
Kavalan
dialect
varchar(40)
Dialect or district
Xinshe
spknat
varchar(80)
Native name of Informant
imui
spkhan
varchar(80)
Chinese name of Informant
潘金妹
Design of a Multimedia Corpus of Austronesian Linguistics
Field name
Format
Description
spkgdr
char(1)
Gender of Informant (M|F)
spkage
integer
Age of Informant in time of recording 51
duration
time
Length of the recording
00:01:15
totaliu
integer
Number of intonation units
31
collected
date
Date of record
05/5/30
revised
date
Date of last revision
03/11/11
transcr
blob
Comma
Example
separated
names
F
of A, B, C
transcribers
dblchk
blob
129
Names of people who double check D, E, F
the text
Table iu: storage of a single intonation unit
Field name
Format
Description
article
varchar(80)
Text name (foreign key of meta.article)
no
integer
IU #
nat
blob
Native words, space separated
sim
blob
Simplified native words
eng
blob
English gloss, space separated
chn
blob
Chinese gloss, space separated
Table para: translation of a block of intonation units
Field name
Format
Description
article
varchar(80)
Text name (foreign key of meta.article)
130
Zhemin Lin, Li-May Sung, I-wen Su
Field name
Format
Description
von
integer
IU# where the block begins
bis
integer
IU# where the block ends
eng
blob
English translation
chn
blob
Chinese translation
note
blob
Elicitation notes (multi-line, separated by two semicolons)
Table dict: the dictionary
Field name
Format
Description
word
varchar(80)
Word of index (simplified word)
lemma
blob
Word forms, comma separated
eng
blob
English gloss with morphological marks
chn
blob
Chinese gloss with morphological marks
note
blob
Notes (multi-line, separated by two semi-colons)
ex
blob
Example of how the word is used
Table lemma: analyse of a lemma
Field name
Format
Description
word
varchar(80)
Lemma of a word (foreign key to an element of
dict.lemma)
lemma
varchar(80)
Prefix-#stem#-suffix
enmorph
varchar(255)
Morphological marks in English, comma separated
zhmorph
varchar(255)
Morphological marks in Chinese, comma separated
Design of a Multimedia Corpus of Austronesian Linguistics
Field name
Format
Description
enstem
varchar(255)
Sense of the stem in English, comma separated
zhstem
varchar(255)
Sense of the stem in Chinese, comma separated
131
Table affix: dictionary of affixes
Field name
Format
Description
affix
varchar(20)
Affix (prefix-, -infix-, -suffix)
englos
varchar(255)
Morphological analyse in English
zhglos
varchar(255)
Morphological analyse in Chinese
note
blob
Notes
Table xref: cross-reference of a word
Field name
Format
Description
word
varchar(80)
Word (simplified)
xref
blob
Article:IU#.number_of_word
References
Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic
aspects of narrative production. Norwood, NJ: Ablex Publishing Corp.
Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research,
chapter Outline of discourse transcription, 45-89. NJ: Hillsdale: Lawrence
Erlbaum Associates.
Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and cognition in
SaiSiyat. NSC 93-2411-H-022-094.
Lin, Zhemin. 2005. Automatic processing of languages with small-scaled corpus:
Part-of-speech tagging and partial parsing saisiyat and applications. Master's
132
Zhemin Lin, Li-May Sung, I-wen Su
thesis, National Taiwan University.
Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic index).
American Documentation 11:288-295.
Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books.
Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse and
grammar.
Amsterdam: John Benjamins.
Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language
archive: Development of a multimedia tool to salvage the languages and oral
traditions of the indigenous tribes of taiwan. Oceanic Linguistics 42(1):218232.
Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive:
Linguistic analysis and language processing. Computational Linguistics and
Chinese Language Processing 10(2):167-200
Download