Design of a Multimedia Corpus of Austronesian Linguistics Zhemin Lin, Li-May Sung, I-wen Su Graduate Institute of Linguistics of the National Taiwan University Abstract In this paper, the design of an integrated platform of multimedia online corpora aiming to serve both linguists and the public is introduced along with database schema and programming details. Compared with the Formosan language archive of Academia Sinica, our design emphasizes more in terms of normalization, accessibility and interoperability of the system. The design of an automatically generated dictionary with cross-references and the capability of searching the entire database in various ways are also described here. 1 Introduction The development of natural language processing techniques and dynamic web pages has generated wide interest in the construction of an integrated platform which enables people to submit, to browse and to search among collected texts in corpora. However, most online corpora are specially built for experts; they are sentence-based and do not provide multimedia contents. The NTU corpus of Austronesian languages1 introduced in this paper is an attempt to construct a multi-lingual online corpus with multimedia contents meeting the needs of both linguists and the public. In the following sections, we will take a brief review of previous works and then focus on the features of our current work. 1 http://corpus.linguistics.ntu.edu.tw 106 Zhemin Lin, Li-May Sung, I-wen Su 2 Formosan Language Archive of Academia Sinica Zeitoun et al. (2003) discussed some of the problems in the conservation of Formosan Austronesian languages. The continuous enhancement of their work with many newly designed tools is further described in Zeitoun and Yu (2005). As discussed in the two articles, fieldwork data are rarely shared in the linguistic community. Collected materials are sometimes inaccessible even in the office where they are stored, due to the change of storage media or data damage. One of the most serious problems is that, although there are elicitated sentences and recordings, few of them are rearranged and published. As a response to the problems, researchers in the Academia Sinica have built a Formosan language archive, i.e., an online corpora with texts, translations, word glosses and sounds from native speakers of 14 languages and dialects.2 Despite their labour, there are however insufficiencies in their system, one of them being the theoretical issue: the Sinica corpora are sentence-based, where pauses, pause fillers, repetitions, intonation contours, IU boundaries and other discoursal clues are either discarded or missing. A sentence-based corpus excludes important linguistic information only present in discourse. Words in the system are written in an ad hoc mixture style via International Phonetic Alphabet (IPA), in a transcription style that prevents their respective native speakers from using the data directly. Nearly every word is altered to some extent. Example (1) is a Saisiyat example extracted from the Sinica archive. (1) (a) yao noka maʔiiæh ... hayðaʔ ʔæhæʔ maʔiiæh la m-waaiʔ, yao minaŋaʔŋaʔ nak hini mina-ʃaaəŋ. (b) ʔinʔalay hikor may nak hini yakin, ʃβət yakin ho. (c) ʔok-ik ʃəβət, m-waaiʔ nak hini pa-paʃœʃ, yao h<œm>ʃœʃ atomalan. (Extracted from 05.002a -- 05.002c of “5.我的故事” of the Sinica archive.) There is so far no dictionary available with cross-referencing function in the 2 http://formosan.sinica.edu.tw/formosan/ch/select_corpus.htm Design of a Multimedia Corpus of Austronesian Linguistics 107 Sinica corpus, even though cross-referencing for an online corpus is essential for researchers deal with elicited or authentic data. Like KWIC (Keyword-in-context, cf. Luhn (1960)), a user can trace a word back to the context where it occurs, and browse its surrounding IUs. on Microsoft Access. query language. Zeitoun et al (2003) has planed a data schema that ran Their design, however, cannot take advantage of the SQL92 Moreover, they designed an XML dialect to improve the interoperability, which does not encourage researchers to share their collected data in a convenient way. The Sinica archive, though primitive in design, is the first attempt to provide public access to the nearly extinct linguistic data, which is an effort highly respectable by itself. 3 NTU Corpus of Austronesian Languages The system designed in this paper is based on the NTU corpus of Austronesian languages. The NTU corpus, first described in Huang, Su, and Sung (2003), is composed of spoken texts in various languages. Currently NTU Saisiyat corpus contains 22 texts, 3081 intonation units (IUs) and approximately 10635 words, whose transcription follows the conventions of Du Bois (1993). There are one conversation, eight narratives of indigenous legends, thirteen elicited narratives based on “Pear Stories” (5 narratives based on a six-minute color mute film made by Wallace Chafe, see Chafe (1980)) and “Frog Stories” (8 narratives from a sketch book by Mayer (1980)). An example of an original data segment follows: (2) 9. ... (1.7) m-wa:i' 'aehae' ka AF-come one 10. ...(1.1) ma'iaeh ima h<oem>oehoe' person ASP 11. ... may kabih ka <AF>pull hiza pass.by[AF] 12. ...(1.9) ilahiza NOM there siri' ACC goat 108 Zhemin Lin, Li-May Sung, I-wen Su move.to.that.place side “(The man pulling a goat) passed by this way and went that way.” (Pear 3:912) Spoken corpus, in contrast to written corpus, is composed of utterances shorter or equal to sentences, which are transcribed according to certain criteria, such as turntaking, pause, and ruptures in intonation contours of monologue (Tao 1996:35). Fig. 1. shows a unified intonation contours in a praat3 window. Fig. 1. A unified intonation contour When a corpus is transcribed, tagged and analyzed, one needs to look for a means to make it accessible to the public. An integrated platform to store, to represent and to lower the technological boundaries for further use of the collected data is thus necessary. With the insufficiencies of the Sinica archive in mind, normalization, accessibility and interoperability are emphasized in the design of our system. The following guidelines are thus proposed. (3) Guidelines of the integrated platform 3 praat is a programmable phonetic analyser written by Paul Boersma and David Weenink, Institute of Phonetic Sciences, University of Amsterdam. It is licensed in GNU Public License, with the courtesy of their outrageous work and the free software. Cf. http://www.fon.hum.uva.nl/praat/ Design of a Multimedia Corpus of Austronesian Linguistics 109 (a) Easy to customize for most Austronesian languages (b) Standardized procedures of transcription, annotation and process (c) Automatic extraction of morphosyntactic information to reduce repetition of human labor (d) Web-based, unified input/output interface (e) Searchable corpus that fit the needs of both linguists and the public (f) Multimedia representation of collected texts (g) Interoperable with other systems (h) Cross-platform, operating system independent Below is a description of the input, processing and output of our system design. 3.1 Standardization of text commitment and standards of committed texts The standardization comprises the procedure of handling transcribed texts, the transcription itself and morphosyntactic and discoursal codes used in the transcription. The procedure to handle collected texts is designed with low coupling in order to reduce complexity. Therefore, the dependence in human manipulation in the system is almost uni-directional, as can be seen in Figure 2. collected, some worker transcribes it. Whenever a spoken text is Once the transcription is complete, it is given to the database maintainer for processing and storage. The web interface shows the corpus in the database, so that people on the other end of Internet can browse and search the corpus. 110 Zhemin Lin, Li-May Sung, I-wen Su Fig. 2. Use cases of the system Design of a Multimedia Corpus of Austronesian Linguistics 111 The transcription follows Du Bois (1993), a de facto standard in the linguistic community. Word glosses and annotations follow a standardized coding list inherited from conventional mark-ups (cf. Appendix A) and the Leipzig glossing rules4. A standard operation is also set for the database maintainer to handle fieldwork collections as shown in Figure 3. Fig.3. Standard operation of text commitment The corpus in our system is stored in Unicode (UTF-8 encoding) for potential need of IPA, Japanese, and annotations in other languages. If some of the tribes decide to adopt non-ASCII letters, such as “ ɖ ʈ ɼ ɫ ʔ ”, into their writing systems, the programs can process them correctly with no need of modification. As Unicode BOM (byte-order-mark, U+FEFF) is appended in the beginning of the file in Microsoft Windows and is absent in Unix-based systems, the mark may cause a potential problem in reading files edited in different operating systems. It is properly dealt with in order to fulfil the criteria of platform-independence. A set of metadata is defined in the head of committed files. An example of the head of a committed text is given in (4) and the description of the fields is shown 4 http://www.eva.mpg.de/lingua/files/morpheme.html 112 Zhemin Lin, Li-May Sung, I-wen Su in Table 1. (4) Topic: Pear story Type: Narrative Language: Kavalan Dialect: Xinshe Speaker: Imui, 潘金妹, F,1952 Time: 00:01:15 Total IUs: 31 Collected: 2003-05-30 Revised: 2003-11-11 Transcribed by: 葉俞廷, 王以勤 Double checked: 鍾曉芳,沈嘉琪 ,葉俞廷 Table 1. Field name Metadata of committed data Description Format Topic Topic of text String (e.g., Pear Story) Type Style of text Narrative|Conversation|... Language Language of text String, first letter in capital Dialect Dialect or district String Design of a Multimedia Corpus of Austronesian Linguistics Field name Description 113 Format Speaker Base data of the informant Native/Chinese name, Gender, Age Time Length of recording hh:mm:ss Total IUs Number of IUs in text Numeric Collected Date of recording yyyy-mm-dd Revised Date of latest revision yyyy-mm-dd Transcribed by Transcribers and annotaters Comma separated string Double checked Inspectors of text Comma separated string The text following the metadata is described below. (5) 5. [IU #, with a period in the end] .. qay- .. qay-byabas 'nay ,_ [words separated by spaces] QAY-guava that [English gloss separated by spaces] 那 QAY-芭樂 6. ... razat 'nay person 人 nani.\ that 那 DM DM [Chinese gloss separated by spaces] 114 Zhemin Lin, Li-May Sung, I-wen Su #e That person picked guavas. Then, #c 那個人採芭樂。然後, #n Elicitaion notes #n (More elicitation notes) Lines beginning with a sharp (#) are processor instructions (PI). “#e” indicates a line of English translation of a paragraph composed of the IUs from the last translation to the current one. elicitation notes. “#c” marks a Chinese translation, and “#n” is It is possible to have more than one note. native words and glosses is automatically done. The alignment of Morpheme boundaries, morphological information and word senses are extracted using the techniques introduced in Lin (2005: Chapters 2 and 4.2). As the transcription is supposed to more or less reflect actual pronunciation of an informant, spelling may vary slightly from word to word. For the system not to be confused by these variations, a feature vector is configured for each formosan language. A vector describes how to reduce the variants into a simpler form. For example, the pronunciation of a and ae is quite similar in Saisiyat, and glottal-stops are sometimes omitted. 'aehae' “one” is usually spelled 'ahae or aehae. Below are feature vectors of Saisiyat and Kavalan. 5 Saisiyat: ae → a, oe → o, Kavalan: th → l, S → s, '→∅ d → l, '→∅ A string substitution is executed before any operation in the database in order to prevent possible duplicated entries; otherwise full-text search may fail to work. 3.2 Database design Database design affects the efficiency in search and storage. For simplifying 5 Kavalan is an Austronesian language spoken in Hualien County, east Taiwan. Design of a Multimedia Corpus of Austronesian Linguistics 115 programming logic and high-speed query, we proposed a schema that differs from the Sinica archive. Every relational database engine that follows the SQL92 standard can be used in the implementation of the schema. SQLite 6 , among relational database systems, is recommended for the following reasons: 1. It is light-weight, fast and platform independent. 2. A database is stored in a single file, thus is easy to maintain. 3. It supports UTF-8 encoding. 4. It is a free software. One formosan language is placed in one database and is thus stored in a single file. The schema of every language should be the same, therefore cross-linguistic search can be executed in a single page. normalized to the third level. 7 efficiency. It is often argued that a database has to be To be realistic, our system is designed for the sake of The relational diagram of tables in the database is shown in Figure 4. A full list of database schema is given in Appendix B. 6 http://www.sqlite.org 7 There is a good tutorial about database normalization at http://dev.mysql.com/techresources/articles/intro-to-normalization.html 116 Zhemin Lin, Li-May Sung, I-wen Su Fig.4. Relational diagram of tables in the database The text is mainly stored in Table “iu”. In contrast to the word-based design in the Sinica archive, every intonation unit is stored in one row. For example, article : pear3 nat : ...(1.2) ima h-oem-angaw kasna'itol ray kahoey babaw sim : . ima homangaw kasnaitol ray kahoy babaw eng : . Asp set_a_ladder-AF move_up-AF Loc tree above For a full-text search, a simple query of “%keyword%” to every field listed above returns the correct results. among spelling variants. The simplified spelling is stored for searching Words in the database are separated by a single space, so that they are easily processed in programs by a single function (explode () in PHP and split () in Python). Places where no gloss is available are occupied by a period (“.”); Design of a Multimedia Corpus of Austronesian Linguistics 117 thus, words and glosses are always aligned across the fields. Another specialized data structure is designed in Table “lemma”. In order to properly search an affix, the stem is marked for every word in the dictionary. morpheme before the stem is a prefix and the one after it a suffix. The For example, Saisiyat kapapama'an 'bicycle' is stored as ka-#papama'#-an in the table. If one looks for a prefix ka- or a suffix -an, one can always obtain the right answer by taking the elements before the first sharp (#) or after the second sharp. Since infixation is simple in the two languages, it is currently analyzed on the fly by external programs. 8 3.3 Back-end programs and the POS-tagger Database maintainer commits a pre-processed transcription into the database through a batch of back-end programs. Commitment is preferably done in the command- line, so that mismatches in alignment or failure of automated morphological analysis may be corrected immediately and interactively. prove the workability of the system. A prototype is implemented to Here is a list of programs. features.py defines language-specific feature-vectors and provides connection DSN. simplify.py is the common library for reducing spelling variants. canon.py checks input validity, including metadata and text format. It writes the data into the database when the check passes. extractmorph.py defines morphological and discoursal codes and extracts them from the texts. makedict.py extracts information from imported texts and updates the dictionary. 8 After the corpus complies with the Leipzig glossing rules, infixation will be marked by < and >. 118 Zhemin Lin, Li-May Sung, I-wen Su mp3splt.py/mpgsplt.py splits .mp3 / .mpg files according to the time-file (see below). tidy.py utility to convert Chinese punctuation into ASCII and remove unnecessary Microsoft Word mark-ups. The coupling of the modules is fairly low. “features.py” and “simplify.py” provide the necessary functions for all programs. As texts have been put into the database, they are tagged by a TBL tagger (cf. Lin 2005: Chapter 2), and the dictionary is updated at the same time. When a user looks up a word, the part-of-speech information can be obtained along with its frequency in the corpus. Any time the database maintainer finds an error in the tagged corpus, it can be corrected on-line as an immediate feedback to the tagger. The tagger can later be retrained by a single click. 3.4 Unified output interface For the corpus to be accessible to the public, a unified user-friendly interface is built. The system follows HTML 4.01 (loose) proposed by the World-Wide Web Consortium9 and is designed to be browsed with a browser, because this is one of the major means to access data from the Internet. For a dynamic and interactive representation, the Document Object Model (DOM) 10 and JavaScript 1.2 are preferably used. Popular browsers, such as Internet Explorer 5.0, Mozilla 1.7, Firefox 0.9 and Opera 4, are compliant to these standards. major browsers for the purpose of accessibility. web site under construction. 9 http://www.w3.org/TR/REC-html40/ 10 http://www.w3.org/DOM/ It is important to support Figure 5 is a screen dump of the Design of a Multimedia Corpus of Austronesian Linguistics Fig.5. 119 Screen dump of the web site (under construction) The interface is composed of the following parts: a window with the informant's photo where movie clips are played, a list of metadata in the upper-left corner, several switches to adjust browsing effects and a frame in the bottom of the screen to dump the selected article in a format following linguistic convention. A dictionary is popped-up anytime when a user clicks on an unknown word (see Figure 6). The pages are being revised for a better visual effect. Ethnological notes and examples are preferably given in the dictionary with crossreference. The design for an interface for searching is simple, yet complicated and special linguistic needs are still possible. For example, by typing tabatathan a user can find the occurrences of the Kavalan word ta-batad-an; typing 'ahae or aehae results in 'aehae' for Saisiyat, and so on. are also kept for further improvement. Interfaces to user-defined functions (UDF) 120 Zhemin Lin, Li-May Sung, I-wen Su Fig. 6. Pop-up dictionary with cross-references As the bandwidth is quite limited, it is suggested that multimedia data are stored and transferred in the formats of 16Kbps 11kHz MPEG-1 layer 3 for audio data and MPEG-1 for video data. 3.5 Interoperability It is important to share the corpus with the linguistic community. The Extensible Mark-up Language (XML)11 is a simple and flexible language used to exchange data between different systems. It is now a de facto standard on the web. For researchers of natural language processing to easily profit from our collected data, the corpus should be able to be exported in XML. Morphological information, gloss and part-of-speech of every word may be output in a uniform manner. format is given below. 11 http://www.w3.org/XML/ An exported Design of a Multimedia Corpus of Austronesian Linguistics <?xml version="1.0" encoding="utf-8" ?> <article id="pear_imui"> <topic>Pear Story</topic> <language>Kavalan</language> <dialect>Xinshe</dialect> <speaker> <natname>imui</natname> <chnname>潘金妹</chnname> <gender>F</gender> <age-of-record>51</age-of-record> </speaker> <duration>00:01:15</duration> <total-iu>31</total-iu> <collected>2003-05-30</collected> <revised>2003-11-11</revised> <transcriber>葉俞廷</transcriber> <transcriber>王以勤</transcriber> <doublecheck>鍾曉芳</doublecheck> <doublecheck>沈嘉琪</doublecheck> <doublecheck>葉俞廷</doublecheck> <text> <iu id="iu_1"> <word> <nat>tangi</nat> <sim>tangi</sim> <eng>today</eng> <chn>今天</chn> <pos>RB</pos> </word> <word> 121 122 Zhemin Lin, Li-May Sung, I-wen Su ... </word> </iu> <iu id="iu_2"> ... </iu> ... <para von="1" bis="4"> <eng>I just saw a person there ...</eng> <chn>我剛剛看到 ...</chn> <notes>Some elicitation notes</notes> </para> ... </text> </article> 4 Conclusive Remarks The online version of NTU corpus of Austronesian languages is still under construction and more texts are to be added. The adaptation of the Leipzig glossing rules will be adopted in the near future. As normalization, accessibility and interoperability are emphasized for the system, it should be useful and helpful for linguists, teachers and even native speakers of Austronesian languages. that our work could contribute to the language communities. It is hoped The system is extendible for the processing of other languages once the proper feature vector is set. The implementation is still on its experimental stage. As Saisiyat and most Formosan languages are on the verge of being endangered, more people are urged to participate, to use and to promote the enlargement of the corpora. Coding Lists Table 2. Morphological coding list Appendix A. Design of a Multimedia Corpus of Austronesian Linguistics English code Chinese code Description 1SG 1SG 1st person singular 2SG 2SG 2nd person singular 3SG 3SG 3rd person singular 1IPL.NOM 1IPL.主格 1st person plural, Inclusive, Nominative 1EPL.NOM 1EPL.主格 1st person plural, Exclusive, Nominative 1PL 1PL 1st person plural 2PL 2PL 2nd person plural 3PL 3PL 3rd person plural ACC 受格 Accusative AF 主焦 Agent Focus ASP 動貌 Aspect AUX 助動詞 Auxiliary BC BC Back Channel / Reactive Token BF 予焦 Benefactive Focus CAU 使役 Causative CLF 量詞 Classifier CLF.HUM 人量詞 Human Classifier CLF.NHUM 非人量詞 Non-human Classifier COM ? Comitative COMP 補語詞 Complementizer COND 條件詞 Conditional Marker DAT 予格 Dative DEF 定指 Definite DET 限定詞 Determiner DIST 遠距 Distal DM DM Discourse Marker EXCL 排除 Exclusive 123 124 Zhemin Lin, Li-May Sung, I-wen Su EXIST 存在 Existential EXPER 經驗 Experiential FIL FIL Pause Filler FS FS False Start FUT 未來 Future GEN 屬格 Genitive IF 工焦 Instrumental Focus IMP 祈使 Imperative INCL 包含 Inclusive INDF 不定指 Indefinite INS 工具格 Instrument INT 感嘆 Interjection INVIS 不可見 Invisible IRR 非實現 Irrealis LF 處焦 Locative Focus LNK 連詞 Linker LOC 處格 Locative NCM Ncm Non-common Name Marker NEG 否定 Negative NEU 中性格 Neutral NMZ 名物化 Nominalizer/Nominalization NOM 主格 Nominative NRFUT 即將 Near Future OBL 斜格 Oblique PF 受焦 Patient Focus PFV 完成 Perfective PN 人名/地名 proper name/place name POSS 所有格 Possessive PROG 進行 Progressive Design of a Multimedia Corpus of Austronesian Linguistics PROX 近距 Proximal/Proximate Q 疑問 Question Marker QUOT QUOT Quotative REC 交互 Reciprocal RED 重疊 Reduplication REL 關係詞 Relativizer REFL 反身 Reflexive RF 指焦 Referential Focus TOP 主題 Topic VIS 可見 Visible VOC 呼格 Vocative X X Uncertain Hearing TU TU mazmun many.HUM many (humans) mwaza many.NHUM many (animals) this 這個 that 那個 Table 3. Discourse coding list (adopted from Du Bois (1993)) Meaning Marker Units Intonation Unit ((newline)) Truncated IU -- Word ((space)) Truncated word - Speaker identity / turn start : Speech Overlap [] 125 126 Zhemin Lin, Li-May Sung, I-wen Su Meaning Marker Transitional Continuity Final . Continuing , Appeal ? Terminal Pitch Direction Fall \ Rise / Level _ Accent and Lengthening Primary accent ^ Secondary accent ` High booster ! Low booster ; Lengthening == Tone Fall \ Rise / Fall-Rise \/ Rise-fall /\ Level _ Pause Long ...(N) Design of a Multimedia Corpus of Austronesian Linguistics Meaning Marker Medium ... Short .. Latching 0 Vocal Noises Vocal noises (CAPITAL LETTERS) Inhalation (H) Exhalation (Hx) Glottal stop % Laughter @ Quality Quality <Y Y> Laugh quality <@ @> Quotation quality <Q Q> Phonetics Phonetic / phonemic transcription (/ /) Transcriber's Perspective Researcher's comment (( Uncertain hearing <X X> Indecipherable syllable X Specialized Notations Duration (N) IU boundary & )) 127 128 Zhemin Lin, Li-May Sung, I-wen Su Meaning Marker Accent unit boundary | Embedded IU <| Restart {Capital Initial} False start < Code switching <L2 Nontranscription line $ |> > L2> Reserved Symbols Phonetic / orthographic symbols ' Morphosyntactic coding +*#{} User-definable symbols Appendix B. "~ Database Schema Table meta: metadata of text Field name Format Description Example article varchar(80) Filename pear_imui topic varchar(80) Text name Pear Story texttype varchar(40) Text style narrative language varchar(40) Text language Kavalan dialect varchar(40) Dialect or district Xinshe spknat varchar(80) Native name of Informant imui spkhan varchar(80) Chinese name of Informant 潘金妹 Design of a Multimedia Corpus of Austronesian Linguistics Field name Format Description spkgdr char(1) Gender of Informant (M|F) spkage integer Age of Informant in time of recording 51 duration time Length of the recording 00:01:15 totaliu integer Number of intonation units 31 collected date Date of record 05/5/30 revised date Date of last revision 03/11/11 transcr blob Comma Example separated names F of A, B, C transcribers dblchk blob 129 Names of people who double check D, E, F the text Table iu: storage of a single intonation unit Field name Format Description article varchar(80) Text name (foreign key of meta.article) no integer IU # nat blob Native words, space separated sim blob Simplified native words eng blob English gloss, space separated chn blob Chinese gloss, space separated Table para: translation of a block of intonation units Field name Format Description article varchar(80) Text name (foreign key of meta.article) 130 Zhemin Lin, Li-May Sung, I-wen Su Field name Format Description von integer IU# where the block begins bis integer IU# where the block ends eng blob English translation chn blob Chinese translation note blob Elicitation notes (multi-line, separated by two semicolons) Table dict: the dictionary Field name Format Description word varchar(80) Word of index (simplified word) lemma blob Word forms, comma separated eng blob English gloss with morphological marks chn blob Chinese gloss with morphological marks note blob Notes (multi-line, separated by two semi-colons) ex blob Example of how the word is used Table lemma: analyse of a lemma Field name Format Description word varchar(80) Lemma of a word (foreign key to an element of dict.lemma) lemma varchar(80) Prefix-#stem#-suffix enmorph varchar(255) Morphological marks in English, comma separated zhmorph varchar(255) Morphological marks in Chinese, comma separated Design of a Multimedia Corpus of Austronesian Linguistics Field name Format Description enstem varchar(255) Sense of the stem in English, comma separated zhstem varchar(255) Sense of the stem in Chinese, comma separated 131 Table affix: dictionary of affixes Field name Format Description affix varchar(20) Affix (prefix-, -infix-, -suffix) englos varchar(255) Morphological analyse in English zhglos varchar(255) Morphological analyse in Chinese note blob Notes Table xref: cross-reference of a word Field name Format Description word varchar(80) Word (simplified) xref blob Article:IU#.number_of_word References Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex Publishing Corp. Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research, chapter Outline of discourse transcription, 45-89. NJ: Hillsdale: Lawrence Erlbaum Associates. Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and cognition in SaiSiyat. NSC 93-2411-H-022-094. Lin, Zhemin. 2005. Automatic processing of languages with small-scaled corpus: Part-of-speech tagging and partial parsing saisiyat and applications. Master's 132 Zhemin Lin, Li-May Sung, I-wen Su thesis, National Taiwan University. Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic index). American Documentation 11:288-295. Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books. Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse and grammar. Amsterdam: John Benjamins. Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language archive: Development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of taiwan. Oceanic Linguistics 42(1):218232. Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive: Linguistic analysis and language processing. Computational Linguistics and Chinese Language Processing 10(2):167-200