Abstract

Title A sense discrimination engine for Chinese Keywords Corpus linguistics; lexicography; Chinese; corpus query tools; collocation; salience. Abstract Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. Today’s more scientific approach generally involves the use of linguistic corpora: large databases of spoken or written language samples. Numerous large corpora have been assembled for English, including the British National Corpus and the Bank of English. Dictionaries published by the Longman Group are based on the 100 million word BNC, and corpora are routinely used by computational linguists in tasks such as machine translation and speech recognition. The number of Chinese corpora available is more limited. Academia Sinica and Lancaster University offered balanced corpora of Chinese (balanced, because the contents are drawn from multiple sources, and are not restricted to one genre or style). A much larger Chinese corpus, the LDC’s Gigaword, is composed exclusively of newswire texts. Central to corpus analysis is the context in which a word occurs: J R Firth pointed out that information about meaning can be derived from surrounding words and sentence patterns. Various tools are available for exploring word context in corpora, determining for example which words are likely to appear in collocation with which others. The SARA tool is widely used with the BNC, and the Sinica Corpus user interface offers some statistical analysis of the corpus contents. Most of these tools, however, suffer from an important constraint: when considering the context of a word, an arbitrary number of adjacent words to the left or right is taken into account. Thus, most tools pay no heed to the grammatical context of a word under investigation. Furthermore, most query tools consider only part of speech (POS) in the limited grammatical analysis that they do make. One corpus query tool which overcomes these limitations is the Sketch Engine. The analysis exploits relationships such as subject and verb, or verb and object, where other tools would only recognize that one noun and one verb were being treated. Occurrences of words are assigned classes according to their grammatical relationship with other frequently collocating words, and then ranked according to salience (a formal measure of the significance of the word in a given context). The proposed research involves extending Sketch Engine to handle Chinese. First, a very large corpus of Chinese (probably Gigaword) is segmented and tagged. Because that corpus is so large, the existing (error-prone) tagging programs for Chinese would have to be refined and improved. Secondly, a set of grammatical relations appropriate for Chinese would need to be drawn up. The proposed research will be of considerable benefit to Chinese lexicography: it will be possible to distinguish the different senses of a word at a glance, obviating the need to search through hundreds of lines of corpus output. It also has application in Chinese language learning, helping the student to distinguish between homonyms and make use of synonyms. Summary of results The PI is no newcomer to the field of Corpus Linguistics, as the research projects he carried out at both masters and doctoral level involved the use of large corpora. His MSc dissertation (Smith 1999) presents an algorithm which assigns lexicalization scores to Chinese verb-object compounds (VOCs), taking into account contextual information as well as compound-specific features. The dissertation opens with a review of approaches to wordhood, with special reference to Chinese. A typology of compound types is offered, and some of the wordhood criteria that have been suggested in both the Chinese and the general linguistics literature are surveyed. Four key criteria (boundedness, translatability, idiomaticity and referentiality) are identified; a synthesis of these is then coded into the lexicalization algorithm. The Academia Sinica balanced corpus is used to create a VOC database, to which reference is made in determining lexicalization scores for VOCs encountered in unseen, user-supplied utterances. A limited evaluation is conducted, using a second corpus. The program itself aims to be as user-friendly as possible. It is written in LPA Prolog, and takes up a number of challenges associated with Chinese text processing, providing a convenient interface for users to key Chinese characters at the LPA Prolog prompt. The PI’s doctoral thesis presents a data-driven approach to the classification of utterances, using a novel combination of existing algorithmic approaches. Previous work had generally classified utterances according to such categories as wh- question, yes/no question, acknowledgement, response and the like; in general, the audio data used was specially commissioned and recorded for research purposes. The work presented in the thesis departs from this tradition, in that the recorded data consists of genuine interaction between the telephone operator and members of the public, taken from OASIS, the British Telecom telephone enquiry corpus (BTexaCT 2001). Moreover, most of the calls recorded can be characterized as queries. The techniques presented in this thesis attempt to determine, automatically, the class of query, from a set of six possibilities including "statement of a problem" and "request for action". To achieve this, a scheme for automatically labelling utterance segments according to their prosodic features was devised, and this is presented. It is then shown how labelling patterns encountered in training data can be exploited to classify unseen utterances. The algorithm is coded in a combination of C and UNIX shell scripts. Background Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. The compilation of dictionaries and thesauri, for example, required that the compiler read very widely, and record the results of his efforts – the definitions and different senses of words -- on thousands, or millions of index cards. Dictionary entries which seemed intuitively similar were placed together in boxes or piles, according to Speelman (1997), for later analysis. Thus, the distribution of items among sets preceded the lexical analysis, whereas under a computer-age model the analysis would come first, guiding the distribution: a distribution which could be based on masses of data, rather than the intuitions of the compiler. Kilgarriff & Tugwell (2002) observe that manual lexicography, with its limited number of citations per word, emphasizes the unusual, by its nature paying somewhat less attention to common lexical items and patterns. With the advent of computers and corpora, full attention can be paid to even the most common words: the more frequently a word occurs, the more an analyst can say about it, with appropriate computer tools. The presently proposed research will adapt and extend one such tool, and make it compatible with corpora of the Chinese language. Today’s approach to linguistic analysis generally involves the use of linguistic corpora: large databases of spoken or written language samples, defined by Crystal (1991) as “A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”. Numerous large corpora have been assembled for English, including the British National Corpus and the Bank of English. Dictionaries published by the Longman Group are based on the 100 million word BNC, and corpora are routinely used by computational linguists in tasks such as machine translation and speech recognition. The BNC is an example of a balanced corpus, in that it attempts to represent a broad cross-section of genres and styles, including fiction and non-fiction, books, periodicals and newspapers, and even essays by students. Transcriptions of spoken data are also included. A body called the Linguistic Data Consortium licenses and makes available a variety of corpora of English and other languages, including the recently released American National Corpus (Ide & McLeod 2001). Currently, the ANC consists of 10 million words of American English, but it will eventually be extended to 100 million words. The LDC also offers spoken corpora, where the data is presented in the form of audio files, as well as text and spoken corpora from many languages in addition to English, including Chinese. Current LDC Chinese spoken offerings include Mandarin Broadcast News, and the Callhome and Callfriend telephone speech corpora. The two principal Chinese text corpora from this source consist of newswire texts (thus, they are not balanced corpora): the LDC’s Chinese Treebank is relatively small at 500000 words, while Chinese Gigaword is by any standards large, at 1.1 billion Chinese characters. Academia Sinica and Lancaster University, respectively, offer balanced text corpora of Taiwan and mainland China Chinese. Central to corpus analysis is the context in which a word occurs: J R Firth pointed out that information about meaning can be derived from surrounding words and sentence patterns: “You shall know a word by the company it keeps”, as he famously stated in 1957. A convenient and straightforward tool for inspecting the context of a given word in a corpus is the KWIC (keyword in context) concordance, where all lines in the corpus containing the desired keyword are listed, with the keyword at the centre. Figure 1 shows such a concordance. 便無情地拖著她，往車站的方向走去。大概是下雨的關係，站牌四周站滿了趕著想到竟會和其中唯一的女孩成了情侶。大概是她大四最後一年的時候，一起在同一所羈絆之處甚少，故常有驚人之舉，大概屬率性而為。此等情懷見乎其，不過妳笑都沒有笑就是。讓我覺得我大概天生不會講笑話，在小鎮唯一的一家，可以從 1691 年他的一份上諭中看出個大概。那年五月，古北口總兵官蔡元向朝廷的命題。但是，這一命題是不公平的。大概是八，九年前的某一天，我在翻閱一堆。我想，擁有如此的氣概和謀略，大概與三晉文明的深厚蘊藏、表裡山河的很看不慣這些人。具有這種心態的人，大概只有兩種解釋：一是他根本沒有戶人家，外頭有一個院子擺滿了煤球，大概是做批發的。公寓外只有小陰溝，裡頭新年當然可以拿紅包。當時的行情大概是几張十塊的鈔票，因為十塊鈔票是了當時太空探險的熱潮。進入彩色（大概是 73 年）時代後，我最欣賞卡通寶馬人會放鞭炮呢！當時的兒童讀物不多，大概只有東方出版社專為小孩子們出書， Figure 1 Excerpt from KWIC concordance of the word 大概 from the Sinica corpus That Figure 1 includes only an excerpt from the full concordance is probably fairly obvious: in a large corpus of say 100 million words, a common word such as 大概 would occur hundreds if not thousands of times. So while KWIC might help a lexicographer or a student of Chinese to see, for example, that the word in question often occurs at the beginning of a sentence or a clause, or that it is frequently followed by the verb 是, any comprehensive analysis taking into account all the occurrences of the keyword would not be practicable. Various tools are available for exploring word context in corpora, determining by a statistical analysis which words are likely to appear in collocation with which others. Often, the statistic involved is mutual information (MI), first suggested in the linguistic context by Church and Hanks (1989). Oakes (1998:63) reported that co-occurrence statistics such as MI “are slowly taking a central position in corpus linguistics”. MI provides a measure of the degree of association of a given segment with others. Pointwise MI, calculated by Equation 1, is what is used in lexical processing to return the degree of association of two words x and y (a collocation). (1) I ( x; y )  log P ( x| y ) P( x ) Where one constituent of a collocation could scarcely occur other than in the company of the other (as with “Hong Kong”, perhaps), MI will be positive and relatively high. Zero MI indicates, in principle, that two items are contiguous by chance, and that they are independent of each other (although it is quite difficult to make out a case for independence when word order is clearly constrained by rules of syntax). A negative MI suggests that the items are relatively common, but in complementary distribution: ungrammatical sequences such as “the and” would come into this category. The SARA tool, widely used with the BNC, and the Sinica Corpus user interface both offer an MI analysis of the corpus contents. Such tools, however, suffer from two important constraints: first, when considering the context of a word, an arbitrary number of adjacent words to the left or right is taken into account, ignoring discontinuous collocations, which occur when other words (in particular function words like the and of) are found between the collocation components. To illustrate the problem, imagine that we wish to determine which of two senses of the English word bank (“the bank of a river”, or “financial institution”) is more common. If the strings river bank and, say, investment bank are frequent, there might be enough evidence on which to make a judgment. But such an analysis would ignore Bank of Taiwan and bank of the river, where the important collocates are not adjacent to the keyword, even though Taiwan and river stand in the same grammatical relationship to the keyword as investment and river in the other example. The second constraint is that a list of collocates of some keyword could include, undistinguished, items of any part of speech (POS: noun, verb and so on) and of any syntactic role (such as subject or object). This sort of grammatical information can provide useful clues for sense discrimination, which standard corpus analyses are unable to take advantage of. Consider again the word bank, which has at least two verbal senses, illustrated by The plane banked sharply and John banked the money. The first of these is an intransitive verb – it cannot take an object. Thus, if an object is observed in the sentence featuring the keyword, the chances are that forms of the verb bank properly belong to the second sense. One corpus query tool which overcomes these limitations is the Sketch Engine, developed by Adam Kilgarriff and Pavel Rychly, and described by Kilgarriff, Rychly, Smrz &Tugwell (2004). The description of the Sketch Engine which follows draws from that source, and from the Sketch Engine website, which is referenced below. The Sketch Engine is embedded in a corpus query tool called Manatee, and offers a number of modules. There is a standard concordance tool, whose output is very similar to that shown at Figure 1. It allows the user to select, as a keyword, either a lemma (in which case the keyword bank would yield results for all of bank, banks and banking for example), or a simple word-form match. The user may also specify the size of the window (the numbers of words to the left and right of the keyword) that he wishes to view. Word frequency counts are also available, and the user may define a subcorpus (in the case of the BNC, on which the English version of Sketch Engine is based, on can choose different parts of the corpus such as fiction or non-fiction). The novelty of the Sketch Engine lies in its ability to produce word sketches. The word sketch for the verb express is shown at Figure 2. It will be seen that occurrences of express in the corpus are presented according to the grammatical context in which they occur, along with a frequency count and a salience count (this statistic is based on mutual information). Thus the most salient collocate of express to act as object is concern (as in He expressed great concern), while the most salient subject collocate is (somewhat surprisingly, perhaps) infinitive: the first sentence presented in the corresponding concordance is actually the event which is experienced (expressed) by the infinitive. Figure 2 Word sketch for the lemma express Clicking on the frequency count for concern yields the concordance shown at Figure 3. A03 has appealed to Brazilian parliamentarians expressing its concern over the moves to reinstate A0K ] falls within the terms of the concern expressed above , then the whole should be submitted A10 , has written to British Telecom ( BT ) expressing concern about the likely effects on older A10 A10 Telecommunications ( OFTEL ) . OFTEL had expressed the concern that the level of the deposit district councils responding to the survey expressed concern with the way in which the Secretary A10 PSYCHOGERIATRIC SERVICESSimilar concern might be expressed for the continuation or development of geriatric A1E British Department of Transport officials have expressed concern about the probable restrictions A2E protested strongly . There was much concern expressed on Merseyside about the safety aspects of A2E we are aware that some concern has been expressed about playing on the Friday night . ' A2J the Broadwater Farm Youth Association has expressed its concern about the situation to the police A3T KEN GILL and othersSir : We write to express our deep concern and profound fears over A41 night .The West German government expressed its concern at the police violence against A49 A5G outside Yorkshire . Many local people also expressed concern at the number of holiday cottages in November .Economists yesterday expressed concern that the increase in prices was Figure 3 Sketch engine concordance for express…concern It will be observed that the system handles discontinuous collocations of verb and object such as express our deep concern as well as the canonical expressed concern where verb and object are adjacent. Other patterns, including the concern expressed and expressed by the infinitive (a passive form) are appropriately dealt with. Altogether, the Sketch Engine defines 27 grammatical relations for English. As well as the subject and object relations, adverbial modifier, and/or, and prepositional relations may be seen in Figure 2. The grammatical relations are defined using regular expressions over part-of-speech tags, as shown in (2) for the simplest verb-object relation. (2) 1:”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N” In (2), the 1: and 2: identify the two collocate components. Between the components, zero or more (denoted by the *) words may appear. If any do appear, they may be determiners (the or a), numbers, adjectives, adverbs or nouns. Other rules are also required for the verb-object grammatical relation (for example to capture the passive form mentioned above). Kilgarriff et al have also ported the Sketch Engine to the Czech language. Because, unlike English, Czech is a free word order language, this implementation presented additional challenges. The fact that these problems were overcome, largely through the provision of a gap() object in the grammatical relations rules, bodes well for implementation in other languages whose word order is less constrained (such as Chinese). Bibliography 黃居仁，主編 (2003) 中文的意義與詞義之一/之二/之三。中央研究院文獻語料庫與詞庫小組技術報告 03-01/03-02/03-03. 中央研究院資訊所、語言所詞庫小組。 1995/1998。「中央研究院漢語料庫的內容與說明。」詞庫小組技術報告 95-02/98-04 號。吳毓傑、陳振南 (2002)以叢聚式作法進行中文新聞分類，第八屆國際資訊管理研究暨實務研討會，Vol. 1, pp.619-627 Agirre, E. and Martinez, D. (2000). Exploring automatic word sense disambiguation with decision lists and the web. In Proceedings of the Coling 2000 Workshop on Semantic Annotation and Intelligent Annotation, Centre Universitaire, Luxembourg. Bruce, R. and Wiebe, J. (1994). Word-sense disambiguation using decomposable models. In 32nd Annual Meeting of the Association for Computational Linguistics (ACL 1994), pages 139-146, Las Cruces. BTexaCT (2001) OASIS First Utterance corpus, version 2.23 (annotated transcription, audio files and release notes) Carroll, J. and McCarthy, D. (2000). Word sense disambiguation using automatically acquired verbal preferences. Computers and the Humanities, 34(1-2):109-114. Chen, J. N. (2000) Adaptive Word Sense Disambiguation Using Lexical Knowledge in Machine-readable Dictionary, Computational Linguistics and Chinese Language Processing, Vol. 5, No. 2, pp. 1-42. Chen, Jen-Nan and Sue J. Ker. (2001) Towards a Conceptual Representation of Lexical Meaning in WordNet. In the 15th Pacific Asia Conference on Language, Information and Computation, pp. 97-108, Hong Kong, February 1-3 Chodorow, M., Leacock, C., and Miller, G. (2000). A topical/local classifier for word sense identification. Computers and the Humanities, 34(1-2):115-120. Choueka, Y. and Lusignan, S. (1985). Disambiguation by short contexts. Computers and the Humanities, 19:147-158. Church, K. W. and Hanks, P. (1989) Word association norms, mutual information and lexicography. In Proc. 27th Annual Meeting of ACL, Vancouver. 1989: 76-83 Clear, J.H. (1993) The British National Corpus in Paul Delany & G. P. Landow (eds) The Digital Word : text-based computing in the humanities . Cambridge, Mass.: MIT Press,: 163-187. Cowie, J., Guthrie, J., and Guthrie, L. (1992). Lexical disambiguation using simulated annealing. In Proceedings of the 15th International Conference on Computational Linguistics (Coling 1992), pages 359-365, Nantes. Crystal, D (1991) A Dictionary of Linguistics and Phonetics. Oxford: Blackwell. Dagan, I. and Itai, A. (1994). Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20(4):563-596. Dagan, I., L. Lee, and F. Pereira. (1999) Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1) Dini, L., di Tomaso, V., and Segond, F. (2000). GINGER II: An example-driven word sense disambiguator. Computers and the Humanities, 34(1-2):121-126. Edmonds, P. and Kilgarriff, A. (2002). Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering, Special Issue on Word Sense Disambiguation Systems, 8(4):279-291. Escudero, G., Marquez, L., and Rigau, G. (2000b). A comparison between supervised learning algorithms for word sense disambiguation. In Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL-2000), pages 31-36, Lisbon. Fellbaum, C. (ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Firth, J.R. (1957) A synopsis of linguistic theory, 1930-1955. In Palmer, F.R. (ed) (1968) Selected papers of J.R. Firth 1952-9. Harlow: Longman Gale, B., Church, K., and Yarowsky, D. (1992b). A method for disambiguating word senses in a corpus. Computers and the Humanities, 26:415-439. Gale, B., Church, K., and Yarowsky, D. (1992d). Work on statistical methods for word sense disambiguation. In AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pages 54-60, Cambridge. Garner, P (1997) On topic identification and dialogue move recognition. In Computer Speech and Language 11(4): 275-306 Huang, Chu-Ren, I-Ju E. Tseng, Dylan B.S. Tsai and Brian Murphy. 2003. Cross-lingual Portability of Lexical Semantic Relations: Bootstrapping Chinese WordNet with English WordNet Relations. Language and Linguistics. 4.3.509-532. Huang, Chu-Ren, Zhao-ming Gao, Claude C.C. Shen, and Keh-jian Chen. 1998. Quantitative Criteria for Computational Chinese Lexicography: A Study based on a Standard Reference Lexicon for Chinese NLP. Proceedings of ROCLING XI. 87-108. Huang, Chu-Ren. 1995. Observation, Theory, and Practice: The application of corpus-based linguistic studies in Chinese language teaching. Presented at the First International Conference on New Technologies in Chinese Language Teaching. San Francisco. April 27-30, 1995 Ide, N., Macleod, C. (2001). The American National Corpus: A Standardized Resource of American English. Proceedings of Corpus Linguistics 2001, Lancaster UK. Lin, D. (1998) Using collocation statistics in information extraction. In Proc. of the Seventh Message Understanding Conference (MUC-7). Ker, S.J. and Jen-Nan Chen, “Adaptive Word Sense Tagging on Chinese Corpus”, In Proceedings of 18th Pacific Asia Conference on language, Information and Computation, Tokyo, Japan, pp. 267-274, Dec 2004. Kilgarriff, A & Tugwell, D (2002) Sketching Words. In Marie-Hélène Corréard (ed.): Lexicography and Natural Language Processing. A Festschrift in Honour of B.T.S. Atkins. Euralex 2002. Kilgarriff, A, & D Tugwell (2001) WORD SKETCH: Extraction and Display of Significant Collocations for Lexicography. In Proc. workshop "COLLOCATION: Computational Extraction, Analysis and Exploitation", pp.32-38. 39th ACL \& 10th EACL, Toulouse, July 2001. Kilgarriff, A, Rychly, P, Smrz, P & Tugwell, D (2004). The Sketch Engine, in Proceedings of EURALEX, Lorient, France, July 2004 Kilgarriff, A. (1997). I don't believe in word senses. Computers and the Humanities, 31:97-113. Kilgarriff, Adam, and David Tugwell. (2001) "WASP-Bench: an MT Lexicographers' Workstation Supporting State-of-the-art Lexical Disambiguation". Proc. of MT Summit VII, Santiago de Compostela, pp.187-190. Leacock, C., Chodorow, M., and Miller, G. (1998). Using corpus statistics and WordNet relations for sense identification. Computational Linguistics,24(1):147-165. Lee, Y. K. and Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), pages 41-48, Philadelphia. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of ACM SIGDOC Conference, pages 24-26, Toronto. McEnery, A., Xiao, Z. & Mo, L. (2003) Aspect marking in English and Chinese: Using the Lancaster Corpus of Mandarin Chinese for contrastive language study. Literary and Linguistic Computing 18(4): 361-378. Miller, G., Richard Beckwith, Christian Fellbaum, Derek Gross, and Katherine Miller. Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory, Princeton University, August 1993. Oakes, M (1998) Statistics for Corpus Linguistics. Edinburgh University Press Pearce, D. (2001) Synonymy in collocation extraction. In Proc. of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, CMU Shimohata, S., T. Sugio, and J. Nagata. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proc. of the 35th Annual Meeting of the ACL and 8th Conference of the EACL (ACL-EACL'97), pages 476-81, Madrid, Spain. Smith, S. (1999) Discontinuous compounds in Mandarin Chinese: A lexicalization algorithm. Unpublished MSc dissertation, UMIST, Manchester. Smith, S. (2003) Predicting query types by prosodic analysis. Unpublished PhD dissertation, University of Birmingham. Speelman, D. (1997). Abundantia Verborum. A computer tool for carrying out corpus-based linguistic case studies. Doctoral dissertation Katholieke Universiteit Leuven. Tugwell, David and Adam Kilgarriff. (2000) "Harnessing the Lexicographer in the Quest for Accurate Word Sense Disambiguation" Proc. 3rd Int. Workshop on Test, Speech, Dialogue (TSD 2000), pp.9-14. Brno, Czech Republic Springer Verlag Lecture Notes in Artificial Intelligence Vossen P. (ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks. Norwell, MA: Kluwer Academic Publishers. Wilks, Y. and Stevenson, M. (1998). The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering, 4(2):135-144. Relevant websites British National Corpus, BNC. http://info.ox.ac.uk/bnc/ Sinica Balanced Corpus http://www.sinica.edu.tw/SinicaCorpus/ Sketch Engine http://www.sketchengine.co.uk WordNet. http://www.cogsci.princeton.edu/~wn/ Linguistic Data Consortium http://wave.ldc.upenn.edu/ Goals and significance of the work The purpose of the proposed work is to make available an online word disambiguation and sense discrimination tool for Chinese, by writing and implementing a set of grammatical relations for that language. Such a tool would be of great value to lexicographers, and those involved in determining word senses for semantic network projects. An example is the Chinese WordNet project, which Huang Churen, one of the co-PIs, is currently working on, along with the PI, and colleagues at Academia Sinica’s Institute of Linguistics. At present, sense assignment on such projects is often based on the linguistic intuitions of researchers, who manually search corpora and internet resources for instances of candidate senses. While there can be no substitute for such intuitions, they would be usefully supplemented by the Sketch Engine, which would provide compact summaries of typical usages for each word. Wu Yiching of Academia Sinica, with the PI as a co-author, has had a paper accepted at the 2005 Conference and Workshop on TEFL and Applied Linguistics, to be held at Ming Chuan University (Wu, Smith & Huang, forthcoming). This paper analyzes and compares the use in English and Chinese respectively of the apparently equivalent verbs express and 表示, finding that the two forms exhibit considerable differences. The Sketch Engine was used for the English part of the analysis, and revealed some interesting findings: for example, the PI, a native speaker of English, was not aware that the sentence The justification for this was <expressed> to be that it would thus be open to a court at a later date to review the matter of sex determination was a possible sentence of English. It is likely that a Chinese implementation of Sketch Engine would be equally revealing, and it would certainly be useful in all manner of linguistic analyses. For learners of Chinese as a second language, another feature of the Sketch Engine, the sketch differences module, would be of great assistance. This module allows the user to see at a glance how the usage of apparent synonyms, or near-synonyms, differs in practice. The difference between 能, 可以 and 會 is not always easy for learners to grasp, but the Chinese Sketch Engine would illustrate the differences in usage at a glance. Current research status Before a corpus can be used by the Sketch Engine, it must be segmented into word tokens (in the case of a language like Chinese which does not indicate word boundaries by white space) and then tagged for part of speech. The Chinese Gigaword corpus has now been automatically segmented and tagged by the PI along with Huang Churen and colleagues at Academia Sinica. There are, however, tagging and segmentation errors which need to be tackled, most likely by software modifications, with which co-PI Chen Jennan will be able to assist. The question of Chinese character formats also needs attention: the simplified and traditional components of Gigaword are supplied in Unicode, while the Academia Sinica tagger accepts Big-5. A one-pass manual conversion was performed, but it would have generated many errors in the case of the mainland China data, because of the one-to-many mapping of the simplified and traditional character schemes. Ideally, the tagger should be modified to accept Unicode format, which is after all now an international standard. In the English and Czech implementations, the Sketch Engine requires a lemmatization phase, whereby forms such as banking and banks are reduced to the stem bank. Since morphological affixation is rare in Chinese, this step may well not be necessary; it will, however, be necessary to consider how the system should handle morphs such as 們, 了 and 的. Research methods The central task is to make a set of grammatical relations (essentially grammar rules) for Chinese. The number of grammatical relations should be comparable to the 27 required for English, or the 23 used for Czech. The procedure will be to study large sections of Gigaword and other corpora, as well as other texts, and determine what characteristic patterns emerge. Then each grammatical relation will need to be encoded in the form of regular expression accepted by the Sketch Engine. Each relation will probably require several rule formalisms, and will differ from similar English relations in key respects. The verb-object relation would need to be equipped to handle fronted objects preceded by 把, for example. The sample rule at (3) covers such situations. (3)(把)(“N”的) “(DET|NUM|ADJ|ADV|N)”* 2:”N” 1:”V” In (3), most of the elements are optional (in brackets, or with an asterisk wildcard). It validates verb-object sentence fragments with and without 把: 把gongke jiao (gei laoshi, for example), 把nide gongke jiao…,gongke jiao… would all be identified as verb-object relations by this rule. It is not known how many grammatical relations (rules) will eventually be defined for Chinese. The total should be of the same order of magnitude as for English (27 relations); but because of the lack of morphological clues in Chinese, the rules will certainly need to be more complex and very likely more numerous. While the grammatical relations are being written, an attempt will be made to load the Gigaword corpus using the grammatical relations applied to English, and, perhaps, those for Czech. We will hope to see a great improvement in sense discrimination and organization when the Chinese relations are implemented, but the foreign language versions will act as a useful benchmark. Co-PI Chen Jennan, under a separate grant application, intends to write an algorithm for automatic, adaptive word sense disambiguation in Chinese, following his earlier publication of such an algorithm for English (Chen 2000). It will be designed so that it can make use of salient collocation information supplied by the Sketch Engine. Using this algorithm, we will test the hypothesis that the Sketch Engine benefits from the Chinese-specific grammatical relations, by conducting word sense disambiguation experiments using both those relations and the foreign-language benchmark relations. The PI has a good working relationship with Adam Kilgarriff and Pavel Rychly, the designers of the Sketch Engine. He will be in frequent contact with them, providing feedback and progress reports on the Chinese implementation, keeping abreast of Sketch Engine developments in other languages and resolving problems through mutual discussion. Anticipated problems and means of resolution It is very difficult to predict what problems might be encountered. One possibility is that it will be hard to formulate grammatical relations for constructions where the word order in Chinese is fairly free. For example, in 公課還沒有寫完, the object 公課 precedes the verb. Although Chinese, like English, is a subject-verb-object (SVO) language, this kind of topicalization is by no means uncommon. Of course the difficulty with the rule (3) above is that it also admits a subject-verb relation such as laoshi jiaoshu. Kilgarriff et al (2004) overcame a similar problem in Czech, which also has relatively free word order, by making appeal to case (a subject is nominative case); but this grammatical feature does not exist in Chinese. To resolve such problems, we will proceed as follows. We will examine the corpus to see what grammatical and lexical features do attend the verb-object and subject-verb relation, and formulate our rules accordingly. One solution might be to test for an object after the verb, and if there is none assume that the object has been fronted. Alternatively, it could be possible to treat 公課 in this sentence is a kind of pseudo-subject, and experiment to see if this course affected results adversely. Anticipated results An online resource will in the first instance be made available to the academic community, enabling researchers to create their own word sketches dynamically. Those working on word sense discrimination should be able to start using it immediately, and those involved in pedagogical research, or the teaching of Chinese to non-native speakers, should find it useful too. Ultimately, it is hoped that the Sketch Engine could form part of a Chinese CALL (Computer aided language learning) platform, for the benefit of foreign learners. It could also be adapted for native Chinese elementary school students, who are beginning to learn writing skills. Deliverables will include presentations at domestic conferences, and at least one international conference, both on the central task – creating a Chinese Sketch Engine – and on endeavours of linguistic analysis and description which make use of the Sketch Engine. This should inspire young researchers to test their linguistic hypotheses, conveniently and readily, on a large corpus of natural Chinese.

Abstract

Related documents

Products

Support

Abstract

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib