Title A sense discrimination engine for Chinese Keywords Corpus linguistics; lexicography; Chinese; corpus query tools; collocation; salience. Abstract Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. Today’s more scientific approach generally involves the use of linguistic corpora: large databases of spoken or written language samples. Numerous large corpora have been assembled for English, including the British National Corpus and the Bank of English. Dictionaries published by the Longman Group are based on the 100 million word BNC, and corpora are routinely used by computational linguists in tasks such as machine translation and speech recognition. The number of Chinese corpora available is more limited. Academia Sinica and Lancaster University offered balanced corpora of Chinese (balanced, because the contents are drawn from multiple sources, and are not restricted to one genre or style). A much larger Chinese corpus, the LDC’s Gigaword, is composed exclusively of newswire texts. Central to corpus analysis is the context in which a word occurs: J R Firth pointed out that information about meaning can be derived from surrounding words and sentence patterns. Various tools are available for exploring word context in corpora, determining for example which words are likely to appear in collocation with which others. The SARA tool is widely used with the BNC, and the Sinica Corpus user interface offers some statistical analysis of the corpus contents. Most of these tools, however, suffer from an important constraint: when considering the context of a word, an arbitrary number of adjacent words to the left or right is taken into account. Thus, most tools pay no heed to the grammatical context of a word under investigation. Furthermore, most query tools consider only part of speech (POS) in the limited grammatical analysis that they do make. One corpus query tool which overcomes these limitations is the Sketch Engine. The analysis exploits relationships such as subject and verb, or verb and object, where other tools would only recognize that one noun and one verb were being treated. Occurrences of words are assigned classes according to their grammatical relationship with other frequently collocating words, and then ranked according to salience (a formal measure of the significance of the word in a given context). The proposed research involves extending Sketch Engine to handle Chinese. First, a very large corpus of Chinese (probably Gigaword) is segmented and tagged. Because that corpus is so large, the existing (error-prone) tagging programs for Chinese would have to be refined and improved. Secondly, a set of grammatical relations appropriate for Chinese would need to be drawn up. The proposed research will be of considerable benefit to Chinese lexicography: it will be possible to distinguish the different senses of a word at a glance, obviating the need to search through hundreds of lines of corpus output. It also has application in Chinese language learning, helping the student to distinguish between homonyms and make use of synonyms. Background Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. The compilation of dictionaries and thesauri, for example, required that the compiler read very widely, and record the results of his efforts – the definitions and different senses of words -- on thousands, or millions of index cards. Dictionary entries which seemed intuitively similar were placed together in boxes or piles, according to Speelman (1997), for later analysis. Thus, the distribution of items among sets preceded the lexical analysis, whereas under a computer-age model the analysis would come first, guiding the distribution: a distribution which could be based on masses of data, rather than the intuitions of the compiler. Kilgarriff & Tugwell (2002) observe that manual lexicography, with its limited number of citations per word, emphasizes the unusual, by its nature paying somewhat less attention to common lexical items and patterns. With the advent of computers and corpora, full attention can be paid to even the most common words: the more frequently a word occurs, the more an analyst can say about it, with appropriate computer tools. The presently proposed research will adapt and extend one such tool, and make it compatible with corpora of the Chinese language. Today’s approach to linguistic analysis generally involves the use of linguistic corpora: large databases of spoken or written language samples, defined by Crystal (1991) as “A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”. Numerous large corpora have been assembled for English, including the British National Corpus and the Bank of English. Dictionaries published by the Longman Group are based on the 100 million word BNC, and corpora are routinely used by computational linguists in tasks such as machine translation and speech recognition. The BNC is an example of a balanced corpus, in that it attempts to represent a broad cross-section of genres and styles, including fiction and non-fiction, books, periodicals and newspapers, and even essays by students. Transcriptions of spoken data are also included. A body called the Linguistic Data Consortium licenses and makes available a variety of corpora of English and other languages, including the recently released American National Corpus. Currently, the ANC consists of 10 million words of American English, but it will eventually be extended to 100 million words. The LDC also offers spoken corpora, where the data is presented in the form of audio files, as well as text and spoken corpora from many languages in addition to English, including Chinese. Current LDC Chinese spoken offerings include Mandarin Broadcast News, and the Callhome and Callfriend telephone speech corpora. The two principal Chinese text corpora from this source consist of newswire texts (thus, they are not balanced corpora): the LDC’s Chinese Treebank is relatively small at 500000 words, while Chinese Gigaword is by any standards large, at 1.1 billion Chinese characters. Academia Sinica and Lancaster University, respectively, offer balanced text corpora of Taiwan and mainland China Chinese. Central to corpus analysis is the context in which a word occurs: J R Firth pointed out that information about meaning can be derived from surrounding words and sentence patterns: “You shall know a word by the company it keeps”, as he famously stated in 1957. A convenient and straightforward tool for inspecting the context of a given word in a corpus is the KWIC (keyword in context) concordance, where all lines in the corpus containing the desired keyword are listed, with the keyword at the centre. Figure 1 shows such a concordance. 便無情地拖著她,往車站的方向走去。大概是下雨的關係,站牌四周站滿了趕著 想到竟會和其中唯一的女孩成了情侶。大概是她大四最後一年的時候,一起在同一 所羈絆之處甚少,故常有驚人之舉,大概屬率性而為。此等情懷見乎其 ,不過妳笑都沒有笑就是。讓我覺得我大概天生不會講笑話,在小鎮唯一的一家 ,可以從 1691 年他的一份上諭中看出個大概。那年五月,古北口總兵官蔡元向朝廷 的命題。但是,這一命題是不公平的。大概是八,九年前的某一天,我在翻閱一堆 。我想,擁有如此的氣概和謀略,大概與三晉文明的深厚蘊藏、表裡山河的 很看不慣這些人。具有這種心態的人,大概只有兩種解釋:一是他根本沒有 戶人家,外頭有一個院子擺滿了煤球,大概是做批發的。公寓外只有小陰溝,裡頭 新年當然可以拿紅包。當時的行情大概是几張十塊的鈔票,因為十塊鈔票是 了當時太空探險的熱潮。進入彩色(大概是 73 年)時代後,我最欣賞卡通寶馬 人會放鞭炮呢!當時的兒童讀物不多,大概只有東方出版社專為小孩子們出書, Figure 1 Excerpt from KWIC concordance of the word 大概 from the Sinica corpus That Figure 1 includes only an excerpt from the full concordance is probably fairly obvious: in a large corpus of say 100 million words, a common word such as 大概 would occur hundreds if not thousands of times. So while KWIC might help a lexicographer or a student of Chinese to see, for example, that the word in question often occurs at the beginning of a sentence or a clause, or that it is frequently followed by the verb 是, any comprehensive analysis taking into account all the occurrences of the keyword would not be practicable. Various tools are available for exploring word context in corpora, determining by a statistical analysis which words are likely to appear in collocation with which others. Often, the statistic involved is mutual information (MI), first suggested in the linguistic context by Church and Hanks (1989). Oakes (1998:63) reported that co-occurrence statistics such as MI “are slowly taking a central position in corpus linguistics”. MI provides a measure of the degree of association of a given segment with others. Pointwise MI, calculated by Equation 1, is what is used in lexical processing to return the degree of association of two words x and y (a collocation). (1) I ( x; y ) log P ( x| y ) P( x ) Where one constituent of a collocation could scarcely occur other than in the company of the other (as with “Hong Kong”, perhaps), MI will be positive and relatively high. Zero MI indicates, in principle, that two items are contiguous by chance, and that they are independent of each other (although it is quite difficult to make out a case for independence when word order is clearly constrained by rules of syntax). A negative MI suggests that the items are relatively common, but in complementary distribution: ungrammatical sequences such as “the and” would come into this category. The SARA tool, widely used with the BNC, and the Sinica Corpus user interface both offer an MI analysis of the corpus contents. Such tools, however, suffer from two important constraints: first, when considering the context of a word, an arbitrary number of adjacent words to the left or right is taken into account, ignoring discontinuous collocations, which occur when other words (in particular function words like the and of) are found between the collocation components. To illustrate the problem, imagine that we wish to determine which of two senses of the English word bank (“the bank of a river”, or “financial institution”) is more common. If the strings river bank and, say, investment bank are frequent, there might be enough evidence on which to make a judgment. But such an analysis would ignore Bank of Taiwan and bank of the river, where the important collocates are not adjacent to the keyword, even though Taiwan and river stand in the same grammatical relationship to the keyword as investment and river in the other example. The second constraint is that a list of collocates of some keyword could include, undistinguished, items of any part of speech (POS: noun, verb and so on) and of any syntactic role (such as subject or object). This sort of grammatical information can provide useful clues for sense discrimination, which standard corpus analyses are unable to take advantage of. Consider again the word bank, which has at least two verbal senses, illustrated by The plane banked sharply and John banked the money. The first of these is an intransitive verb – it cannot take an object. Thus, if an object is observed in the sentence featuring the keyword, the chances are that forms of the verb bank properly belong to the second sense. One corpus query tool which overcomes these limitations is the Sketch Engine, developed by Adam Kilgarriff and Pavel Rychly, and described by Kilgarriff, Rychly, Smrz &Tugwell (2004). The description of the Sketch Engine which follows draws from that source, and from the Sketch Engine website, which is referenced below. The Sketch Engine is embedded in a corpus query tool called Manatee, and offers a number of modules. There is a standard concordance tool, whose output is very similar to that shown at Figure 1. It allows the user to select, as a keyword, either a lemma (in which case the keyword bank would yield results for all of bank, banks and banking for example), or a simple word-form match. The user may also specify the size of the window (the numbers of words to the left and right of the keyword) that he wishes to view. Word frequency counts are also available, and the user may define a subcorpus (in the case of the BNC, on which the English version of Sketch Engine is based, on can choose different parts of the corpus such as fiction or non-fiction). The novelty of the Sketch Engine lies in its ability to produce word sketches. The word sketch for the verb express is shown at Figure 2. It will be seen that occurrences of express in the corpus are presented according to the grammatical context in which they occur, along with a frequency count and a salience count (this statistic is based on mutual information). Thus the most salient collocate of express to act as object is concern (as in He expressed great concern), while the most salient subject collocate is (somewhat surprisingly, perhaps) infinitive: the first sentence presented in the corresponding concordance is actually the event which is experienced (expressed) by the infinitive Figure 2 Word sketch for the lemma express Clicking on the frequency count for concern yields the concordance shown at Figure 3. A03 has appealed to Brazilian parliamentarians expressing its concern over the moves to reinstate A0K ] falls within the terms of the concern expressed above , then the whole should be submitted A10 , has written to British Telecom ( BT ) expressing concern about the likely effects on older A10 A10 Telecommunications ( OFTEL ) . OFTEL had expressed the concern that the level of the deposit district councils responding to the survey expressed concern with the way in which the Secretary A10 PSYCHOGERIATRIC SERVICES<p>Similar concern might be expressed for the continuation or development of geriatric A1E British Department of Transport officials have expressed concern about the probable restrictions A2E protested strongly . There was much concern expressed on Merseyside about the safety aspects of A2E we are aware that some concern has been expressed about playing on the Friday night . '</p> A2J the Broadwater Farm Youth Association has expressed its concern about the situation to the police A3T KEN GILL and others<p>Sir : We write to express our deep concern and profound fears over A41 night .</p><p>The West German government expressed its concern at the police violence against A49 A5G outside Yorkshire . Many local people also expressed concern at the number of holiday cottages in November .</p><p>Economists yesterday expressed concern that the increase in prices was Figure 3 Sketch engine concordance for express…concern It will be observed that the system handles discontinuous collocations of verb and object such as express our deep concern as well as the canonical expressed concern where verb and object are adjacent. Other patterns, including the concern expressed and expressed by the infinitive (a passive form) are appropriately dealt with. Altogether, the Sketch Engine defines 27 grammatical relations for English. As well as the subject and object relations, adverbial modifier, and/or, and prepositional relations may be seen in Figure 2. The grammatical relations are defined using regular expressions over part-of-speech tags, as shown in (2) for the simplest verb-object relation. (2) 1:”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N” In (2), the 1: and 2: identify the two collocate components. Between the components, zero or more (denoted by the *) words may appear. If any do appear, they may be determiners (the or a), numbers, adjectives, adverbs or nouns. Other rules are also required for the verb-object grammatical relation (for example to capture the passive from mentioned above). Kilgarriff et al have also ported the Sketch Engine to the Czech language. Because, unlike English, Czech is a free word order language, this implementation presented additional challenges. The fact that these problems were overcome, largely through the provision of a gap() object in the grammatical relations rules, bodes well for implementation in other languages whose word order is less constrained (such as Chinese). Bibliography 黃居仁,2003,主編。中文的意義與詞義之一/之二/之三。中央研究院文獻語料庫 與詞庫小組技術報告 03-01/03-02/03-03. 中央研究院資訊所、語言所詞庫小組。 1995/1998。 「中央研究院漢語料庫的 內容與說明。」詞庫小組技術報告 95-02/98-04 號。 Church, K. W. and Hanks, P. (1989) Word association norms, mutual information and lexicography. In Proc. 27th Annual Meeting of ACL, Vancouver. 1989: 76-83 Clear, J.H. (1993) The British National Corpus in Paul Delany & G. P. Landow (eds) The Digital Word : text-based computing in the humanities . Cambridge, Mass.: MIT Press,: 163-187. Crystal, D (1991) A Dictionary of Linguistics and Phonetics. Oxford: Blackwell. Firth, J.R. (1957) A synopsis of linguistic theory, 1930-1955. In Palmer, F.R. (ed) (1968) Selected papers of J.R. Firth 1952-9. Harlow: Longman Ide, N., Macleod, C. (2001). The American National Corpus: A Standardized Resource of American English. Proceedings of Corpus Linguistics 2001, Lancaster UK. Kilgarriff, A & Tugwell, D (2002) Sketching Words. In Marie-Hélène Corréard (ed.): Lexicography and Natural Language Processing. A Festschrift in Honour of B.T.S. Atkins. Euralex 2002. Kilgarriff, A, Rychly, P, Smrz, P & Tugwell, D (2004). The Sketch Engine, in Proceedings of EURALEX, Lorient, France, July 2004 McEnery, A., Xiao, Z. & Mo, L. (2003) Aspect marking in English and Chinese: Using the Lancaster Corpus of Mandarin Chinese for contrastive language study. Literary and Linguistic Computing 18(4): 361-378. Oakes, M (1998) Statistics for Corpus Linguistics. Edinburgh University Press Speelman, D. (1997). Abundantia Verborum. A computer tool for carrying out corpus-based linguistic case studies. Doctoral dissertation Katholieke Universiteit Leuven. Relevant websites British National Corpus, BNC. http://info.ox.ac.uk/bnc/ Sinica Balanced Corpus http://www.sinica.edu.tw/SinicaCorpus/ Sketch Engine http://www.sketchengine.co.uk WordNet. http://www.cogsci.princeton.edu/~wn/ Goals and significance of the work The purpose of the proposed work is to make available an online word disambiguation and sense discrimination tool for Chinese, by writing and implementing a set of grammatical relations for that language. Such a tool would be of great value to lexicographers, and those involved in determining word senses for semantic network projects. An example is the Chinese WordNet project, which Huang Churen, one of the co-PIs, is currently working on, along with the PI, and colleagues at Academia Sinica’s Institute of Linguistics. At present, sense assignment on such projects is often based on the linguistic intuitions of researchers, who manually search corpora and internet resources for instances of candidate senses. While there can be no substitute for such intuitions, they would be usefully supplemented by the Sketch Engine, which would provide compact summaries of typical usages for each word. Wu Yiching of Academia Sinica, with the PI as a co-author, has had a paper accepted at the 2005 Conference and Workshop on TEFL and Applied Linguistics, to be held at Ming Chuan University. This paper analyzes and compares the use in English and Chinese respectively of the apparently equivalent verbs express and 表示, finding that the two forms exhibit considerable differences. The Sketch Engine was used for the English part of the analysis, and revealed some interesting findings: for example, the PI, a native speaker of English, was not aware that the sentence The justification for this was <expressed> to be that it would thus be open to a court at a later date to review the matter of sex determination was a possible sentence of English. It is likely that a Chinese implementation of Sketch Engine would be equally revealing, and it would certainly be useful in all manner of linguistic analyses. For learners of Chinese as a second language, another feature of the Sketch Engine, the sketch differences module, would be of great assistance. This module allows the user to see at a glance how the usage of apparent synonyms, or near-synonyms, differs in practice. The difference between 能, 可以 and 會 is not always easy for learners to grasp, but the Chinese Sketch Engine would illustrate the differences in usage at a glance. Current research status Before a corpus can be used by the Sketch Engine, it must be segmented into word tokens (in the case of a language like Chinese which does not indicate word boundaries by white space) and then tagged for part of speech. The Chinese Gigaword corpus has now been automatically segmented and tagged by the PI along with Huang Churen and colleagues at Academia Sinica. There are, however, tagging and segmentation errors which need to be tackled, most likely by software modifications, with which co-PI Chen Jennan will be able to assist. The question of Chinese character formats also needs attention: the simplified and traditional components of Gigaword are supplied in Unicode, while the Academia Sinica tagger accepts Big-5. A one-pass manual conversion was performed, but it would have generated many errors in the case of the mainland China data, because of the one-to-many mapping of the simplified and traditional character schemes. Ideally, the tagger should be modified to accept Unicode format, which is after all now an international standard. In the English and Czech implementations, the Sketch Engine requires a lemmatization phase, whereby forms such as banking and banks are reduced to the stem bank. Since morphological affixation is rare in Chinese, this step may well not be necessary; it will, however, be necessary to consider how the system should handle morphs such as 們, 了 and 的. Research methods The central task is to make a set of grammatical relations (essentially grammar rules) for Chinese. The number of grammatical relations should be comparable to the 27 required for English, or the 23 used for Czech. The procedure will be to study large sections of Gigaword and other corpora, as well as other texts, and determine what characteristic patterns emerge. Then each grammatical relation will need to be encoded in the form of regular expression accepted by the Sketch Engine. Each relation will probably require several rule formalisms, and will differ from similar English relations in key respects. The verb-object relation would need to be equipped to handle fronted objects preceded by 把, for example. While the grammatical relations are being written, an attempt will be made to load the Gigaword corpus using the grammatical relations applied to English, and, perhaps, those for Czech. We will hope to see a great improvement in sense discrimination and organization when the Chinese relations are implemented, but the foreign language versions will act as a useful benchmark. Co-PI Chen Jennan, under a separate grant application, intends to write an algorithm for automatic, adaptive word sense disambiguation in Chinese, following his earlier publication of such an algorithm for English (Adaptive word sense disambiguation using lexical knowledge in a machine-readable dictionary, in ROCLING 2000). It will be designed so that it can make use of salient collocation information supplied by the Sketch Engine. Using this algorithm, we will test the hypothesis that the Sketch Engine benefits from the Chinese-specific grammatical relations, by conducting word sense disambiguation experiments using both those relations and the foreign-language benchmark relations. Anticipated problems and means of resolution It is very difficult to predict what problems might be encountered. One possibility is that it will be hard to formulate grammatical relations for constructions where the word order in Chinese is fairly free. For example, in 公課還沒有寫完, the object 公 課 precedes the verb. Although Chinese, like English, is a subject-verb-object (SVO) language, this kind of topicalization is by no means uncommon. One solution might be to treat 公課 in this sentence is a kind of pseudo-subject, and experiemtn to see if this course affected results adversely. Alternatively, it might be possible to test for an object after the verb, and if there is none assume that the object has been fronted. Anticipated results An online resource will in the first instance be made available to the academic community, enabling researchers to create their own word sketches dynamically. Those working on word sense discrimination should be able to start using it immediately, and those involved in pedagogical research, or the teaching of Chinese to non-native speakers, should find it useful too. Ultimately, it is hoped that the Sketch Engine could form part of a Chinese CALL (Computer aided language learning) platform, for the benefit of foreign learners. It could also be adapted for native Chinese elementary school students, who are beginning to learn writing skills. Deliverables will include presentations at domestic and international conferences, both on the central task – creating a Chinese Sketch Engine – and on endeavours of linguistic analysis and description which make use of the Sketch Engine. This should inspire young researchers to test their linguistic hypotheses, conveniently and readily, on a large corpus of natural Chinese.