Analysis of text and spoken language, for the purposes of second language teaching,
dictionary making and other linguistic applications, used to be based on the intuitions
of linguists and lexicographers. Today’s more scientific approach generally involves
the use of linguistic corpora: large databases of spoken or written language samples.
Numerous large corpora have been assembled for English, including the British
National Corpus and the Bank of English. Dictionaries published by the Longman
Group are based on the 100 million word BNC, and corpora are routinely used by
computational linguists in tasks such as machine translation and speech recognition.
The number of Chinese corpora available is more limited. Academia Sinica and
Lancaster University offered balanced corpora of Chinese (balanced, because the
contents are drawn from multiple sources, and are not restricted to one genre or style).
A much larger Chinese corpus, the LDC’s Gigaword, is composed exclusively of
newswire texts.
Central to corpus analysis is the context in which a word occurs: J R Firth pointed out
that information about meaning can be derived from surrounding words and sentence
patterns. Various tools are available for exploring word context in corpora,
determining for example which words are likely to appear in collocation with which
others. The SARA tool is widely used with the BNC, and the Sinica Corpus user
interface offers some statistical analysis of the corpus contents. Most of these tools,
however, suffer from an important constraint: when considering the context of a word,
an arbitrary number of adjacent words to the left or right is taken into account. Thus,
most tools pay no heed to the grammatical context of a word under investigation.
Furthermore, most query tools consider only part of speech (POS) in the limited
grammatical analysis that they do make.
One corpus query tool which overcomes these limitations is the Sketch Engine. The
analysis exploits relationships such as subject and verb, or verb and object, where
other tools would only recognize that one noun and one verb were being treated.
Occurrences of words are assigned classes according to their grammatical relationship
with other frequently collocating words, and then ranked according to salience (a
formal measure of the significance of the word in a given context).
The proposed research involves extending Sketch Engine to handle Chinese. First, a
very large corpus of Chinese (probably Gigaword) is segmented and tagged. Because
that corpus is so large, the existing (error-prone) tagging programs for Chinese would
have to be refined and improved. Secondly, a set of grammatical relations appropriate
for Chinese would need to be drawn up.
The proposed research will be of considerable benefit to Chinese lexicography: it will
be possible to distinguish the different senses of a word at a glance, obviating the need
to search through hundreds of lines of corpus output. It also has application in
Chinese language learning, helping the student to distinguish between homonyms and
make use of synonyms.
Summary of results
The PI is no newcomer to the field of Corpus Linguistics, as the research projects he
carried out at both masters and doctoral level involved the use of large corpora. His
MSc dissertation (Smith 1999) presents an algorithm which assigns lexicalization
scores to Chinese verb-object compounds (VOCs), taking into account contextual
information as well as compound-specific features. The dissertation opens with a
review of approaches to wordhood, with special reference to Chinese. A typology of
compound types is offered, and some of the wordhood criteria that have been
suggested in both the Chinese and the general linguistics literature are surveyed. Four
key criteria (boundedness, translatability, idiomaticity and referentiality) are identified;
a synthesis of these is then coded into the lexicalization algorithm. The Academia
Sinica balanced corpus is used to create a VOC database, to which reference is made
in determining lexicalization scores for VOCs encountered in unseen, user-supplied
utterances. A limited evaluation is conducted, using a second corpus. The program
itself aims to be as user-friendly as possible. It is written in LPA Prolog, and takes up
a number of challenges associated with Chinese text processing, providing a
convenient interface for users to key Chinese characters at the LPA Prolog prompt.
The PI’s doctoral thesis presents a data-driven approach to the classification of
utterances, using a novel combination of existing algorithmic approaches. Previous
work had generally classified utterances according to such categories as wh- question,
yes/no question, acknowledgement, response and the like; in general, the audio data
used was specially commissioned and recorded for research purposes. The work
presented in the thesis departs from this tradition, in that the recorded data consists of
genuine interaction between the telephone operator and members of the public, taken
from OASIS, the British Telecom telephone enquiry corpus (BTexaCT 2001).
Moreover, most of the calls recorded can be characterized as queries. The techniques
presented in this thesis attempt to determine, automatically, the class of query, from a
set of six possibilities including "statement of a problem" and "request for action". To
achieve this, a scheme for automatically labelling utterance segments according to
their prosodic features was devised, and this is presented. It is then shown how
labelling patterns encountered in training data can be exploited to classify unseen
utterances. The algorithm is coded in a combination of C and UNIX shell scripts.
Analysis of text and spoken language, for the purposes of second language teaching,
dictionary making and other linguistic applications, used to be based on the intuitions
of linguists and lexicographers. The compilation of dictionaries and thesauri, for
example, required that the compiler read very widely, and record the results of his
efforts – the definitions and different senses of words -- on thousands, or millions of
index cards. Dictionary entries which seemed intuitively similar were placed together
in boxes or piles, according to Speelman (1997), for later analysis. Thus, the
distribution of items among sets preceded the lexical analysis, whereas under a
computer-age model the analysis would come first, guiding the distribution: a
distribution which could be based on masses of data, rather than the intuitions of the
Kilgarriff & Tugwell (2002) observe that manual lexicography, with its limited
number of citations per word, emphasizes the unusual, by its nature paying somewhat
less attention to common lexical items and patterns. With the advent of computers and
corpora, full attention can be paid to even the most common words: the more
frequently a word occurs, the more an analyst can say about it, with appropriate
computer tools.
The presently proposed research will adapt and extend one such tool, and make it
compatible with corpora of the Chinese language. Today’s approach to linguistic
analysis generally involves the use of linguistic corpora: large databases of spoken or
written language samples, defined by Crystal (1991) as “A collection of linguistic data,
either written texts or a transcription of recorded speech, which can be used as a
starting-point of linguistic description or as a means of verifying hypotheses about a
Numerous large corpora have been assembled for English, including the British
National Corpus and the Bank of English. Dictionaries published by the Longman
Group are based on the 100 million word BNC, and corpora are routinely used by
computational linguists in tasks such as machine translation and speech recognition.
The BNC is an example of a balanced corpus, in that it attempts to represent a broad
cross-section of genres and styles, including fiction and non-fiction, books,
periodicals and newspapers, and even essays by students. Transcriptions of spoken
data are also included.
A body called the Linguistic Data Consortium licenses and makes available a variety
of corpora of English and other languages, including the recently released American
National Corpus (Ide & McLeod 2001). Currently, the ANC consists of 10 million
words of American English, but it will eventually be extended to 100 million words.
The LDC also offers spoken corpora, where the data is presented in the form of audio
files, as well as text and spoken corpora from many languages in addition to English,
including Chinese. Current LDC Chinese spoken offerings include Mandarin
Broadcast News, and the Callhome and Callfriend telephone speech corpora. The two
principal Chinese text corpora from this source consist of newswire texts (thus, they
are not balanced corpora): the LDC’s Chinese Treebank is relatively small at 500000
words, while Chinese Gigaword is by any standards large, at 1.1 billion Chinese
Academia Sinica and Lancaster University, respectively, offer balanced text corpora
of Taiwan and mainland China Chinese.
Central to corpus analysis is the context in which a word occurs: J R Firth pointed out
that information about meaning can be derived from surrounding words and sentence
patterns: “You shall know a word by the company it keeps”, as he famously stated in
1957. A convenient and straightforward tool for inspecting the context of a given
word in a corpus is the KWIC (keyword in context) concordance, where all lines in
the corpus containing the desired keyword are listed, with the keyword at the centre.
Figure 1 shows such a concordance.
,可以從 1691 年他的一份上諭中看出個大概。那年五月,古北口總兵官蔡元向朝廷
了當時太空探險的熱潮。進入彩色(大概是 73 年)時代後,我最欣賞卡通寶馬
Figure 1 Excerpt from KWIC concordance of the word 大概 from the Sinica corpus
That Figure 1 includes only an excerpt from the full concordance is probably fairly
obvious: in a large corpus of say 100 million words, a common word such as 大概
would occur hundreds if not thousands of times. So while KWIC might help a
lexicographer or a student of Chinese to see, for example, that the word in question
often occurs at the beginning of a sentence or a clause, or that it is frequently followed
by the verb 是, any comprehensive analysis taking into account all the occurrences of
the keyword would not be practicable.
Various tools are available for exploring word context in corpora, determining by a
statistical analysis which words are likely to appear in collocation with which others.
Often, the statistic involved is mutual information (MI), first suggested in the
linguistic context by Church and Hanks (1989).
Oakes (1998:63) reported that co-occurrence statistics such as MI “are slowly taking a
central position in corpus linguistics”. MI provides a measure of the degree of
association of a given segment with others. Pointwise MI, calculated by Equation 1, is
what is used in lexical processing to return the degree of association of two words x
and y (a collocation).
I ( x; y )  log
P ( x| y )
P( x )
Where one constituent of a collocation could scarcely occur other than in the
company of the other (as with “Hong Kong”, perhaps), MI will be positive and
relatively high. Zero MI indicates, in principle, that two items are contiguous by
chance, and that they are independent of each other (although it is quite difficult to
make out a case for independence when word order is clearly constrained by rules of
syntax). A negative MI suggests that the items are relatively common, but in
complementary distribution: ungrammatical sequences such as “the and” would come
into this category.
The SARA tool, widely used with the BNC, and the Sinica Corpus user interface both
offer an MI analysis of the corpus contents. Such tools, however, suffer from two
important constraints: first, when considering the context of a word, an arbitrary
number of adjacent words to the left or right is taken into account, ignoring
discontinuous collocations, which occur when other words (in particular function
words like the and of) are found between the collocation components. To illustrate the
problem, imagine that we wish to determine which of two senses of the English word
bank (“the bank of a river”, or “financial institution”) is more common. If the strings
river bank and, say, investment bank are frequent, there might be enough evidence on
which to make a judgment. But such an analysis would ignore Bank of Taiwan and
bank of the river, where the important collocates are not adjacent to the keyword,
even though Taiwan and river stand in the same grammatical relationship to the
keyword as investment and river in the other example.
The second constraint is that a list of collocates of some keyword could include,
undistinguished, items of any part of speech (POS: noun, verb and so on) and of any
syntactic role (such as subject or object). This sort of grammatical information can
provide useful clues for sense discrimination, which standard corpus analyses are
unable to take advantage of. Consider again the word bank, which has at least two
verbal senses, illustrated by The plane banked sharply and John banked the money.
The first of these is an intransitive verb – it cannot take an object. Thus, if an object is
observed in the sentence featuring the keyword, the chances are that forms of the verb
bank properly belong to the second sense.
One corpus query tool which overcomes these limitations is the Sketch Engine,
developed by Adam Kilgarriff and Pavel Rychly, and described by Kilgarriff, Rychly,
Smrz &Tugwell (2004). The description of the Sketch Engine which follows draws
from that source, and from the Sketch Engine website, which is referenced below.
The Sketch Engine is embedded in a corpus query tool called Manatee, and offers a
number of modules. There is a standard concordance tool, whose output is very
similar to that shown at Figure 1. It allows the user to select, as a keyword, either a
lemma (in which case the keyword bank would yield results for all of bank, banks
and banking for example), or a simple word-form match. The user may also specify
the size of the window (the numbers of words to the left and right of the keyword)
that he wishes to view. Word frequency counts are also available, and the user may
define a subcorpus (in the case of the BNC, on which the English version of Sketch
Engine is based, on can choose different parts of the corpus such as fiction or
The novelty of the Sketch Engine lies in its ability to produce word sketches. The
word sketch for the verb express is shown at Figure 2. It will be seen that occurrences
of express in the corpus are presented according to the grammatical context in which
they occur, along with a frequency count and a salience count (this statistic is based
on mutual information). Thus the most salient collocate of express to act as object is
concern (as in He expressed great concern), while the most salient subject collocate is
(somewhat surprisingly, perhaps) infinitive: the first sentence presented in the
corresponding concordance is actually the event which is experienced (expressed) by
the infinitive.
Figure 2 Word sketch for the lemma express
Clicking on the frequency count for concern yields the concordance shown at Figure
has appealed to Brazilian parliamentarians expressing its concern over the moves to reinstate
] falls within the terms of the concern expressed above , then the whole should be submitted
, has written to British Telecom ( BT ) expressing concern about the likely effects on older
Telecommunications ( OFTEL ) . OFTEL had expressed the concern that the level of the deposit
district councils responding to the survey expressed concern with the way in which the Secretary
A10 PSYCHOGERIATRIC SERVICES<p>Similar concern might be expressed for the continuation or development of geriatric
British Department of Transport officials have expressed concern about the probable restrictions
protested strongly . There was much concern expressed on Merseyside about the safety aspects of
we are aware that some concern has been expressed about playing on the Friday night . '</p>
the Broadwater Farm Youth Association has expressed its concern about the situation to the police
KEN GILL and others<p>Sir : We write to express our deep concern and profound fears over
night .</p><p>The West German government expressed its concern at the police violence against
outside Yorkshire . Many local people also expressed concern at the number of holiday cottages
in November .</p><p>Economists yesterday expressed concern that the increase in prices was
Figure 3 Sketch engine concordance for express…concern
It will be observed that the system handles discontinuous collocations of verb and
object such as express our deep concern as well as the canonical expressed concern
where verb and object are adjacent. Other patterns, including the concern expressed
and expressed by the infinitive (a passive form) are appropriately dealt with.
Altogether, the Sketch Engine defines 27 grammatical relations for English. As well
as the subject and object relations, adverbial modifier, and/or, and prepositional
relations may be seen in Figure 2. The grammatical relations are defined using regular
expressions over part-of-speech tags, as shown in (2) for the simplest verb-object
(2) 1:”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N”
In (2), the 1: and 2: identify the two collocate components. Between the components,
zero or more (denoted by the *) words may appear. If any do appear, they may be
determiners (the or a), numbers, adjectives, adverbs or nouns. Other rules are also
required for the verb-object grammatical relation (for example to capture the passive
form mentioned above).
Kilgarriff et al have also ported the Sketch Engine to the Czech language. Because,
unlike English, Czech is a free word order language, this implementation presented
additional challenges. The fact that these problems were overcome, largely through
the provision of a gap() object in the grammatical relations rules, bodes well for
implementation in other languages whose word order is less constrained (such as
Goals and significance of the work
The purpose of the proposed work is to make available an online word
disambiguation and sense discrimination tool for Chinese, by writing and
implementing a set of grammatical relations for that language. Such a tool would be
of great value to lexicographers, and those involved in determining word senses for
semantic network projects. An example is the Chinese WordNet project, which Huang
Churen, one of the co-PIs, is currently working on, along with the PI, and colleagues
at Academia Sinica’s Institute of Linguistics. At present, sense assignment on such
projects is often based on the linguistic intuitions of researchers, who manually search
corpora and internet resources for instances of candidate senses. While there can be
no substitute for such intuitions, they would be usefully supplemented by the Sketch
Engine, which would provide compact summaries of typical usages for each word.
Wu Yiching of Academia Sinica, with the PI as a co-author, has had a paper accepted
at the 2005 Conference and Workshop on TEFL and Applied Linguistics, to be held at
Ming Chuan University (Wu, Smith & Huang, forthcoming). This paper analyzes and
compares the use in English and Chinese respectively of the apparently equivalent
verbs express and 表示, finding that the two forms exhibit considerable differences.
The Sketch Engine was used for the English part of the analysis, and revealed some
interesting findings: for example, the PI, a native speaker of English, was not aware
that the sentence The justification for this was <expressed> to be that it would thus be
open to a court at a later date to review the matter of sex determination was a
possible sentence of English. It is likely that a Chinese implementation of Sketch
Engine would be equally revealing, and it would certainly be useful in all manner of
linguistic analyses.
For learners of Chinese as a second language, another feature of the Sketch Engine,
the sketch differences module, would be of great assistance. This module allows the
user to see at a glance how the usage of apparent synonyms, or near-synonyms, differs
in practice. The difference between 能, 可以 and 會 is not always easy for learners
to grasp, but the Chinese Sketch Engine would illustrate the differences in usage at a
Current research status
Before a corpus can be used by the Sketch Engine, it must be segmented into word
tokens (in the case of a language like Chinese which does not indicate word
boundaries by white space) and then tagged for part of speech. The Chinese Gigaword
corpus has now been automatically segmented and tagged by the PI along with Huang
Churen and colleagues at Academia Sinica. There are, however, tagging and
segmentation errors which need to be tackled, most likely by software modifications,
with which co-PI Chen Jennan will be able to assist. The question of Chinese
character formats also needs attention: the simplified and traditional components of
Gigaword are supplied in Unicode, while the Academia Sinica tagger accepts Big-5. A
one-pass manual conversion was performed, but it would have generated many errors
in the case of the mainland China data, because of the one-to-many mapping of the
simplified and traditional character schemes. Ideally, the tagger should be modified to
accept Unicode format, which is after all now an international standard.
In the English and Czech implementations, the Sketch Engine requires a
lemmatization phase, whereby forms such as banking and banks are reduced to the
stem bank. Since morphological affixation is rare in Chinese, this step may well not
be necessary; it will, however, be necessary to consider how the system should handle
morphs such as 們, 了 and 的.
Research methods
The central task is to make a set of grammatical relations (essentially grammar rules)
for Chinese. The number of grammatical relations should be comparable to the 27
required for English, or the 23 used for Czech. The procedure will be to study large
sections of Gigaword and other corpora, as well as other texts, and determine what
characteristic patterns emerge. Then each grammatical relation will need to be
encoded in the form of regular expression accepted by the Sketch Engine. Each
relation will probably require several rule formalisms, and will differ from similar
English relations in key respects. The verb-object relation would need to be equipped
to handle fronted objects preceded by 把, for example. The sample rule at (3) covers
such situations.
(3)(把)(“N”的) “(DET|NUM|ADJ|ADV|N)”* 2:”N” 1:”V”
In (3), most of the elements are optional (in brackets, or with an asterisk wildcard). It
validates verb-object sentence fragments with and without 把: 把gongke jiao (gei
laoshi, for example), 把nide gongke jiao…,gongke jiao… would all be identified as
verb-object relations by this rule.
It is not known how many grammatical relations (rules) will eventually be defined for
Chinese. The total should be of the same order of magnitude as for English (27
relations); but because of the lack of morphological clues in Chinese, the rules will
certainly need to be more complex and very likely more numerous.
While the grammatical relations are being written, an attempt will be made to load the
Gigaword corpus using the grammatical relations applied to English, and, perhaps,
those for Czech. We will hope to see a great improvement in sense discrimination and
organization when the Chinese relations are implemented, but the foreign language
versions will act as a useful benchmark.
Co-PI Chen Jennan, under a separate grant application, intends to write an algorithm
for automatic, adaptive word sense disambiguation in Chinese, following his earlier
publication of such an algorithm for English (Chen 2000). It will be designed so that it
can make use of salient collocation information supplied by the Sketch Engine. Using
this algorithm, we will test the hypothesis that the Sketch Engine benefits from the
Chinese-specific grammatical relations, by conducting word sense disambiguation
experiments using both those relations and the foreign-language benchmark relations.
The PI has a good working relationship with Adam Kilgarriff and Pavel Rychly, the
designers of the Sketch Engine. He will be in frequent contact with them, providing
feedback and progress reports on the Chinese implementation, keeping abreast of
Sketch Engine developments in other languages and resolving problems through
mutual discussion.
Anticipated problems and means of resolution
It is very difficult to predict what problems might be encountered. One possibility is
that it will be hard to formulate grammatical relations for constructions where the
word order in Chinese is fairly free. For example, in 公課還沒有寫完, the object 公
課 precedes the verb. Although Chinese, like English, is a subject-verb-object (SVO)
language, this kind of topicalization is by no means uncommon.
Of course the difficulty with the rule (3) above is that it also admits a subject-verb
relation such as laoshi jiaoshu. Kilgarriff et al (2004) overcame a similar problem in
Czech, which also has relatively free word order, by making appeal to case (a subject
is nominative case); but this grammatical feature does not exist in Chinese.
To resolve such problems, we will proceed as follows. We will examine the corpus to
see what grammatical and lexical features do attend the verb-object and subject-verb
relation, and formulate our rules accordingly. One solution might be to test for an
object after the verb, and if there is none assume that the object has been fronted.
Alternatively, it could be possible to treat 公課 in this sentence is a kind of
pseudo-subject, and experiment to see if this course affected results adversely.
Anticipated results
An online resource will in the first instance be made available to the academic
community, enabling researchers to create their own word sketches dynamically.
Those working on word sense discrimination should be able to start using it
immediately, and those involved in pedagogical research, or the teaching of Chinese
to non-native speakers, should find it useful too.
Ultimately, it is hoped that the Sketch Engine could form part of a Chinese CALL
(Computer aided language learning) platform, for the benefit of foreign learners. It
could also be adapted for native Chinese elementary school students, who are
beginning to learn writing skills.
Deliverables will include presentations at domestic conferences, and at least one
international conference, both on the central task – creating a Chinese Sketch
Engine – and on endeavours of linguistic analysis and description which make use of
the Sketch Engine. This should inspire young researchers to test their linguistic
hypotheses, conveniently and readily, on a large corpus of natural Chinese.