Developing Asian Language Corpora

advertisement
The Lancaster Corpus of Mandarin
Chinese (LCMC): A corpus for
monolingual and contrastive study
Tony McEnery
Richard Xiao
1
27/05/2004
LREC 2004, Lisbon
The LCMC Corpus: Aims

Built for the ESRC project Contrasting tense and
aspect in English and Chinese (Grant Ref. RES-000220135)
–



2
See http://www.regard.ac.uk
A Chinese match for FLOB/Frown for BrE/AmE
A publicly available balanced corpus of Mandarin
Chinese
Distributed free of charge for use in non-profit-making
research
LREC 2004, Lisbon
27/05/2004
LCMC: Profile










3
One million words
Standard character and Romanized Pinyin versions
1990-1993 (ca. 87% of samples from 1991-1992)
15 text categories
500 text samples
Major text provider: SSReader Digital Library, China
Unicode (UTF-8)
XML-conformant mark-up
Marked for paragraphs and sentences
POS-tagged (precision rate 98%+)
LREC 2004, Lisbon
27/05/2004
Major Chinese corpus resources (1)

Sinica Corpus of Mandarin Chinese
–
–

PH corpus
–
–
–

Ca. 2 million words of newswire text (1990-1991)
Available at ftp://ftp.cogsci.ed.ac.uk/pub/chinese
POS version available at http://www.ling.lancs.ac.uk/corplang/
PFR People’s Daily Corpus
–
–
4
5 million words of Mandarin as used in Taiwan
http://www.sinica.edu.tw/SinicaCorpus
–
Newspaper text from People’s Daily 1998
Sample (01/98) available at
http://icl.pku.edu.cn/Introduction/corpustagging.htm
Searchable at http://www.ling.lancs.ac.uk/corplang/
LREC 2004, Lisbon
27/05/2004
Major Chinese corpus resources (2)

Linguistic Variation in Chinese Speech Communities
–
–

Spoken Chinese Corpus of Situated Discourse (SCCSD)
–

See Gu, Y. 2002. ‘Towards an understanding of workplace discourse’
in C. Candlin (ed) Research and Practice in Professional Discourse
(pp. 137-86). Hong Kong: City University of Hong Kong Press.
Three Mandarin corpora released by LDC
–
–
5
Text from newspapers and electronic media in six Chinese
speech communities
http://www.livac.org/
TREC, Gigaword and Callhome
See the LDC catalogue
LREC 2004, Lisbon
27/05/2004
Chinese corpora: A comparison
6
Corpus
POS
Bal.
Channel
Variety
Contr.
LCMC
Yes
Yes
Written
Mainland
E–C
Sinica
Yes
Yes
Mixed
Taiwan
No
PH
PFR
No
Yes
No
No
Written
Written
Mainland
Mainland
No
No
LIVAC
No
No
Written
Mixed
C–C
SCCSD
No
Yes
Spoken
Mainland
No
TREC
Gigaword
No
No
No
No
Written
Written
Mainland
Mainland
No
No
Callhome
No
?
Spoken
Mixed
No
LREC 2004, Lisbon
27/05/2004
LCMC: Sampling frame
7
Code
Text category
Samples
Proportion
A
B
C
D
E
F
G
H
J
K
L
Press reportage
Press editorials
Press reviews
Religion
Skills/trades/hobbies
Popular lore
Biographies/essays
Miscellaneous
Science
General fiction
Mystery/detective fiction
44
27
17
17
38
44
77
30
80
29
24
8.8%
5.4%
3.4%
3.4%
7.6%
8.8%
15.4%
6%
16%
5.8%
4.8%
M
N
Science fiction
Adventure and martial arts fiction
6
29
1.2%
5.8%
P
R
Total
Romantic fiction
Humour
29
9
500
5.8%
1.8%
100%
LREC 2004, Lisbon
27/05/2004
LCMC: Markup (XML)
8
Level
Code
Gloss
Attribute
Value
1
text
Text type
TYPE
As per Table 2 Text
Category
ID
As per Table 2
Code
Text ID plus file
number starting
from 01
---
2
file
Corpus file
ID
3
p
Paragraph
---
4
s
Sentence
n
Starting from 0001
onwards
5
w
Word
POS
c
Punctuation
and symbol
Part-of-speech tags
as per the LCMC
tagset
gap
Omission
---
---
LREC 2004, Lisbon
27/05/2004
LCMC: Annotations


Word segmentation
POS tagging
–
Applying the Peking University tagset
 26 Level 1 POS tags
 50 Level 2 POS tags
–
POS tagger (ICT Chinese Lexical Analyzing System)

–
–
9
Developed by the Institute of Computing Technologies, the
Chinese Academy of Sciences
Automatic tagging with a precision rate of 97.16%
Post-editing improved the precision to over 98%
LREC 2004, Lisbon
27/05/2004
LCMC Level 1 POS tags













10
a. adjective
b. non-predicative adj.
c. conjunction
d. adverb
e. interjection
f. directional locality
g. morpheme
h. prefix
i. Idiom
j. abbreviation
k. suffix
l. fixed expression
m. numeral













LREC 2004, Lisbon
n. noun
o. onomatopoeia
p. preposition
q. classifier
r. pronoun
s. space word
t. time word
u. auxiliary
v. verb
w. punctuation/symbol
x. unclassified item
y. modal particle
z. descriptive
27/05/2004
LCMC: corpus exploration tools

Unicode-compliant, XML-aware corpus tools
–
WebConc designed for use with LCMC

–
http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
Xaira (XML-aware Sara)

Sara: SGML-aware Retrieval Application
–



–
Known as Xara before beta version 1.06
Documentation available at http://www.oucs.ox.ac.uk/rts/xaira/
A tutorial available at the LCMC website
The WordSmith Tools version 4

Beta version available
–
11
Originally developed for use with the British National Corpus (BNC)
http://www.lexically.net/wordsmith/version4/index.htm
LREC 2004, Lisbon
27/05/2004
Software demonstration

Using Xaira to search LCMC
–
–
–
–
–

Using WebConC to access LCMC
–
12
Query types: Quick query, word query (pattern), POS query,
pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc
Display mode: KWIC mode vs. sentence mode
Display format: Plain vs. XML
Status bar: Reference
Other useful features: distribution, sort, collocation, partition,
user-defined stylesheets, etc.
http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
LREC 2004, Lisbon
27/05/2004
LCMC: Potential use

Monolingual study
–
–

Contrastive study (in conjunction with
FLOB/Frown)
–
–
13
Studying modern Mandarin Chinese as a whole
Exploring variation across 15 text categories
Contrasting Chinese and BrE/AmE
Contrasting text categories in Chinese and English
LREC 2004, Lisbon
27/05/2004
LCMC: Availability




Both the standard and Romanized versions are
available free of charge for use in non-profit-making
research
Distributed by ELRA and Oxford Text Archive
Searchable via WebConc on the corpus website
The LCMC website
–

The Chinese mirror site (the Chinese Academy of
Social Sciences)
–
14
http://www.ling.lancs.ac.uk/corplang/lcmc
http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC
.htm
LREC 2004, Lisbon
27/05/2004
LCMC: Release notes and licensees

Release notes
–
–
–
–
–

Number of licensees as of 08/04/2004
–
15
06/2003: Corpus mounted on the website of Corpus-based Language
Studies and announced at the UCREL website;
08/2003: Chinese mirror site for the corpus established, hosted by the
Chinese Academy of Sciences, Beijing;
12/2003: Corpus release announced at CORPORA-list, ELSNET-list
and CLUK-list;
03/2004: Corpus release publicised at the 4th Workshop of Asian
Language Resources
05/2004. Corpus taken over by the ELRA and Oxford Text Archive.
55 academic institutions and over 40 private and nonacademic users
LREC 2004, Lisbon
27/05/2004
Related publications





16
McEnery, A., Xiao, Z. & Mo, L. 2003. ‘Aspect marking in English and
Chinese: Using the Lancaster Corpus of Mandarin Chinese for contrastive
language study’. Literary and Linguistic Computing 18(4): 361-378.
Xiao, Z. & McEnery, A. 2004. ‘A corpus-based two-level model of situation
aspect’. Journal of Linguistics 40(2).
Xiao, Z., McEnery, A., Baker, P. & Hardie, A. 2004. ‘Developing Asian
language corpora: Standards and practice’. Proceedings of the 4th
Workshop on Asian Language Resources, pp. 1-8. March 25, 2004.
Sanya, China.
Xiao, Z. & McEnery, A. Forthcoming. Aspect in Chinese. Amsterdam:
Benjamins.
Xiao, Z. & McEnery A. Under review. ‘Near synonymy, collocation and
semantic prosody: a cross-linguistic perspective’.
LREC 2004, Lisbon
27/05/2004
Download