The Lancaster Corpus of Mandarin Chinese (LCMC): A corpus for monolingual and contrastive study Tony McEnery Richard Xiao 1 27/05/2004 LREC 2004, Lisbon The LCMC Corpus: Aims Built for the ESRC project Contrasting tense and aspect in English and Chinese (Grant Ref. RES-000220135) – 2 See http://www.regard.ac.uk A Chinese match for FLOB/Frown for BrE/AmE A publicly available balanced corpus of Mandarin Chinese Distributed free of charge for use in non-profit-making research LREC 2004, Lisbon 27/05/2004 LCMC: Profile 3 One million words Standard character and Romanized Pinyin versions 1990-1993 (ca. 87% of samples from 1991-1992) 15 text categories 500 text samples Major text provider: SSReader Digital Library, China Unicode (UTF-8) XML-conformant mark-up Marked for paragraphs and sentences POS-tagged (precision rate 98%+) LREC 2004, Lisbon 27/05/2004 Major Chinese corpus resources (1) Sinica Corpus of Mandarin Chinese – – PH corpus – – – Ca. 2 million words of newswire text (1990-1991) Available at ftp://ftp.cogsci.ed.ac.uk/pub/chinese POS version available at http://www.ling.lancs.ac.uk/corplang/ PFR People’s Daily Corpus – – 4 5 million words of Mandarin as used in Taiwan http://www.sinica.edu.tw/SinicaCorpus – Newspaper text from People’s Daily 1998 Sample (01/98) available at http://icl.pku.edu.cn/Introduction/corpustagging.htm Searchable at http://www.ling.lancs.ac.uk/corplang/ LREC 2004, Lisbon 27/05/2004 Major Chinese corpus resources (2) Linguistic Variation in Chinese Speech Communities – – Spoken Chinese Corpus of Situated Discourse (SCCSD) – See Gu, Y. 2002. ‘Towards an understanding of workplace discourse’ in C. Candlin (ed) Research and Practice in Professional Discourse (pp. 137-86). Hong Kong: City University of Hong Kong Press. Three Mandarin corpora released by LDC – – 5 Text from newspapers and electronic media in six Chinese speech communities http://www.livac.org/ TREC, Gigaword and Callhome See the LDC catalogue LREC 2004, Lisbon 27/05/2004 Chinese corpora: A comparison 6 Corpus POS Bal. Channel Variety Contr. LCMC Yes Yes Written Mainland E–C Sinica Yes Yes Mixed Taiwan No PH PFR No Yes No No Written Written Mainland Mainland No No LIVAC No No Written Mixed C–C SCCSD No Yes Spoken Mainland No TREC Gigaword No No No No Written Written Mainland Mainland No No Callhome No ? Spoken Mixed No LREC 2004, Lisbon 27/05/2004 LCMC: Sampling frame 7 Code Text category Samples Proportion A B C D E F G H J K L Press reportage Press editorials Press reviews Religion Skills/trades/hobbies Popular lore Biographies/essays Miscellaneous Science General fiction Mystery/detective fiction 44 27 17 17 38 44 77 30 80 29 24 8.8% 5.4% 3.4% 3.4% 7.6% 8.8% 15.4% 6% 16% 5.8% 4.8% M N Science fiction Adventure and martial arts fiction 6 29 1.2% 5.8% P R Total Romantic fiction Humour 29 9 500 5.8% 1.8% 100% LREC 2004, Lisbon 27/05/2004 LCMC: Markup (XML) 8 Level Code Gloss Attribute Value 1 text Text type TYPE As per Table 2 Text Category ID As per Table 2 Code Text ID plus file number starting from 01 --- 2 file Corpus file ID 3 p Paragraph --- 4 s Sentence n Starting from 0001 onwards 5 w Word POS c Punctuation and symbol Part-of-speech tags as per the LCMC tagset gap Omission --- --- LREC 2004, Lisbon 27/05/2004 LCMC: Annotations Word segmentation POS tagging – Applying the Peking University tagset 26 Level 1 POS tags 50 Level 2 POS tags – POS tagger (ICT Chinese Lexical Analyzing System) – – 9 Developed by the Institute of Computing Technologies, the Chinese Academy of Sciences Automatic tagging with a precision rate of 97.16% Post-editing improved the precision to over 98% LREC 2004, Lisbon 27/05/2004 LCMC Level 1 POS tags 10 a. adjective b. non-predicative adj. c. conjunction d. adverb e. interjection f. directional locality g. morpheme h. prefix i. Idiom j. abbreviation k. suffix l. fixed expression m. numeral LREC 2004, Lisbon n. noun o. onomatopoeia p. preposition q. classifier r. pronoun s. space word t. time word u. auxiliary v. verb w. punctuation/symbol x. unclassified item y. modal particle z. descriptive 27/05/2004 LCMC: corpus exploration tools Unicode-compliant, XML-aware corpus tools – WebConc designed for use with LCMC – http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl Xaira (XML-aware Sara) Sara: SGML-aware Retrieval Application – – Known as Xara before beta version 1.06 Documentation available at http://www.oucs.ox.ac.uk/rts/xaira/ A tutorial available at the LCMC website The WordSmith Tools version 4 Beta version available – 11 Originally developed for use with the British National Corpus (BNC) http://www.lexically.net/wordsmith/version4/index.htm LREC 2004, Lisbon 27/05/2004 Software demonstration Using Xaira to search LCMC – – – – – Using WebConC to access LCMC – 12 Query types: Quick query, word query (pattern), POS query, pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc Display mode: KWIC mode vs. sentence mode Display format: Plain vs. XML Status bar: Reference Other useful features: distribution, sort, collocation, partition, user-defined stylesheets, etc. http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl LREC 2004, Lisbon 27/05/2004 LCMC: Potential use Monolingual study – – Contrastive study (in conjunction with FLOB/Frown) – – 13 Studying modern Mandarin Chinese as a whole Exploring variation across 15 text categories Contrasting Chinese and BrE/AmE Contrasting text categories in Chinese and English LREC 2004, Lisbon 27/05/2004 LCMC: Availability Both the standard and Romanized versions are available free of charge for use in non-profit-making research Distributed by ELRA and Oxford Text Archive Searchable via WebConc on the corpus website The LCMC website – The Chinese mirror site (the Chinese Academy of Social Sciences) – 14 http://www.ling.lancs.ac.uk/corplang/lcmc http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC .htm LREC 2004, Lisbon 27/05/2004 LCMC: Release notes and licensees Release notes – – – – – Number of licensees as of 08/04/2004 – 15 06/2003: Corpus mounted on the website of Corpus-based Language Studies and announced at the UCREL website; 08/2003: Chinese mirror site for the corpus established, hosted by the Chinese Academy of Sciences, Beijing; 12/2003: Corpus release announced at CORPORA-list, ELSNET-list and CLUK-list; 03/2004: Corpus release publicised at the 4th Workshop of Asian Language Resources 05/2004. Corpus taken over by the ELRA and Oxford Text Archive. 55 academic institutions and over 40 private and nonacademic users LREC 2004, Lisbon 27/05/2004 Related publications 16 McEnery, A., Xiao, Z. & Mo, L. 2003. ‘Aspect marking in English and Chinese: Using the Lancaster Corpus of Mandarin Chinese for contrastive language study’. Literary and Linguistic Computing 18(4): 361-378. Xiao, Z. & McEnery, A. 2004. ‘A corpus-based two-level model of situation aspect’. Journal of Linguistics 40(2). Xiao, Z., McEnery, A., Baker, P. & Hardie, A. 2004. ‘Developing Asian language corpora: Standards and practice’. Proceedings of the 4th Workshop on Asian Language Resources, pp. 1-8. March 25, 2004. Sanya, China. Xiao, Z. & McEnery, A. Forthcoming. Aspect in Chinese. Amsterdam: Benjamins. Xiao, Z. & McEnery A. Under review. ‘Near synonymy, collocation and semantic prosody: a cross-linguistic perspective’. LREC 2004, Lisbon 27/05/2004