Association for Computational Linguistics and Chinese Language Processing (ACLCLP) Established date: March 16, 1990 Membership: Individual Members: 255 (111 Life members, 36 regular members, 111 student members ) Group Members: 11 (including 2 permanent institutional members) Membership Benefit: Membership includes IJCLCLP journals (4 issues per volume) and bimonthly newsletters, reduced registration in ACLCLP-sponsored conferences, and discount of ACLCLP publications. ISCA and ACLCLP have signed an agreement, with which a member of ISCA or ACLCLP can apply for a 25% membership fee discount when he or she is a member of both organizations. Special Interest Group: SIGIR: Special Interest Group on Information Retrieval SIGSLP: Special Interest Group on Spoken Language Processing SIGCALL: Special Interest Group on Computer Assisted Language Learning Publications: International Journal of the Association for Computational Linguistics and Chinese Language Processing (IJCLCLP): This journal was founded in August 1996 and was published twice a year. In order to promote research and technical advancement and play a more active role in this area, this journal publishes four issues per year from 2005. This journal covers all aspects related to computational linguistics and Chinese speech and language processing. Newsletter: Bimonthly Academic Activities: Annually: 1. 1 Conference on Computational Linguistics and Speech Processing (ROCLING) Date: In August or September Format: paper presentations and invited talks Number of participants: 180 2. Speech Processing Workshop Date: In March or April Format: invited talks Number of participants: 120 3. Information Retrieval Workshop Date: In November or December Format: invited talks Number of participants: 100 4. New Generation Speech Science and Technologies (NeGSST) Seminar Date: In January and August Format: presentations by graduate students Number of participants: 100 Resources Academia Sinica Corpora News Corpus: Corpus Program developed and maintained by CKIP group in Academia Sinica, the News Corpus includes 14 million words. The CKIP group began collecting Chinese texts since 1990 mainly from newspapers and magazines. The resources in this database are developed by Chinese Knowledge Information Processing Group, Institute of Information Science in Academia Sinica, and the ACLCLP is authorized to release it. Sinica Balanced Corpus: The Corpus is the first balanced Chinese corpus with part-of-speech tagging. The size of this corpus is 5 million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve 2 different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which can filter the data, generate statistics, sort, and identify collocations. The resources in this database are developed by Chinese Knowledge Information Processing Group, Institute of Information Science in Academia Sinica, and the ACLCLP is authorized to release it. Chinese Electronic Dictionary: The Chinese Electronic Dictionary is an electronic lexicon for Mandarin Chinese containing 88,000 entries. Each entry contains: 1. print form (Chinese characters), 2. word frequency (based on a 5 million words corpus), 3. pronunciation (National Phonetic Alphabets, Zhu4yin1fu2hao4 and Chinese Phonetic Alphabet, Han4Yu3Pin1Yin1), 4. syntactic category (based on CKIP classification of 198 categories), 5. semantic feature (base on CKIP classification of 123 concept nodes for nouns). The resources in this database are developed by Chinese Knowledge Information Processing Group, Institute of Information Science in Academia Sinica, and the ACLCLP is authorized to release it. Sinica Treebank: Sinica Treebank 3.0 contains 6 files, 61,087 syntactic tree structures, and 361,834 words. The tree structures were extracted from the Sinica Corpus, and every structure is segmented and parsed. Each segmented word of a tree structure is tagged with its part-of-speech and argument. Sinica Treebank 3.0 is provided free on the website for syntactic and semantic research use. 1,000 syntactic tree structures are available. The resources in this database are developed by Chinese Knowledge Information Processing Group, Institute of Information Science in Academia Sinica, and the ACLCLP is authorized to release it. Sinica Bilingual WordNet: The Sinica Bilingual WordNet is a Mandarin-English bilingual database covering 100,000 English synsets. This database is developed based on the frame of English WordNet and language usage in Taiwan. The information provided in this database includes Mandarin-English cross-language information transformation, identification and linking relation between senses as well as the specific domain and frequency of usage of the senses. This database enabled information from different sources to become inter-operable. The resources in this database are mainly from the jointly developed data by Corpus Linguistics Group, Institute of Linguistics and Mandarin Knowledge Information Processing Group, Institute of Information Science in Academia Sinica, along with WordNet by Princeton University and jointly developed data by Global View Co. Ltd. Taiwan, and Academia Sinica. The copyright of all original content in this database belongs to Academia Sinica and Global View Co. Ltd., except English monolingual database WordNet developed by and authorized from Princeton University. All open source data are stored both in plain text and XML format. 3 Sinica MCDC: The Sinica MCDC contains sound files and transcriptions of eight conversations (4.7 GB). The sound files are segmented into approximately three minutes sub-files and the transcripts contain time code for all speaker turns. The conversations are orthographically transcribed in Chinese characters and in Pinyin including discourse particles, markers and pauses. They are annotated according to specifically defined spontaneous speech phenomena: 1) disfluency, 2) particular pronunciation, 3) discourse-related items and 4) sociolinguistic phenomena. The annotations are not contained in this corpus. The Sinica MCDC is the result of the research project funded by the Institute of Linguistics, Academia Sinica, and the ACLCLP is authorized to release it. Speech Corpora MAT-2000: This is a product of the joint valiation project conducted by Association of Computational Linguistics and Chinese Language Processing and Philips Innovation Center-Taipei. The original database is MAT-2400 where speech data are collected through telephone networks in Taiwan. The database contains speech data provided by 2232 speakers (1227 males and 1005 females). The MAT Speech Database is the result of the research program subsidized by the National Science Council of the Executive Yuan, and the ACLCLP is authorized to release it. TCC-300: This is a collection of microphone speech databases produced by National Taiwan University, National Cheng Kung University, and National Chiao Tung University. The speech data of each university are provided by 100 speakers (50 males and 50 females). Totally TCC-300 contains speech data from 300 speakers. This is a collection of microphone speech databases produced by National Taiwan University, National Cheng Kung University, and National Chiao Tung University, and the ACLCLP is authorized to release it. EAT: The English Across Taiwan (EAT) corpus containing three groups of channels: PSTN, MIC16K and GSM was stored in three DVD discs. PSTN and GSM corpora were stored in the same DVD disc which is label as “PSTN +GSM”. Because the sampling rate of MIC16K speech data was high, the resulting storage requirement was huge. We stored MIC16K speech in two DVD discs labeled by “Mic16K English” and “Mic16K NonEnglish” for English Department and non-English Department, respectively. 4 MATBN: The MATBN Mandarin Chinese broadcast news corpus is a product of a joint project sponsored by the National Science Council, Taiwan. It contains a total of 198 one-hour news shows from the Public Television Service Foundation, Taiwan with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. The copyright of all original content in this database belongs to National Chiao Tung University, and the ACLCLP is authorized to release it. Others Corpora CIRB030: An information retrieval (IR) test collection is used to evaluate the performance of IR systems. It is a helpful and powerful tool for investigation of the developing systems and the developed systems. CIRB030 (Chinese Information Retrieval Benchmark, version 3.0) test collection is such kind of test collection, which is designed to be used for evaluation of Chinese document retrieval. 5