the Association for Computational Linguistics and Chinese

advertisement
Association for Computational Linguistics and
Chinese Language Processing (ACLCLP)
Established date: March 16, 1990
Membership:
 Individual Members: 255 (111 Life members, 36 regular members, 111 student
members )
 Group Members: 11 (including 2 permanent institutional members)
Membership Benefit:
 Membership includes IJCLCLP journals (4 issues per volume) and bimonthly
newsletters, reduced registration in ACLCLP-sponsored conferences, and discount
of ACLCLP publications.
 ISCA and ACLCLP have signed an agreement, with which a member of ISCA or
ACLCLP can apply for a 25% membership fee discount when he or she is a
member of both organizations.
Special Interest Group:
 SIGIR: Special Interest Group on Information Retrieval
 SIGSLP: Special Interest Group on Spoken Language Processing
 SIGCALL: Special Interest Group on Computer Assisted Language Learning
Publications:
 International Journal of the Association for Computational Linguistics and Chinese
Language Processing (IJCLCLP): This journal was founded in August 1996 and
was published twice a year. In order to promote research and technical
advancement and play a more active role in this area, this journal publishes four
issues per year from 2005. This journal covers all aspects related to computational
linguistics and Chinese speech and language processing.
 Newsletter: Bimonthly
Academic Activities:
Annually:
1.
1
Conference on Computational Linguistics and Speech Processing (ROCLING)
Date: In August or September
Format: paper presentations and invited talks
Number of participants: 180
2.
Speech Processing Workshop
Date: In March or April
Format: invited talks
Number of participants: 120
3.
Information Retrieval Workshop
Date: In November or December
Format: invited talks
Number of participants: 100
4.
New Generation Speech Science and Technologies (NeGSST) Seminar
Date: In January and August
Format: presentations by graduate students
Number of participants: 100
Resources
Academia Sinica Corpora
 News Corpus: Corpus Program developed and maintained by CKIP group in
Academia Sinica, the News Corpus includes 14 million words. The CKIP group
began collecting Chinese texts since 1990 mainly from newspapers and magazines.
The resources in this database are developed by Chinese Knowledge Information
Processing Group, Institute of Information Science in Academia Sinica, and the
ACLCLP is authorized to release it.
 Sinica Balanced Corpus: The Corpus is the first balanced Chinese corpus with
part-of-speech tagging. The size of this corpus is 5 million words. Each text in
the corpus is classified and marked according to five criteria: genre, style, mode,
topic, and source. The feature values of these classifications are assigned in a
hierarchy. Subcorpora can be defined with a specific set of attributes to serve
2
different research purposes. Texts in the corpus are segmented according to the
word segmentation standard proposed by the ROC Computational Linguistic
Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns
and language structures can be extracted from the tagged corpus via a corpus
inspection program which can filter the data, generate statistics, sort, and identify
collocations.
The resources in this database are developed by Chinese
Knowledge Information Processing Group, Institute of Information Science in
Academia Sinica, and the ACLCLP is authorized to release it.
 Chinese Electronic Dictionary: The Chinese Electronic Dictionary is an electronic
lexicon for Mandarin Chinese containing 88,000 entries. Each entry contains:
1. print form (Chinese characters),
2. word frequency (based on a 5 million words corpus),
3. pronunciation (National Phonetic Alphabets, Zhu4yin1fu2hao4 and Chinese
Phonetic Alphabet, Han4Yu3Pin1Yin1),
4. syntactic category (based on CKIP classification of 198 categories),
5. semantic feature (base on CKIP classification of 123 concept nodes for
nouns).
The resources in this database are developed by Chinese Knowledge Information
Processing Group, Institute of Information Science in Academia Sinica, and the
ACLCLP is authorized to release it.
 Sinica Treebank: Sinica Treebank 3.0 contains 6 files, 61,087 syntactic tree
structures, and 361,834 words. The tree structures were extracted from the Sinica
Corpus, and every structure is segmented and parsed. Each segmented word of a
tree structure is tagged with its part-of-speech and argument. Sinica Treebank 3.0 is
provided free on the website for syntactic and semantic research use. 1,000
syntactic tree structures are available. The resources in this database are
developed by Chinese Knowledge Information Processing Group, Institute of
Information Science in Academia Sinica, and the ACLCLP is authorized to release
it.
 Sinica Bilingual WordNet: The Sinica Bilingual WordNet is a Mandarin-English
bilingual database covering 100,000 English synsets. This database is developed
based on the frame of English WordNet and language usage in Taiwan. The
information provided in this database includes Mandarin-English cross-language
information transformation, identification and linking relation between senses as
well as the specific domain and frequency of usage of the senses. This database
enabled information from different sources to become inter-operable. The
resources in this database are mainly from the jointly developed data by Corpus
Linguistics Group, Institute of Linguistics and Mandarin Knowledge Information
Processing Group, Institute of Information Science in Academia Sinica, along with
WordNet by Princeton University and jointly developed data by Global View Co.
Ltd. Taiwan, and Academia Sinica. The copyright of all original content in this
database belongs to Academia Sinica and Global View Co. Ltd., except English
monolingual database WordNet developed by and authorized from Princeton
University. All open source data are stored both in plain text and XML format.
3
 Sinica MCDC: The Sinica MCDC contains sound files and transcriptions of eight
conversations (4.7 GB). The sound files are segmented into approximately three
minutes sub-files and the transcripts contain time code for all speaker turns. The
conversations are orthographically transcribed in Chinese characters and in Pinyin
including discourse particles, markers and pauses. They are annotated according to
specifically defined spontaneous speech phenomena: 1) disfluency, 2) particular
pronunciation, 3) discourse-related items and 4) sociolinguistic phenomena. The
annotations are not contained in this corpus. The Sinica MCDC is the result of
the research project funded by the Institute of Linguistics, Academia Sinica, and
the ACLCLP is authorized to release it.
Speech Corpora
 MAT-2000: This is a product of the joint valiation project conducted by
Association of Computational Linguistics and Chinese Language Processing and
Philips Innovation Center-Taipei. The original database is MAT-2400 where speech
data are collected through telephone networks in Taiwan. The database contains
speech data provided by 2232 speakers (1227 males and 1005 females). The
MAT Speech Database is the result of the research program subsidized by the
National Science Council of the Executive Yuan, and the ACLCLP is authorized to
release it.
 TCC-300: This is a collection of microphone speech databases produced by
National Taiwan University, National Cheng Kung University, and National Chiao
Tung University. The speech data of each university are provided by 100 speakers
(50 males and 50 females). Totally TCC-300 contains speech data from 300
speakers. This is a collection of microphone speech databases produced by
National Taiwan University, National Cheng Kung University, and National Chiao
Tung University, and the ACLCLP is authorized to release it.
 EAT: The English Across Taiwan (EAT) corpus containing three groups of
channels: PSTN, MIC16K and GSM was stored in three DVD discs. PSTN and
GSM corpora were stored in the same DVD disc which is label as “PSTN +GSM”.
Because the sampling rate of MIC16K speech data was high, the resulting storage
requirement was huge. We stored MIC16K speech in two DVD discs labeled by
“Mic16K English” and “Mic16K NonEnglish” for English Department and
non-English Department, respectively.
4
 MATBN: The MATBN Mandarin Chinese broadcast news corpus is a product of a
joint project sponsored by the National Science Council, Taiwan. It contains a total
of 198 one-hour news shows from the Public Television Service Foundation,
Taiwan with corresponding transcripts. The primary purpose of this collection is to
provide training and testing data for continuous speech recognition evaluation in
the broadcast news domain.
The copyright of all original content in this database
belongs to National Chiao Tung University, and the ACLCLP is authorized to
release it.
Others Corpora
 CIRB030: An information retrieval (IR) test collection is used to evaluate the
performance of IR systems. It is a helpful and powerful tool for investigation of the
developing systems and the developed systems. CIRB030 (Chinese Information
Retrieval Benchmark, version 3.0) test collection is such kind of test collection,
which is designed to be used for evaluation of Chinese document retrieval.
5
Download