NLP memo #2 Natural Language Processing William S-Y. Wang September 10,2005 Occasionally I will communicate with my class TRA-7204 via these NLP memos. These memos will be available at two websites: www.cuhk.edu.hk/tra/macat/nlp/ www.ee.cuhk.edu.hk/~wsywang/ In NLP memo #1, I gave a very brief discussion of the term NLP, and how students will be assessed in this course. In addition to these memos, I will also occasionally make available powerpoint files. My office is in 229 Ho Sing Hang Engineering Bldg, tel.26098456 which can receive voice messages. Typically, I plan to go to my office after each class for an hour or so for conversation with students. Since I need to be traveling quite a bit this semester, it is surer to reach me by email : wsywang@ee.cuhk.edu.hk. Since a major purpose of the course is to encourage research in the area of natural language processing, the student should begin exploring the various resources as early as possible in order to identify a project of interest to him. Here are some useful websites worth exploring. When using the internet for study and research purposes, it is important to keep in mind that the quality and reliability of the resources may be quite uneven. - http://www.ethnologue.com/ is maintained by the Summer Institute of Linguistics, a missionary organization. It is often associated with Barbara Grimes. It gives the most up-to-date global picture of the 6000 some languages in the world, based on the fieldwork of numerous linguists. -http://www.ldc.upenn.edu/ points to the massive materials of the Linguistic Data Consortium, maintained by the University of Pennsylvania. One of its founders is Mark Liberman. The Digital Signal Processing Laboratory of the CUHK EE dept has a complete subscription to these materials, under the care of Mr. Arthur Luk. 1 -http://ehl.santafe.edu/ is aimed at revealing the evolution of human language, and does this by compiling large etymological dictionaries. It is based at the Santa Fe Institute, under the direction of Murray Gell-Mann, a Nobel laureate in physics, and involves Sergei Starostin [Moscow] and Merritt Ruhlen [Stanford]. This database contains as a subset a corpus of pronunciations in some 20 Chinese dialects, called DOC, originally compiled at Berkeley under the direction of W.S-Y.Wang. -http://wordnet.princeton.edu/ was started under the direction of the psychologist George Miller, and contains a wealth of grammatical and semantic information on English words. -http://childes.psy.cmu.edu/ Child Language Data Exchange System is directed by the psychologist Brian MacWhinney of Carnegie Mellon University. It contains transcribed data of children learning their first language in several different languages. -http://www.elra.info/ European Language Resources Association -http://helmer.aksis.uib.no/icame.html International Computer Archive of Modern English -http://ota.ahds.ac.uk/ Oxford Text Archive http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brow n/brown.html The Brown University corpus was initiated by two linguists, Nelson Francis and Henry Kucera, on American English. It is perhaps the first such large scale corpus. -http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#bnc The British National Corpus. 2 Organization of the course. [#indicates that the reading is downloadable.] I. 9.10 The nature of language, and the nature of translation. #Y.R.Chao, Dimensions of fidelity in translation, with special reference to Chinese. 1967. #W.S-Y.Wang. The Chinese language. Scientific American 1973. II. 9.17 III. 9.24 10.1 IV. 10.8 V. 10.15 National Day By Friday 10.21 at the latest, students should submit by email a project prospectus on what he wishes to work on. VI. 10.22 By Friday 10.28 at the latest, students should have received approval on their prospectus by consultation with the instructor, by email, telephone, or in person. VII. 10.29 The sounds of language – how they are produced, classified, and perceived. VIII. 11.5 speech technology 1 [Dr. PENG Gang] Acoustic properties of speech sounds as revealed by computer analysis, PRAAT software; statistical methods in speech recognition, Hidden Markov Models. IX. speech technology 2 [Dr. PENG Gang] Data bases for speech technology: Cantonese, Putonghua, and English. Review quiz on materials covered in lectures. 11.12 X. 11.19 XI. 11.26 XII. 12.3 student presentations on projects. student presentations on projects. student presentations on projects. Final form of project due. 3 Ideally, your project should be based on some topic that has interested you for some time. Hopefully, your project report can be based on enough work and original thinking that it can be accepted for publication by a major journal. However, if you are looking around for ideas, here are some possibilities to get you started: [1] Hong Kong is a multilingual society, with three major languages [Cantonese, Putonghua, and English] and many other languages with fewer speakers from South Asia and Southeast Asia. One often hears a great deal of code switching and code mixing on radio and tv, and sees them in written materials, such as cartoons. What is the nature of the problems people face in such a context? How are these problems different in nature for the computer? [2] A major impetus for a language to change is contact with other languages. Because of its complex sociolinguistic context, Hong Kong Cantonese is undergoing rapid change at many levels of its structure, presumably differently from Guangzhou Cantonese. What are the major changes going on, and how does one go about studying such processes? [3] The most difficult problem facing computer analysis of texts, spoken or written, is the very high degree of ambiguity in any corpus. Parsing programs that have been developed in natural language processing are notoriously unsuccessful. What are the major types of ambiguity in Cantonese, and how may some of them be resolved by computer? [4] Choose two pieces of text, either in Chinese or in English, one from literature and one from natural science. Translate these two pieces into the other language. What are the different difficulties you encounter with these two genres? How would you relate your efforts to the three dimensions of translation discussed by Y.R.Chao? [5] Explore some online corpus, such as the WORDNET, and perform some semantic analysis by algorithm. [6] Explore some online corpus, such as DOC, and perform some phonological analysis by algorithm. [7] Using some phonetic software, such as PRAAT, and perform some phonetic analysis on any linguistic problem of theoretical interest. 4