NPL – Natural Language Processing

advertisement
NLP memo #2
Natural Language Processing
William S-Y. Wang
September 10,2005
Occasionally I will communicate with my class TRA-7204 via these NLP
memos. These memos will be available at two websites:
www.cuhk.edu.hk/tra/macat/nlp/
www.ee.cuhk.edu.hk/~wsywang/
In NLP memo #1, I gave a very brief discussion of the term NLP, and how
students will be assessed in this course. In addition to these memos, I will
also occasionally make available powerpoint files.
My office is in 229 Ho Sing Hang Engineering Bldg, tel.26098456 which
can receive voice messages. Typically, I plan to go to my office after each
class for an hour or so for conversation with students. Since I need to be
traveling quite a bit this semester, it is surer to reach me by email :
wsywang@ee.cuhk.edu.hk.
Since a major purpose of the course is to encourage research in the area of
natural language processing, the student should begin exploring the various
resources as early as possible in order to identify a project of interest to him.
Here are some useful websites worth exploring. When using the internet for
study and research purposes, it is important to keep in mind that the quality
and reliability of the resources may be quite uneven.
- http://www.ethnologue.com/
is maintained by the Summer Institute of Linguistics, a missionary organization. It
is often associated with Barbara Grimes. It gives the most up-to-date global picture
of the 6000 some languages in the world, based on the fieldwork of numerous
linguists.
-http://www.ldc.upenn.edu/
points to the massive materials of the Linguistic Data Consortium,
maintained by the University of Pennsylvania. One of its founders is Mark
Liberman. The Digital Signal Processing Laboratory of the CUHK EE dept
has a complete subscription to these materials, under the care of Mr. Arthur
Luk.
1
-http://ehl.santafe.edu/
is aimed at revealing the evolution of human language, and does this by
compiling large etymological dictionaries. It is based at the Santa Fe
Institute, under the direction of Murray Gell-Mann, a Nobel laureate in
physics, and involves Sergei Starostin [Moscow] and Merritt Ruhlen
[Stanford]. This database contains as a subset a corpus of pronunciations in
some 20 Chinese dialects, called DOC, originally compiled at Berkeley
under the direction of W.S-Y.Wang.
-http://wordnet.princeton.edu/
was started under the direction of the psychologist George Miller, and
contains a wealth of grammatical and semantic information on English
words.
-http://childes.psy.cmu.edu/
Child Language Data Exchange System is directed by the psychologist Brian
MacWhinney of Carnegie Mellon University. It contains transcribed data of
children learning their first language in several different languages.
-http://www.elra.info/
European Language Resources Association
-http://helmer.aksis.uib.no/icame.html
International Computer Archive of Modern English
-http://ota.ahds.ac.uk/
Oxford Text Archive
http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brow
n/brown.html
The Brown University corpus was initiated by two linguists, Nelson Francis
and Henry Kucera, on American English. It is perhaps the first such large
scale corpus.
-http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#bnc
The British National Corpus.
2
Organization of the course. [#indicates that the reading is downloadable.]
I.
9.10
The nature of language, and the nature of translation.
#Y.R.Chao, Dimensions of fidelity in translation, with special
reference to Chinese. 1967.
#W.S-Y.Wang. The Chinese language. Scientific American 1973.
II.
9.17
III.
9.24
10.1
IV.
10.8
V.
10.15
National Day
By Friday 10.21 at the latest, students should submit by
email a project prospectus on what he wishes to work on.
VI.
10.22
By Friday 10.28 at the latest, students should have
received approval on their prospectus by consultation
with the instructor, by email, telephone, or in person.
VII. 10.29
The sounds of language – how they are produced,
classified, and perceived.
VIII. 11.5
speech technology 1 [Dr. PENG Gang]
Acoustic properties of speech sounds as revealed by computer
analysis, PRAAT software; statistical methods in speech
recognition, Hidden Markov Models.
IX.
speech technology 2 [Dr. PENG Gang]
Data bases for speech technology: Cantonese, Putonghua,
and English.
Review quiz on materials covered in lectures.
11.12
X.
11.19
XI. 11.26
XII. 12.3
student presentations on projects.
student presentations on projects.
student presentations on projects.
Final form of project due.
3
Ideally, your project should be based on some topic that has interested you
for some time. Hopefully, your project report can be based on enough work
and original thinking that it can be accepted for publication by a major
journal. However, if you are looking around for ideas, here are some
possibilities to get you started:
[1] Hong Kong is a multilingual society, with three major languages
[Cantonese, Putonghua, and English] and many other languages with fewer
speakers from South Asia and Southeast Asia. One often hears a great deal
of code switching and code mixing on radio and tv, and sees them in written
materials, such as cartoons. What is the nature of the problems people face
in such a context? How are these problems different in nature for the
computer?
[2] A major impetus for a language to change is contact with other languages.
Because of its complex sociolinguistic context, Hong Kong Cantonese is
undergoing rapid change at many levels of its structure, presumably
differently from Guangzhou Cantonese. What are the major changes going
on, and how does one go about studying such processes?
[3] The most difficult problem facing computer analysis of texts, spoken or
written, is the very high degree of ambiguity in any corpus. Parsing
programs that have been developed in natural language processing are
notoriously unsuccessful. What are the major types of ambiguity in
Cantonese, and how may some of them be resolved by computer?
[4] Choose two pieces of text, either in Chinese or in English, one from
literature and one from natural science. Translate these two pieces into the
other language. What are the different difficulties you encounter with these
two genres? How would you relate your efforts to the three dimensions of
translation discussed by Y.R.Chao?
[5] Explore some online corpus, such as the WORDNET, and perform some
semantic analysis by algorithm.
[6] Explore some online corpus, such as DOC, and perform some
phonological analysis by algorithm.
[7] Using some phonetic software, such as PRAAT, and perform some
phonetic analysis on any linguistic problem of theoretical interest.
4
Download