Introduction to Natural Language Processing

advertisement
Course: Introduction to Natural Language Processing
Lecturer: Nives Mikelic Preradovic, assistant professor
Substitute Lecturer (if needed): Damir Boras, full professor
Language: English
ECTS credits: 6
Teaching hours: 2 lecture hours and 2 hours of practical classes per week / 30 lecture
hours and 30 hours of practical classes per semester
Duration: 1 semester
Status: elective
Method of teaching: 2 lecture hours and 2 hours of practical classes per week
Prerequisite: none
Assessment: written exam
Course description:
The wide use of computers has had a profound influence on the way ordinary people
communicate, search and store information today. For the overwhelming majority of
people and situations, the natural vehicle for such information is natural language - the
language people use in their everyday communication, i.e. Croatian, English, German,
opposite to artificial languages, such as programming languages. Text (and to a lesser
extent speech) are crucial encoding formats for the information revolution.
Natural language processing is related to applications that in some way use natural
language. For instance, it is used for computer interface design, where we give the
commands to the computer in natural language. It is also used in knowledge acquisition,
information retrieval and machine translation.
This course will give the introduction to natural language processing (NLP), focusing on
the computer use of the natural language. It will provide students who are interested in
both linguistics and computers with the insight into the fundamentals of how computers
are used to represent, process and organize textual and spoken information.
It will introduce them to the field which combines insights from linguistics and computer
science to produce applications like machine translation, information retrieval, and spell
checking. The course will also provide them with the tips on how to effectively integrate
this knowledge into the working practice.
Course encompasses the theory and practice of human language technology. Students
will be exposed to two languages that require somewhat different NLP approach:
Croatian (representing Slavic, morphologically rich language) and English (representing
syntactically rich language).
Students who are native speakers of language that is not English will have a chance to
perform analysis and to implement computational techniques of natural language
processing for their own mother tongue during the laboratory exercises.
Some of the topics included are text encoding, information retrieval technology, tools for
writing support, morphological processing, lexicon building, tagging, parsing, word sense
disambiguation, machine translation, language acquisition, dialogue systems, natural
language understanding and computer aided language learning. We will move from
simple representations of language, such as finite-state techniques and n-gram analysis, to
more advanced representations, such as those found in context-free and unification-based
parsing.
The course will cover a range of topics that will help students understand how current
NLP technology works and will provide them with a platform for future study and
research.
Students who take this course will gain a thorough understanding of the fundamental
methods used in natural language processing, along with the ability to assess the strengths
and weaknesses of natural language technologies based on these methods.
Course objectives:
The course is designed to develop an understanding of both the linguistic and
computational aspects of Natural Language Processing.
It aims to teach students the leading trends and systems in NLP and to make them
understand the concepts of morphology, syntax, semantics and pragmatics of the
language.
These goals will be achieved by:
1. Readings, lectures, and class discussions of the multiple levels of linguistic analysis
required for a computer to accept natural language input, interpret it, and carry out a
particular application;
2. Lab exercises and assignments in analyzing or implementing computational techniques
required to perform these levels of natural language processing of text
Course topics:
Unit 1: Introduction to Natural Language Processing (NLP)
NLP topics and goals, the reasons to study NLP (interdisciplinary), difference between
computational linguistics and NLP, applications and tools for natural language processing
(text processing applications, dialogue applications), Turing test, Loebner prize
Unit 2: Phonetics and phonology: speech and text encoding
Definition and comparison of phonetics and phonology, speech organs, distribution of
phonemes, IPA (International Phonetic Alphabet), speech animator (multimedia
presentation), features of speech (sound waves, speech flow, loudness, pitch, frequency,
etc.), digital representations of speech (oscillogram, spectrogram), Automatic Speech
Recognition (ASR) applications and Text-to-Speech Synthesis (TTS) applications
Unit 3: Writing systems and languages
History and the development of writing systems (pictograms, logograms, ideograms),
presentation of different writing systems today: syllabaries, consonantaries, alphabets,
and alphasyllabaries, comparison of writing systems, character encoding (Big Endian,
Little Endian, ASCII, Unicode - standard method of representing documents with
multiple writing systems)
Unit 4: Information retrieval and natural language
Facilities and applications for information retrieval on the internet (search engines, web
directories, meta-search engines, and invisible web), Boolean operators, Google
operators, synonymy, hyponymy, hypernymy, meronymy and antonymy, the specific
knowledge on the retrieval technology needed to use it well, differences between specific
and general queries, the evaluation of query results, invisible web (vertical search,
specialized search, types of specialized databases, finding invisible web)
Unit 5: Regular expressions and finite state automata
Regular expressions (regexp): definition, uses, operators and patterns, writing correct
expressions, finite state automata (FSA): definition, examples, uses, relation between
regexp and FSA, tools that use regexp (e.g. UNIX), regexp in Word: examples
Unit 6: Intro to English and Croatian language morphology
Definition of morpheme, root, affixes, allomorphs, derivational morphemes vs.
inflectional morphemes, morpheme vs. lexeme, derivational morphology in English and
Croatian (nouns, adjectives), inflectional derivational morphology in English and
Croatian (nouns, verbs, adjectives)
Unit 7: Computational morphology, finite state automata and finite state
transducers
Morphological analysis in English and Croatian: identifying morphemes, morphological
derivation in English and Croatian: combining morphemes, building of the lexicon,
algorithm: implementation of rules for morphological analysis, finite state automata –
FSA: definition, examples in Croatian and English, finite state transducer-FST:
definition, examples in Croatian and English, two-level morphology
Unit 8: Outline of English and Croatian language syntax
Levels of language analysis, representations and understanding, part of speech definition
and analysis (nouns, verbs, adjectives, adverbs, prepositions), declination, conjugation,
open and closed vocabulary, lexical features, category of phrases with examples: noun
phrase (NP), verb phrase (VP), adjective phrase (AdjP), adverb phrase (AdvP),
prepositional phrase (PP), elements of noun phrases, verb phrases and simple sentences,
adjective phrases, adverbial phrases, prepositional phrases, phrase marker tree,
grammatical correctness
Unit 9: Computational syntax: syntax trees and parsing
Syntax structure rules in English, structural ambiguity, context-free grammar (CFG):
definition and usage, computer implementation of context-free grammar (pushdown
automaton), top-down sentence parsing, bottom-up sentence parsing, grammatical
functions (modifiers, specifiers, complements, heads), generative grammar, Chomsky
hierarchy of grammars
Unit 10: Computational semantics: selectional preferences and semantic roles
Semantics of English and Croatian, compositional semantics, semantic / deep roles
(Fillmore, Jackendoff), verb valence: definition and usage, semantic roles and noun
phrases, definition and examples of the most popular semantic roles (Agent, Patient, etc.),
syntactic approach to valence (Levin), verb valence lexicon, valence and machine
translation
Unit 11: Spelling and grammar correction tools
Purpose of grammar checkers and spelling correctors, the advantages and disadvantages
of the use of such tools and expected errors, interactive and automatic correctors, error
detection and correction, non-word error detection, isolated word error correction,
context-dependent word error correction, grammar correction, n-gram analysis, nonpositional bigram array, positional bigram array, rule-based method, similarity key
method, minimum edit distance method, probabilistic method
Unit 12: Language identification and spam filtering
Natural language techniques for classifying documents (including the language the
document is written in), Ngrams, frequency distributions, markers of lexical style,
stop/function words, definition of spam vs. ham, introduction to linguistic (rule-based)
and statistical techniques for spam filtering, spam detection
Unit 13: Machine translation
Purpose of automatic machine translation, reliability of online translation services,
computer translation support functions, statistical (Google Translate) vs. rule-based
machine translation (Babelfish), direct transfer, transformer, interlingua, sentence
alignment, word alignment, “bag of words” method, translation memory, lexical
ambiguity, MT evaluation
Unit 14: Dialogue systems
Basic features of human dialogue, Gricean maxims, speech acts, Description of Eliza,
Parry, Alice and their surprising success in engaging people in conversation, the use of
dialog systems and their specific purpose, detailed analysis of the components of a dialog
system, modern dialogue systems
Unit 15: Computer-Assisted Language Learning
Definition of all important facts involved in learning a foreign language, second language
acquisition, adult language learning, the role of computers in language, analysis of system
architecture: from vocabulary training, via presentation of learning material to providing
feedback on learner errors and progress
Competencies, knowledge and skills developed by the course:
Students should by the end of the course be able:
a) To recognize the features which distinguish the natural language system from other
intelligent systems
b) To give the appropriate examples that will illustrate the concepts of morphology,
syntax, semantics and pragmatics of the language
c) To show that they understand the difference in approach based on the linguistic rules
from the approach based on pure statistics
d) To recognize the significance of pragmatics for natural language understanding
e) To describe the application based on NLP and to show the points of syntactic, semantic
and pragmatic processing
f) To evaluate the existing systems
Reading list:
1. Jurafsky D. & Martin, J.H. Speech and Language Processing. An Introduction to
Natural Language Processing, Computational Linguistics and Speech
Recognition. Prentice Hall, 2000.
2. Butler, C. (ed). Computer and Written Text. Blackwell, 1992.
3. Pinker, S. The Language Instinct. London: Penguin, 1994.
Additional reading list:
1. Allen, J. Natural Language Understanding. Redwood, CA: Benjamin, 1995.
2. Dale, R., Moisl, H. & Somers, H. (eds). Handbook of Natural Langauge
Processing. MIT Press, 2000.
3. Hausser, R.R. Foundations of Computational Linguistics: Human-Computer
Communication in Natural Language. Springer Verlag, 2001.
4. Iwanska, L.M. & Shapiro, S.C. (eds). Natural Language Processing and
Knowledge Representation. MIT Press, 2000.
5. Manning, C. & Schutze, H. Foundations of Statistical NLP. MIT Press, 1999.
6. Tadic, M. Building the Croatian Morphological Lexicon. Proceedings of the
EACL2003 Workshop on Morphological Processing of Slavic Languages, pp. 4146.
Course Evaluation Criteria
Breakdown
Number Effective Proportion %
Midterm Exams
Laboratory Assignments
Final Exam
1
15
1
30
30
40
Quality check and success of the course (evaluation):
Quality check and success of the course will be done by combining internal and external
evaluation. Internal evaluation will be done by teachers and students using survey method
at the end of semester. The external evaluation will be done by colleagues attending the
course, by monitoring and assessment of the course.
Students with disabilities:
Students who need an accommodation based on the impact of a disability should contact
the lecturer to arrange an appointment as soon as possible to discuss the course format, to
anticipate needs, and to explore potential accommodations.
Curriculum vitae
Nives Mikelic, assistant professor
University of Zagreb, Faculty of Humanities and Social Sciences, Department of
Information Science
E-mail:
Address:
nmikelic@ffzg.hr
University of Zagreb, Faculty of Humanities and Social Sciences, Ivana
Lucica 3, 10000 Zagreb, Croatia
Education:
 PhD in Information Science, University of Zagreb (2008)
 MPhil in Computer Speech, Text and Internet Technology,
University of Cambridge (2004)
 MSc in Information Science, University of Zagreb (2003)
 BSc in Croatian language, University of Zagreb (2001)
 B.Sc. Information Science, University of Zagreb (2001)
Current Positions:
 Assistant Professor at the Department of Information Sciences,
Faculty of Humanities and Social Sciences, University of Zagreb
o Course: Introduction to Natural Language Processing
o Course: Language Engineering
o Course: Automatic Text Summarization
o Course: Discourse and dialogue systems
o Course: Digitial image and text editing basics
o Course: Service Learning
Research interests:
1.
2.
3.
4.
Natural Language Processing
Service Learning (community-based learning)
Multimedia Applications in Education
Croatian Dictionary Heritage
Scholarly Lectures/Workshops:
 Multimedia in foreign language teaching and learning,” workshop presented at
Foreign Language Studies Center, University of Dubrovnik, Croatia, April, 2008.
 “Introduction to Community-Based Service Learning” workshop, presented at
Faculty of Humanistics and Social Sciences, University of Zagreb, Croatia,
February, 2008. (funded by EUR/ACE Democracy Outreach/Alumni Fund)
 “Information education for lecturers and assistants,” workshop presented at
Faculty of Economics, University of Zagreb, Croatia, March, 2007.

“Multimedia in education: primary school, secondary school and higher
education,” workshop presented at Faculty of Humanistics and Social Sciences,
University of Zagreb, Croatia, February, 2007.
Publications:
Natural Language Processing
1. Mikelic Preradovic, Nives; Boras, Damir; Kisicek, Sanja. Marvin-a
Conversational Agent based Interface for the Study of Information Sciences //
Proceedings of the 2nd International Conference on the Future of Information
Sciences: INFuture2009-Digital Resources and Knowledge Sharing”. Zagreb:
2009.
2. Mikelic Preradovic, N., Boras, D., Kisicek, S. CROVALLEX: Croatian Verb
Valence Lexicon // Proceedings of the 31th International Conference on
Information Technology Interfaces / Lužar-Stiffler, V.; Bekić, Z.; Jarec, I. (eds).
Zagreb: SRCE, 2009.
3. Preradovic Mikelic, N., Lauc, T., Boras, D. Text Summarization of XML
documents in Croatian // Modern Topics of Computer Science. Proceedings of
2nd WSEAS International Conference on COMPUTER ENGINEERING and
APPLICATIONS (CEA '08) / Grebennikov, A. and Zemliak, A. (eds). Acapulco,
Mexico. January 25-27, 2008. WSEAS Press. 143 -148.
4. Preradovic Mikelic, N., Lauc, T., Boras, D. CROXMLSUM – the System for
XML Document Summarization in Croatian. International Journal of
Mathematics and Computers in Simulation, 1/1(2007), p. 81-89.
5. Ljubesic, Nikola; Mikelić, Nives; Boras, Damir. Language identification: how to
distinguish similar languages? // Proceedings of the 29th International
Conference on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler,
Vesna ; Bekić, Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2007.
6. Lauc, Tomislava; Mikelić, Nives; Boras, Damir. Croatian Text Summarizer
(CROSUM) // Proceedings of the 27th International Conference on Information
Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić, Zoran ; Hljuz
Dobrić, Vesna (eds). Zagreb: SRCE, 2005. 651-657.
7. Mikelić, Nives. Word sense disambiguation: Distinguishing between
individuals and kinds / MPhil thesis. Cambridge: Computer Laboratory, 2004.
8. Tuđman, Miroslav; Mikelić, Nives; Boras, Damir. Vocabulary size prediction of
Croatian texts // Proceedings of the 25th International Conference on
Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić,
Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2003. 223-228.
9. Tuđman Miroslav; Mikelić Nives. Disinformation and Relevance from the
Sender’s Point of View // Proceedings of the IS 2003 Informing Science + IT
Education Conference / Eli Cohen and Elisabeth Boyd (eds.). Pori: Turku School
of Economics and Business Administration, 2003. 1513-1527.
10. Boras, Damir; Mikelić, Nives; Lauc, Davor.
Lexical Inflectional database of Croatian First and Last Names // Models of
Knowledge and Natural Language Processing / Tuđman, Miroslav (ed). Zagreb:
Institute for Information Studies, Faculty of Philosophy, 2003. 219-237.
Service Learning
1. Mikelić Preradović, Nives; Basrak, Bojan; Matić, Sanja. Service Learning in
Information Science: Web for the Blind. // Proceedings of the 1st International
Conference “The Future of Information Sciences: INFuture2007-Digital
Information and Heritage” / Bawden, David; Boras, Damir; Lasic-Lazic, Jadranka;
Seljan, Sanja; Slavic Aida; Stancic, Hrvoje; Sola, Tomislav; Tudman, Miroslav;
Urbania Joze (eds). Zagreb: 2007. 501-507.
2. Mikelić Preradović, Nives; Tuđman, Miroslav, Matić, Sanja. Promotion of
knowledge society through service learning. Proceedings of the 4th International
Congress of Quality Management in the Systems of Education and
Training/Casablanca, Morocco:2007.
3. Mikelić, Nives; Boras, Damir. Service learning: can our students learn how to
become a successful student? // Proceedings of the 28th International Conference
on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić,
Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2006. 651-657.
Multimedia Applications in Education
1. Lauc Tomislava; Matić Sanja; Mikelić, Nives. Educational multimedia software
for English language vocabulary // Current Research in Information Sciences
and Technologies Multidisciplinary approaches to global information systems:
VOLUME I / Vicente P. Guerrero-Bote (ur.). Merida : Open Institute of
Knowledge, 2006. 117-121.
2. Lauc, Tomislava; Matić, Sanja; Mikelić Preradović, Nives.
Project of developing the multimedia software supporting teaching and
learning of English vocabulary // InFuture2007: Digital information and
heritage / Bawden, D. et al. (ur.). Zagreb : Odsjek za informacijske znanosti,
Filozofski fakultet, Sveučilište u Zagrebu, 2007.
3. Banek, Mihaela; Mikelić, Nives. Reading literature prescribed by the school
curricula: pleasure or a nightmare? // Proceedings of the 2005 IBBY Congress
in Cuba / Habana: 2005.
4. Mikelić, Nives; Lauc, Tomislava, Golubić, Kruno: Computer-assisted learning
of Croatian language stress system (CAL-CROLESS) // Proceedings of the
conference CE / Čičin-Šain, M. Dragojlović, P. Turčić Prstačić, I (eds). Opatija:
MIPRO HU, 2005.
5. Mikelić, Nives; Lauc, Tomislava: Multimedia and multimedia instructional
message// Information Science in the Process of Change/ Lasić-Lazić,
Jadranka (eds). Zagreb: Institute for Information Studies, Faculty of Philosophy
2005. 95-114.
6. Mikelić, Nives. Methods of multimedia information design and its influence on
memorizing and understanding the content / MSc thesis. Zagreb: Faculty of
Philosophy, 2003.
7. Mikelić, Nives; Lauc, Tomislava. Modern educational technology: Media for
Inquiry and Research // Crikvenica 2002 / Šeta, Višnja (eds.). Rijeka, 2003.
8. Mikelić, Nives; Miškić, Jelena. Copyright situation in the Republic of Croatia //
The 11th Bobcatsss Symposium PROCEEDINGS / De Boer, Pelle et al (ur.).
Torun: Nicolaus Copernicus University Torun, 2003. 501-515
Croatian Dictionary Heritage
1. Boras, Damir; Mikelić, Nives; Ljubešić, Nikola. Learning medieval and
renaissance Latin in a new way // Proceedings of the Cambridge Latin
conference- Meeting the Challenge: European perspectives on the teaching and
learning of Latin / Lister, B., Landi, L., Rasmussen P. (eds). Cambridge: 2005. (in
press)
2. Boras, Damir; Ljubešić, Nikola; Mikelić, Nives. CROATIAN OLD
DICTIONARY PORTAL (CRODIP) // Proceedings of the Libraries in the digital
age 2005. (in press)
3. Boras, Damir; Mikelić, Nives. Croatian dictionary heritage: old Croatian
dictionaries in digital form // Libraries in the digital age 2003.
4. Boras, Damir; Mikelić, Nives.
Faust Vrančić’s Dictionary – Basis of Croatian Dictionary Heritage (Computer
Analysis) // Models of Knowledge and Natural Language Processing / Tuđman,
Miroslav (ed). Zagreb: Institute for Information Studies, Faculty of Philosophy,
2003. 237 - 272.
Scientific projects:


participates in the scientific project: Design of management of public knowledge
in information space, funded by the Ministry of Science and Technology
participates in the scientific project: Croatian dictionary heritage and Croatian
European identity, funded by the Croatian Ministry of Science and Technology
Awards:
Rector's Award for student paper “Dictionarium of Faust Vrančič in 16th and 21st
century – inside view” made on the scientific project “Croatian dictionary
heritage and dictionary knowledge presentation”, led by Damir Boras, PhD.
Scholarships:
 JFDP scholarship (Junior Faculty Development Program, funded by the
U.S. Department of State's Bureau of Educational and Cultural Affairs) in
academic year 2005/2006
 Cambridge Overseas Trust scholarship on the University of Cambridge in
academic year 2003/2004
Foreign languages: English, Italian, Czech.
Download