Course: Introduction to Natural Language Processing Lecturer: Nives Mikelic Preradovic, assistant professor Substitute Lecturer (if needed): Damir Boras, full professor Language: English ECTS credits: 6 Teaching hours: 2 lecture hours and 2 hours of practical classes per week / 30 lecture hours and 30 hours of practical classes per semester Duration: 1 semester Status: elective Method of teaching: 2 lecture hours and 2 hours of practical classes per week Prerequisite: none Assessment: written exam Course description: The wide use of computers has had a profound influence on the way ordinary people communicate, search and store information today. For the overwhelming majority of people and situations, the natural vehicle for such information is natural language - the language people use in their everyday communication, i.e. Croatian, English, German, opposite to artificial languages, such as programming languages. Text (and to a lesser extent speech) are crucial encoding formats for the information revolution. Natural language processing is related to applications that in some way use natural language. For instance, it is used for computer interface design, where we give the commands to the computer in natural language. It is also used in knowledge acquisition, information retrieval and machine translation. This course will give the introduction to natural language processing (NLP), focusing on the computer use of the natural language. It will provide students who are interested in both linguistics and computers with the insight into the fundamentals of how computers are used to represent, process and organize textual and spoken information. It will introduce them to the field which combines insights from linguistics and computer science to produce applications like machine translation, information retrieval, and spell checking. The course will also provide them with the tips on how to effectively integrate this knowledge into the working practice. Course encompasses the theory and practice of human language technology. Students will be exposed to two languages that require somewhat different NLP approach: Croatian (representing Slavic, morphologically rich language) and English (representing syntactically rich language). Students who are native speakers of language that is not English will have a chance to perform analysis and to implement computational techniques of natural language processing for their own mother tongue during the laboratory exercises. Some of the topics included are text encoding, information retrieval technology, tools for writing support, morphological processing, lexicon building, tagging, parsing, word sense disambiguation, machine translation, language acquisition, dialogue systems, natural language understanding and computer aided language learning. We will move from simple representations of language, such as finite-state techniques and n-gram analysis, to more advanced representations, such as those found in context-free and unification-based parsing. The course will cover a range of topics that will help students understand how current NLP technology works and will provide them with a platform for future study and research. Students who take this course will gain a thorough understanding of the fundamental methods used in natural language processing, along with the ability to assess the strengths and weaknesses of natural language technologies based on these methods. Course objectives: The course is designed to develop an understanding of both the linguistic and computational aspects of Natural Language Processing. It aims to teach students the leading trends and systems in NLP and to make them understand the concepts of morphology, syntax, semantics and pragmatics of the language. These goals will be achieved by: 1. Readings, lectures, and class discussions of the multiple levels of linguistic analysis required for a computer to accept natural language input, interpret it, and carry out a particular application; 2. Lab exercises and assignments in analyzing or implementing computational techniques required to perform these levels of natural language processing of text Course topics: Unit 1: Introduction to Natural Language Processing (NLP) NLP topics and goals, the reasons to study NLP (interdisciplinary), difference between computational linguistics and NLP, applications and tools for natural language processing (text processing applications, dialogue applications), Turing test, Loebner prize Unit 2: Phonetics and phonology: speech and text encoding Definition and comparison of phonetics and phonology, speech organs, distribution of phonemes, IPA (International Phonetic Alphabet), speech animator (multimedia presentation), features of speech (sound waves, speech flow, loudness, pitch, frequency, etc.), digital representations of speech (oscillogram, spectrogram), Automatic Speech Recognition (ASR) applications and Text-to-Speech Synthesis (TTS) applications Unit 3: Writing systems and languages History and the development of writing systems (pictograms, logograms, ideograms), presentation of different writing systems today: syllabaries, consonantaries, alphabets, and alphasyllabaries, comparison of writing systems, character encoding (Big Endian, Little Endian, ASCII, Unicode - standard method of representing documents with multiple writing systems) Unit 4: Information retrieval and natural language Facilities and applications for information retrieval on the internet (search engines, web directories, meta-search engines, and invisible web), Boolean operators, Google operators, synonymy, hyponymy, hypernymy, meronymy and antonymy, the specific knowledge on the retrieval technology needed to use it well, differences between specific and general queries, the evaluation of query results, invisible web (vertical search, specialized search, types of specialized databases, finding invisible web) Unit 5: Regular expressions and finite state automata Regular expressions (regexp): definition, uses, operators and patterns, writing correct expressions, finite state automata (FSA): definition, examples, uses, relation between regexp and FSA, tools that use regexp (e.g. UNIX), regexp in Word: examples Unit 6: Intro to English and Croatian language morphology Definition of morpheme, root, affixes, allomorphs, derivational morphemes vs. inflectional morphemes, morpheme vs. lexeme, derivational morphology in English and Croatian (nouns, adjectives), inflectional derivational morphology in English and Croatian (nouns, verbs, adjectives) Unit 7: Computational morphology, finite state automata and finite state transducers Morphological analysis in English and Croatian: identifying morphemes, morphological derivation in English and Croatian: combining morphemes, building of the lexicon, algorithm: implementation of rules for morphological analysis, finite state automata – FSA: definition, examples in Croatian and English, finite state transducer-FST: definition, examples in Croatian and English, two-level morphology Unit 8: Outline of English and Croatian language syntax Levels of language analysis, representations and understanding, part of speech definition and analysis (nouns, verbs, adjectives, adverbs, prepositions), declination, conjugation, open and closed vocabulary, lexical features, category of phrases with examples: noun phrase (NP), verb phrase (VP), adjective phrase (AdjP), adverb phrase (AdvP), prepositional phrase (PP), elements of noun phrases, verb phrases and simple sentences, adjective phrases, adverbial phrases, prepositional phrases, phrase marker tree, grammatical correctness Unit 9: Computational syntax: syntax trees and parsing Syntax structure rules in English, structural ambiguity, context-free grammar (CFG): definition and usage, computer implementation of context-free grammar (pushdown automaton), top-down sentence parsing, bottom-up sentence parsing, grammatical functions (modifiers, specifiers, complements, heads), generative grammar, Chomsky hierarchy of grammars Unit 10: Computational semantics: selectional preferences and semantic roles Semantics of English and Croatian, compositional semantics, semantic / deep roles (Fillmore, Jackendoff), verb valence: definition and usage, semantic roles and noun phrases, definition and examples of the most popular semantic roles (Agent, Patient, etc.), syntactic approach to valence (Levin), verb valence lexicon, valence and machine translation Unit 11: Spelling and grammar correction tools Purpose of grammar checkers and spelling correctors, the advantages and disadvantages of the use of such tools and expected errors, interactive and automatic correctors, error detection and correction, non-word error detection, isolated word error correction, context-dependent word error correction, grammar correction, n-gram analysis, nonpositional bigram array, positional bigram array, rule-based method, similarity key method, minimum edit distance method, probabilistic method Unit 12: Language identification and spam filtering Natural language techniques for classifying documents (including the language the document is written in), Ngrams, frequency distributions, markers of lexical style, stop/function words, definition of spam vs. ham, introduction to linguistic (rule-based) and statistical techniques for spam filtering, spam detection Unit 13: Machine translation Purpose of automatic machine translation, reliability of online translation services, computer translation support functions, statistical (Google Translate) vs. rule-based machine translation (Babelfish), direct transfer, transformer, interlingua, sentence alignment, word alignment, “bag of words” method, translation memory, lexical ambiguity, MT evaluation Unit 14: Dialogue systems Basic features of human dialogue, Gricean maxims, speech acts, Description of Eliza, Parry, Alice and their surprising success in engaging people in conversation, the use of dialog systems and their specific purpose, detailed analysis of the components of a dialog system, modern dialogue systems Unit 15: Computer-Assisted Language Learning Definition of all important facts involved in learning a foreign language, second language acquisition, adult language learning, the role of computers in language, analysis of system architecture: from vocabulary training, via presentation of learning material to providing feedback on learner errors and progress Competencies, knowledge and skills developed by the course: Students should by the end of the course be able: a) To recognize the features which distinguish the natural language system from other intelligent systems b) To give the appropriate examples that will illustrate the concepts of morphology, syntax, semantics and pragmatics of the language c) To show that they understand the difference in approach based on the linguistic rules from the approach based on pure statistics d) To recognize the significance of pragmatics for natural language understanding e) To describe the application based on NLP and to show the points of syntactic, semantic and pragmatic processing f) To evaluate the existing systems Reading list: 1. Jurafsky D. & Martin, J.H. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2000. 2. Butler, C. (ed). Computer and Written Text. Blackwell, 1992. 3. Pinker, S. The Language Instinct. London: Penguin, 1994. Additional reading list: 1. Allen, J. Natural Language Understanding. Redwood, CA: Benjamin, 1995. 2. Dale, R., Moisl, H. & Somers, H. (eds). Handbook of Natural Langauge Processing. MIT Press, 2000. 3. Hausser, R.R. Foundations of Computational Linguistics: Human-Computer Communication in Natural Language. Springer Verlag, 2001. 4. Iwanska, L.M. & Shapiro, S.C. (eds). Natural Language Processing and Knowledge Representation. MIT Press, 2000. 5. Manning, C. & Schutze, H. Foundations of Statistical NLP. MIT Press, 1999. 6. Tadic, M. Building the Croatian Morphological Lexicon. Proceedings of the EACL2003 Workshop on Morphological Processing of Slavic Languages, pp. 4146. Course Evaluation Criteria Breakdown Number Effective Proportion % Midterm Exams Laboratory Assignments Final Exam 1 15 1 30 30 40 Quality check and success of the course (evaluation): Quality check and success of the course will be done by combining internal and external evaluation. Internal evaluation will be done by teachers and students using survey method at the end of semester. The external evaluation will be done by colleagues attending the course, by monitoring and assessment of the course. Students with disabilities: Students who need an accommodation based on the impact of a disability should contact the lecturer to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations. Curriculum vitae Nives Mikelic, assistant professor University of Zagreb, Faculty of Humanities and Social Sciences, Department of Information Science E-mail: Address: nmikelic@ffzg.hr University of Zagreb, Faculty of Humanities and Social Sciences, Ivana Lucica 3, 10000 Zagreb, Croatia Education: PhD in Information Science, University of Zagreb (2008) MPhil in Computer Speech, Text and Internet Technology, University of Cambridge (2004) MSc in Information Science, University of Zagreb (2003) BSc in Croatian language, University of Zagreb (2001) B.Sc. Information Science, University of Zagreb (2001) Current Positions: Assistant Professor at the Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb o Course: Introduction to Natural Language Processing o Course: Language Engineering o Course: Automatic Text Summarization o Course: Discourse and dialogue systems o Course: Digitial image and text editing basics o Course: Service Learning Research interests: 1. 2. 3. 4. Natural Language Processing Service Learning (community-based learning) Multimedia Applications in Education Croatian Dictionary Heritage Scholarly Lectures/Workshops: Multimedia in foreign language teaching and learning,” workshop presented at Foreign Language Studies Center, University of Dubrovnik, Croatia, April, 2008. “Introduction to Community-Based Service Learning” workshop, presented at Faculty of Humanistics and Social Sciences, University of Zagreb, Croatia, February, 2008. (funded by EUR/ACE Democracy Outreach/Alumni Fund) “Information education for lecturers and assistants,” workshop presented at Faculty of Economics, University of Zagreb, Croatia, March, 2007. “Multimedia in education: primary school, secondary school and higher education,” workshop presented at Faculty of Humanistics and Social Sciences, University of Zagreb, Croatia, February, 2007. Publications: Natural Language Processing 1. Mikelic Preradovic, Nives; Boras, Damir; Kisicek, Sanja. Marvin-a Conversational Agent based Interface for the Study of Information Sciences // Proceedings of the 2nd International Conference on the Future of Information Sciences: INFuture2009-Digital Resources and Knowledge Sharing”. Zagreb: 2009. 2. Mikelic Preradovic, N., Boras, D., Kisicek, S. CROVALLEX: Croatian Verb Valence Lexicon // Proceedings of the 31th International Conference on Information Technology Interfaces / Lužar-Stiffler, V.; Bekić, Z.; Jarec, I. (eds). Zagreb: SRCE, 2009. 3. Preradovic Mikelic, N., Lauc, T., Boras, D. Text Summarization of XML documents in Croatian // Modern Topics of Computer Science. Proceedings of 2nd WSEAS International Conference on COMPUTER ENGINEERING and APPLICATIONS (CEA '08) / Grebennikov, A. and Zemliak, A. (eds). Acapulco, Mexico. January 25-27, 2008. WSEAS Press. 143 -148. 4. Preradovic Mikelic, N., Lauc, T., Boras, D. CROXMLSUM – the System for XML Document Summarization in Croatian. International Journal of Mathematics and Computers in Simulation, 1/1(2007), p. 81-89. 5. Ljubesic, Nikola; Mikelić, Nives; Boras, Damir. Language identification: how to distinguish similar languages? // Proceedings of the 29th International Conference on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić, Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2007. 6. Lauc, Tomislava; Mikelić, Nives; Boras, Damir. Croatian Text Summarizer (CROSUM) // Proceedings of the 27th International Conference on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić, Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2005. 651-657. 7. Mikelić, Nives. Word sense disambiguation: Distinguishing between individuals and kinds / MPhil thesis. Cambridge: Computer Laboratory, 2004. 8. Tuđman, Miroslav; Mikelić, Nives; Boras, Damir. Vocabulary size prediction of Croatian texts // Proceedings of the 25th International Conference on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić, Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2003. 223-228. 9. Tuđman Miroslav; Mikelić Nives. Disinformation and Relevance from the Sender’s Point of View // Proceedings of the IS 2003 Informing Science + IT Education Conference / Eli Cohen and Elisabeth Boyd (eds.). Pori: Turku School of Economics and Business Administration, 2003. 1513-1527. 10. Boras, Damir; Mikelić, Nives; Lauc, Davor. Lexical Inflectional database of Croatian First and Last Names // Models of Knowledge and Natural Language Processing / Tuđman, Miroslav (ed). Zagreb: Institute for Information Studies, Faculty of Philosophy, 2003. 219-237. Service Learning 1. Mikelić Preradović, Nives; Basrak, Bojan; Matić, Sanja. Service Learning in Information Science: Web for the Blind. // Proceedings of the 1st International Conference “The Future of Information Sciences: INFuture2007-Digital Information and Heritage” / Bawden, David; Boras, Damir; Lasic-Lazic, Jadranka; Seljan, Sanja; Slavic Aida; Stancic, Hrvoje; Sola, Tomislav; Tudman, Miroslav; Urbania Joze (eds). Zagreb: 2007. 501-507. 2. Mikelić Preradović, Nives; Tuđman, Miroslav, Matić, Sanja. Promotion of knowledge society through service learning. Proceedings of the 4th International Congress of Quality Management in the Systems of Education and Training/Casablanca, Morocco:2007. 3. Mikelić, Nives; Boras, Damir. Service learning: can our students learn how to become a successful student? // Proceedings of the 28th International Conference on Information Technology Interfaces / Budin, Leo; Lužar-Stiffler, Vesna ; Bekić, Zoran ; Hljuz Dobrić, Vesna (eds). Zagreb: SRCE, 2006. 651-657. Multimedia Applications in Education 1. Lauc Tomislava; Matić Sanja; Mikelić, Nives. Educational multimedia software for English language vocabulary // Current Research in Information Sciences and Technologies Multidisciplinary approaches to global information systems: VOLUME I / Vicente P. Guerrero-Bote (ur.). Merida : Open Institute of Knowledge, 2006. 117-121. 2. Lauc, Tomislava; Matić, Sanja; Mikelić Preradović, Nives. Project of developing the multimedia software supporting teaching and learning of English vocabulary // InFuture2007: Digital information and heritage / Bawden, D. et al. (ur.). Zagreb : Odsjek za informacijske znanosti, Filozofski fakultet, Sveučilište u Zagrebu, 2007. 3. Banek, Mihaela; Mikelić, Nives. Reading literature prescribed by the school curricula: pleasure or a nightmare? // Proceedings of the 2005 IBBY Congress in Cuba / Habana: 2005. 4. Mikelić, Nives; Lauc, Tomislava, Golubić, Kruno: Computer-assisted learning of Croatian language stress system (CAL-CROLESS) // Proceedings of the conference CE / Čičin-Šain, M. Dragojlović, P. Turčić Prstačić, I (eds). Opatija: MIPRO HU, 2005. 5. Mikelić, Nives; Lauc, Tomislava: Multimedia and multimedia instructional message// Information Science in the Process of Change/ Lasić-Lazić, Jadranka (eds). Zagreb: Institute for Information Studies, Faculty of Philosophy 2005. 95-114. 6. Mikelić, Nives. Methods of multimedia information design and its influence on memorizing and understanding the content / MSc thesis. Zagreb: Faculty of Philosophy, 2003. 7. Mikelić, Nives; Lauc, Tomislava. Modern educational technology: Media for Inquiry and Research // Crikvenica 2002 / Šeta, Višnja (eds.). Rijeka, 2003. 8. Mikelić, Nives; Miškić, Jelena. Copyright situation in the Republic of Croatia // The 11th Bobcatsss Symposium PROCEEDINGS / De Boer, Pelle et al (ur.). Torun: Nicolaus Copernicus University Torun, 2003. 501-515 Croatian Dictionary Heritage 1. Boras, Damir; Mikelić, Nives; Ljubešić, Nikola. Learning medieval and renaissance Latin in a new way // Proceedings of the Cambridge Latin conference- Meeting the Challenge: European perspectives on the teaching and learning of Latin / Lister, B., Landi, L., Rasmussen P. (eds). Cambridge: 2005. (in press) 2. Boras, Damir; Ljubešić, Nikola; Mikelić, Nives. CROATIAN OLD DICTIONARY PORTAL (CRODIP) // Proceedings of the Libraries in the digital age 2005. (in press) 3. Boras, Damir; Mikelić, Nives. Croatian dictionary heritage: old Croatian dictionaries in digital form // Libraries in the digital age 2003. 4. Boras, Damir; Mikelić, Nives. Faust Vrančić’s Dictionary – Basis of Croatian Dictionary Heritage (Computer Analysis) // Models of Knowledge and Natural Language Processing / Tuđman, Miroslav (ed). Zagreb: Institute for Information Studies, Faculty of Philosophy, 2003. 237 - 272. Scientific projects: participates in the scientific project: Design of management of public knowledge in information space, funded by the Ministry of Science and Technology participates in the scientific project: Croatian dictionary heritage and Croatian European identity, funded by the Croatian Ministry of Science and Technology Awards: Rector's Award for student paper “Dictionarium of Faust Vrančič in 16th and 21st century – inside view” made on the scientific project “Croatian dictionary heritage and dictionary knowledge presentation”, led by Damir Boras, PhD. Scholarships: JFDP scholarship (Junior Faculty Development Program, funded by the U.S. Department of State's Bureau of Educational and Cultural Affairs) in academic year 2005/2006 Cambridge Overseas Trust scholarship on the University of Cambridge in academic year 2003/2004 Foreign languages: English, Italian, Czech.