Course Syllabus A. Name of university/ Instructor’s name (department/ position) Voronezh State University / Natalia Sapkina (Department of Applied Mathematics, Informatics and Mechanics / a lecturer in Mathematics and Applied Information Technologies) B. Title of course/ Semester Data Mining / 4th Semester C. Instructor’s office location and address/ office phone Universitetskaya sq. 1 Voronezh Russia 394000 Tel.: +7 473 220 83 16 D. Instructor’s e-mail address natashasapkina@yandex.ru E. Course description The course will explore Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical and machine learning techniques, and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples. Special emphasis will be placed on Machine Learning methods as they provide the real knowledge discovery tools. Important related technologies, as data warehousing and on-line analytical processing (OLAP) will be also discussed. The students will use recent Data Mining software. Enrollment in this course is limited to 20 students. F. Course Objectives After completing the course students will recognize basic concepts and techniques of Data Mining; enhance skills in using recent data mining software for solving practical problems; gain experience of doing independent study and research. G. Methods of Instruction Courses meetings will consist of approximately two‐thirds lectures and one third activities. There will be three types of activities: Discussions - We will discuss the motivations, methods, results, and implications of recent papers. Writing Assignments - A variety of writing assignments will be given throughout the semester. Individual Presentations During the final three weeks of class all students will write and orally present a paper (1500 word maximum) on topic approved by the instructor. H. Course Requirements and Grading Graded work will receive a numeric score reflecting the quality of performance. Relative weights assigned to graded work are as follows: Exams - 60% Project - Final Presentation - 20% - Written assignments - 15% Participation in Discussions - 5% Your overall course grade will be determined according to the following scale: 85-100% A 70-84% B 55-69% C 0-54% F I. Final Exam Oral interview will be used as an assessment method. J. Required texts Essential readings 1. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann, 2005, ISBN: 0-12-088407-0. 2. Hand, Mannila and Smyth: Principles of Data Mining, MIT Press, 2001. a. Recommended readings 1. Mitchell: Machine Learning, McGraw-Hill, 1997. 2. Han and Kamber: Data Mining: Concepts and Techniques. Morgan Kaufmann 2000. 3. Berry and Linoff: Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley, 1999. K. Tentative schedule Week Week 1 Lec1 Topic Introduction to Data Mining Week 2 Lec2 Data Warehouse and OLAP Week 3 Dis1 Week 4 Lec3 Witten and Frank 2005 Han and Kamber 2000 Data preprocessing Witten and Frank 2005 Data cleaning Data transformation Data reduction Discretization and generating concept hierarchies Installing Weka 3 Data Mining System Task relevant data Background knowledge Interestingness measures Representing input data and output knowledge Visualization techniques Attribute-oriented analysis Hand, Mannila and Smyth 2001 Data Warehouse and DBMS Multidimensional data model OLAP operations Data mining knowledge representation Week 5 Lec4 What is data mining? Related technologies Machine Learning, DBMS, OLAP, Statistics Data Mining Goals Stages of the Data Mining Process Data Mining Techniques Knowledge Representation Methods Applications Assigned readings and due assignments Witten and Frank 2005 Attribute generalization Attribute relevance Hand, Mannila and Smyth 2001 Hand, Mannila and Smyth 2001 Han and Kamber 2000 Hand, Mannila and Smyth 2001 Week 6 Lec5 Class comparison Statistical measures Data mining algorithms: Association rules Witten and Frank 2005 Han and Kamber 2000 Week 7 Lec6 Motivation and terminology Example: mining weather data Basic idea: item sets Generating item sets and rules efficiently Correlation analysis Data mining algorithms: Classification Mitchell 1997 Witten and Frank 2005 Han and Kamber 2000 Week 8 Dis2 Basic learning/mining tasks Inferring rudimentary rules: 1R algorithm Decision trees Covering rules Data mining algorithms: Prediction Mitchell 1997 Witten and Frank 2005 Han and Kamber 2000 Week 9 Lec7 The prediction task Statistical (Bayesian) classification Bayesian networks Instance-based methods (nearest neighbor) Linear models Evaluating what's been learned Week 10 Lec8 Basic issues Training and testing Estimating classifier accuracy (holdout, cross-validation, leave-one-out) Combining multiple models (bagging, boosting, stacking) Minimum Description Length Principle (MLD) Mining real data Preprocessing data from a Mitchell 1997 Hand, Mannila and Smyth 2001 Han and Kamber 2000 Witten and Frank 2005 Berry and Linoff 1999 Week 11 Lec9 Clustering Week 12 Dis3 Basic issues in clustering First conceptual clustering system: Cluster/2 Partitioning methods: kmeans, expectation maximization (EM) Hierarchical methods: distance-based agglomerative and divisible clustering Advanced techniques, Data Mining software and applications Week 13 Week 14 Week 15 real medical domain Applying various data mining techniques to create a comprehensive and accurate model of the data Text mining: extracting attributes (keywords), structural approaches (parsing, soft parsing) Bayesian approach to classifying text Web mining: classifying web pages, extracting knowledge from the web Data Mining software and applications Project presentations Project presentations Final exam Hand, Mannila and Smyth 2001 Mitchell 1997 Han and Kamber 2000 Witten and Frank 2005 Berry and Linoff 1999 Han and Kamber 2000