Data Mining

Course Syllabus
A. Name of university/ Instructor’s name (department/ position)
Voronezh State University / Natalia Sapkina (Department of Applied
Mathematics, Informatics and Mechanics / a lecturer in Mathematics
and Applied Information Technologies)
B. Title of course/ Semester
Data Mining / 4th Semester
C. Instructor’s office location and address/ office phone
Universitetskaya sq. 1
Tel.: +7 473 220 83 16
D. Instructor’s e-mail address
E. Course description
The course will explore Data Mining studies algorithms and
computational paradigms that allow computers to find patterns and
regularities in databases, perform prediction and forecasting, and
generally improve their performance through interaction with data. It
is currently regarded as the key element of a more general process
called Knowledge Discovery that deals with extracting useful
knowledge from raw data. The knowledge discovery process includes
data selection, cleaning, coding, using different statistical and
machine learning techniques, and visualization of the generated
structures. The course will cover all these issues and will illustrate the
whole process by examples. Special emphasis will be placed on
Machine Learning methods as they provide the real knowledge
discovery tools. Important related technologies, as data warehousing
and on-line analytical processing (OLAP) will be also discussed. The
students will use recent Data Mining software. Enrollment in this
course is limited to 20 students.
F. Course Objectives
After completing the course students will
 recognize basic concepts and techniques of Data Mining;
 enhance skills in using recent data mining software for solving
practical problems;
 gain experience of doing independent study and research.
G. Methods of Instruction
Courses meetings will consist of approximately two‐thirds lectures and
one third activities. There will be three types of activities:
Discussions - We will discuss the motivations, methods, results, and
implications of recent papers.
Writing Assignments - A variety of writing assignments will be given
throughout the semester.
Individual Presentations
During the final three weeks of class all students will write and orally
present a paper (1500 word maximum) on topic approved by the instructor.
H. Course Requirements and Grading
Graded work will receive a numeric score reflecting the quality of
performance. Relative weights assigned to graded work are as follows:
 Exams - 60%
 Project
- Final Presentation - 20%
- Written assignments - 15%
 Participation in Discussions - 5%
Your overall course grade will be determined according to the following
85-100% A
70-84% B
55-69% C
0-54% F
I. Final Exam
Oral interview will be used as an assessment method.
J. Required texts
Essential readings
1. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition), Morgan
Kaufmann, 2005, ISBN: 0-12-088407-0.
2. Hand, Mannila and Smyth: Principles of Data Mining, MIT Press,
a. Recommended readings
1. Mitchell: Machine Learning, McGraw-Hill, 1997.
2. Han and Kamber: Data Mining: Concepts and Techniques.
Morgan Kaufmann 2000.
3. Berry and Linoff: Mastering Data Mining: The Art and Science of
Customer Relationship Management. Wiley, 1999.
K. Tentative schedule
Week 1
Introduction to Data Mining
Week 2
Data Warehouse and OLAP
Week 3
Week 4
Witten and Frank 2005
Han and Kamber 2000
Data preprocessing
Witten and Frank 2005
Data cleaning
Data transformation
Data reduction
Discretization and generating
concept hierarchies
Installing Weka 3 Data
Mining System
Task relevant data
Background knowledge
Interestingness measures
Representing input data and
output knowledge
Visualization techniques
Attribute-oriented analysis
Hand, Mannila and
Smyth 2001
Data Warehouse and DBMS
Multidimensional data model
OLAP operations
Data mining knowledge
Week 5
What is data mining?
Related technologies Machine Learning, DBMS,
OLAP, Statistics
Data Mining Goals
Stages of the Data Mining
Data Mining Techniques
Knowledge Representation
Assigned readings and
due assignments
Witten and Frank 2005
Attribute generalization
Attribute relevance
Hand, Mannila and
Smyth 2001
Hand, Mannila and
Smyth 2001
Han and Kamber 2000
Hand, Mannila and
Smyth 2001
Week 6
Class comparison
Statistical measures
Data mining algorithms:
Association rules
Witten and Frank 2005
Han and Kamber 2000
Week 7
Motivation and terminology
Example: mining weather
Basic idea: item sets
Generating item sets and
rules efficiently
Correlation analysis
Data mining algorithms:
Mitchell 1997
Witten and Frank 2005
Han and Kamber 2000
Week 8
Basic learning/mining tasks
Inferring rudimentary rules:
1R algorithm
Decision trees
Covering rules
Data mining algorithms:
Mitchell 1997
Witten and Frank 2005
Han and Kamber 2000
Week 9
The prediction task
Statistical (Bayesian)
Bayesian networks
Instance-based methods
(nearest neighbor)
Linear models
Evaluating what's been learned
Week 10
Basic issues
Training and testing
Estimating classifier accuracy
(holdout, cross-validation,
Combining multiple models
(bagging, boosting, stacking)
Minimum Description Length
Principle (MLD)
Mining real data
Preprocessing data from a
Mitchell 1997
Hand, Mannila and
Smyth 2001
Han and Kamber 2000
Witten and Frank 2005
Berry and Linoff 1999
Week 11
Week 12
Basic issues in clustering
First conceptual clustering
system: Cluster/2
Partitioning methods: kmeans, expectation
maximization (EM)
Hierarchical methods:
agglomerative and divisible
Advanced techniques, Data
Mining software and
Week 13
Week 14
Week 15
real medical domain
Applying various data mining
techniques to create a
comprehensive and accurate
model of the data
Text mining: extracting
attributes (keywords),
structural approaches
(parsing, soft parsing)
Bayesian approach to
classifying text
Web mining: classifying web
pages, extracting knowledge
from the web
Data Mining software and
Project presentations
Project presentations
Final exam
Hand, Mannila and
Smyth 2001
Mitchell 1997
Han and Kamber 2000
Witten and Frank 2005
Berry and Linoff 1999
Han and Kamber 2000