Data Mining - College of Business Administration @ Kuwait University

advertisement
Kuwait University
College of Business Administration
Quantitative Methods and Information System Department
Syllabus
QM# Selected Topics
Data Mining: Concepts and Techniques
Fall 2007
Instructor: Dr. Aboul Ella Hassanien
Email: Abo@cba.edu.kw
Web site: http://www.cba.edu.kw/abo
Kuwait University
College of Business Administration
Quantitative Methods and Information System Department
QM# Selected Topics:
Data Mining: Concepts and Techniques, Fall 2007
Instructor: Dr. Aboul Ella Hassanien
Email: Abo@cba.edu.kw
Office Hours:
TBA
Course web page: http://www.cba.edu.kw/abo/Datamining
Required Textbook:
Jiawei Han and Micheline Kamber , data Mining: Concepts and Techniques, 2ed. The
Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor. Morgan
Kaufmann Publishers, Feb. 2006. ISBN 1-55860-901-6 Web site: http://wwwfaculty.cs.uiuc.edu/~hanj/bk2/index.html
Course Overview & Objectives:
Data Mining studies algorithms and computational paradigms that allow computers
to find patterns and regularities in databases, perform prediction and forecasting, and
generally improve their performance through interaction with data. It is currently regarded
as the key element of a more general process called Knowledge Discovery that deals with
extracting useful knowledge from raw data. The knowledge discovery process includes
data selection, cleaning, coding, using different statistical, pattern recognition and machine
learning techniques, and reporting and visualization of the generated structures. The course
will cover all these issues and will illustrate the whole process by examples of practical
applications. The students will use recent Data Mining software
Course Objectives



to introduce students to the basic concepts and techniques of Data Mining.
to develop skills of using recent data mining software for solving practical
problems.
to gain experience of doing independent study and research.
2
Required Software
Weka is a set of software for machine learning and data mining developed. Weka is open
source software issued under the GNU General Public License. Download the software
from: http://www.cs.waikato.ac.nz/ml/weka/
Collections of datasets








A jarfile containing 37 classification problems, originally obtained from the UCI
repository (datasets-UCI.jar, 1,190,961 Bytes).
A jarfile containing 37 regression problems, obtained from various sources
(datasets-numeric.jar, 169,344 Bytes).
A jarfile containing 6 agricultural datasets obtained from agricultural researchers in
New Zealand (agridatasets.jar, 31,200 Bytes).
A jarfile containing 30 regression datasets collected by Luis Torgo (regressiondatasets.jar, 10,090,266 Bytes).
A gzip'ed tar containing UCI and UCI KDD datasets (uci-20050214.tar.gz,
15,308,385 Bytes)
A gzip'ed tar containing StatLib datasets (statlib-20050214.tar.gz, 12,785,582
Bytes)
A gzip'ed tar containing ordinal, real-world datasets donated by Dr. Arie Ben David
(Holon Inst. of Technology/Israel) (datasets-arie_ben_david.tar.gz, 11,348 Bytes)
A zip file containing 19 multi-class (1-of-n) text datasets donated by George
Forman/Hewlett-Packard Labs (19MclassTextWc.zip, 14,084,828 Bytes)
Assessment
Course assessment will be based on the combination of the following:
Assignment
Final Mark Due Date
Term Test 1
25
TBA
Final exam
40
TBA
Quiz, assignments, reports, project, case study, etc.
35
TOTAL
100
3
Tentative Schedule
Wk
topics
1,2
1.1 What Motivated Data Mining? Why Is It Important ……………………………….
1.2 What Is Data Mining? ……………………………………………………………...
1.3 Data Mining—On What Kind of Data? …………………………………………..
1.4 Data Mining Functionalities—What Kinds of Patterns Can Be Mined? ………….
1.4.1 Concept/Class Description: Characterization and Discrimination ……………..
1.4.2 Mining Frequent Patterns, Associations, and Correlations ……………………..
1.4.3 Classification and Prediction …………………………………………………….
1.4.4 Cluster Analysis …………………………………………………………………
1.4.5 Outlier Analysis ………………………………………………………………..
1.6 Classification of Data Mining Systems …………………………………………
1.7 Data Mining Task Primitives …………………………………………………
1.8 Integration of a Data Mining System with a Database or Warehouse System …..
1.9 Major Issues in Data Mining………………………………………………………
Chapter-1: Introduction
pages
1
5
9
21
23
24
25
26
29
31
34
36
Chapter-2: Data Preprocessing
3,4
2.1 Why Preprocess the Data?
2.2 Descriptive Data Summarization
2.2.1 Measuring the Central Tendency
2.2.2 Measuring the Dispersion of Data
2.2.3 Graphic Displays of Basic Descriptive Data Summaries
2.3 Data Cleaning
2.3.1 Missing Values
2.3.2 Noisy Data
2.3.3 Data Cleaning as a Process
2.4 Data Integration and Transformation
2.4.1 Data Integration
2.4.2 Data Transformation
2.5 Data Reduction
2.5.1 Data Cube Aggregation
2.5.2 Attribute Subset Selection
2.5.3 Dimensionality Reduction
48
51
51
53
56
61
61
62
65
67
67
70
72
73
75
77
Introduction to WEKA Software (lab work)
Chapter-5: Mining Frequent Patterns, Associations, and Correlations
5,6
5.1 Basic Concepts and a Road Map
5.1.1 Market Basket Analysis: A Motivating Example
5.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules
5.1.3 Frequent Pattern Mining: A Road Map
5.2 Efficient and Scalable Frequent Itemset Mining Methods
5.2.1 The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
5.2.2 Generating Association Rules from Frequent Itemsets
5.2.3 Improving the Efficiency of Apriori
5.2.4 Mining Frequent Itemsets without Candidate Generation
5.2.5 Mining Frequent Itemsets Using Vertical Data Format
5.2.6 Mining Closed Frequent Itemsets
5.3 Mining Various Kinds of Association Rules
5.3.1 Mining Multilevel Association Rules
5.3.2 Mining Multidimensional Association Rules
from Relational Databases and Data Warehouses
4
227
228
230
232
234
234
239
240
242
245
248
250
250
254
259
5.4 From Association Mining to Correlation Analysis
5.4.1 Strong Rules Are Not Necessarily Interesting: An Example
5.4.2 From Association Analysis to Correlation Analysis
5.5 Constraint-Based Association Mining
5.5.1 Metarule-Guided Mining of Association Rules
5.5.2 Constraint Pushing: Mining Guided by Rule Constraints
260
261
265
266
267
Midterm exam, will covers chapters 1,2, and 5
Chapter-6: Classification and Prediction
7,8,
9
6.1 What Is Classification? What Is Prediction?
6.2 Issues Regarding Classification and Prediction
6.2.1 Preparing the Data for Classification and Prediction
6.2.2 Comparing Classification and Prediction Methods
6.3 Classification by Decision Tree Induction
6.3.1 Decision Tree Induction
6.3.2 Attribute Selection Measures
6.3.3 Tree Pruning
6.3.4 Scalability and Decision Tree Induction
6.4 Bayesian Classification
6.4.1 Bayes’ Theorem
6.4.2 Naïve Bayesian Classification
6.4.3 Bayesian Belief Networks
6.4.4 Training Bayesian Belief Networks
6.5 Rule-Based Classification
6.5.1 Using IF-THEN Rules for Classification
6.5.2 Rule Extraction from a Decision Tree
6.5.3 Rule Induction Using a Sequential Covering Algorithm
6.11 Prediction
6.11.1 Linear Regression
6.11.2 Nonlinear Regression
6.11.3 Other Regression-Based Methods
Chapter 10
285
289
289
290
291
292
296
304
306
310
310
311
315
317
318
319
321
322
354
355
357
358
Mining Object, Spatial, Multimedia, Text, andWeb Data
Multimedia Data Mining
10.3.1 Similarity Search in Multimedia Data
10.3.2 Multidimensional Analysis of Multimedia Data
10.3.3 Classification and Prediction Analysis of Multimedia Data
10.3.4 Mining Associations in Multimedia Data
10.3.5 Audio and Video Data Mining
10.4 Text Mining
10.4.1 Text Data Analysis and Information Retrieval
10.4.2 Dimensionality Reduction for Text
10.4.3 Text Mining Approaches
10.5 Mining theWorld WideWeb
10.5.1 Mining the Web Page Layout Structure
10.5.2 Mining the Web’s Link Structures to Identify
Authoritative Web Pages
10.5.3 Mining Multimedia Data on the Web
10.5.4 Automatic Classification of Web Documents
10.5.5 Web Usage Mining
Final Exam will cover chapter-1, 6 and 10
5
607
608
609
611
612
613
614
615
621
624
628
630
631
637
638
Download