Kuwait University College of Business Administration Quantitative Methods and Information System Department Syllabus QM# Selected Topics Data Mining: Concepts and Techniques Fall 2007 Instructor: Dr. Aboul Ella Hassanien Email: Abo@cba.edu.kw Web site: http://www.cba.edu.kw/abo Kuwait University College of Business Administration Quantitative Methods and Information System Department QM# Selected Topics: Data Mining: Concepts and Techniques, Fall 2007 Instructor: Dr. Aboul Ella Hassanien Email: Abo@cba.edu.kw Office Hours: TBA Course web page: http://www.cba.edu.kw/abo/Datamining Required Textbook: Jiawei Han and Micheline Kamber , data Mining: Concepts and Techniques, 2ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor. Morgan Kaufmann Publishers, Feb. 2006. ISBN 1-55860-901-6 Web site: http://wwwfaculty.cs.uiuc.edu/~hanj/bk2/index.html Course Overview & Objectives: Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical, pattern recognition and machine learning techniques, and reporting and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples of practical applications. The students will use recent Data Mining software Course Objectives to introduce students to the basic concepts and techniques of Data Mining. to develop skills of using recent data mining software for solving practical problems. to gain experience of doing independent study and research. 2 Required Software Weka is a set of software for machine learning and data mining developed. Weka is open source software issued under the GNU General Public License. Download the software from: http://www.cs.waikato.ac.nz/ml/weka/ Collections of datasets A jarfile containing 37 classification problems, originally obtained from the UCI repository (datasets-UCI.jar, 1,190,961 Bytes). A jarfile containing 37 regression problems, obtained from various sources (datasets-numeric.jar, 169,344 Bytes). A jarfile containing 6 agricultural datasets obtained from agricultural researchers in New Zealand (agridatasets.jar, 31,200 Bytes). A jarfile containing 30 regression datasets collected by Luis Torgo (regressiondatasets.jar, 10,090,266 Bytes). A gzip'ed tar containing UCI and UCI KDD datasets (uci-20050214.tar.gz, 15,308,385 Bytes) A gzip'ed tar containing StatLib datasets (statlib-20050214.tar.gz, 12,785,582 Bytes) A gzip'ed tar containing ordinal, real-world datasets donated by Dr. Arie Ben David (Holon Inst. of Technology/Israel) (datasets-arie_ben_david.tar.gz, 11,348 Bytes) A zip file containing 19 multi-class (1-of-n) text datasets donated by George Forman/Hewlett-Packard Labs (19MclassTextWc.zip, 14,084,828 Bytes) Assessment Course assessment will be based on the combination of the following: Assignment Final Mark Due Date Term Test 1 25 TBA Final exam 40 TBA Quiz, assignments, reports, project, case study, etc. 35 TOTAL 100 3 Tentative Schedule Wk topics 1,2 1.1 What Motivated Data Mining? Why Is It Important ………………………………. 1.2 What Is Data Mining? ……………………………………………………………... 1.3 Data Mining—On What Kind of Data? ………………………………………….. 1.4 Data Mining Functionalities—What Kinds of Patterns Can Be Mined? …………. 1.4.1 Concept/Class Description: Characterization and Discrimination …………….. 1.4.2 Mining Frequent Patterns, Associations, and Correlations …………………….. 1.4.3 Classification and Prediction ……………………………………………………. 1.4.4 Cluster Analysis ………………………………………………………………… 1.4.5 Outlier Analysis ……………………………………………………………….. 1.6 Classification of Data Mining Systems ………………………………………… 1.7 Data Mining Task Primitives ………………………………………………… 1.8 Integration of a Data Mining System with a Database or Warehouse System ….. 1.9 Major Issues in Data Mining……………………………………………………… Chapter-1: Introduction pages 1 5 9 21 23 24 25 26 29 31 34 36 Chapter-2: Data Preprocessing 3,4 2.1 Why Preprocess the Data? 2.2 Descriptive Data Summarization 2.2.1 Measuring the Central Tendency 2.2.2 Measuring the Dispersion of Data 2.2.3 Graphic Displays of Basic Descriptive Data Summaries 2.3 Data Cleaning 2.3.1 Missing Values 2.3.2 Noisy Data 2.3.3 Data Cleaning as a Process 2.4 Data Integration and Transformation 2.4.1 Data Integration 2.4.2 Data Transformation 2.5 Data Reduction 2.5.1 Data Cube Aggregation 2.5.2 Attribute Subset Selection 2.5.3 Dimensionality Reduction 48 51 51 53 56 61 61 62 65 67 67 70 72 73 75 77 Introduction to WEKA Software (lab work) Chapter-5: Mining Frequent Patterns, Associations, and Correlations 5,6 5.1 Basic Concepts and a Road Map 5.1.1 Market Basket Analysis: A Motivating Example 5.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 5.1.3 Frequent Pattern Mining: A Road Map 5.2 Efficient and Scalable Frequent Itemset Mining Methods 5.2.1 The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation 5.2.2 Generating Association Rules from Frequent Itemsets 5.2.3 Improving the Efficiency of Apriori 5.2.4 Mining Frequent Itemsets without Candidate Generation 5.2.5 Mining Frequent Itemsets Using Vertical Data Format 5.2.6 Mining Closed Frequent Itemsets 5.3 Mining Various Kinds of Association Rules 5.3.1 Mining Multilevel Association Rules 5.3.2 Mining Multidimensional Association Rules from Relational Databases and Data Warehouses 4 227 228 230 232 234 234 239 240 242 245 248 250 250 254 259 5.4 From Association Mining to Correlation Analysis 5.4.1 Strong Rules Are Not Necessarily Interesting: An Example 5.4.2 From Association Analysis to Correlation Analysis 5.5 Constraint-Based Association Mining 5.5.1 Metarule-Guided Mining of Association Rules 5.5.2 Constraint Pushing: Mining Guided by Rule Constraints 260 261 265 266 267 Midterm exam, will covers chapters 1,2, and 5 Chapter-6: Classification and Prediction 7,8, 9 6.1 What Is Classification? What Is Prediction? 6.2 Issues Regarding Classification and Prediction 6.2.1 Preparing the Data for Classification and Prediction 6.2.2 Comparing Classification and Prediction Methods 6.3 Classification by Decision Tree Induction 6.3.1 Decision Tree Induction 6.3.2 Attribute Selection Measures 6.3.3 Tree Pruning 6.3.4 Scalability and Decision Tree Induction 6.4 Bayesian Classification 6.4.1 Bayes’ Theorem 6.4.2 Naïve Bayesian Classification 6.4.3 Bayesian Belief Networks 6.4.4 Training Bayesian Belief Networks 6.5 Rule-Based Classification 6.5.1 Using IF-THEN Rules for Classification 6.5.2 Rule Extraction from a Decision Tree 6.5.3 Rule Induction Using a Sequential Covering Algorithm 6.11 Prediction 6.11.1 Linear Regression 6.11.2 Nonlinear Regression 6.11.3 Other Regression-Based Methods Chapter 10 285 289 289 290 291 292 296 304 306 310 310 311 315 317 318 319 321 322 354 355 357 358 Mining Object, Spatial, Multimedia, Text, andWeb Data Multimedia Data Mining 10.3.1 Similarity Search in Multimedia Data 10.3.2 Multidimensional Analysis of Multimedia Data 10.3.3 Classification and Prediction Analysis of Multimedia Data 10.3.4 Mining Associations in Multimedia Data 10.3.5 Audio and Video Data Mining 10.4 Text Mining 10.4.1 Text Data Analysis and Information Retrieval 10.4.2 Dimensionality Reduction for Text 10.4.3 Text Mining Approaches 10.5 Mining theWorld WideWeb 10.5.1 Mining the Web Page Layout Structure 10.5.2 Mining the Web’s Link Structures to Identify Authoritative Web Pages 10.5.3 Mining Multimedia Data on the Web 10.5.4 Automatic Classification of Web Documents 10.5.5 Web Usage Mining Final Exam will cover chapter-1, 6 and 10 5 607 608 609 611 612 613 614 615 621 624 628 630 631 637 638