COURSE ANNOUNCEMENT Spring 2002 92.6961-01 DATA MINING AND KNOWLEDGE DISCOVERY Many firms have invested heavily in information technology to help them manage their businesses more effectively and gain a competitive edge. Over the last three decades, increasingly large amounts of critical business data have been stored electronically and this volume is expected to continue to grow considerably in the near future. Yet despite this wealth of data, many companies have been unable to fully capitalize on its value. Data mining is the computationally intelligent extraction of information from large databases. It is the process of automated presentation of patterns, rules and functions from large data bases to make crucial business decisions. This course takes a multi-disciplinary approach to data mining and knowledge discovery involving statistics, rule and tree induction, neural networks and fuzzy logic. Experts from several disciplines will provide guest lectures. The course requires a project and puts a special emphasis on neural networks for data mining. Course is open to graduate students and seniors of all disciplines. Instructor: Prof. Mark J. Embrechts (x 4009 or 371-4562) Office hrs: CII 5217 Tuesday 10-12 am Monday/Thursday 10:00-11:20 am (Sage 3713) Michael J. A. Berry, and Gordon Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley (1997). ISBN 0-471-17980-9 Class Time: Book: database transform select extract information selected data transformed data extracted info interpret GRADING: Tests 5 Homework Projects Course Project Presentation 10% 35% 40% 15% 1 ATTENDANCE POLICY: Course attendance is mandatory, a make-up project is required for each missed class. (Unexcused classes that were not made-up result in a half letter grade penalty). COURSE OUTLINE: 1. INTRODUCTION TO DATAMINING AND KNOWLEDGE DISCOVERY 1.1 What is data mining? 1.2 What is knowledge discovery? 2. MARKET BASKET ANALYSIS 3. DATA PREPARATION 3.1 Data cleaning: outlier removal, noise removal, missing records 3.2 Detrending 3.3 Scaling 3.4 Transformations 4. CLUSTERING TECHNIQUES 4.1 Introduction k-means algorithm 4.2 Other clustering techniques 4.3 Hands-on case study 5. DECISION TREES (Guest lecture) 5.1 Introduction to decision trees 5.2 Gainsschart 5.3 Software: hands-on case study 6. NEURAL NETWORKS FOR DATA MINING 6.1 Introduction to neural networks 6.2 MetaNeural™: Neural Networks hands-on case study 6.3 Time series prediction with neural networks 7. STATISTICAL TECHNIQUES for DM & KNOWLEDGE DISCOVERY 8. FUZZY LOGIC FOR DM & KD 8.1 Introduction to fuzzy logic 8.2 Rule extraction with fuzzy logic 8.3 Case study 9 GENETIC ALGORITHMS 9.2 Introduction 9.3 Rule extraction with Gas 9.4 Case study 10. BAYSIAN CLASSIFICATION & PNNs for DM&KD 10.1 Bayesian classification 10.2 Probabilistic neural networks 2 11. TIME SERIES ANALYSIS 11.1 Is a time series predictable? 11.2 Fractal dimensions 11.3 Hurst analysis of a time series 11.4 Case study 12. SUPPORT VECTORS (Guest lecture) 13. VISUALIZATION TECHNIQUES FOR DM&KD (guest lecture) 3 January 1,4 2002 92.6961-01 DATA MINING AND KNOWLEDGE DISCOVERY LECTURE #1: INTRODUCTION TO DATA MINING AND KNOWLEDGE DISCOVERY The purpose of the first two lectures is to expose an overview of data mining and knowledge discovery. Handout: 1. Usma M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From data mining to knowledge discovery: an overview,” from Advances in Knowledge Discovery and Data Mining, Usma M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, Eds., AAAI Press/The MIT Press, pp. 1 -31, 1996. Tasks: 1. Read chapters 1-2 in Data Mining Techniques and read the handout 2. Start thinking about project topic, meet with me during office hours or by appointment. 3. Browse the WWW about data mining and knowledge discovery and prepare a twopage report about interesting web sites that you visited. Try to comment on less popular web sites that somehow stand out for something unusual or interesting (please mention web addresses in write-up). Note: one of the best and most popular web addresses for data mining is www.kdnuggets.com . (Due January 25 1998). 4 SUGGESTED PROJECTS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Molecular property data for drug characteristics (Prof. Curt Breneman - Chemistry) Heart disease data for benchmarking different methodologies: Benchmark case studies for data mining (Mark J. Embrechts - DSES). Support Vectors for nonlinear prediction in large data sets (Prof. Kristin P. Bennett – Mathematics) Data cleaning: missing data, outliers, and false data? (Mark J. Embrechts) Color spreadsheet for fast discovery of outliers and false data. (Mark Embrechts) Web page with downloading (Mark J. Embrechts) Fuzzy logic & data mining (Mark J. Embrechts) Genetic Algorithms and data mining (Mark J. Embrechts) Visualization techniques for data mining: development of VR fly-over the data software (Mark J. Embrechts). Bayesian Networks for data mining Neural networks for data mining (Mark J. Embrechts) Data mining with WEBSOM for text retrieval from the WWW (Mark J. Embrechts) Intelligent agents for Data Mining (Mark J. Embrechts) WHAT IS EXPECTED FROM THE CLASS PROJECT? • Prepare a monologue about a course-related subject (20 to 30 written pages and supporting material in appendices). • Prepare a 20 minute lecture about your project and give presentation. Hand in a hard copy of your slides. • A project starts in the library (or the web). Prepare to spend at least a full day in the library over the course of the project. Meticulously write down all the relevant references, and attach a copy of the most important references to your report. • The idea for the lecture and the monologue is that you spend the maximum amount of effort to allow a third party to understand (and if necessary to present) that same material, based on your preparation, with a minimal amount of effort. • A project should be a finished and self-consistent professional document where you meticulously digest the prerequisite material, give a brief introduction to your work, and motivate the relevance of the material. Hands-on program development, personal expansions of, and reflections on the literature are strongly encouraged. If your project involves programming, hand in a working version of the program (with source code) and document the program with a user’s manual and sample problems. • It is expected that you spend on average at least 6 hours/week on the class project. 5