COURSE ANNOUNCEMENT Spring 2002 92.6961

advertisement
COURSE ANNOUNCEMENT Spring 2002
92.6961-01 DATA MINING AND KNOWLEDGE DISCOVERY
Many firms have invested heavily in information technology to help them manage their
businesses more effectively and gain a competitive edge. Over the last three decades,
increasingly large amounts of critical business data have been stored electronically and this
volume is expected to continue to grow considerably in the near future. Yet despite this
wealth of data, many companies have been unable to fully capitalize on its value.
Data mining is the computationally intelligent extraction of information from large
databases. It is the process of automated presentation of patterns, rules and functions from
large data bases to make crucial business decisions.
This course takes a multi-disciplinary approach to data mining and knowledge
discovery involving statistics, rule and tree induction, neural networks and fuzzy logic.
Experts from several disciplines will provide guest lectures. The course requires a project
and puts a special emphasis on neural networks for data mining.
Course is open to graduate students and seniors of all disciplines.
Instructor:
Prof. Mark J. Embrechts (x 4009 or 371-4562)
Office hrs: CII 5217 Tuesday 10-12 am
Monday/Thursday 10:00-11:20 am (Sage 3713)
Michael J. A. Berry, and Gordon Linoff, Data Mining Techniques:
For Marketing, Sales, and Customer Support, John Wiley (1997).
ISBN 0-471-17980-9
Class Time:
Book:
database
transform
select
extract
information
selected
data
transformed
data
extracted
info
interpret
GRADING:
Tests
5 Homework Projects
Course Project
Presentation
10%
35%
40%
15%
1
ATTENDANCE POLICY:
Course attendance is mandatory, a make-up project is required for each missed class.
(Unexcused classes that were not made-up result in a half letter grade penalty).
COURSE OUTLINE:
1.
INTRODUCTION TO DATAMINING AND KNOWLEDGE DISCOVERY
1.1 What is data mining?
1.2 What is knowledge discovery?
2.
MARKET BASKET ANALYSIS
3.
DATA PREPARATION
3.1 Data cleaning: outlier removal, noise removal, missing records
3.2 Detrending
3.3 Scaling
3.4 Transformations
4.
CLUSTERING TECHNIQUES
4.1 Introduction k-means algorithm
4.2 Other clustering techniques
4.3 Hands-on case study
5.
DECISION TREES (Guest lecture)
5.1 Introduction to decision trees
5.2 Gainsschart
5.3 Software: hands-on case study
6.
NEURAL NETWORKS FOR DATA MINING
6.1 Introduction to neural networks
6.2 MetaNeural™: Neural Networks hands-on case study
6.3 Time series prediction with neural networks
7.
STATISTICAL TECHNIQUES for DM & KNOWLEDGE DISCOVERY
8.
FUZZY LOGIC FOR DM & KD
8.1 Introduction to fuzzy logic
8.2 Rule extraction with fuzzy logic
8.3 Case study
9
GENETIC ALGORITHMS
9.2 Introduction
9.3 Rule extraction with Gas
9.4 Case study
10. BAYSIAN CLASSIFICATION & PNNs for DM&KD
10.1 Bayesian classification
10.2 Probabilistic neural networks
2
11. TIME SERIES ANALYSIS
11.1 Is a time series predictable?
11.2 Fractal dimensions
11.3 Hurst analysis of a time series
11.4 Case study
12. SUPPORT VECTORS (Guest lecture)
13. VISUALIZATION TECHNIQUES FOR DM&KD (guest lecture)
3
January 1,4 2002
92.6961-01 DATA MINING AND KNOWLEDGE DISCOVERY
LECTURE #1: INTRODUCTION TO DATA MINING AND KNOWLEDGE
DISCOVERY
The purpose of the first two lectures is to expose an overview of data mining and
knowledge discovery.
Handout:
1. Usma M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From data mining
to knowledge discovery: an overview,” from Advances in Knowledge Discovery and
Data Mining, Usma M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and
Ramasamy Uthurusamy, Eds., AAAI Press/The MIT Press, pp. 1 -31, 1996.
Tasks:
1.
Read chapters 1-2 in Data Mining Techniques and read the handout
2.
Start thinking about project topic, meet with me during office hours or by
appointment.
3.
Browse the WWW about data mining and knowledge discovery and prepare a twopage report about interesting web sites that you visited. Try to comment on less
popular web sites that somehow stand out for something unusual or interesting
(please mention web addresses in write-up). Note: one of the best and most popular
web addresses for data mining is www.kdnuggets.com . (Due January 25 1998).
4
SUGGESTED PROJECTS
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Molecular property data for drug characteristics (Prof. Curt Breneman - Chemistry)
Heart disease data for benchmarking different methodologies: Benchmark case
studies for data mining (Mark J. Embrechts - DSES).
Support Vectors for nonlinear prediction in large data sets (Prof. Kristin P. Bennett –
Mathematics)
Data cleaning: missing data, outliers, and false data? (Mark J. Embrechts)
Color spreadsheet for fast discovery of outliers and false data. (Mark Embrechts)
Web page with downloading (Mark J. Embrechts)
Fuzzy logic & data mining (Mark J. Embrechts)
Genetic Algorithms and data mining (Mark J. Embrechts)
Visualization techniques for data mining: development of VR fly-over the data
software (Mark J. Embrechts).
Bayesian Networks for data mining
Neural networks for data mining (Mark J. Embrechts)
Data mining with WEBSOM for text retrieval from the WWW (Mark J. Embrechts)
Intelligent agents for Data Mining (Mark J. Embrechts)
WHAT IS EXPECTED FROM THE CLASS PROJECT?
•
Prepare a monologue about a course-related subject (20 to 30 written pages and
supporting material in appendices).
•
Prepare a 20 minute lecture about your project and give presentation. Hand in a hard
copy of your slides.
•
A project starts in the library (or the web). Prepare to spend at least a full day in the
library over the course of the project. Meticulously write down all the relevant
references, and attach a copy of the most important references to your report.
•
The idea for the lecture and the monologue is that you spend the maximum amount of
effort to allow a third party to understand (and if necessary to present) that same
material, based on your preparation, with a minimal amount of effort.
•
A project should be a finished and self-consistent professional document where you
meticulously digest the prerequisite material, give a brief introduction to your work,
and motivate the relevance of the material. Hands-on program development,
personal expansions of, and reflections on the literature are strongly encouraged. If
your project involves programming, hand in a working version of the program (with
source code) and document the program with a user’s manual and sample problems.
•
It is expected that you spend on average at least 6 hours/week on the class project.
5
Download