Course Number:
Course Title:
Number of Credits:
Location of Class:
Meeting Time:
Professor:
Office:
Office Hours:
Office Phone:
Email:
Class web-page:
CISC 4327
Data Mining Algorithms and Applications
3
Davidson Building, Room 122
2:30 - 3:50 pm Tuesday & Thursday
Dr. William G. Tanner, Jr.
Room 119 Davidson Building posted in Davidson and on-line
(254) 295-4645 btanner@umhb.edu
http://mars.umhb.edu/
With the current and continuing increase in information sources, e.g. google, yahoo, etc., we have truly entered the “Information Age”. We are also in information overload. For that reason alone we need to perfect efficient and effective algorithms to understand and benefit from the many sources of information.
Data mining is an increasingly important branch of computer science that examines data in order to find and describe patterns. Because we live in a world where we can be overwhelmed with information, it is imperative that we find ways to classify this input, to find the information we need, to illuminate structures, and to be able to draw conclusions. Data mining is a very practical discipline with many applications in science, and government, such as web analysis, disease diagnosis and outcome prediction, weather forecasting, fraud detection, and terrorism threat detection. It is based on methods from several fields, but mainly machine learning, statistics, and information visualization.
This course examines the design and efficiency of Data mining algorithms for the classification, association and non-trivial discovery of insights and knowledge within data. The data mining applications linked to data or databases will use both traditional and new data mining methods. A course in algorithms presents an opportunity to expose students to some of the fundamentals of data mining, in the form of decision trees.
These trees are a basic structure for representing data. Decision tree induction algorithms are used to classify data, perhaps the most common data mining task. This course will require a lot of out of class time.
The average student should spend between 3 to 15 hours per week, working on programs and projects
(keep up and if you start falling behind, ask for help early). Assignments will be given out in class and posted on the CISC 4327 web-page and help; Web-link: http://mars.umhb.edu/wgt/cisc4327/
This is not a beginning programming course; a previous structured language course either in JAVA, C# or C++ is required. Some of the topics covered will be:
1. Classification and parsing methods.
2. Methods for discover of the association and clustering of data.
3. Data processing, file I/O and data access methods.
4. Data mining algorithms, implemented in C#
5. Incremental learning, Bayesian networks and other classifiers.
6. Use of standard data mining programs and Implementation of our own.
Textbook:
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, 2005, Addison Wesley,
ISBN #: 978-0321321367
Textbook Resources: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm
Other items:
A flash drive is required for this class (a 16 Gb or larger pen-drive is recommended).
Our computer lab will have appropriate software installed to allow you to program in C++. Either Dev-C++ or Visual C++ is recommended. You are responsible for maintaining backup copies of all your programs. Our web-page at: http://mars.umhb.edu/ will be used to provide software and a BBS for class interaction.
1.
Grading: The final grade calculation will be reached according to the distribution described in the UMHB
Catalog. The final course grade will be computed by the following percentages:
Class participation & Daily Assignments 10%
10%
80%
Laboratory Projects
Tests (3): + FINAL
2. Attendance: The student is expected to attend ALL scheduled classes and will be held responsible for all class work and assignments. Continued absences will result in an unsatisfactory grade report for the course and exceeding 80% of schedule classes will result in an failing grade automatically
(for a TTH course that will be no more than 9).
3. Tests: All students are required to be present for a test. If an extreme emergency occurs, and you cannot make the test time, the student should make every effort to contact the professor by email, telephone or in person to receive permission to miss the test. Permission will be granted only in the case of extenuating circumstances.
4. Makeup Tests: Students desiring a Makeup Test must make arrangements with the professor to take the test. A Makeup Test must be scheduled during office hours BEFORE the next scheduled test. If a student fails to take a Makeup Test before the next scheduled test, that student will receive a ZERO for the missed test.
5. Assignments: All assignments will be due on the DUE-DATE (normally Tuesday’s). They are due at the beginning of a class period.
6. Final Exam: The final exam will be comprehensive. NO MAKEUP WILL BE GIVEN FOR THE FINAL.
5
6
7
4
Day Date Topic
0 June 9 Introduction
1
2
3
June 9 What is Data Mining?
June 10 Data: Types, Quality & Preprocessing
June 10 Measures of Similarity and Dissimilarity
June 11 Exploring Data: Summary Statistics
June 11 Visualization & OLAP
June 15 DUE Take Home Examination 1 (Chapters 1 – 3)
June 15 Classification: Basic Concepts & Decision Trees
June 15 Classification: Model Evaluation
June 16 Classification: Evaluating Classifier
June 16 Classification: Alternative Techniques
June 17 Classification: Rule-Based, Bayesian
June 17 Classification: ANN, SVM, Ensemble Methods
Lectures / Exams
8
9
June 22 DUE Take Home Examination 2 (Chapters 4 – 5)
June 18 Association Analysis: Basic Concepts - Item Generation
June 18 Association Analysis: Rule Generation, FP-Growth Algorithm
10 June 22 Association Analysis: Evaluation of Association Patterns
June 22 Association Analysis: Advanced Concepts, Categorical Attributes
11 June 23 Association Analysis: Sequential, Subgraph & Infrequent Patterns
June 29 DUE Take Home Examination 3 (Chapters 6 – 7)
12 June 29 Cluster Analysis: Basic Concepts & Algorithms
Chapter 1
Chapter 1
Chapter 2
Chapter 2
Chapter 3
Chapter 3
Exam #1
Chapter 4
Chapter 4
Chapter 4
Chapter 5
Chapter 5
Chapter 5
Exam #2
Chapter 6
Chapter 6
Chapter 7
Chapter 7
Exam #3
Chapter 8
Chapter 8
Chapter 7
June 29 Cluster Analysis: DBSCAN & Cluster Evaluation
13 June 30 Cluster Analysis: Prototype-Based & Density-Based Clustering
June 30 Cluster Analysis: Graph-Based & Scalable Clustering Algorithms
14 July 01 Review for Final Examination
16 July 02 Final Examination: Chapters (1 – 9)
Chapter 9
Chapter 9