To: Graduate Council

CSMN 667: Data Mining
(rev 6: January 2005: Kirk Borne)
Spring 2005
Instructor’s Name: Dr. Kirk Borne
Location of class:
WebTycho Online
Instructor’s Tel: 301-286-0696
Day & Time of class:
Jan.24—May 8, 2005
Appointments: contact instructor by e-mail to set up an appointment
As the amount of data has grown, so has the difficulty in analyzing it. Data mining is the
search for hidden, meaningful patterns in large databases. Identifying these patterns and
rules can provide significant competitive advantage to businesses. This course focuses
on the data mining component of the knowledge discovery process. Students will be
introduced to some data mining applications and identify algorithms and techniques
useful for solving different problems. Many of the techniques will include the
application of well known statistical, machine learning and database algorithms including
decision trees, similarity measures, regression, Bayes theorem, nearest neighbor, neural
networks and genetic algorithms. Students will also research a data mining application
and learn how to integrate data mining with data warehouses.
Upon completion of this course, the student should be able to:
1. Recognize the role of data mining within knowledge discovery in databases
2. Correctly use data mining terminology.
3. Express the most well known data mining algorithms.
4. Utilize statistics, similarity measures, decision trees, neural networks and genetic
algorithms to data mining tasks.
5. Determine appropriate techniques for classification and clustering applications.
6. Determine approaches used for web content, web structure and web usage mining.
7. Recognize techniques used for temporal mining applications, including pattern
8. Devise an effective case study.
9. Illustrate how to build a data mining application.
10. Formulate specific case studies in data mining and the corresponding techniques
used in those cases (e.g., Human Genome, Counter-Terrorism, Network Security).
11. Express the steps in a data mining project (e.g., cleaning, transforming, indexing).
12. Compare different implementations of data mining (e.g., OLAP, CRM, EDA).
13. Analyze classic examples of data mining and their techniques.
CSMN 667: Data Mining
Required text
Dunham, M.H. (2003). Data mining introductory and advanced topics. Upper Saddle
River, NJ: Prentice Hall.
Publication manual of the American Psychological Association (5th ed.). (2001)
Washington, DC: American Psychological Association.
Reference texts
Han, J., Kamber, M. (2000) Data Mining: Concepts and Techniques. New York: Morgan
Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA:
MIT Press.
Hastie, T., Tibshirani, R., Friedman, J. (2001) The Elements of Statistical Learning, Data
Mining Inference and Prediction. New York: Springer.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (1996) Advances in
Knowledge Discovery and Data Mining. Cambridge, MA: MIT Press.
Mitchell, T. (1997). Machine Learning. Boston, MA: McGraw-Hill.
The course requirements are as follows:
Mid-Term and Final Exams
There will be two exams.
Lab Project
There will be several lab assignments of equal weight. Some of the lab assignments will
involve running different data mining tasks on a data set stored in a data warehouse.
These tasks will be run using a combination of industry and freeware products. For the
other assignments, students will implement one of the algorithms discussed in class and
use it to ‘mine’ for patterns in an existing data warehouse.
CSMN 667: Data Mining
Case Analysis
The goal of the paper assignment is to complete an in-depth study of a data mining
application. Examples of applications include financial, scientific, medical, intrusion
detection and web mining.
Possible case topics include:
 A direct mailing application looking to maximize cross-selling opportunities, e.g.
 A bank determining the credit worthiness of a potential customer, e.g. American
Express, Bank of America.
 A medical insurer looking to detect medical fraud.
 Gene detection in BioInformatics, e.g. Celera.
Class Participation
Participation is critical to maximizing the value of this class. Your participation grade
will be a subjective score assigned by the instructor based on your demonstrated
preparation for discussion and active, timely participation in classroom discussions
throughout the course of the semester.
The final grade will be determined as follows:
Mid-term Examination----------------- -------- 20%
Final Examination---------------------- -------- 20%
Laboratory Assignments ---------------------- 15%
Case Analysis ----------------------------------- 25%
Class Participation ----------------------------- 20%
According to the Graduate School of Management and Technology's grading policy, the
following marks are used:
A (90-100) = Excellent
B (80-89) = Good
C (70-79) = Below standards
F (69 or below) = Failure
FN = Failure for nonattendance
G = Grade pending
P = Passing
S = Satisfactory
U = Unsatisfactory
I = Incomplete
AU = Audit
W = Withdrew
CSMN 667: Data Mining
The grade of "B" represents the benchmark for the Graduate School of Management and
Technology. It indicates that the student has demonstrated competency in the subject
matter of the course, e.g., has fulfilled all course requirements on time, has a clear grasp
of the full range of course materials and concepts, and is able to present and apply these
materials and concepts in clear, well-reasoned, well-organized, and grammatically correct
responses, whether written or oral.
Only students who fully meet this standard and, in addition, demonstrate exceptional
comprehension and application of the course subject matter earn a grade of "A."
Students who do not meet the benchmark standard of competency fall within the "C"
range or lower. They, in effect, have not met graduate level standards. Where this failure
is substantial, they can earn an "F." The "FN" grade means a failure in the course because
the student has ceased to attend and participate in course assignments and activities but
has not officially withdrawn.
Effective managers, leaders, and teachers are also effective communicators. Written
communication is an important element of the total communication process. The
Graduate School of Management and Technology recognizes and expects exemplary
writing to be the norm for course work. To this end, all papers, individual and group,
must demonstrate graduate level writing and comply with the format requirements of the
Publication Manual of the American Psychological Association, 5th Edition. Careful
attention should be given to spelling, punctuation, source citations, references, and the
presentation of tables and figures. It is expected that all course work will be presented on
time and error free.
UMUC policy on academic dishonesty and plagiarism
UMUC offers the Vail Tutor, a tutorial program covering scholarly documentation
Vail Tutor
The University has a license agreement with, a service that helps prevent
plagiarism from internet resources. Your instructor may be using this service in this class
by either requiring students to submit their papers electronically to or by
submitting questionable text on behalf of a student. If you or your instructor submit part
or all of your paper, it will be stored by in their database throughout the
term of the University's contract with If you object to this temporary
storage of your paper, you must let your instructor know no later than two weeks after the
CSMN 667: Data Mining
start of this class. Please Note: If you object to the storage of your paper on,
your instructor may utilize other services to check your work for plagiarism.
UMUC values its students' feedback. You will be asked to complete a mandatory online
evaluation toward the end of the semester. The primary purpose of this evaluation is to
assess the effectiveness of classroom instruction. UMUC requires all students to complete
this evaluation. Your individual responses are kept confidential.
The evaluation notice will appear on your class screen about 21 days before the end of
the semester. You will have approximately one week to complete the evaluation. If,
within this 21-day period, you do not open the file and either respond to the questions or
click on "no response," you will be "locked out" of the class until you do complete the
evaluation. This means that you will not be able to enter the classroom. Once you have
completed the evaluation, you will regain access to the classroom. If you have any
problem getting back in your classroom, you should immediately contact WebTycho
support at 1.800.807.4862 or at
The Graduate School of Management and Technology takes students' evaluations
seriously, and in order to provide the best learning experience possible, information
provided is used to make continuous improvements to every class. Please take full
advantage of this opportunity to provide constructive recommendations and comments
about potential areas of improvement.
Students with disabilities who want to request and register for services should contact
UMUC's technical director for veteran and disabled student services at least four to six
weeks in advance of registration each semester. Please email or call
301-985-7930 or 301-985-7466 (TTY).
Understanding and navigating through WebTycho is critical to successfully completing
this course. All students are encouraged to complete UMUC’s Orientation to Distance
Education and WebTycho Tour at
The online WebTycho Help Desk is accessible directly in the classroom. In addition,
WebTycho Support is available 24 hours a day, 7 days a week, at 1-800-807-4862 or
CSMN 667: Data Mining
SESSION 1: Introduction to data mining
Overview of the course
What is data mining?
Overview of knowledge discovery in databases (KDD)
Data mining vs. KDD
Data mining issues
Readings: Dunham Chapter 1 and
Supplemental reading to be provided by the instructor
SESSION 2: Data mining roots
Database systems
Fuzzy sets and logic
Information retrieval
Data warehousing, OLAP
Machine learning
Foundation for data mining techniques
Readings: Dunham Chapter 2 and
Supplemental reading to be provided by the instructor
SESSION 3: Background techniques used in data mining algorithms
Similarity measures
Decision trees
Neural networks
Genetic algorithms
Readings: Dunham Chapter 3 and
Supplemental reading to be provided by the instructor
SESSION 4: Classification – Part 1
Introduction to classification applications
Issues in classification
 Statistical algorithms:
o Regression
CSMN 667: Data Mining
o Bayesian classification
 Nearest Neighbor algorithm
Readings: Dunham Chapter 4.1 – 4.3 and
Supplemental reading to be provided by the instructor
SESSION 5: Classification – Part 2
Decision Tree algorithm:
o Major considerations
o ID3 approach
o C4.5 and C.5 approach
o CART approach
 Neural network algorithm
 Rule based algorithms
Readings: Dunham Chapter 4.4 – 4.8 and
Supplemental reading to be provided by the instructor
SESSION 6: Clustering Algorithms
Clustering vs. Classification
Popular similarity and distance measures.
Hierarchical clustering algorithms
Partitional algorithms
Algorithms for clustering large databases
Clustering with categorical attributes
Readings: Dunham Chapter 5 and
Supplemental reading to be provided by the instructor
SESSION 8: Association rules – Part 1
Apriori algorithm
Variations using sampling
Variations using partitioning
Readings: Dunham Chapter 6.1 – 6.3 and
Supplemental reading to be provided by the instructor
CSMN 667: Data Mining
SESSION 9: Association rules – Part 2
Parallel and distributed algorithms
Overview of concept hierarchies
Generalized association rules
Metrics for measuring rule quality
Readings: Dunham Chapter 6.4 – 6.8 and
Supplemental reading to be provided by the instructor
SESSION 10: Temporal mining
Temporal mining issues
Modeling temporal events
Time series algorithms
Temporal pattern detection
Temporal sequence identification
Temporal association rules
Readings: Dunham Chapter 9 and
Supplemental reading to be provided by the instructor
SESSION 11: Spatial Mining
Spatial mining issues
Modeling spatial events
Spatial indexing schemes and mining algorithms
Readings: Dunham Chapter 8 and
Supplemental reading to be provided by the instructor
SESSION 12: Web Mining
Web mining taxonomy
Web content mining
Web structure mining
Web usage mining
Readings: Dunham Chapter 7 and
Supplemental reading to be provided by the instructor