To: Graduate Council

advertisement
UNIVERSITY OF MARYLAND UNIVERSITY COLLEGE
GRADUATE SCHOOL
CSMN 667: Data Mining
(rev 6: January 2005: Kirk Borne)
Semester:
Spring 2005
Instructor’s Name: Dr. Kirk Borne
Location of class:
WebTycho Online
Instructor’s Tel: 301-286-0696
Day & Time of class:
Jan.24—May 8, 2005
E-mail: kirk.borne@gsfc.nasa.gov
Appointments: contact instructor by e-mail to set up an appointment
COURSE DESCRIPTION:
As the amount of data has grown, so has the difficulty in analyzing it. Data mining is the
search for hidden, meaningful patterns in large databases. Identifying these patterns and
rules can provide significant competitive advantage to businesses. This course focuses
on the data mining component of the knowledge discovery process. Students will be
introduced to some data mining applications and identify algorithms and techniques
useful for solving different problems. Many of the techniques will include the
application of well known statistical, machine learning and database algorithms including
decision trees, similarity measures, regression, Bayes theorem, nearest neighbor, neural
networks and genetic algorithms. Students will also research a data mining application
and learn how to integrate data mining with data warehouses.
COURSE OBJECTIVES:
Upon completion of this course, the student should be able to:
1. Recognize the role of data mining within knowledge discovery in databases
(KDD).
2. Correctly use data mining terminology.
3. Express the most well known data mining algorithms.
4. Utilize statistics, similarity measures, decision trees, neural networks and genetic
algorithms to data mining tasks.
5. Determine appropriate techniques for classification and clustering applications.
6. Determine approaches used for web content, web structure and web usage mining.
7. Recognize techniques used for temporal mining applications, including pattern
detection.
8. Devise an effective case study.
9. Illustrate how to build a data mining application.
10. Formulate specific case studies in data mining and the corresponding techniques
used in those cases (e.g., Human Genome, Counter-Terrorism, Network Security).
11. Express the steps in a data mining project (e.g., cleaning, transforming, indexing).
12. Compare different implementations of data mining (e.g., OLAP, CRM, EDA).
13. Analyze classic examples of data mining and their techniques.
1
CSMN 667: Data Mining
TEXTS:
Required text
Dunham, M.H. (2003). Data mining introductory and advanced topics. Upper Saddle
River, NJ: Prentice Hall.
Publication manual of the American Psychological Association (5th ed.). (2001)
Washington, DC: American Psychological Association.
Reference texts
Han, J., Kamber, M. (2000) Data Mining: Concepts and Techniques. New York: Morgan
Kaufmann.
Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA:
MIT Press.
Hastie, T., Tibshirani, R., Friedman, J. (2001) The Elements of Statistical Learning, Data
Mining Inference and Prediction. New York: Springer.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (1996) Advances in
Knowledge Discovery and Data Mining. Cambridge, MA: MIT Press.
Mitchell, T. (1997). Machine Learning. Boston, MA: McGraw-Hill.
COURSE REQUIREMENTS:
The course requirements are as follows:
Mid-Term and Final Exams
There will be two exams.
Lab Project
There will be several lab assignments of equal weight. Some of the lab assignments will
involve running different data mining tasks on a data set stored in a data warehouse.
These tasks will be run using a combination of industry and freeware products. For the
other assignments, students will implement one of the algorithms discussed in class and
use it to ‘mine’ for patterns in an existing data warehouse.
2
CSMN 667: Data Mining
Case Analysis
The goal of the paper assignment is to complete an in-depth study of a data mining
application. Examples of applications include financial, scientific, medical, intrusion
detection and web mining.
Possible case topics include:
 A direct mailing application looking to maximize cross-selling opportunities, e.g.
Doubleclick.
 A bank determining the credit worthiness of a potential customer, e.g. American
Express, Bank of America.
 A medical insurer looking to detect medical fraud.
 Gene detection in BioInformatics, e.g. Celera.
Class Participation
Participation is critical to maximizing the value of this class. Your participation grade
will be a subjective score assigned by the instructor based on your demonstrated
preparation for discussion and active, timely participation in classroom discussions
throughout the course of the semester.
GRADING:
The final grade will be determined as follows:
Mid-term Examination----------------- -------- 20%
Final Examination---------------------- -------- 20%
Laboratory Assignments ---------------------- 15%
Case Analysis ----------------------------------- 25%
Class Participation ----------------------------- 20%
GRADUATE SCHOOL GRADING GUIDELINES:
According to the Graduate School of Management and Technology's grading policy, the
following marks are used:
A (90-100) = Excellent
B (80-89) = Good
C (70-79) = Below standards
F (69 or below) = Failure
FN = Failure for nonattendance
G = Grade pending
P = Passing
S = Satisfactory
U = Unsatisfactory
I = Incomplete
AU = Audit
W = Withdrew
3
CSMN 667: Data Mining
The grade of "B" represents the benchmark for the Graduate School of Management and
Technology. It indicates that the student has demonstrated competency in the subject
matter of the course, e.g., has fulfilled all course requirements on time, has a clear grasp
of the full range of course materials and concepts, and is able to present and apply these
materials and concepts in clear, well-reasoned, well-organized, and grammatically correct
responses, whether written or oral.
Only students who fully meet this standard and, in addition, demonstrate exceptional
comprehension and application of the course subject matter earn a grade of "A."
Students who do not meet the benchmark standard of competency fall within the "C"
range or lower. They, in effect, have not met graduate level standards. Where this failure
is substantial, they can earn an "F." The "FN" grade means a failure in the course because
the student has ceased to attend and participate in course assignments and activities but
has not officially withdrawn.
WRITING STANDARDS
Effective managers, leaders, and teachers are also effective communicators. Written
communication is an important element of the total communication process. The
Graduate School of Management and Technology recognizes and expects exemplary
writing to be the norm for course work. To this end, all papers, individual and group,
must demonstrate graduate level writing and comply with the format requirements of the
Publication Manual of the American Psychological Association, 5th Edition. Careful
attention should be given to spelling, punctuation, source citations, references, and the
presentation of tables and figures. It is expected that all course work will be presented on
time and error free.
POLICY ON ACADEMIC INTEGRITY AND PLAGIARISM
UMUC policy on academic dishonesty and plagiarism
UMUC offers the Vail Tutor, a tutorial program covering scholarly documentation
practices.
Vail Tutor
The University has a license agreement with Turnitin.com, a service that helps prevent
plagiarism from internet resources. Your instructor may be using this service in this class
by either requiring students to submit their papers electronically to Turnitin.com or by
submitting questionable text on behalf of a student. If you or your instructor submit part
or all of your paper, it will be stored by Turnitin.com in their database throughout the
term of the University's contract with Turnitin.com. If you object to this temporary
storage of your paper, you must let your instructor know no later than two weeks after the
4
CSMN 667: Data Mining
start of this class. Please Note: If you object to the storage of your paper on Turnitin.com,
your instructor may utilize other services to check your work for plagiarism.
COURSE EVALUATION FORM
UMUC values its students' feedback. You will be asked to complete a mandatory online
evaluation toward the end of the semester. The primary purpose of this evaluation is to
assess the effectiveness of classroom instruction. UMUC requires all students to complete
this evaluation. Your individual responses are kept confidential.
The evaluation notice will appear on your class screen about 21 days before the end of
the semester. You will have approximately one week to complete the evaluation. If,
within this 21-day period, you do not open the file and either respond to the questions or
click on "no response," you will be "locked out" of the class until you do complete the
evaluation. This means that you will not be able to enter the classroom. Once you have
completed the evaluation, you will regain access to the classroom. If you have any
problem getting back in your classroom, you should immediately contact WebTycho
support at 1.800.807.4862 or at webtychosupport@umuc.edu.
The Graduate School of Management and Technology takes students' evaluations
seriously, and in order to provide the best learning experience possible, information
provided is used to make continuous improvements to every class. Please take full
advantage of this opportunity to provide constructive recommendations and comments
about potential areas of improvement.
STUDENTS WITH DISABILITIES
Students with disabilities who want to request and register for services should contact
UMUC's technical director for veteran and disabled student services at least four to six
weeks in advance of registration each semester. Please email vdsa@umuc.edu or call
301-985-7930 or 301-985-7466 (TTY).
TECHNICAL ASSISTANCE AND WEBTYCHO SUPPORT:
Understanding and navigating through WebTycho is critical to successfully completing
this course. All students are encouraged to complete UMUC’s Orientation to Distance
Education and WebTycho Tour at http://www.umuc.edu/distance/de_orien/.
The online WebTycho Help Desk is accessible directly in the classroom. In addition,
WebTycho Support is available 24 hours a day, 7 days a week, at 1-800-807-4862 or
webtychosupport@umuc.edu.
5
CSMN 667: Data Mining
COURSE READING ASSIGNMENTS AND SCHEDULE:
SESSION 1: Introduction to data mining





Overview of the course
What is data mining?
Overview of knowledge discovery in databases (KDD)
Data mining vs. KDD
Data mining issues
Readings: Dunham Chapter 1 and
Supplemental reading to be provided by the instructor
SESSION 2: Data mining roots







Database systems
Fuzzy sets and logic
Information retrieval
Data warehousing, OLAP
Statistics
Machine learning
Foundation for data mining techniques
Readings: Dunham Chapter 2 and
Supplemental reading to be provided by the instructor
SESSION 3: Background techniques used in data mining algorithms





Statistics
Similarity measures
Decision trees
Neural networks
Genetic algorithms
Readings: Dunham Chapter 3 and
Supplemental reading to be provided by the instructor
SESSION 4: Classification – Part 1


Introduction to classification applications
Issues in classification
 Statistical algorithms:
o Regression
6
CSMN 667: Data Mining
o Bayesian classification
 Nearest Neighbor algorithm
Readings: Dunham Chapter 4.1 – 4.3 and
Supplemental reading to be provided by the instructor
SESSION 5: Classification – Part 2

Decision Tree algorithm:
o Major considerations
o ID3 approach
o C4.5 and C.5 approach
o CART approach
 Neural network algorithm
 Rule based algorithms
Readings: Dunham Chapter 4.4 – 4.8 and
Supplemental reading to be provided by the instructor
SESSION 6: Clustering Algorithms






Clustering vs. Classification
Popular similarity and distance measures.
Hierarchical clustering algorithms
Partitional algorithms
Algorithms for clustering large databases
Clustering with categorical attributes
Readings: Dunham Chapter 5 and
Supplemental reading to be provided by the instructor
SESSION 7: MIDTERM
SESSION 8: Association rules – Part 1



Apriori algorithm
Variations using sampling
Variations using partitioning
Readings: Dunham Chapter 6.1 – 6.3 and
Supplemental reading to be provided by the instructor
7
CSMN 667: Data Mining
SESSION 9: Association rules – Part 2




Parallel and distributed algorithms
Overview of concept hierarchies
Generalized association rules
Metrics for measuring rule quality
Readings: Dunham Chapter 6.4 – 6.8 and
Supplemental reading to be provided by the instructor
SESSION 10: Temporal mining






Temporal mining issues
Modeling temporal events
Time series algorithms
Temporal pattern detection
Temporal sequence identification
Temporal association rules
Readings: Dunham Chapter 9 and
Supplemental reading to be provided by the instructor
SESSION 11: Spatial Mining



Spatial mining issues
Modeling spatial events
Spatial indexing schemes and mining algorithms
Readings: Dunham Chapter 8 and
Supplemental reading to be provided by the instructor
SESSION 12: Web Mining




Web mining taxonomy
Web content mining
Web structure mining
Web usage mining
Readings: Dunham Chapter 7 and
Supplemental reading to be provided by the instructor
SESSION 13: FINAL EXAM
8
Download