UNIVERSITY OF MARYLAND UNIVERSITY COLLEGE GRADUATE SCHOOL CSMN 667: Data Mining (rev 6: January 2005: Kirk Borne) Semester: Spring 2005 Instructor’s Name: Dr. Kirk Borne Location of class: WebTycho Online Instructor’s Tel: 301-286-0696 Day & Time of class: Jan.24—May 8, 2005 E-mail: kirk.borne@gsfc.nasa.gov Appointments: contact instructor by e-mail to set up an appointment COURSE DESCRIPTION: As the amount of data has grown, so has the difficulty in analyzing it. Data mining is the search for hidden, meaningful patterns in large databases. Identifying these patterns and rules can provide significant competitive advantage to businesses. This course focuses on the data mining component of the knowledge discovery process. Students will be introduced to some data mining applications and identify algorithms and techniques useful for solving different problems. Many of the techniques will include the application of well known statistical, machine learning and database algorithms including decision trees, similarity measures, regression, Bayes theorem, nearest neighbor, neural networks and genetic algorithms. Students will also research a data mining application and learn how to integrate data mining with data warehouses. COURSE OBJECTIVES: Upon completion of this course, the student should be able to: 1. Recognize the role of data mining within knowledge discovery in databases (KDD). 2. Correctly use data mining terminology. 3. Express the most well known data mining algorithms. 4. Utilize statistics, similarity measures, decision trees, neural networks and genetic algorithms to data mining tasks. 5. Determine appropriate techniques for classification and clustering applications. 6. Determine approaches used for web content, web structure and web usage mining. 7. Recognize techniques used for temporal mining applications, including pattern detection. 8. Devise an effective case study. 9. Illustrate how to build a data mining application. 10. Formulate specific case studies in data mining and the corresponding techniques used in those cases (e.g., Human Genome, Counter-Terrorism, Network Security). 11. Express the steps in a data mining project (e.g., cleaning, transforming, indexing). 12. Compare different implementations of data mining (e.g., OLAP, CRM, EDA). 13. Analyze classic examples of data mining and their techniques. 1 CSMN 667: Data Mining TEXTS: Required text Dunham, M.H. (2003). Data mining introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall. Publication manual of the American Psychological Association (5th ed.). (2001) Washington, DC: American Psychological Association. Reference texts Han, J., Kamber, M. (2000) Data Mining: Concepts and Techniques. New York: Morgan Kaufmann. Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA: MIT Press. Hastie, T., Tibshirani, R., Friedman, J. (2001) The Elements of Statistical Learning, Data Mining Inference and Prediction. New York: Springer. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (1996) Advances in Knowledge Discovery and Data Mining. Cambridge, MA: MIT Press. Mitchell, T. (1997). Machine Learning. Boston, MA: McGraw-Hill. COURSE REQUIREMENTS: The course requirements are as follows: Mid-Term and Final Exams There will be two exams. Lab Project There will be several lab assignments of equal weight. Some of the lab assignments will involve running different data mining tasks on a data set stored in a data warehouse. These tasks will be run using a combination of industry and freeware products. For the other assignments, students will implement one of the algorithms discussed in class and use it to ‘mine’ for patterns in an existing data warehouse. 2 CSMN 667: Data Mining Case Analysis The goal of the paper assignment is to complete an in-depth study of a data mining application. Examples of applications include financial, scientific, medical, intrusion detection and web mining. Possible case topics include: A direct mailing application looking to maximize cross-selling opportunities, e.g. Doubleclick. A bank determining the credit worthiness of a potential customer, e.g. American Express, Bank of America. A medical insurer looking to detect medical fraud. Gene detection in BioInformatics, e.g. Celera. Class Participation Participation is critical to maximizing the value of this class. Your participation grade will be a subjective score assigned by the instructor based on your demonstrated preparation for discussion and active, timely participation in classroom discussions throughout the course of the semester. GRADING: The final grade will be determined as follows: Mid-term Examination----------------- -------- 20% Final Examination---------------------- -------- 20% Laboratory Assignments ---------------------- 15% Case Analysis ----------------------------------- 25% Class Participation ----------------------------- 20% GRADUATE SCHOOL GRADING GUIDELINES: According to the Graduate School of Management and Technology's grading policy, the following marks are used: A (90-100) = Excellent B (80-89) = Good C (70-79) = Below standards F (69 or below) = Failure FN = Failure for nonattendance G = Grade pending P = Passing S = Satisfactory U = Unsatisfactory I = Incomplete AU = Audit W = Withdrew 3 CSMN 667: Data Mining The grade of "B" represents the benchmark for the Graduate School of Management and Technology. It indicates that the student has demonstrated competency in the subject matter of the course, e.g., has fulfilled all course requirements on time, has a clear grasp of the full range of course materials and concepts, and is able to present and apply these materials and concepts in clear, well-reasoned, well-organized, and grammatically correct responses, whether written or oral. Only students who fully meet this standard and, in addition, demonstrate exceptional comprehension and application of the course subject matter earn a grade of "A." Students who do not meet the benchmark standard of competency fall within the "C" range or lower. They, in effect, have not met graduate level standards. Where this failure is substantial, they can earn an "F." The "FN" grade means a failure in the course because the student has ceased to attend and participate in course assignments and activities but has not officially withdrawn. WRITING STANDARDS Effective managers, leaders, and teachers are also effective communicators. Written communication is an important element of the total communication process. The Graduate School of Management and Technology recognizes and expects exemplary writing to be the norm for course work. To this end, all papers, individual and group, must demonstrate graduate level writing and comply with the format requirements of the Publication Manual of the American Psychological Association, 5th Edition. Careful attention should be given to spelling, punctuation, source citations, references, and the presentation of tables and figures. It is expected that all course work will be presented on time and error free. POLICY ON ACADEMIC INTEGRITY AND PLAGIARISM UMUC policy on academic dishonesty and plagiarism UMUC offers the Vail Tutor, a tutorial program covering scholarly documentation practices. Vail Tutor The University has a license agreement with Turnitin.com, a service that helps prevent plagiarism from internet resources. Your instructor may be using this service in this class by either requiring students to submit their papers electronically to Turnitin.com or by submitting questionable text on behalf of a student. If you or your instructor submit part or all of your paper, it will be stored by Turnitin.com in their database throughout the term of the University's contract with Turnitin.com. If you object to this temporary storage of your paper, you must let your instructor know no later than two weeks after the 4 CSMN 667: Data Mining start of this class. Please Note: If you object to the storage of your paper on Turnitin.com, your instructor may utilize other services to check your work for plagiarism. COURSE EVALUATION FORM UMUC values its students' feedback. You will be asked to complete a mandatory online evaluation toward the end of the semester. The primary purpose of this evaluation is to assess the effectiveness of classroom instruction. UMUC requires all students to complete this evaluation. Your individual responses are kept confidential. The evaluation notice will appear on your class screen about 21 days before the end of the semester. You will have approximately one week to complete the evaluation. If, within this 21-day period, you do not open the file and either respond to the questions or click on "no response," you will be "locked out" of the class until you do complete the evaluation. This means that you will not be able to enter the classroom. Once you have completed the evaluation, you will regain access to the classroom. If you have any problem getting back in your classroom, you should immediately contact WebTycho support at 1.800.807.4862 or at webtychosupport@umuc.edu. The Graduate School of Management and Technology takes students' evaluations seriously, and in order to provide the best learning experience possible, information provided is used to make continuous improvements to every class. Please take full advantage of this opportunity to provide constructive recommendations and comments about potential areas of improvement. STUDENTS WITH DISABILITIES Students with disabilities who want to request and register for services should contact UMUC's technical director for veteran and disabled student services at least four to six weeks in advance of registration each semester. Please email vdsa@umuc.edu or call 301-985-7930 or 301-985-7466 (TTY). TECHNICAL ASSISTANCE AND WEBTYCHO SUPPORT: Understanding and navigating through WebTycho is critical to successfully completing this course. All students are encouraged to complete UMUC’s Orientation to Distance Education and WebTycho Tour at http://www.umuc.edu/distance/de_orien/. The online WebTycho Help Desk is accessible directly in the classroom. In addition, WebTycho Support is available 24 hours a day, 7 days a week, at 1-800-807-4862 or webtychosupport@umuc.edu. 5 CSMN 667: Data Mining COURSE READING ASSIGNMENTS AND SCHEDULE: SESSION 1: Introduction to data mining Overview of the course What is data mining? Overview of knowledge discovery in databases (KDD) Data mining vs. KDD Data mining issues Readings: Dunham Chapter 1 and Supplemental reading to be provided by the instructor SESSION 2: Data mining roots Database systems Fuzzy sets and logic Information retrieval Data warehousing, OLAP Statistics Machine learning Foundation for data mining techniques Readings: Dunham Chapter 2 and Supplemental reading to be provided by the instructor SESSION 3: Background techniques used in data mining algorithms Statistics Similarity measures Decision trees Neural networks Genetic algorithms Readings: Dunham Chapter 3 and Supplemental reading to be provided by the instructor SESSION 4: Classification – Part 1 Introduction to classification applications Issues in classification Statistical algorithms: o Regression 6 CSMN 667: Data Mining o Bayesian classification Nearest Neighbor algorithm Readings: Dunham Chapter 4.1 – 4.3 and Supplemental reading to be provided by the instructor SESSION 5: Classification – Part 2 Decision Tree algorithm: o Major considerations o ID3 approach o C4.5 and C.5 approach o CART approach Neural network algorithm Rule based algorithms Readings: Dunham Chapter 4.4 – 4.8 and Supplemental reading to be provided by the instructor SESSION 6: Clustering Algorithms Clustering vs. Classification Popular similarity and distance measures. Hierarchical clustering algorithms Partitional algorithms Algorithms for clustering large databases Clustering with categorical attributes Readings: Dunham Chapter 5 and Supplemental reading to be provided by the instructor SESSION 7: MIDTERM SESSION 8: Association rules – Part 1 Apriori algorithm Variations using sampling Variations using partitioning Readings: Dunham Chapter 6.1 – 6.3 and Supplemental reading to be provided by the instructor 7 CSMN 667: Data Mining SESSION 9: Association rules – Part 2 Parallel and distributed algorithms Overview of concept hierarchies Generalized association rules Metrics for measuring rule quality Readings: Dunham Chapter 6.4 – 6.8 and Supplemental reading to be provided by the instructor SESSION 10: Temporal mining Temporal mining issues Modeling temporal events Time series algorithms Temporal pattern detection Temporal sequence identification Temporal association rules Readings: Dunham Chapter 9 and Supplemental reading to be provided by the instructor SESSION 11: Spatial Mining Spatial mining issues Modeling spatial events Spatial indexing schemes and mining algorithms Readings: Dunham Chapter 8 and Supplemental reading to be provided by the instructor SESSION 12: Web Mining Web mining taxonomy Web content mining Web structure mining Web usage mining Readings: Dunham Chapter 7 and Supplemental reading to be provided by the instructor SESSION 13: FINAL EXAM 8