CS490-001: Introduction to Data Mining Winter 2016 (201610) Instructor Robert J. Hilderman Office: CW308.23, 3rd Floor, College West Voice: (306) 585-4061 Fax: (306) 585-4745 e-Mail: robert.hilderman@uregina.ca WWW: http://www.cs.uregina.ca/~hilder Office Hours Location: CW308.23, 3rd Floor, College West Time: MWF 10:30 – 11:30 AM (or by appointment) Course Overview This course will be a mix of self-directed study by the student on various core data mining topics, a series of assignments to test the student’s understanding of the core topics, and a term project based upon a research paper chosen by the student from the recent data mining literature. There will be no exams. Mark Distribution Assignments (4) (written) 40% Term project proposal (written) Term project final report and appendices (written) 10% 50% ------100% Note: Your final mark must be at least 75% to pass the course. Note: At the instructor’s discretion, the final mark may be adjusted +/-5%. Choosing a Term Project Topic The research paper upon which your term project is based must be a recent publication, specifically, anything published in 2013, 2014, and 2015 is eligible. It must be approved by the instructor. The scope of your term project must include significant software development, data mining, and results evaluation components. Term Project Proposal Requirements The term project proposal must contain the following sections (the minimum requirement): Statement of Problem: Provide a statement of the problem addressed in the selected paper. Examples: Provide detailed example/s of the problem that was solved. These must be complete hand-derived examples different from those contained in the paper. Overview of the Proposed Solution: Provide an approximate form of the proposed solution in writing. This must be a complete hand-derived example of how the problem was solved different from that contained in the paper. Proposed Software Solution: Provide a detailed description of the software that you will develop to solve the problem. A pseudocode overview of the details of your software would be appropriate. Evaluation Criteria: Describe your plans for testing that the software actually solves the problem correctly. Experimental Results: Describe the datasets that you will use to generate your experimental results. References: Provide a complete, properly formatted list of the cited references. The project proposal must be eight to 10 double-spaced typewritten pages in 12pt font. Term Project Final Report Requirements The final project report should be modeled on a format that is similar to typical research papers that you have read. For example, like the one you based your term project on. It must contain the following sections (the minimum requirement): Introduction: Provide some background on the problem addressed by the project, an overview of the proposed solution, and a description of the report document (i.e., the organization of the report). Statement of Problem and Examples: This section can be adapted from your term project proposal. Proposed Approach: Provide detailed descriptions of algorithms, data structures, and/or theoretical results. This section can be adapted from your term project proposal and the software you developed. Experimental Results: Provide a description of sample/typical experimental results, tabular/graphical comparisons of your results compared to other published results, and a summary of your results (a detailed description of your results will be in the appendix). Brief Comparison to Related Work: Provide a detailed analysis and discussion of your results in comparison to other related work. Conclusions: Provide a summary of your results. References: Provide a complete, properly formatted list of the cited references. The final project report must be 16 to 18 double-spaced typewritten pages in 12pt font. This does not include appendices (see below). Appendices Source Code Listing: Provide a complete listing of well-formatted, welldocumented source code. Experimental Results: Provide a complete listing of all experimental results (i.e., both raw data and summary data). Sources for Reference Materials Books (many other books are available and those below may have newer editions) Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, 1996. Berry, M.J.A. and Linoff, G.S., Mastering Data Mining: The Art and Science of Customer Relationship Management, Wiley, 2000. Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. Hand, D., Mannila, H., and Smyth, P., Principles of Data Mining, The MIT Press, 2001. Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Hastie, T., Tibshirani, R., and Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2001. Fayyad, U., Grinstein, G.G., and Wierse, A. (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2002. Thuraisingham, B., Data Mining: Technologies, Techniques, Tools, and Trends, CRC Press, 1999. Pyle, D., Data Preparation for Data Mining, Morgan Kaufmann, 1999. Dunham, M.H., Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003. Russell, S. and Norvig, P., Artificial Intelligence: A Modern Approach, Prentice Hall, 2003. Mitchell, T.M., Machine Learning, McGraw-Hill, 1997. Guillet, F. and Hamilton, H.J., Quality Measures in Data Mining, Springer, 2007. Liu, B., Web Data Mining, Springer, 2007. Wu, X. and Kumar, V., The Top Ten Algorithms in Data Mining, CRC Press, 2009. Bramer, M., Principles of Data Mining, Springer, 2007. Conference Proceedings (there are many others that have KDD tracks) Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD) Proceedings of the European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD) Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD) Proceedings of the Data Warehousing and Knowledge Discovery Conference (DaWaK) Proceedings of the International Conference on Data Mining (ICDM) Proceedings of the International Conference on Very Large Databases (VLDB) Proceedings of the International Conference on Management of Data (SIGMOD) Journals (these are just a few of many dealing with KDD) IEEE Transactions on Knowledge and Data Engineering Data Mining and Knowledge Discovery Intelligent Data Analysis Journal of Intelligent Information Systems Knowledge and Information Systems SIGKDD Explorations Sources for Real World Datasets To locate each of the sources shown below, use the terms given as keywords in a web search engine. UCI KDD Database Repository UCI Machine Learning Repository DELVE FEDSTATS FIMI Repository Financial Data Finder Grain Market Research Investor Links MIT Cancer Genomics Gene Expression Datasets MLnet National Space Science Data Center PubGene Gene Database Stanford Microarray Database STATLOG Project Datasets United States Census Bureau DataCrunch Reuters-21578 Text Categorization Collection UCR Time Series Archive DataWeb WHO Statistical Information System