Course Description and Objectives Textbook Software Methods of Instruction Evaluation Student Responsibilities Attendance Policy Academic Dishonesty ADAAccommodation Notice Instructor: Dr. Vladimir Zanev Office Location/Phone Number: CCT 442/ 569-3056 Office Hours: Mon-Thu, 3:00 p.m. - 4:00 p.m., Fri: 10:00 a.m.-11:00 a.m. E-mail: WebCT class e-mail or zanev_vladimir@colstate.edu Website: http://webct.colstate.edu http://csc.colstate.edu/zanev/current_courses.asp This course is offered as an online class in the Spring semester 2007. Class meets 100% online at ( http://webct.colstate.edu ) Online Interface: WebCT Vista will be the primary method of online interaction in this course. Course materials (course outline, schedule, assignments, projects, course notes, datasets, discussions, resources, and grading will be available through WebCT Vista. You can access WebCT at Vista: or http://webct.colstate.edu At this page, click on the "Log-in" link to activate the WebCT Vista logon dialog box, which will ask for your WebCT Vista username and password. Your WebCT Vista username and password are: Username: lastname_firstname Password: DDMMYY where DDMMYY is the student birth date. (Example - Birthday of Oct. 25, 1978 is 251078) If you try the above and WebCT Vista will not let you in, please use the "Comments/Problems" link at the bottom of the WebCT Vista home page to request help. If you are still having problems gaining access a day or so after the class begins, please email me. Once you have clicked on the course's name and accessed the course itself, you will find a home page with links to other sections and tools, and a menu on the left-hand side. This course homepage and the left-hand menu will give you access to all course materials. Course Description and Objectives Course Description: Prerequisite - CPSC 5115. Algorithm Analysis and Design, CPSC 5138 Advanced DBMS. These prerequisites will not be enforced. Consider them as a suggested background, which you should have to pass this course in a breeze. It is not required that you must have taken the courses above. However, completing the following courses and/or having a working knowledge in the respective areas will greatly help you to succeed in this class. This course is an introduction to data mining. Recent advances in database technology along with the phenomenal growth of the Internet have resulted in an explosion of data collected, stored, and disseminated by various organizations. Because of its massive size, it is difficult for analysts to sift through the data even though it may contain useful information. Data mining holds great promise to address this problem by providing efficient techniques to uncover useful information hidden in the large data repositories. Data mining is a modern area of computer science concerned with automated or convenient extraction of patterns that represents previously unknown knowledge implicitly stored in large databases, data warehouses, and other massive information repositories. In this course we will approach the data mining problem from the position of database design and programming. We will discuss suitable data models, data preparation, and finally - different methods and algorithms one can implement to discover new knowledge from raw data. The key objectives of this course are two-fold: (1) to teach the fundamental concepts of data mining and (2) to provide extensive hands-on experience in applying the concepts to real-world applications. The core topics to be covered in this course include: data and exploring/preprocessing data classification data mining algorithms and methods association analysis data mining algorithms and methods cluster data mining algorithms and methods SQL Server 2005 data mining environment Expected Outcomes At the completion of this course, students will have an understanding and knowledge of: What is data mining? Data and exploring data: sampling, data cleaning, feature selection, and dimensionality reduction Classification: basic concepts, decision trees, model evaluation Classification: naive Bayes, time series, neural networks, Association analysis: basic concepts and algorithms, Apriori algorithm, Cluster analysis: basic concepts and algorithms, partitional and hierarchical clustering methods, SQL Server 2005 environment, tools, and algorithms How to use SQL Server 2005 for data mining Textbook Textbooks - required Title: Introduction to Data Mining Authors: Pang-Ning Tan, Michael Steinbach and Vipin Kumar Edition: 2006 Publisher: Addison-Wesley ISBN: 0-321-32136-7 Title: Data Mining with SQL Server 2005 Authors: ZhaoHui Tang Edition: 2005 Publisher: Wiley Publishing Inc. ISBN: 0-471-46261-6 Software Software To complete all lessons, the data mining project, assignments, discussions, and exams, you will need a computer with: Windows 2000/XP, Internet Explorer, PowerPoint, and Word SQL Server 2005 (see Resources Web page for details how to obtain SQL Server 2005). Access to WebCT Vista at CSU Methods of Instruction Methods of Instruction: Online Study Assignments Data Mining Project Discussions Midterm Exam Final Exam Online Study Each student is expected to complete all readings from the textbooks following the course schedule. Make your own notes. You can use your own notes during the exams. Assignments Four to six assignments will be given that build upon the concepts covered in the textbooks and have to be completed on your own time. Assignment deadlines are not flexible for any reason. Late assignments are not accepted for credit. Assignment submissions are usually via WebCT Vista email. Data Mining Project The purpose of this project is to give you experience with a Data Mining implementation. The data mining project is an opportunity to apply on real data the concepts, techniques, and tools studied in class. This project is a data mining project developed individually. The objective is to implement and run a data mining algorithm analyzing real data sets. You can use SQL Server 2005 as implementation tool or another data mining software (see the Resources Web page). Discussions A special Discussion Board with three discussions will be opened in the course WebCT site. Online discussions will be based on the discussion questions posted by the instructor (threaded discussions). Your participation in the discussions will be evaluated through your contributions (questions, answers, remarks, and essays) in the three discussions. For details see the Discussion area on the WebCT class Web site. Exams Your performance in this class will be measured by two online exams - Midterm and Final Exam. No make up tests will be given unless an exam was missed due to a documented emergency. The exams will be closed textbook but you can use your own notes. Questions on the exams may include the following: problem solving essay questions multiple choice answer selection filling in the blanks Evaluation Evaluation The final grade will be obtained from the following: Assignments Discussions Project Midterm Exam Final Exam 20% 15% 20% 20% 25% The letter grade will be assigned as follows: Grade A B C D F Points 90-100 80-89 70-79 60-69 0 -59 Grading Example: Assignments Discussions Project Midterm Exam Final Exam 85, 90, 80, 70 85, 90, 90 95 80 94 G = (85+90+80+70)/4*0.2 + (85+90+90+)/3*0.15 + 95*0.2 + 80*0.2+94*0.25 = 88.69 It is a B. Student Responsibilities Student Responsibilities Each student is responsible to manage his/her time and maintain the discipline required to meet the course requirements. Each student is responsible to read from the textbooks all topics covered in the class Each student is responsible to read from the textbook all chapter topics, bibliographic notes, and summaries Each student is responsible to execute the data mining project and all discussions Each student is responsible to adhere to all course deadlines Each student is responsible to take the exams as they are scheduled in the course schedule. “I didn’t know” is no an acceptable excuse for failing to meet the course requirements. Students who fail to meet their responsibilities do so at their own risk. Attendance Policy Attendance Policy Attendance at all classes and other activities (lecture periods, laboratory sessions, tests, examinations, or other schedule meetings is required of every student at Columbus State University. The attendance record begins with the first meeting of the class, and one who registers late is responsible for class work missed. Student should note that the Computer Science Faculty does not initiate "class drops". A student wishing to drop should complete the official procedure before the deadline. Those who violate the attendance policy after that deadline may receive an "F" at the discretion of the instructor. After the midpoint of the quarter, no drop slip will be signed by the Dean unless extreme circumstances can be proved. Academic Dishonesty Academic Dishonesty: Academic dishonesty includes, but is not limited to, activities such as cheating and plagiarism (http://aa.colstate.edu/advising/a.htm#AcademicDishonesty/Academic Misconduct). It is a basis for disciplinary action. Any work turned in for individual credit must be entirely the work of the student submitting the work. All work must be your own. You may share ideas but submitting identical assignments (for example) will be considered cheating. You may discuss the material in the course and help one another with debugging; however, any work you hand in for a grade must be your own. A simple way to avoid inadvertent plagiarism is to talk about the assignments, but don't read each other's work or write solutions together unless otherwise directed. For your own protection, keep scratch paper and old versions of assignments to establish ownership, until after the assignment has been graded and returned to you. If you have any questions about this, please see me immediately. For assignments, access to notes, the course textbooks, books and other publications is allowed. All work that is not your own, MUST be properly cited. This includes any material found on the Internet. Stealing or giving or receiving any code, diagrams, drawings, text or designs from another person (CSU or non-CSU, including the Internet) is not allowed. Having access to another person’s work on the computer system or giving access to your work to another person is not allowed. It is your responsibility to keep your work confidential. No cheating in any form will be tolerated. Penalties for academic dishonesty may include: a zero grade on the assignment or exam/quiz a failing grade for the course suspension from the Computer Science program dismissal from the Computer Science program. All instances of cheating will be documented in writing with a copy placed in the Department’s files. Students will be expected to discuss the academic misconduct with the faculty members and the chair person. For more details see the Faculty Handbook:http://aa.colstate.edu/faculty/FacHandbook0203/sec100.htm#109.14and the Student Handbook:http://sa.colstate.edu/handbook/handbook2003.pdf ADA Accommodation Notice ADA Accommodation Notice If you have a documented disability as described by the Rehabilitation Act of 1973 (P.L. 933-112 Section 504) and Americans with Disabilities Act (ADA) and would like to request academic and/or physical accommodations please the Office of Disability Services in the Center for Academic Support and Student Retention, Tucker Hall 100 or at (706) 568-2330, as soon as possible. Course requirements will not be waived but reasonable accommodations may be provided as appropriate. Tentative Schedule CPSC 6127 is an online class and the schedule is flexible and suggestive. However the scheduled studies, software installation, assignments, project, and discussions have to be completed on strict weekly basis. The exam dates and times are firm. Textbooks: DM: P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining SS05: Z. Tang, J. MacLennan, Data Mining with SQL Server 2005 Week 1 01/9 01/11 2 01/16 01/18 3 01/23 01/25 4 01/30 02/01 5 02/06 02/08 Lecture Topics Class organization and administration. WebCT class site. Install SQL Server 2005 (see Resources Web page for details. SQL Server 2005 will be used with the SS05 textbook. SQL Server 2005 environment: Documentation and Tutorials, Business Intelligence Developer Studio Chapter 1. Introduction, DM Chapter 1. Introduction to Data Mining, SS05 Chapter 2. Data, DM Chapter 2. OLE DB for Data Mining, SS05 Chapter 2. Data, DM Chapter 2. OLE DB for Data Mining, SS05 Chapter 2. Data, DM Chapter 2. OLE DB for Data Mining, SS05 Chapter 2. Data, DM Chapter 3. Using SQL Server Data Mining, SS05 Chapter 3. Exploring Data, DM Chapter 3. Using SQL Server Data Mining, SS05 Chapter 3. Exploring Data, DM Chapter 3. Exploring Data, DM 9 Chapter 4. Classification: Basic Concepts, Decision Trees, and Model Evaluation, DM Chapter 5. Microsoft Decision Trees, SS05 Chapter 4. Classification: Basic Concepts, Decision Trees, and Model Evaluation, DM Chapter 5. Microsoft Decision Trees, SS05 Chapter 4. Classification: Basic Concepts, Decision Trees, and Model Evaluation, DM Chapter 5. Microsoft Decision Trees, SS05 Review for Midterm Exam (Chapters 1-4, DM; Chapters 1,2,3, and 5, SS05) Midterm Exam on February 27, 2007, Tue Midpoint of semester – March 1, Thu. Chapter 6. Microsoft Time Series, SS05 March 5-9. Spring Break. No classes. 10 Chapter 11. Mining OLAP Cubes, SS05 6 02/13 02/15 7 02/20 02/22 8 02/27 03/01 Assignments, Project, Discussions, and Exams Discussion 1 opened on 01/18/2007, Thu Assignment 1 begin Project begin Discussion 1 closed on 02/10/2006, Sat Assignment 1 due on 02/13, Tue Assignment 2 begin Discussion 2 opened on 02/15/2006, Thu Project Proposal due on 02/20/2006, Tue Begin Assignment 3 Assignment 2 due on 02/22, Thu Midterm Exam on 02/27, Tue 03/13 03/15 11 03/20 03/22 12 03/27 03/29 13 04/03 04/05 14 04/10 04/12 15 04/17 04/19 16 04/24 04/26 17 05/03 Chapter 5. Classification: Alternative Techniques (section 5.1-5.4), DM Chapter 4. Microsoft Naïve Bayes, SS05 Chapter 10. Microsoft Neural Networks, SS05 Chapter 5. Classification: Alternative Techniques (section 5.1-5.4), DM Chapter 4. Microsoft Naïve Bayes, SS05 Chapter 10. Microsoft Neural Networks, SS05 Chapter 5. Classification: Alternative Techniques (section 5.1-5.4), DM Chapter 4. Microsoft Naïve Bayes, SS05 Chapter 10. Microsoft Neural Networks, SS05 Chapter 5. Classification: Alternative Techniques (section 5.1-5.4), DM Chapter 4. Microsoft Naïve Bayes, SS05 Chapter 10. Microsoft Neural Networks, SS05 Chapter 5. Classification: Alternative Techniques (section 5.1-5.4), DM Chapter 4. Microsoft Naïve Bayes, SS05 Chapter 10. Microsoft Neural Networks, SS05 Chapter 6. Association Analysis: Basic Concepts and Algorithms (section 6.1-6.4), DM Chapter 9. Microsoft Association Rules, SS05 Chapter 6. Association Analysis: Basic Concepts and Algorithms (section 6.1-6.4), DM Chapter 9. Microsoft Association Rules, SS05 Chapter 6. Association Analysis: Basic Concepts and Algorithms (section 6.1-6.4), DM Chapter 9. Microsoft Association Rules, SS05 Chapter 6. Association Analysis: Basic Concepts and Algorithms (section 6.1-6.4), DM Chapter 9. Microsoft Association Rules, SS05 Chapter 8. Cluster Analysis: Basic Concepts and Algorithms, DM Chapter 7. Microsoft Clustering, SS05 Chapter 8. Cluster Analysis: Basic Concepts and Algorithms, DM Chapter 7. Microsoft Clustering, SS05 Chapter 8. Cluster Analysis: Basic Concepts and Algorithms, DM Chapter 7. Microsoft Clustering, SS05 Review for the Final Exam Final Exam on May 3, 2007, Thu Assignment 3 due on 03/15/2007, Thu Discussion 2 closed on 03/29/200, Thu Begin Assignment 4 Discussion 3 opened on 04/03/2007, Tue Assignment 4 due on 04/10/2007, Tue Begin Assignment 5 Assignment 5 due on 04/24/2007, Tue Project Final Report and Project Implementation due on 04/26/2007, Thu Discussion 3 closed on 04/27/2007, Fri Final Exam on May 3, 2007, Thu