IST 565: Data Mining Syracuse University School of Information Studies Updated Syllabus – 10/28/2009 Instructor: Dr. Ozgur Yilmazel Research Assistant Professor School of Information Studies Syracuse University, Syracuse, NY 13244 Email: oyilmaz@syr.edu phone: 315-849-5666 This course will introduce popular data mining methods for extracting knowledge from data. It will balance theory and practice. The principles of data mining methods will be discussed, but students will also acquire hands-on experience using software to develop data mining solutions to scientific and business problems. The focus of this course is in understanding data and how to formulate data mining tasks in order to solve problems using the data. Course will include the key tasks of data mining, including data preparation, concept description, association rule mining, classification, clustering, evaluation and analysis. Throughout the class, students will learn both through readings, as well as hands-on work with real data in a data mining project. The format of the course will be a combined lecture and lab format, with lectures to cover material and lab time to investigate small examples for the topic of the week. There will be weekly readings based on the textbook and on other materials which will be posted on-line. Objectives: 1. Understand the fundamental processes, concepts and techniques of data mining, 2. Develop familiarity with data mining techniques and be able to apply them to real-world problems, and 3. Advance your understanding of contemporary data-mining systems. Course Materials: 1. Data Mining: Practical Machine Learning Tools and Techniques Second Edition by Ian H.Witten & Eibe Frank. ISBN 0120884070 2. Software as needed for assignments and projects. Students will need access to a PC where they can install and run programs. Other materials will be assigned and will be made available through electronic means, and these will be discussed in class. Technology Requirements: This course will be run on LMS. There will be optional video components. In addition, students will be using various tools to clean and manipulate data throughout the semester. This may require students to install various programs on a PC-based machine, so students should not only have access to a PC but also be comfortable with installing and uninstalling programs. Graduate students are expected to meet the minimum and recommended information technology literacy skills required of students in all School of Information Studies master's programs. Please refer to: http://istweb.syr.edu/prospective/graduate/literacyreq.asp for the "Computer Literacy Requirements" document. Student Responsibilities: You are expected to participate fully in all activities and discussions during the class duration, as well turning in assignments by the designated time. Just as in the real world, late assignments will be penalized. If you do not understand assignments, readings, etc. it is your responsibility to inform the instructor well before the due date. If you are having difficulty, please contact us early so that I can resolve problems. . You must complete all assignments to pass the course. A grade of Incomplete will only be given in the most extreme emergency cases. If, for some reason, you feel that it is imperative that you receive an Incomplete, you must bring your concerns to our attention as soon as the emergency situation arises. Please note: Successful time management skills is the most critical factor of success for students in a distance course. You are expected to log in to class at least four times a week, if not every day. Logging into class and participating regularly and in small amounts makes it much easier to avoid hours of work built up. Statement of Academic Integrity: Undergraduate, graduate and doctoral students enrolled in IST courses are required to follow the guidelines for academic honesty described in the School of Information Studies Statement on Academic Integrity, available: • in any IST Student Handbook, • on the Web or • on request at the IST Student Services Office. Academic dishonesty includes but is not limited to plagiarism, cheating on examinations, unauthorized collaboration, multiple submission of work, misusing resources for teaching and learning, falsifying information, forgery, bribery, and any other acts that deceive others about ones academic work or record. Students should be aware that standards for documentation and intellectual contribution may depend on the course content and method of teaching, and should consult instructors for guidance. Sanctions for academic dishonesty may include but are not limited to the following: • requiring students to re-produce work under the supervision of a proctor; • rejecting the student work that was dishonestly created, and giving the student a zero or failing grade for that work • lowering the course grade • giving a failing grade in the course • formal reprimand and warning • disciplinary probation • administrative withdrawal from the course • suspension from the University • or expulsion from the University. Instructors who impose sanctions must notify the student promptly and indicate any formal or informal hearing procedures available. Students accused of academic dishonesty have the right to challenge accusations. iLMS information This course will be give via the iSchool Learning Management System powered by WebCT/Blackboard (the iSchool iLMS). On the iSchool iLMS homepage (http://ischool. syr.edu/learn), you will find links to student FAQs, troubleshooting tips, tutorials, and other information that may prove useful for both new and experienced iLMS users. As students are responsible for being fluent on this educational platform, it is recommended that you review these resources as necessary. Guidelines for preparing assignments Prepare a professional document. Include tables and graphs that support your content where appropriate. When you prepare assignments be sure to provide proper bibliographical information for any sources referenced, for direct quotations and for the source of key concepts or ideas. Any citation format is acceptable (I personally use APA format), as long as it provides sufficient information for a reader to find the source (i.e., authors names, title of article or book, title, volume and issue of journal (if appropriate), page numbers, publisher, date of publication). If you cite a webpage, be sure to indicate the URL and the date on which you accessed the page, as pages do change. Failure to cite sources is considered plagiarism and subject to sanctions ranging from being required to redo the assignment through expulsion (see above). If you have any questions about what must be cited or how to cite, please feel free to ask. In addition to punctuality, grammar, presentation and ability to follow instructions are very important, as in the real world. If your work does not meet professional standards, up to 30% of your score may be deducted. It is essential that you spell check and proofread your documents. According to Family Educational Rights and Privacy Act, all student work produced as part of this course may be used during this semester for educational purposes. Later use of this work is also permitted providing either the work is rendered anonymous or if the student provides written permission for such usage. If you do not wish your work to be used even if rendered anonymous let me know and I will not do so. Special consideration Students who may need special consideration because of a disability should contact instructors at the start of the course, and anytime thereafter if further consideration is needed. In addition, you are encouraged to register with the Office of Student Assistance in 306 Steele Hall 315-443-4357) and/or the Center for Academic Achievement in 804 University Avenue, Room 303 (315-443- 2622). Course Topics: 1. Lecture/ Date 1/19-1/24 2. 1/25 – 1/31 3. 2/1-2/7 Topic Tentative Readings Introduction to Course Structure, Content & Requirements Data Mining Basic Concepts Introduction to Data issues Data Preprocessing Data Cleansing , Transformation, Integration & Reduction Data Warehousing Styles of Machine Learning Decision Trees Classification Clustering Whitten & Frank, Ch. 1. What's it all about? Benoit, 2002 ARIST Chapter 6, Data Mining Whitten & Frank, Ch. 2. Input: Concepts, Instances & Attributes - Witten & Frank - pp. 58-63; 70-72; 8992; 159-170 - Witten & Frank - pp. 170-183 (or 190) - Witten & Frank - pp. 75-76; 210-227 Colet, E. Clustering and Classification: Data Mining Approaches http://www.tgc.com/dsstar/00/0704/000704 .html 4. 2/8 – 2/14 5. 2/15 – 2/21 2/22 – 2/28 3/1 – 3/7 3/8 – 3/14 6. 7. 8. 3/15-3/21 9. 10 . 11 12 . Introduction to WEKA Spring Break (Review and Discuss Track A Projects) 3/22 – 3/28 3/29 – 4/4 4/5 – 4/11 4/12 – 4/18 Data Output Visualization of Data Mining Results 13 . 4/19 – 4/25 14 . 4/26 – 5/2 15 Whitten & Frank, Ch. 8. 5/3 – 5/9 Whitten & Frank, Ch. 3, Output: Knowledge Representation Thearling, Becker, DeCoste, Mawby, Pilote & Sommerfield. Visualizing Data Mining Models http://www3.shore.net/~kht/text/d mviz/modelviz.htm Ramakrishnan & Grama. Mining Scientific Data. Advances in Computers.http://people.cs.vt.edu/~ramakri s/papers/scimining.pdf Privacy of Data and Legal Implications Data Mining Products How to select Data Mining software TBA Future Directions of Data Whitten & Frank, Ch. 9, Looking Forward www.kdnuggets.com Elder & Abbot. A Comparison of Leading Data Mining Tools,http://www.datamininglab.com/pubs/ kdd98_elder_abbott_nopics_bw.pdf . Mining & Knowledge Discovery ASSIGNMENTS These are just outlines. As we approach the sections in the class dealing with each assignment, more detail will be provided to you. The instructor reserves the right to change the assignments, add new assignments, and alter the percentages attached to the assignments with ample notice. Individual Assignments 20% There will be 3 or 4 assignments during the semesters. They will be take home problem sets. Overall Participation 20% - In a distance class, a large part of your learning experience comes from participating in class discussions. In this course, you’ll be learning a combination of theory and practice, and the participation in course discussions will provide the theory. This participation is essential and the weight of this grade should indicate to students that it will be taken seriously. Course Project 60% - You will work in a group max 3 people on a DM problem of your choice. It will be up to you to find your data and run experiments. Here are a few ground rules about posting in the public iLMS forums: 1 – The tone of your messages should be similar to the tone you would use in a classroom discussion, and should be placed in the appropriate forum. The “Informal Chit-chat” board is designed for social communication. “Questions to the Professor” should be used whenever possible so that all students may benefit from the answer. 2 – In classroom discussion, present more than your opinion. If you present an opinion, also present some support from the readings or from other sources you have discovered or a logical argument from commonly accepted beliefs. Part of the graduate education experience is to help you learn how to present information with support and not just say “Well, I think that…”. This also applies to agreeing with someone; the statement “I agree” should be presented with some other fact or information new to the discussion. 3 – Remember that there is someone behind every post, and there is a reason that they wrote what they did. Before taking offense, getting upset, or lashing out, think about why that may have made that post. Treat each other with respect. 4 – Review the standard rules of Net Etiquette (a.k.a. netiquette) at http://www.albion.com/netiquette/corerules.html 5 – When discussing a point from a previous post, copy and paste the appropriate points into your e-mail (and just post the portion you need for the discussion). The typical symbol for showing a quote is > before the line. The instructors maintain the right to move or delete discussions from the public forums as they see fit. This might be done because a post is inflammatory, off-topic for the board, or contains private information. Grading Policy Grades will be numeric, ranging from 0 to 100, with the most likely grades for a thorough job being between 70 and 90. There are likely to be few, if any, grades of 100 earned. We reserve this grade for truly exceptional performance. Grades lower than 70 suggest a need to discuss your performance with the instructors.