COURSE SYLLABUS Semester: Fall 2011 Course Prefix/Number: CAP 6990 Course Title: Web Data Mining Course Credit Hours: 3.0 Course Meeting Times/Places: Online Instructor and Contact Information: Dr. Runa Bhuamik E-mail: rbhaumik@uwf.edu Course Web Site: http://elearning.uwf.edu/ (login and select Web Data Mining, CAP6990) Prerequisites or Co-requisites: Data Mining (CAP5771). Course Description: The primary focus of this course is on Web usage mining and its applications to e-commerce and business intelligence. Specifically, we will consider techniques from machine learning, data mining, text mining, and databases to extract useful knowledge from Web data which could be used for site management, automatic personalization, recommendation, and user profiling. The first half of the course will be focused on a detailed overview of the data mining process and techniques, specifically those that are most relevant to Web mining. The second half will concentrate on the applications of these techniques to Web and e-commerce data, and their use in Web analytics, user profiling and personalization. List of Topics: The following issues and topics will be covered throughout the course. Many of these topics will be revisited several times during the course in a variety of contexts. Data Mining and Knowledge Discovery The KDD process and methodology Data preparation for knowledge discovery Overview of data mining techniques Market basket analysis Classification and prediction Clustering Memory-based reasoning Evaluation and Interpretation Web Usage Mining Process and Techniques Data collection and sources of data Data preparation for usage mining Mining navigational patterns Integrating e-commerce data Leveraging site content and structure User tracking and profiling E-Metrics: measuring success in e-commerce Privacy issues Web Mining Applications and Other Topics Data integration for e-commerce Web personalization and recommender systems Web content and structure mining Web data warehousing Review of tools, applications, and systems Textbooks and Reading Material: Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, Second Edition, by Michael Berry and Gordon Linoff, John Wiley, 2004. Various papers or online resources (provided in class or online). Recommended Books: o Data Mining: Practical Machine Learning Tools and Techniques, by Ian Witten and Eibe Frank, 2nd Ed., Morgan Kaufmann, 2005. [Note: this is the WEKA book] o Mining the Web: Transforming Customer Data into Customer Value, by Gordon Linoff and Michael Berry, John Wiley & Sons, 2001. o The Data Webhouse Toolkit, by Ralph Kimball and Richard Merz, John Wiley, 2000. References: Weka’s site: http://www.cs.waikato.ac.nz/~ml/weka/ Grading Policy: The final grade will be determined (tentatively) based on the following components: Assignments = 65% Final Project = 35% 2 Assignments: There will be 6-7 assignments during the semester involving the concepts and techniques discussed in class. The assignments may involve experimenting with various tools, as well as other written or problem-oriented exercises. You can work in a group of 2 students. You do the analysis together but should write your report separately, do not copy the text from each other. Some assignments must be done individually. Late Policy: 1. 2. You are expected to complete work on schedule. Deadlines are part of the real world environment you are being prepared for. Documentation of health or family problems may be required. Late assignments will be penalized 25% per day (that means, four days after due date it will not be accepted). Course Project: For the class project, students can choose to do an implementation project, a data analysis project, or a research paper. Implementation projects may be done individually or in groups of 2 people (depending the complexity and the type of the project). Research papers and data analysis projects must be done individually. Each group or individual will submit a specific project proposal to be approved. More details about the possible project options, as well as due dates for the proposal and the final submission, will be available later. About this Course: This course is delivered completely online. Students must have consistent access to the Internet. Learning at a distance may be a very different environment for many of you. You will set your own schedules, participate in class activities at your convenience, and work at your own pace. You may require some additional time online during the first few days while you become accustomed to the online format and you may even feel overwhelmed at times. It will get better. You should be prepared to spend more than 8 – 10 hours per week online completing lessons, activities, and participating in class discussions. Finally, you may want to incorporate these tips to help you get started: Set a time at least twice a week (schedule) to: o Check elearning postings to determine your tasks. 3 o Check elearning frequently throughout the week for updates. Within the first week, become familiar with elearning and how to use it. o It is a tool to help you learn! Ask questions when you need answers. o If you have problems, contact your instructor early. Technology Requirements: Knowledge of a machine learning tool – WEKA (on the Windows environment) will be necessary for the project. Expectations for Academic Conduct/Plagiarism Policy: Academic Conduct Policy: (Web Format) | (PDF Format) | (RTF Format) Plagiarism Policy: (Word Format) | (PDF Format) | (RTF Format) Student Handbook: (PDF Format) Assistance: Students with special needs who require specific examination-related or other course-related accommodations should contact Barbara Fitzpatrick, Director of Disabled Student Services (DSS), dss@uwf.edu, (850) 474-2387. DSS will provide the student with a letter for the instructor that will specify any recommended accommodations. Other Course Policies: Class material and due dates: Students are responsible for all announcements and all material presented. Students are expected to keep up with due dates and submit all assignments and work into the elearning dropbox before the due date. Communication: You are responsible for checking your e-mail and the elearning site regularly, preferably once a day, to keep up with important announcements, assignments, etc. Re-grading Assignments: It is the student’s responsibility to check graded assignments/tests when they are returned to you. I will gladly re-grade an assignment/test when a question or mistake is brought to my attention. To ensure fairness, I reserve the right to re-grade the entire assignment/test. As a result, your grade may increase, decrease, or remain the same. Grades will not be changed after a week from the date graded assignments/tests are returned to the class. Grades: Final grades will be calculated using a standard grade distribution. The last day of the term for withdrawal from an individual course with an automatic grade of “W” is 3/24. Students requesting late withdrawal (W or WF) from class must have the approval of the advisor, instructor, and the department chairperson (in that order) and finally by the Academic Appeals committee. Requests for late withdraws may be approved only for the following reasons (which must be documented): 1. A death in the immediate family. 2. Serious illness of the student or an immediate family member. 3. A situation deemed similar to categories 1 and 2 by all in the approval process. 4. Withdrawal due to Military Service (Florida Statute 1004.07) 4 5. National Guard Troops Ordered into Active Service (Florida Statute 250.482) Requests without documentation will not be accepted. Requests for late withdrawal simply for not succeeding in a course, do not meet the criteria for approval and will not be approved. Applying for an incomplete or “I” grade will be considered only if: (1) there are extenuating circumstances to warrant it, AND (2) you have a passing grade and have completed at least 70% of the course work, AND (3) approval of the department chair. Participation and Feedback: I encourage active participation and regular feedback. I believe that effective communication between the instructor and students will make the course more useful, interesting, and productive. Please contact me if you have any questions, concerns, or suggestions! Important Note: Any changes to the syllabus or schedule made during the semester take precedence over this version. Check the elearning site (or email) regularly for up-to-date information. Overall Grading Scale: 1. A : 92 - 100 2. A-: 89 - 91 3. B+: 87 - 88 4. B : 82 - 86 5. B- : 79 - 81 6. C+: 77 - 78 7. C : 72 - 76 8. C : 72-76 9. C-: 69-71 10. D+: 67-68 11. D: 59-66 12. F: 0-58 There’s another page…keep scrolling down… 5 Tentative Course Schedule: Lectures & Course Material Topics & Reading Week 1/Week2 Week3/Week4 Overview Web Data Mining and E-Business Analytics Reading: Chapters 1 and 2 of Berry and Linoff Web Analytics (Wikipedia) Web Mining (Wikipedia) Web Mining: Information and Pattern Discovery on the World Wide Web, by Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, ICTAI 1997. Knowledge Discovery Process; Data Preparation for Mining Reading: Chapters 3 and 17 of Berry and Linoff Data Mining Overview Driving e-Commerce Profitability From Online and Offline Data, White paper form Torrent Systems. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, by Jaideep Srivastava, et. al., SIGKDD Explorations, January 2000. Read the Description of Porter's Stemming Algorithm An online version of porter stemmer algorithm http://qaa.ath.cx/porter_js_demo.html Porter stemmer in many different programming languages can be reached here: http://tartarus.org/~martin/PorterStemmer/ Week5/Week6 Data Mining Techniques: Mining Association Rules and Sequential Patterns Reading: Chapter 9 of Berry and Linoff Web Usage Mining for Web Site Evaluation, by Myra Spiliopoulou, Communications of ACM, August 2000. An Internet-enabled Knowledge Discovery Process, by Alex Buchner, et. al., MINEit Software Ltd., 1999. 6 Week7/Week8 Data Mining Techniques: Classification & Prediction, Neural Network Reading: Chapter 6 of Berry and Linoff Modeling Web Robot Navigation Patterns, by PangNing Tan and Vipin Kumar, WebKDD Workshop at the ACM SIGKDD Conference, 2000. Note: An additional description of the ID3 and C4.5 algorithms can be found in the document "Building Classification Models: ID3 and C4.5" from the AI course at Temple university. Week9/Week10 Data Mining Techniques: Clustering; Memory-Based Reasoning Reading: Chapters 11 and 8 of Berry and Linoff Text-Learning and Related Intelligent Agents: A Survey, by Dunja Mladenic, IEEE Intelligent Systems, July/August 1999. Clustering Users of Large Web Sites into Communities, by Georgios Paliouras, et. al., ICML 2000. Week11/Week12 Web Usage Mining: Data Preparation and Integration Reading: Week13/Week14 Data Preparation for Mining World Wide Web Browsing Patterns, by Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Knowledge and Information Systems, Volume 1, No. 1, 1999. Web Usage Mining: E-Metrics and E-Commerce Data Analysis, Predictive Web Analytics Reading: Chapters 14 and 4 of Berry and Linoff E-Commerce Intelligence: Measuring, Analyzing, and Reporting on Merchandising Effectiveness of Online Stores, by Stephen Gomory, et. al., IBM T. J. Watson Research Center. E-Metrics Business Metrics For The New Economy, White Paper from NetGenesis. Lessons and Challenges from Mining Retail ECommerce Data, by Ron Kohavi, et al., Journal of Machine Learning, 2003. 7 Week 15 Analysis of Recommendation Algorithms for ECommerce, by Badrul Sarwar, et. al., ACM Electronic Commerce Conference, November 2000. Web Personalization and Recommender Systems Reading: Automatic Personalization Based on Web Usage Mining, by Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava, Communications of ACM, August 2000. Integrating Web Usage and Content Mining for More Effective Personalization, by Bamshad Mobasher et. al., EC-Web 2000. Important Note: Not all lecture notes are prepared from the textbook. This is just a guideline about the topics and a good source of solving homework problems. To get a better understanding of the topics, you should read the related papers and the text from the book. If you find typos or don’t understand any question, please let me know as soon as possible. Do not wait until the last moment. 8