New Jersey Institute of Technology College of Computing Sciences IS-698 Web Mining Course Syllabus- Fall 2011 Instructor: Min Song Office: Room 4102 – GITC Building – 4th floor Office Hours: Check my Web site for hours, other times by appointment Web Site: http://web.njit.edu/~song/ Telephone: 973-596-5291 (email is much better if have to leave a message) E-mail: min.song@njit.edu Course: IS 698 Where: FMH308 When: Tuesday, 6:00pm – 9:05pm I am on campus extensively, but send me an e-mail to make sure that I am available in my office. OVERVIEW Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content and usage log. It has quickly become one of the most popular areas in computing and information systems because of its direct applications in e-commerce, Web analytics, information retrieval/filtering, Web personalization, and recommender systems. Employees knowledgeable about Web mining techniques and their applications are highly sought by major Web companies such as Google, Amazon, Yahoo, MSN and others who need to understand user behavior and utilize discovered patterns from terabytes of user profile data to design more intelligent applications. The primary focus of this course is on Web usage mining and its applications to business intelligence and biomedical domains. Specifically, we will consider techniques from machine learning, data mining, text mining, and databases to extract useful knowledge from Web data which could be used for site management, automatic personalization, recommendation, and user profiling. Programming assignments give hands-on experience with web mining tasks. Programming experience is required. DISCUSSION: Lecture presented in the text listed below and other possible sources Semester-long project and paper. PREREQUISITES Knowledge and experience of java programming are required. GRADING Grades are assigned based on 3-4 assignments, midterm, a final project, and class participation. The grading breakdown is as follows: * Assignments: 20% * Midterm: 25% * Final Project: 45% * Class Participation: 10% Late assignments will not be accepted. COLLABORATION POLICY For assignments, you are not allowed to discuss your answers with other students. Copying solutions from other students is never allowed. For the group project, you will work in teams and hand in only one written report. TEXT Required Textbook: Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, By Bing Liu, Springer, ISBN 3540-37881-2, Dec 2006. (It is available in UIC bookstore) References Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8. Principles of Data Mining, by David Hand, Heikki Mannila, Padhraic Smyth, The MIT Press, ISBN 0-262-08290-X. Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley, ISBN 0-321-32136-7. Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7 IS 698 Course Schedule – Fall 2009 Class Topic Reading Assignment Due/ Project Progress #1 – Sep 6 Introduction and Project Idea Chapter 1 Week1 Data pre-processing and Natural Language Processing (NLP) Chapter 1 Week2 Project team #1 – #2 – Sep 13 Data cleaning Data transformation Data reduction Part-Of-Speech Sentence Parsing Text Chunking #2 – #3 – Sep 20 Association rules and sequential patterns · Basic concepts · Apriori Algorithm · Sequential pattern mining Chapter 2 Week 3 Chapter 3 Assignment 1 Week 4 Chapter 4 Project proposal Proposal Outline #3 – #4 – Sep 27 Supervised learning (Classification) and Unsupervised Learning (Clustering) · Basic concepts · Decision trees · Classifier evaluation · Rule induction · Classification based on association rules · Naive-Bayesian learning #4 – #5 – Oct. 4 Information retrieval and Web search #5 – #6 – Oct 11 Question Answering Chapter 5 Week 5 Class Exercise Full-text Mining Chapter 6 Assignment 2 Week 6 Chapter 7 Week 7 Project progress report #6 – #7 – Oct 18 #7 – #8 – Oct 25 #8 – Midterm – in class exam #9 – Nov 1 Partially supervised learning · Word Sense Disambiguation (WSD) · #9 – #10 – Nov 8 Link analysis Chapter 8 Week 8 · Social network analysis: centrality and prestige · Citation analysis: co-citation and bibliographic coupling · Mining communities on the Web #10 – #11 – Nov 15 Data extraction and information integration #11 – #12 – Nov 22 No Class – Thanksgiving Break Chapter 9 Assignment 3 #12 – #13 – Nov 29 Opinion mining and summarization Chapter 10 #13 – #14 – Dec 6 Future Direction of Web Mining #14 – #15 – Dec 13 Project Presentation Deadline for submission of final paper #15 – Note: The class will have three hour long class instead of one and half hour class. Oct 25. - conference trip and possibly Nov. 18 - conference trip.