Unstructured Data Analytics MSA 8770E Course Syllabus Instructor: Dr. Wael Jabr Email: wjabr@gsu.edu Office: Robinson College of Business - Room 906 Phone: 404.413.7363 Office Hours: By appointment Prerequisites: MSA 8050 and CIS 8040. Textbook & Supplementary material: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, by Ronen Feldman and James Sanger, Cambridge University Press. Mining the Social Web, by Matthew A. Russell, O'Reilly Media Course Description: We have been witnessing an explosion in the amount of unstructured data generated. According to experts at Gartner, Forrester and IDC, about 80% of the world’s data is unstructured. This includes corporate e-mails, financial filings, customer feedback, blogs, online reviews, instant messages, pictures, videos, and graphs among others. And unstructured data is growing at fifteen times the rate of structured data (Market Metrix). It therefore becomes clear that there is a significant competitive opportunity for firms to extract insights from such unstructured data, mostly untapped up until now. Firms therefore need the necessary tools to sift through the large data sets and to unlock the hidden insights. The challenges however are that such data does not fit within traditional storage tools and that the task of processing this typically large data is a non-trivial one. Using a hands-on approach, the course introduces students to the process of formulating business objectives, identifying relevant data sources (big and small), implementing rigorous unstructured data processing techniques, and lastly training, testing, implementing and evaluating predictive models. The purpose is to investigate novel approaches to study firm performance based on unstructured corpora released by the firm (namely financial reports), corpora released by analysts and corpora contributed by potential shareholders. The course will build on established concepts in natural language processing and information retrieval, and on newer developments in the applicability and usability of unstructured data analytics. The course will also introduce students to techniques for supplementing their analysis with traditional structured data. Course Objectives: Upon completion of this class, students will be able to (1) Identify and describe basic concepts and methods in unstructured data management, including document representation, information extraction, document classification and clustering, and topic modeling; (2) Apply well-established corpora, commercial and open-source unstructured data analysis and visualization tools to unlock hidden patterns; (3) Gain a conceptual understanding of the probabilistic models of advanced unstructured data management for information retrieval, topic classification, and sense-making; (4) Select suitable technologies for various unstructured data analysis tasks, and evaluate the benefit and challenges of the chosen approaches. (5) Synthesize and apply the skills learnt to a real-world problem. Weekly Mini-Projects: Each week, students complete a hands-on project that further explores the topic/technique covered in class. This is an individual activity. With these min-projects, students gain proficiency in the various SW and databases assigned for this class. Check the course site for details. Final Project: The project consists of a research report and presentation on a student-selected topic that is relevant to the course. It is group-based. A sample topic might be to predict the performance of IT-intensive firms through the evaluation of their innovative capital. This requires defining the research problem (e.g., how to quantify innovative capital), identify contributing factors (e.g., internal ecosystem and competitive pressure), gather data (e.g., 10-K filings, patents, trademarks, and announcements), and use appropriate analysis techniques. On the last day of class, each group will present their findings to the class in the presence of a panel of academic and industry “judges” (a 20-min presentation and a written report). The grade will be based on the evaluation from the instructor as well as the judges. Check the course site for additional details. Course Software & Databases: SAS Enterprise Miner* R software and dedicated packages (freely available online at www.r-project.org) HPCC Systems* Morningstar Database* WRDS Database* Other corpora* * will be provided through the class Typical class session: Class sessions will comprise (1) lectures/discussions of relevant techniques, concepts and features, (2) instructor demonstrations, and (3) student lab sessions with hands-on work. The purpose of this pedagogical approach is to introduce and reinforce ideas and skill sets so that you can master these on your own after class hours. To bring this knowledge to a highly proficient, professional level, you will have to spend time and effort outside of class reviewing and practicing the class material. To ensure that you have the basic knowledge that will allow you to function on your own after class, be sure to ask the instructor questions during class, either during the lecture/discussion, demo, or lab. Classroom guidelines: Attendance is not mandatory but highly recommended. Coming to class fully prepared and contributing to the discussion help deepening the learning. Individual deliverables are to be submitted individually and group work is collaborative. Refer to http://www2.gsu.edu/~wwwfhb/sec400.html for additional information on instructional information. Grading: Deliverables Participation 20% (in-class & online) Mini-Projects 50% Final Project 30% 100% Total *Late submission policy: deliverables submitted after their due date will be penalized 10% per day. No submission is accepted after the fifth day. Letter Grade Scale 98 94 90 86 82 79 75 71 68 65 62 A+ A A- B+ B B- C+ C C- D+ D Less than 61 F Class Schedule (adjustments may be necessary) Date Topic Class 1 - Usefulness and challenges of large amount of unstructured data - Scaling issues - Intro to SW & Databases adopted Class 2 - Reading unstructured data: well formatted (XML), not so well-formatted (free flowing) - Intro to automated data readers Class 3 - Representing text (NLP, vectors) - Use of SAS EM Class 4 - Using probabilistic models for information extraction - Clustering and modeling documents - Use of R add-ons Class 5 - Visualizing text (graphs, network analysis) - Use of R add-ons Class 6 - Data handling (ownership, security, privacy...) Reading TMH* & MSW** TMH chapters 1, 2, 3 Mini-projects (MP) MP 1 assigned MSW chapters 1, 8, 9 MP 1 due MP 2 assigned TMH chapters 4, 5 MP 2 due MP 3 assigned MP 3 due MP 4 assigned TMH chapters 6, 7 TMH chapters 10, 11 Online material HBR: “With Big Data MP 4 due MP 5 assigned MP 5 due MP 6 assigned Comes Big Responsibility” HBR: “The Cross-Atlantic Tussle over Financial Data and Privacy Rights” Class 7 - Extracting Insights Online material Harvard Case: “Netflix MP 6 due leading with data” Class 8 Project presentation & report submission * Textbook 1 – Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, by Ronen Feldman and James Sanger, Cambridge University Press. ** Textbook 2 – Mining the Social Web, by Matthew A. Russell, O'Reilly Media