MSA 8770E - Specialized Master`s Programs

advertisement
Unstructured
Data Analytics
MSA 8770E
Course Syllabus
Instructor:
Dr. Wael Jabr
Email: wjabr@gsu.edu
Office: Robinson College of Business - Room 906
Phone: 404.413.7363
Office Hours: By appointment
Prerequisites:
MSA 8050 and CIS 8040.
Textbook & Supplementary material:


Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, by Ronen
Feldman and James Sanger, Cambridge University Press.
Mining the Social Web, by Matthew A. Russell, O'Reilly Media
Course Description:
We have been witnessing an explosion in the amount of unstructured data generated. According
to experts at Gartner, Forrester and IDC, about 80% of the world’s data is unstructured. This
includes corporate e-mails, financial filings, customer feedback, blogs, online reviews, instant
messages, pictures, videos, and graphs among others. And unstructured data is growing at
fifteen times the rate of structured data (Market Metrix). It therefore becomes clear that there is
a significant competitive opportunity for firms to extract insights from such unstructured data,
mostly untapped up until now. Firms therefore need the necessary tools to sift through the large
data sets and to unlock the hidden insights. The challenges however are that such data does not
fit within traditional storage tools and that the task of processing this typically large data is a
non-trivial one.
Using a hands-on approach, the course introduces students to the process of formulating business
objectives, identifying relevant data sources (big and small), implementing rigorous unstructured
data processing techniques, and lastly training, testing, implementing and evaluating predictive
models. The purpose is to investigate novel approaches to study firm performance based on
unstructured corpora released by the firm (namely financial reports), corpora released by analysts
and corpora contributed by potential shareholders. The course will build on established concepts
in natural language processing and information retrieval, and on newer developments in the
applicability and usability of unstructured data analytics. The course will also introduce students
to techniques for supplementing their analysis with traditional structured data.
Course Objectives:
Upon completion of this class, students will be able to
(1) Identify and describe basic concepts and methods in unstructured data management,
including document representation, information extraction, document classification and
clustering, and topic modeling;
(2) Apply well-established corpora, commercial and open-source unstructured data analysis and
visualization tools to unlock hidden patterns;
(3) Gain a conceptual understanding of the probabilistic models of advanced unstructured data
management for information retrieval, topic classification, and sense-making;
(4) Select suitable technologies for various unstructured data analysis tasks, and evaluate the
benefit and challenges of the chosen approaches.
(5) Synthesize and apply the skills learnt to a real-world problem.
Weekly Mini-Projects:
Each week, students complete a hands-on project that further explores the topic/technique
covered in class. This is an individual activity.
With these min-projects, students gain
proficiency in the various SW and databases assigned for this class.
Check the course site for details.
Final Project:
The project consists of a research report and presentation on a student-selected topic that is
relevant to the course. It is group-based. A sample topic might be to predict the performance of
IT-intensive firms through the evaluation of their innovative capital. This requires defining the
research problem (e.g., how to quantify innovative capital), identify contributing factors (e.g.,
internal ecosystem and competitive pressure), gather data (e.g., 10-K filings, patents, trademarks,
and announcements), and use appropriate analysis techniques. On the last day of class, each
group will present their findings to the class in the presence of a panel of academic and industry
“judges” (a 20-min presentation and a written report). The grade will be based on the evaluation
from the instructor as well as the judges.
Check the course site for additional details.
Course Software & Databases:



SAS Enterprise Miner*
R software and dedicated packages (freely available online at www.r-project.org)
HPCC Systems*



Morningstar Database*
WRDS Database*
Other corpora*
* will be provided through the class
Typical class session:
Class sessions will comprise (1) lectures/discussions of relevant techniques, concepts and
features, (2) instructor demonstrations, and (3) student lab sessions with hands-on work.
The purpose of this pedagogical approach is to introduce and reinforce ideas and skill sets so that
you can master these on your own after class hours.
To bring this knowledge to a highly proficient, professional level, you will have to spend time
and effort outside of class reviewing and practicing the class material.
To ensure that you have the basic knowledge that will allow you to function on your own after
class, be sure to ask the instructor questions during class, either during the lecture/discussion,
demo, or lab.
Classroom guidelines:
Attendance is not mandatory but highly recommended.
Coming to class fully prepared and contributing to the discussion help deepening the learning.
Individual deliverables are to be submitted individually and group work is collaborative.
Refer to http://www2.gsu.edu/~wwwfhb/sec400.html for additional information on instructional
information.
Grading:
Deliverables
Participation
20%
(in-class & online)
Mini-Projects
50%
Final Project
30%
100%
Total
*Late submission policy: deliverables submitted after their due date will be penalized 10% per day.
No submission is accepted after the fifth day.
Letter Grade Scale
98
94
90
86
82
79
75
71
68
65
62
A+
A
A-
B+
B
B-
C+
C
C-
D+
D
Less
than
61
F
Class Schedule (adjustments may be necessary)
Date
Topic
Class 1 - Usefulness and challenges of large amount
of unstructured data
- Scaling issues
- Intro to SW & Databases adopted
Class 2 - Reading unstructured data: well formatted
(XML), not so well-formatted (free flowing)
- Intro to automated data readers
Class 3 - Representing text (NLP, vectors)
- Use of SAS EM
Class 4 - Using probabilistic models for information
extraction
- Clustering and modeling documents
- Use of R add-ons
Class 5 - Visualizing text (graphs, network analysis)
- Use of R add-ons
Class 6 - Data handling (ownership, security,
privacy...)
Reading
TMH* & MSW**
TMH chapters 1, 2, 3
Mini-projects
(MP)
MP 1 assigned
MSW chapters 1, 8, 9
MP 1 due
MP 2 assigned
TMH chapters 4, 5
MP 2 due
MP 3 assigned
MP 3 due
MP 4 assigned
TMH chapters 6, 7
TMH chapters 10, 11
Online material
HBR: “With Big Data
MP 4 due
MP 5 assigned
MP 5 due
MP 6 assigned
Comes Big Responsibility”
HBR: “The Cross-Atlantic
Tussle over Financial Data
and Privacy Rights”
Class 7 - Extracting Insights
Online material
Harvard Case: “Netflix
MP 6 due
leading with data”
Class 8
Project presentation & report submission
* Textbook 1 – Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, by Ronen Feldman
and James Sanger, Cambridge University Press.
** Textbook 2 – Mining the Social Web, by Matthew A. Russell, O'Reilly Media
Download