DSCI 5350 – Big Data Analytics Spring 2016 Readings

advertisement
SYLLABUS
DSCI 5350 – Big Data Analytics
Spring 2016
CLASS (DAY/TIME):
INSTRUCTOR:
OFFICE HRS:
CONTACT INFO:
Thursdays 6:30 - 9:20 PM, BLB 245
Dr. Nick Evangelopoulos
TW 1:00-2:00pm (BLB 365D)
OFFICE PHONE: 940-565-3056
E-MAIL (preferred): evangeln@unt.edu
Readings



deRoos, Zikopoulos, et al., Hadoop for Dummies, Wiley, ISBN 978-1-118-60755-8.
Getting Started with SAS Text Miner 13.2, SAS Publishing 2014. Also available as a
free PDF from the http://support.sas.com/documentation Web site.
Readings from current journals/periodicals, industry reports, and training material as
assigned.
Software
VMWare Player, Cloudera Quickstart VM 5.5, SAS EM 13.2, R, IBM SPSS
Statistics 22, MS Excel, MS Access. NOTE: Some packages are available in our COB
virtual lab, but some others you need to install on your own laptop/PC. You will need a
PC with 8GB of RAM, running a 64-bit version of Windows, or an equivalent Mac.
Software inside Cloudera Quickstart VM (Spring 2016)
mySQL, Sqoop, Hive, Flume, Hue, Impala, Solr, Scala, Spark. NOTE: These
packages are already installed inside the Cloudera Quickstart VM in a Linux operating
system environment. No further installation of these packages is required by you, but you
will need to “play” the VM (which is a very large file) on your PC using a VM Player.
VMware Player 7 is available for free download for non-commercial purposes (freeware).
VMware Workstation 12 Player is also available free of charge for personal use.
Blackboard Learn
Materials for the DSCI 5350 course will be posted on Blackboard Learn system.
Course Description
Current issues in storage, retrieval, and analysis of large volumes of data (Big Data), in
order to support business decisions. Big Data are stored in a variety of formats, including
Web log, internet clickstream data, as well as unstructured data, such as industry reports
and customer comments. Big Data analytics utilize data sources that may be left
untapped by conventional business intelligence solutions. Topics include conventional
data warehousing, retrieval of large data sets that are stored across clustered systems,
natural language processing, topic extraction in textual data, machine learning and
artificial intelligence, and predictive analytics for unstructured data. A semester project
in Big Data Analytics relevant to a functional area of business is an important component
of the course.
Course Objectives
1. Develop an understanding of how Big Data Analytics is needed and used in
managerial decision processes and everyday management situations;
2. Develop critical skills that help build a background as a data analyst and a data
scientist
3. Understand the deployment of distributed processing frameworks
4. Develop the ability to implement and demonstrate methods for Big Data analytics
5. Develop familiarity with a number of algorithms for Big Data analytics
6. Understand the challenges in analyzing unstructured data
7. Enhance written and oral presentation skills
8. Enhance discussion and leadership skills
Class Attendance
Regular class attendance and informed participation are expected.
Course Prerequisites
Graduate status and some introductory graduate course in Business Statistics such as
DSCI 5010, or DSCI 5180, or consent of the ITDS department, are required. Some
experience with database management systems and SQL is helpful. Some experience
with any programming language using the command line is helpful.
Point Allocation
Quizzes (12@5 pts, 4 dropped)
HW assignments (11@10 pts)
Group Project & Presentation
Exam 1
Exam 2 (final)
40 pts
110 pts
40 pts
60 pts
50 pts
_____________________________________________________________________________
TOTAL
300 pts
Letter Grades
270+ pts (=90%) = A
240+ pts (=80%) = B
210+ pts (=70%) = C
180+ pts (=60%) = D
Below 180 pts
=F
Quizzes
There will be 12 Quizzes, worth 5 points each. These will be closed books. The Quizzes
will be multiple-choice. They will cover the material presented in class on the day of the
Quiz. Make-up Quizzes will not be allowed, but the 4 lowest grades among the 12
Quizzes will be dropped.
Homework Assignments
HW 1: Install Cloudera Quickstart VM using files & instructions posted on Blackboard
HW 2: Introduction to Relational databases and simple SQL queries (Excel, Access)
HW 3: Cloudera Live Beginner Tutorial 1 (Sqoop, Hive, Avro)
2
HW 4: Cloudera Tutorial 2 (Hue, Impala, SQL queries)
HW 5: Cloudera Tutorial 3 (Flume, Impala, semi-structured data)
HW 6: Cloudera Tutorials 4, 5, 6 (Spark, Solr, Morphlines, Flume, Hue, Web log data)
HW 7: Introduction to R, regression analysis in R (R, software piracy data)
HW 8: Sentiment analysis, Word Cloud (R, Social Media, Twitter API, Twitter data)
HW 9: Text clusters (SAS EM 13.1, Yelp data)
HW 10: Text topics (SAS EM 13.1, Yelp data)
HW 11: Predictive modeling, adjustments, and comparisons (IBM SPSS, Yelp data)
Group Project
There will be a project that will require team work. Related handouts will be distributed
in class and related datasets will be posted on the course Web site. You will be asked to
form teams of 2-3 members. You will have to select a business problem that interests
you (the instructor will suggest a problem in case you run out of ideas.) Problem
data/facts can be real, obtained from published sources, or made up by your team in a
way that they correspond to realistic situations. Your team will prepare a written report
and a PowerPoint presentation, to be presented in class.
Exams
There will be one mid-term examination during the semester, worth 60 points, and a final
exam, worth 50 points. (See section on Grading). Both will be unit exams. The final
will not be comprehensive. Exams will be multiple choice. The exams will be closed
books, closed notes. Calculators will not be required. Laptop computers, tablets,
smartphones, smart watches, and similar communication devices will not be allowed
during an exam. Use of cell phones will be allowed only in case of emergency.
Miscellaneous Policies
IMPORTANT DATES: Dates of drop deadlines, exams, final exams, etc., are published
in the university catalog and schedule of classes. It is your responsibility to be informed
with regard to these dates. Unawareness is no excuse. Do not wait until the "last minute"
to drop if you are not making satisfactory progress in this class. Your instructor may not
be available at this time.
Campus Closures
Should UNT close campus, it is your responsibility to keep checking your official UNT
e-mail account (EagleConnect), the UNT Web site, and Blackboard, to learn if your
instructor plans to modify class activities, and how. This may include changing
assignment due dates, rescheduling quizzes and exams, etc.
Student Perceptions of Teaching (SPOT)
Student Perceptions of Teaching (SPOT) utilizes IASystem® and is a requirement for all
organized classes at UNT. This short Web-based survey will be available to you at the
end of the semester, providing you a chance to comment on how this class is taught. I am
very interested in this feedback from my students, as I work to continually improve my
teaching. I consider SPOT to be an important part of your class participation.
3
Use of Cell Phones
As a courtesy to your instructor and to your fellow classmates, you are asked to set your
cell phone to vibrate, or switch it off. In case of a personal emergency, if you must use
your cell phone, please step out of the classroom.
Students with Disabilities
UNT and the College of Business comply with the Americans with Disabilities Act in
making reasonable accommodations. If you have an established disability you should
request accommodation from the Office for Disability Accommodation.
Academic Integrity
This course adheres to the UNT policy on academic integrity. The policy can be found at
http://vpaa.unt.edu/academic-integrity.htm. Practices that violate academic integrity,
such as “cheating” or “plagiarism”, are strongly discouraged. If you are in violation you
may receive a failing grade on the test or assignment, or a failing grade in the course.
Class Schedule (Subject to change; Effective 1/21/2016)
Week
Date
Topics
Assignment Due
1
21-Jan
2
28-Jan
3
4-Feb
4
5
6
7
11-Feb
18-Feb
25-Feb
3-Mar
8
10-Mar
Course Introduction
Overview of Big Data Analytics
Last week to drop for an 80% refund
RDBMS and SQL, Data cleaning
Last week to drop for a 70% refund
Intro to Hadoop; RDBMS vs. Hive
Last week to drop for a 50% refund
Hadoop ecosystem; SQL queries, Impala
Hadoop ecosystem; Spark, Scala
Hadoop ecosystem; Solr, Web log data
Hadoop ecosystem; Hue, dashboards, midterm exam review
Midterm Exam
9
17-Mar
24-Mar
10
11
31-Mar
7-Apr
12
14-Apr
13
21-Apr
14
15
28-Apr
5-Mar
Thu
May 12
UNT Spring Break (NO CLASS)
Introduction to R, Sentiment analysis, social
media APIs (Twitter, Facebook, Guardian)
Text clusters/topics, SAS EM, HW9, HW10
Text clusters/topics, SAS EM, HW9, HW10
Last week to withdraw from a course
Predictive modeling, IBM SPSS, HW11
BDA Market Players, Project Progress
NO CLASS MEETING (team time)
Work on your group project
Project progress report, Exam review
Project presentations, course evaluation
FINAL EXAM (non-comprehensive):
6:30 pm-8:30 pm, normal classroom
4
HW1
HW2
HW3
HW4
HW5
HW6
Exam 1 (60 pts)
HW7
HW8
HW9
HW10
HW11
Project Presentation (40 pts)
Exam 2
(50 pts)
Download