SYLLABUS DSCI 5350 – Big Data Analytics Spring 2016 CLASS (DAY/TIME): INSTRUCTOR: OFFICE HRS: CONTACT INFO: Thursdays 6:30 - 9:20 PM, BLB 245 Dr. Nick Evangelopoulos TW 1:00-2:00pm (BLB 365D) OFFICE PHONE: 940-565-3056 E-MAIL (preferred): evangeln@unt.edu Readings deRoos, Zikopoulos, et al., Hadoop for Dummies, Wiley, ISBN 978-1-118-60755-8. Getting Started with SAS Text Miner 13.2, SAS Publishing 2014. Also available as a free PDF from the http://support.sas.com/documentation Web site. Readings from current journals/periodicals, industry reports, and training material as assigned. Software VMWare Player, Cloudera Quickstart VM 5.5, SAS EM 13.2, R, IBM SPSS Statistics 22, MS Excel, MS Access. NOTE: Some packages are available in our COB virtual lab, but some others you need to install on your own laptop/PC. You will need a PC with 8GB of RAM, running a 64-bit version of Windows, or an equivalent Mac. Software inside Cloudera Quickstart VM (Spring 2016) mySQL, Sqoop, Hive, Flume, Hue, Impala, Solr, Scala, Spark. NOTE: These packages are already installed inside the Cloudera Quickstart VM in a Linux operating system environment. No further installation of these packages is required by you, but you will need to “play” the VM (which is a very large file) on your PC using a VM Player. VMware Player 7 is available for free download for non-commercial purposes (freeware). VMware Workstation 12 Player is also available free of charge for personal use. Blackboard Learn Materials for the DSCI 5350 course will be posted on Blackboard Learn system. Course Description Current issues in storage, retrieval, and analysis of large volumes of data (Big Data), in order to support business decisions. Big Data are stored in a variety of formats, including Web log, internet clickstream data, as well as unstructured data, such as industry reports and customer comments. Big Data analytics utilize data sources that may be left untapped by conventional business intelligence solutions. Topics include conventional data warehousing, retrieval of large data sets that are stored across clustered systems, natural language processing, topic extraction in textual data, machine learning and artificial intelligence, and predictive analytics for unstructured data. A semester project in Big Data Analytics relevant to a functional area of business is an important component of the course. Course Objectives 1. Develop an understanding of how Big Data Analytics is needed and used in managerial decision processes and everyday management situations; 2. Develop critical skills that help build a background as a data analyst and a data scientist 3. Understand the deployment of distributed processing frameworks 4. Develop the ability to implement and demonstrate methods for Big Data analytics 5. Develop familiarity with a number of algorithms for Big Data analytics 6. Understand the challenges in analyzing unstructured data 7. Enhance written and oral presentation skills 8. Enhance discussion and leadership skills Class Attendance Regular class attendance and informed participation are expected. Course Prerequisites Graduate status and some introductory graduate course in Business Statistics such as DSCI 5010, or DSCI 5180, or consent of the ITDS department, are required. Some experience with database management systems and SQL is helpful. Some experience with any programming language using the command line is helpful. Point Allocation Quizzes (12@5 pts, 4 dropped) HW assignments (11@10 pts) Group Project & Presentation Exam 1 Exam 2 (final) 40 pts 110 pts 40 pts 60 pts 50 pts _____________________________________________________________________________ TOTAL 300 pts Letter Grades 270+ pts (=90%) = A 240+ pts (=80%) = B 210+ pts (=70%) = C 180+ pts (=60%) = D Below 180 pts =F Quizzes There will be 12 Quizzes, worth 5 points each. These will be closed books. The Quizzes will be multiple-choice. They will cover the material presented in class on the day of the Quiz. Make-up Quizzes will not be allowed, but the 4 lowest grades among the 12 Quizzes will be dropped. Homework Assignments HW 1: Install Cloudera Quickstart VM using files & instructions posted on Blackboard HW 2: Introduction to Relational databases and simple SQL queries (Excel, Access) HW 3: Cloudera Live Beginner Tutorial 1 (Sqoop, Hive, Avro) 2 HW 4: Cloudera Tutorial 2 (Hue, Impala, SQL queries) HW 5: Cloudera Tutorial 3 (Flume, Impala, semi-structured data) HW 6: Cloudera Tutorials 4, 5, 6 (Spark, Solr, Morphlines, Flume, Hue, Web log data) HW 7: Introduction to R, regression analysis in R (R, software piracy data) HW 8: Sentiment analysis, Word Cloud (R, Social Media, Twitter API, Twitter data) HW 9: Text clusters (SAS EM 13.1, Yelp data) HW 10: Text topics (SAS EM 13.1, Yelp data) HW 11: Predictive modeling, adjustments, and comparisons (IBM SPSS, Yelp data) Group Project There will be a project that will require team work. Related handouts will be distributed in class and related datasets will be posted on the course Web site. You will be asked to form teams of 2-3 members. You will have to select a business problem that interests you (the instructor will suggest a problem in case you run out of ideas.) Problem data/facts can be real, obtained from published sources, or made up by your team in a way that they correspond to realistic situations. Your team will prepare a written report and a PowerPoint presentation, to be presented in class. Exams There will be one mid-term examination during the semester, worth 60 points, and a final exam, worth 50 points. (See section on Grading). Both will be unit exams. The final will not be comprehensive. Exams will be multiple choice. The exams will be closed books, closed notes. Calculators will not be required. Laptop computers, tablets, smartphones, smart watches, and similar communication devices will not be allowed during an exam. Use of cell phones will be allowed only in case of emergency. Miscellaneous Policies IMPORTANT DATES: Dates of drop deadlines, exams, final exams, etc., are published in the university catalog and schedule of classes. It is your responsibility to be informed with regard to these dates. Unawareness is no excuse. Do not wait until the "last minute" to drop if you are not making satisfactory progress in this class. Your instructor may not be available at this time. Campus Closures Should UNT close campus, it is your responsibility to keep checking your official UNT e-mail account (EagleConnect), the UNT Web site, and Blackboard, to learn if your instructor plans to modify class activities, and how. This may include changing assignment due dates, rescheduling quizzes and exams, etc. Student Perceptions of Teaching (SPOT) Student Perceptions of Teaching (SPOT) utilizes IASystem® and is a requirement for all organized classes at UNT. This short Web-based survey will be available to you at the end of the semester, providing you a chance to comment on how this class is taught. I am very interested in this feedback from my students, as I work to continually improve my teaching. I consider SPOT to be an important part of your class participation. 3 Use of Cell Phones As a courtesy to your instructor and to your fellow classmates, you are asked to set your cell phone to vibrate, or switch it off. In case of a personal emergency, if you must use your cell phone, please step out of the classroom. Students with Disabilities UNT and the College of Business comply with the Americans with Disabilities Act in making reasonable accommodations. If you have an established disability you should request accommodation from the Office for Disability Accommodation. Academic Integrity This course adheres to the UNT policy on academic integrity. The policy can be found at http://vpaa.unt.edu/academic-integrity.htm. Practices that violate academic integrity, such as “cheating” or “plagiarism”, are strongly discouraged. If you are in violation you may receive a failing grade on the test or assignment, or a failing grade in the course. Class Schedule (Subject to change; Effective 1/21/2016) Week Date Topics Assignment Due 1 21-Jan 2 28-Jan 3 4-Feb 4 5 6 7 11-Feb 18-Feb 25-Feb 3-Mar 8 10-Mar Course Introduction Overview of Big Data Analytics Last week to drop for an 80% refund RDBMS and SQL, Data cleaning Last week to drop for a 70% refund Intro to Hadoop; RDBMS vs. Hive Last week to drop for a 50% refund Hadoop ecosystem; SQL queries, Impala Hadoop ecosystem; Spark, Scala Hadoop ecosystem; Solr, Web log data Hadoop ecosystem; Hue, dashboards, midterm exam review Midterm Exam 9 17-Mar 24-Mar 10 11 31-Mar 7-Apr 12 14-Apr 13 21-Apr 14 15 28-Apr 5-Mar Thu May 12 UNT Spring Break (NO CLASS) Introduction to R, Sentiment analysis, social media APIs (Twitter, Facebook, Guardian) Text clusters/topics, SAS EM, HW9, HW10 Text clusters/topics, SAS EM, HW9, HW10 Last week to withdraw from a course Predictive modeling, IBM SPSS, HW11 BDA Market Players, Project Progress NO CLASS MEETING (team time) Work on your group project Project progress report, Exam review Project presentations, course evaluation FINAL EXAM (non-comprehensive): 6:30 pm-8:30 pm, normal classroom 4 HW1 HW2 HW3 HW4 HW5 HW6 Exam 1 (60 pts) HW7 HW8 HW9 HW10 HW11 Project Presentation (40 pts) Exam 2 (50 pts)