Introduction Contact Information •Instructor: Fatme El-Moukaddem •Email: elmoukad@egr.msu.edu •Office: Room 2312 Engineering Building •Office Hours: • After class • By appointment (Mon/Wed) 2 Books •Data Just Right • Michael Manoochehri •Introduction to Data Mining • Pang-Ning Tan, Michael Steinbach, Vipin Kumar •Data Science for Business • Foster Provost, Tom Fawcett •Hadoop in Action • Chuck Lam •Hadoop the definitive guide • Tom White •Mining of massive data sets • Anand Rajaraman •Data Mining: Practical Machine Learning Tools and Techniques • Ian Witten, Eibe Frank, Mark Hall 3 Assessment •Homework: 35% •In class exercises 20% •Exams 25% •Project 20% Paper/Presentation/Review 4 Make Up Exam Policy •Only in case of an emergency •Documentation needed •If you know ahead of time, let me know 5 Homework •Use handin system to submit: • secure.cse.msu.edu/handin •Submit on time •25% of the full grade reduced if within 1 day of the deadline •50% of the full grade reduced if within 2 days of the deadline •Not accepted after 48 hrs Unless previously arranged with the instructor 6 Important Dates •Exam 1: Monday Feb 23rd (tentative) •Exam 2: Wednesday Apr 15th (tentative) •Last date to drop with full refund: Feb 6th •Last date to drop with no grade: Mar 4th 7 Course Outline •Data Collection and Storage •Data Processing and Analysis •Big Data Tools •Applications & Case studies •Project presentation 8 Programming Languages and Tools •Python •Java •Weka software •SQL, NoSQL •Hadoop, Pig, Hive 9 Why Big Data Analytics •Huge amount of data available • Retail • Banking • Insurance • Medical field • Engineering •Data generated in terabytes •Accessible 10 11 Activity on the Internet •639,800GB of global IP data transferred •133 botnet infections •20 million photos are viewed on flickr; 3,000 photos are uploaded •6 new wikipedia articles published •320 new Twitter accounts are created; 100,000 new tweets are sent •1,300 new mobile users •277,000 Facebook logins; 6 million Facebook views •20 new victims of identity theft •2 million Google search queries are initiated •204 million emails sent •30 hours of video are uploaded to YouTube; 1.3 million videos are viewed •47,000 app downloads •$83,000 in Amazon sales •61,141 hours of music are played on Pandora •100 new LinkedIn accounts are created •The number of networked devices equals the global population. By 2015 that number will be double the global population. •In 2015 it would take you 5 years to view all video crossing IP networks each second. 12 Goal Data sources Data Collection and Storage Data Preprocessing Data Analysis Knowledge Decision Making Postprocessing 13 Examples •Where to open the next coffee shop location? •How to predict which customers are likely to quit? •How to decide if a transaction is a fraudulent transaction? •What factors keep customers satisfied? •What factors contribute to infection spread/control? •What items are customers likely to purchase together? 14 Single Server Solution •Cheap hardware •Relational database management system • Mature technology to store data • Lots of tools • Query language 15 Single Server Solution •Limited processing •Limited storage •Problematic with large volumes of data •Does not provide timely answers/analysis 16 Distributed Computing •Distribute software across multiple servers •Data processing and information retrieval done by a collection of computer •Deploy a collection of cheap and small servers in a distributed computing system vs buying a custom built beefed up server with comparable computing power: A distributed computing system is more economical 17 Challenges •Distributing data across multiple machines: nontrivial •Consistency •Concurrency •Retrieval (transparency, latency) •Security •Failure •Heterogeneity 18 Design Questions •Which database backend to use? Relational, key-value? •Buy or build software? •How to collect data? •How to analyze it? Share it? Visualize it? 19 Unix •ls: show folder content •cd path: nagivate through directories •mkdir dirname: create a directory •pwd: print working directory •m sources target: move files sources to target directory •cp source target: create a copy of source and call it target •rm files: remove one or more files •grep: search utility •cat filename: display content •touch filename: create a blank file if it does not exist. Set timestamp to current 20 Unix •gzip filename: compress file •gunzip filename: uncompress gzipped file •tar: zip/unzip • tar –zcvf result.tar.gz source • tar –zxvf result.tar.gz •ps: list active processes •kill pid: kill process with id pid •ssh hostname: a program for logging in to a remote host •man cmd: show manual entry for cmd •Reference: http://www.cs.jhu.edu/~joanne/unixRC.pdf 21 Python •Resources • https://docs.python.org/3.4/tutorial/ • The Python Quick Syntax Reference • Beginning Python: using python 2.6 and 3.1 • Available online at magic.msu.edu • IDE: pyCharm •At CSE: • ssh to black.cse.msu.edu • Type: python: starts interactive shell • Python script.py: executes python script 22