CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 1 – Course Overview DA KUA N G , P O LO C H AU G EO RG I A T EC H FA L L 2 0 1 4 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1 Course Staff Instructor ◦ ◦ ◦ ◦ Da Kuang Postdoctoral Researcher, CSE Office: Klaus 1305 (facing the kitchen door) Office hour: Thu 4-5pm, Klaus 1315 Instructor ◦ Duen Horng (Polo) Chau ◦ Assistant Professor, CSE ◦ Office hour: Thu 4-5pm, Klaus 1315 TA ◦ Lianxiao (Shawn) Qiu ◦ MS CS Student ◦ Office hour: Mon 1-2pm, Klaus 2108 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2 MS Analytics Curriculum Computing ◦ ◦ ◦ ◦ Computing for Data Analysis: Methods and Tools Data and Visual Analytics Computational Data Analysis High Performance Computing Statistics/Optimization ◦ ◦ ◦ ◦ ◦ ◦ ◦ Introduction to Analytical Methods Regression Analysis Deterministic Optimization Probabilistic Models Data Mining and Statistical Learning Simulation Time Series Analysis Business ◦ ◦ ◦ ◦ ◦ ◦ Introduction to Business for Analytics Risk Analytics Project Management Pricing Analytics and Revenue Management Business Process Analysis and Design Customer Relationship Management Introductory Fall 2014 Advanced CSE 6040 COMPUTING FOR DATA ANALYSIS 3 Data Analytics Problems Regression: Predicting a numerical variable Y-axis: # New homes sold in the US (shaded areas indicate US recessions) [Hal Varian, Predicting the present with search engine data, 2013] Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4 Data Analytics Problems Regression: Predicting a numerical variable Search frequencies on Google used as predictors Target variable: # new homes sold in the US [Hal Varian, Predicting the present with search engine data, 2013] Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5 Data Analytics Problems Classification: Predicting a categorical variable (or its probability) Query classification News classification Statistical machine translation Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6 Data Analytics Problems Clustering: Finding patterns without human labeling Both topic modeling and recommender system can be viewed as a clustering problem. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7 Data Analytics Pipeline Data storage/retrieval Data collection Data analysis Data visualization Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8 Data Analytics Pipeline Data storage/retrieval Data collection Data analysis sqlite Scrapy Selenium BeautifulSoup Data visualization numpy pandas scikit-learn igraph bokeh Names in red are Python packages. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8 Data Analytics Pipeline Data storage/retrieval Data collection Data analysis Data visualization Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9 What you will learn in this course Python programming (and a little bit Java and Matlab) – 4 lecture ◦ One of Google’s 3 main languages Python packages ◦ ◦ ◦ ◦ Data collection – 2 lectures Data storage and retrieval – 1 lecture Data analysis Data visualization – 2 lectures Basic linear algebra (math tools, matrices, etc.) – 2 lectures Basic numerical computing (how to do math programmatically) – 4 lectures Several fundamental machine learning algorithms (focusing on intuitive ideas and software development for them) ◦ ◦ ◦ ◦ Linear regression – 2 lectures Logistic regression – 1 lecture K-means – 2 lectures Singular value decomposition – 4 lectures (more detailed topics are in the online tentative syllabus) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10 Logistics Course website (with tentative schedule; slides and assignments will be posted here): http://www.cc.gatech.edu/~dkuang3/cse6040/ Discussion, Q&A, find teammates on Piazza (please sign up): https://piazza.com/gatech/fall2014/cse6040/home Homework/Project submissions on T-square (only for submission; use Piazza for discussion): https://t-square.gatech.edu/ Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11 Logistics 3 homework assignments (30%) Mid-term (20%) Project (40%) – more details coming soon! Class and Piazza participation (10%) No late homework allowed. Start now to find project teammates ◦ 2~3 people per team Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12 What you will do in this course Attend the lectures ACTIVELY participate in class discussion ◦ Based on both in-class and Piazza activities ◦ Chat with / Help out your classmates on Piazza (but DO NOT share your answers) ◦ 10% of your grade Read tutorials/references for programming languages Read documentation for software packages Solve simple math problems ◦ Included in the mid-term: 20% of your grade Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13 What you will do in this course (cont’d) Coding, of course! ◦ ◦ ◦ ◦ Homework #1: Collect real data online (10%) Homework #2: Visualize the data you collected (10%) Homework #3: Implement a machine learning algorithm (10%) Play with different machine learning frameworks/packages Project (40%): Work on the Yelp Dataset ◦ Data for five cities (US, Canada, UK); four of them just released this month ◦ Includes businesses, attributes, check-ins, tips, users, user connections, reviews ◦ Work in teams of 2~3 students ◦ Get inspired: https://github.com/Yelp/dataset-examples (Again, in Python!) (DO NOT copy these examples for your project) ◦ Write your own team proposal ◦ (Optional) Enter the challenge ($5K prize): Round 4 through Dec 31, 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14 Course Expectation You will never say “I don’t have data”. You will be exposed to the entire lifecycle of data analytics (in a simplified way). You will be able to code in Python, a common scripting language for data analytics and employed by many companies (e.g., one of Google’s 3 main languages), as well as have experience in many useful packages. You will know some most fundamental machine learning algorithms. If you already know them, you will have deeper understanding for them from the computational aspect. Hopefully, you will know how to write fast code for data analytics. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15 Why Python? One of Google’s 3 main languages Simpler code: Focus on concepts rather than machine details More readable Many useful packages ◦ ◦ ◦ ◦ ◦ ◦ ◦ Data manipulation Machine learning Image processing Natural language processing Spatial analysis Web application ...... Reasonably fast Easier to parallelize than C++ Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16 Python Setup A text editor + A terminal (command-line window) ◦ This is the convention for (Python) developers in companies Text editor suggestions: ◦ Windows: Notepad++ (open source, with auto-indent and auto-fill) ◦ Linux: Vim, Emacs, Sublime ◦ Mac: Sublime, TextWrangler We use Python 2.7, NOT the highest version 3.x ◦ Many packages support Python 2.x only Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17 Go Jackets! Everyone – Sign up on Piazza: https://piazza.com/gatech/fall2014/cse6040/home Windows users – Install Python on your own machine: https://www.python.org/downloads/ ◦ Make sure it’s Python 2.7.8, NOT Python 3.x ◦ Make sure “python” can be called on command-line (may need to set up environment variables) ◦ Make sure the “Python27” directory is located in a root directory, NOT in “Program Files” Everyone – Setup your development environment ◦ See https://developers.google.com/edu/python/set-up Everyone – Download your own Yelp dataset: (423M tarball) http://www.yelp.com/dataset_challenge ◦ We cannot share it by the terms and conditions ◦ Tip: Save the page that contains the “Download Data” button for future use Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18