CSE 6040 Computing for Data Analytics: Methods and Tools Lecture

advertisement
CSE 6040
Computing for Data Analytics:
Methods and Tools
Lecture 1 – Course Overview
DA KUA N G , P O LO C H AU
G EO RG I A T EC H
FA L L 2 0 1 4
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
1
Course Staff
Instructor
◦
◦
◦
◦
Da Kuang
Postdoctoral Researcher, CSE
Office: Klaus 1305 (facing the kitchen door)
Office hour: Thu 4-5pm, Klaus 1315
Instructor
◦ Duen Horng (Polo) Chau
◦ Assistant Professor, CSE
◦ Office hour: Thu 4-5pm, Klaus 1315
TA
◦ Lianxiao (Shawn) Qiu
◦ MS CS Student
◦ Office hour: Mon 1-2pm, Klaus 2108
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
2
MS Analytics Curriculum
Computing
◦
◦
◦
◦
Computing for Data Analysis: Methods and Tools
Data and Visual Analytics
Computational Data Analysis
High Performance Computing
Statistics/Optimization
◦
◦
◦
◦
◦
◦
◦
Introduction to Analytical Methods
Regression Analysis
Deterministic Optimization
Probabilistic Models
Data Mining and Statistical Learning
Simulation
Time Series Analysis
Business
◦
◦
◦
◦
◦
◦
Introduction to Business for Analytics
Risk Analytics
Project Management
Pricing Analytics and Revenue Management
Business Process Analysis and Design
Customer Relationship Management
Introductory
Fall 2014
Advanced
CSE 6040 COMPUTING FOR DATA ANALYSIS
3
Data Analytics Problems
Regression: Predicting a numerical variable
Y-axis: # New homes sold in the US (shaded areas indicate US recessions)
[Hal Varian, Predicting the present with search engine data, 2013]
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
4
Data Analytics Problems
Regression: Predicting a numerical variable
Search frequencies
on Google used as
predictors
Target variable: # new
homes sold in the US
[Hal Varian, Predicting the present with search engine data, 2013]
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
5
Data Analytics Problems
Classification: Predicting a categorical variable (or its probability)
Query
classification
News
classification
Statistical
machine
translation
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
6
Data Analytics Problems
Clustering: Finding patterns without human labeling
Both topic modeling and
recommender system can be
viewed as a clustering problem.
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
7
Data Analytics Pipeline
Data storage/retrieval
Data collection
Data analysis
Data visualization
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
8
Data Analytics Pipeline
Data storage/retrieval
Data collection
Data analysis
sqlite
Scrapy
Selenium
BeautifulSoup
Data visualization
numpy
pandas
scikit-learn
igraph
bokeh
Names in red are
Python packages.
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
8
Data Analytics Pipeline
Data storage/retrieval
Data collection
Data analysis
Data visualization
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
9
What you will learn in this course
Python programming (and a little bit Java and Matlab) – 4 lecture
◦ One of Google’s 3 main languages
Python packages
◦
◦
◦
◦
Data collection – 2 lectures
Data storage and retrieval – 1 lecture
Data analysis
Data visualization – 2 lectures
Basic linear algebra (math tools, matrices, etc.) – 2 lectures
Basic numerical computing (how to do math programmatically) – 4 lectures
Several fundamental machine learning algorithms (focusing on intuitive
ideas and software development for them)
◦
◦
◦
◦
Linear regression – 2 lectures
Logistic regression – 1 lecture
K-means – 2 lectures
Singular value decomposition – 4 lectures
(more detailed topics are in the online tentative syllabus)
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
10
Logistics
Course website (with tentative schedule; slides and assignments will be
posted here):
http://www.cc.gatech.edu/~dkuang3/cse6040/
Discussion, Q&A, find teammates on Piazza (please sign up):
https://piazza.com/gatech/fall2014/cse6040/home
Homework/Project submissions on T-square (only for submission; use
Piazza for discussion):
https://t-square.gatech.edu/
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
11
Logistics
3 homework assignments (30%)
Mid-term (20%)
Project (40%) – more details coming soon!
Class and Piazza participation (10%)
No late homework allowed.
Start now to find project teammates
◦ 2~3 people per team
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
12
What you will do in this course
Attend the lectures
ACTIVELY participate in class discussion
◦ Based on both in-class and Piazza activities
◦ Chat with / Help out your classmates on Piazza (but DO NOT share your
answers)
◦ 10% of your grade
Read tutorials/references for programming languages
Read documentation for software packages
Solve simple math problems
◦ Included in the mid-term: 20% of your grade
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
13
What you will do in this course
(cont’d)
Coding, of course!
◦
◦
◦
◦
Homework #1: Collect real data online (10%)
Homework #2: Visualize the data you collected (10%)
Homework #3: Implement a machine learning algorithm (10%)
Play with different machine learning frameworks/packages
Project (40%): Work on the Yelp Dataset
◦ Data for five cities (US, Canada, UK); four of them just released this month
◦ Includes businesses, attributes, check-ins, tips, users, user connections,
reviews
◦ Work in teams of 2~3 students
◦ Get inspired: https://github.com/Yelp/dataset-examples (Again, in Python!)
(DO NOT copy these examples for your project)
◦ Write your own team proposal
◦ (Optional) Enter the challenge ($5K prize): Round 4 through Dec 31, 2014
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
14
Course Expectation
You will never say “I don’t have data”.
You will be exposed to the entire lifecycle of data analytics (in a
simplified way).
You will be able to code in Python, a common scripting language for
data analytics and employed by many companies (e.g., one of Google’s
3 main languages), as well as have experience in many useful packages.
You will know some most fundamental machine learning algorithms.
If you already know them, you will have deeper understanding for them
from the computational aspect.
Hopefully, you will know how to write fast code for data analytics.
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
15
Why Python?
One of Google’s 3 main languages
Simpler code: Focus on concepts rather than machine details
More readable
Many useful packages
◦
◦
◦
◦
◦
◦
◦
Data manipulation
Machine learning
Image processing
Natural language processing
Spatial analysis
Web application
......
Reasonably fast
Easier to parallelize than C++
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
16
Python Setup
A text editor + A terminal (command-line window)
◦ This is the convention for (Python) developers in companies
Text editor suggestions:
◦ Windows: Notepad++ (open source, with auto-indent and auto-fill)
◦ Linux: Vim, Emacs, Sublime
◦ Mac: Sublime, TextWrangler
We use Python 2.7, NOT the highest version 3.x
◦ Many packages support Python 2.x only
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
17
Go Jackets!
Everyone – Sign up on Piazza:
https://piazza.com/gatech/fall2014/cse6040/home
Windows users – Install Python on your own machine:
https://www.python.org/downloads/
◦ Make sure it’s Python 2.7.8, NOT Python 3.x
◦ Make sure “python” can be called on command-line (may need to set up
environment variables)
◦ Make sure the “Python27” directory is located in a root directory, NOT in
“Program Files”
Everyone – Setup your development environment
◦ See https://developers.google.com/edu/python/set-up
Everyone – Download your own Yelp dataset: (423M tarball)
http://www.yelp.com/dataset_challenge
◦ We cannot share it by the terms and conditions
◦ Tip: Save the page that contains the “Download Data” button for future use
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
18
Download