Lecture 1 - Department of Computer Science and Engineering

advertisement
Introduction
Contact Information
•Instructor: Fatme El-Moukaddem
•Email: elmoukad@egr.msu.edu
•Office: Room 2312 Engineering Building
•Office Hours:
• After class
• By appointment (Mon/Wed)
2
Books
•Data Just Right
• Michael Manoochehri
•Introduction to Data Mining
• Pang-Ning Tan, Michael Steinbach,
Vipin Kumar
•Data Science for Business
• Foster Provost, Tom Fawcett
•Hadoop in Action
• Chuck Lam
•Hadoop the definitive guide
• Tom White
•Mining of massive data sets
• Anand Rajaraman
•Data Mining: Practical Machine
Learning Tools and Techniques
• Ian Witten, Eibe Frank, Mark Hall
3
Assessment
•Homework:
35%
•In class exercises
20%
•Exams
25%
•Project
20%
Paper/Presentation/Review
4
Make Up Exam Policy
•Only in case of an emergency
•Documentation needed
•If you know ahead of time, let me know
5
Homework
•Use handin system to submit:
• secure.cse.msu.edu/handin
•Submit on time
•25% of the full grade reduced if within 1 day of the deadline
•50% of the full grade reduced if within 2 days of the deadline
•Not accepted after 48 hrs
Unless previously arranged with the instructor
6
Important Dates
•Exam 1: Monday Feb 23rd
(tentative)
•Exam 2: Wednesday Apr 15th (tentative)
•Last date to drop with full refund: Feb 6th
•Last date to drop with no grade: Mar 4th
7
Course Outline
•Data Collection and Storage
•Data Processing and Analysis
•Big Data Tools
•Applications & Case studies
•Project presentation
8
Programming Languages and Tools
•Python
•Java
•Weka software
•SQL, NoSQL
•Hadoop, Pig, Hive
9
Why Big Data Analytics
•Huge amount of data available
• Retail
• Banking
• Insurance
• Medical field
• Engineering
•Data generated in terabytes
•Accessible
10
11
Activity on the Internet
•639,800GB of global IP data transferred
•133 botnet infections
•20 million photos are viewed on flickr; 3,000
photos are uploaded
•6 new wikipedia articles published
•320 new Twitter accounts are created; 100,000
new tweets are sent
•1,300 new mobile users
•277,000 Facebook logins; 6 million Facebook views
•20 new victims of identity theft
•2 million Google search queries are initiated
•204 million emails sent
•30 hours of video are uploaded to YouTube; 1.3
million videos are viewed
•47,000 app downloads
•$83,000 in Amazon sales
•61,141 hours of music are played on
Pandora
•100 new LinkedIn accounts are created
•The number of networked devices equals the
global population. By 2015 that number will be
double the global population.
•In 2015 it would take you 5 years to view all video
crossing IP networks each second.
12
Goal
Data sources
Data Collection and
Storage
Data Preprocessing
Data Analysis
Knowledge
Decision Making
Postprocessing
13
Examples
•Where to open the next coffee shop location?
•How to predict which customers are likely to quit?
•How to decide if a transaction is a fraudulent transaction?
•What factors keep customers satisfied?
•What factors contribute to infection spread/control?
•What items are customers likely to purchase together?
14
Single Server Solution
•Cheap hardware
•Relational database management system
• Mature technology to store data
• Lots of tools
• Query language
15
Single Server Solution
•Limited processing
•Limited storage
•Problematic with large volumes of data
•Does not provide timely answers/analysis
16
Distributed Computing
•Distribute software across multiple servers
•Data processing and information retrieval done by a
collection of computer
•Deploy a collection of cheap and small servers in a
distributed computing system vs buying a custom built
beefed up server with comparable computing power:
A distributed computing
system is more economical
17
Challenges
•Distributing data across multiple machines: nontrivial
•Consistency
•Concurrency
•Retrieval (transparency, latency)
•Security
•Failure
•Heterogeneity
18
Design Questions
•Which database backend to use? Relational, key-value?
•Buy or build software?
•How to collect data?
•How to analyze it? Share it? Visualize it?
19
Unix
•ls: show folder content
•cd path: nagivate through directories
•mkdir dirname: create a directory
•pwd: print working directory
•m sources target: move files sources to target directory
•cp source target: create a copy of source and call it target
•rm files: remove one or more files
•grep: search utility
•cat filename: display content
•touch filename: create a blank file if it does not exist. Set timestamp to current
20
Unix
•gzip filename: compress file
•gunzip filename: uncompress gzipped file
•tar: zip/unzip
• tar –zcvf result.tar.gz source
• tar –zxvf result.tar.gz
•ps: list active processes
•kill pid: kill process with id pid
•ssh hostname: a program for logging in to a remote host
•man cmd: show manual entry for cmd
•Reference: http://www.cs.jhu.edu/~joanne/unixRC.pdf
21
Python
•Resources
• https://docs.python.org/3.4/tutorial/
• The Python Quick Syntax Reference
• Beginning Python: using python 2.6 and 3.1
• Available online at magic.msu.edu
• IDE: pyCharm
•At CSE:
• ssh to black.cse.msu.edu
• Type: python: starts interactive shell
• Python script.py: executes python script
22
Download