Goal of this course What is Big Data? Big Data The three(or four) Vs

advertisement
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.1
CS435 BIG DATA
Part 0. Introduction
Introduction to Big Data
PART 0. INTRODUCTION: WHAT IS BIGDATA?
Instructor: Sangmi Lee Pallickara
Computer Science, Colorado State University
http://www.cs.colostate.edu/~cs435
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.2
Goal of this course
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.3
What is Big Data?
•  Understanding fundamental concepts in Big Data
•  Learn about the effective Big Data technologies and how to
apply them
•  Computing paradigm
•  Data retrieval technology
•  Dataflow management for the large applications
•  Scalable algorithms
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.4
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Big Data
The three(or four) Vs in Big Data
•  Things one can do at a large scale that cannot be done at a smaller
•  Volume
•  Voluminous
•  It does not have to be certain number of petabytes or quantity.
one
•  To extract new insights
•  Create new forms of values
•  Big Data is about analytics of huge quantities of data in order to infer
probabilities
•  Big Data is NOT about trying to “teach” a computer to “think” like humans
•  Providing a quantitative dimension it never had before
W1.A.5
•  Velocity
•  How fast the data is coming in?
•  How fast you need to be able to analyze and utilize it
•  Variety
•  Number of sources or incoming vectors
•  Veracity
•  Can you trust the data itself, source of the data, or the process?
•  User entry errors, redundancy, corruption of the values
•  Data cleaning
1
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Use all of the data, or just a little?
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.6
[1/5]
1/19/2016
Use all of the data, or just a little?
W1.A.7
[2/5]
John Graunt
John Graunt
Estimating population of London
384,000
London, 17th century
•  Estimate the population of London, which then used to estimate the number of
eligible ‘fighting men’ available to the king
•  Dispel the myth that the plague was more rife in years of monarchy changes
•  Investigate how London was expanding
•  Using a 1658 map he assumed that there were 54 families living in each space
of 100 yards square
•  Within the walls of London there were 220 such squares, making 11,880 families
there
•  The number of deaths in London as a whole, both within and outside the walls,
was four times the number of deaths within the walls alone
http://www.theactuary.com/archive/old-articles/part-3/who-was-captain-john-graunt-3F/
1/19/2016
CS435 Introduction to Big Data - Spring 2016
CS435 Introduction to Big Data - Spring 2016
Use all of the data, or just a little?
W1.A.8
[3/5]
•  John Graunt’s population example
http://www.theactuary.com/archive/old-articles/part-3/who-was-captain-john-graunt-3F/
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Use all of the data, or just a little?
W1.A.9
[4/5]
•  By the late nineteenth century, even that was proving
problematic
•  Data outstripped the Census Bureau’s ability to keep up.
•  Conducting censuses
•  Costly and time-consuming
•  The 1880 census took 8 years to complete
•  The information was obsolete even before it became available!
•  The US Constitution mandated one every decade
•  As the growing country measured itself in millions
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Use all of the data, or just a little?
•  The 1890 census was estimated to require a full 13 years
W1.A.10
[5/5]
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.11
In 2009 a new flu virus was discovered.
•  The Census Bureau contracted with Herman Hollerith
•  An American inventor used his idea of punch cards and tabulation
machines for the 1890 census
•  Reduced duration to less than one year
•  Methods of acquiring and analyzing big data is very expensive
“Image of the newly identified H1N1 influenza virus”,
taken in the CDC Influenza Laboratory.
Source: http://www.cdc.gov/h1n1flu/images.htm?s_cid=cs_001
2
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.12
1/19/2016
CS435 Introduction to Big Data - Spring 2016
H1N1 virus
In the United States,
•  Combines elements of the viruses that causes bird flu and swine
•  The Centers for Disease Control and Prevention (CDC)
•  Requested doctors inform patients of new flu cases
•  Tabulated the numbers once a week
flu
W1.A.13
•  Some projected outbreak to be on the scale of the 1918 Spanish
flu that had:
•  Typically 1-2 week reporting lag
•  Infected half a billion people
•  Killed tens of millions
•  What will be the problems with this system?
•  No vaccine was ready
•  Public health authorities needed to know where it already was.
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.14
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.15
Detecting influenza epidemics using search
engine query data -1/2
Detecting influenza epidemics using search
engine query data -2/2
•  J. Ginsberg, et al., “Detecting influenza epidemics using search
•  Predict the spread of the winter flu in the United States
•  Nationally
•  Specific regions and states
engine query data” Nature 457 pp. 1012 ~ 1014, February 2009
http://www.nature.com/nature/journal/v457/n7232/full/
nature07634.html
•  Looking at what people were searching for on the Internet
•  Google
•  More than three billion search queries every day
•  Saves all the queries
•  50 million most common search terms
•  Americans type and compared the list with CDC data
•  On the spread of seasonal flu between 2003 and 2008
•  Identify areas infected by the flu virus by what people searched for on the
Internet
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.16
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.17
How did they build a prediction model? (1/3)
How did they build a prediction model? (2/3)
•  Aggregating historical logs of online web search queries
•  Between 2003 and 2008
•  Computed a time series of weekly counts for 50 million of the most
common search queries in the US
•  Weekly counts were kept for every query in each state
•  The probability that a random search query submitted from the
•  Simple model that estimates the probability that a random
physician visit in a particular region is related to an influenza-like
illness(ILI)
same region is ILI-related
•  logit(I(t)) = αlogit(Q(t))+ε
where:
•  I(t) is the percentage of ILI physician visits at time t
•  Q(t) is the ILI-related query fraction at time t
•  α is the multiplicative coefficient
•  ε is the error term
•  logit(p) = ln(p/(1 – p))
3
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.18
CS435 Introduction to Big Data - Spring 2016
How did they build a prediction model? (3/3)
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.19
Evaluation of combining top-scoring queries
After adding query 81
‘oscar nominations’
•  Each of the 50 million candidate queries was separately tested
•  To identify the search queries that could most accurately model the CDC ILI
visit percentage in each region
•  Generated the highest scoring search queries
Maximal performance at estimating out-of–sample points during cross-validation was obtained by summing top 45
search queries
J Ginsberg et al. Nature 000, 1-3 (2008) doi:10.1038/nature07634
1/19/2016
W1.A.20
In the top 100, topics like ‘high school basket ball’ was not Included. It coincided with influenza season
in the US
CS435 Introduction to Big Data - Spring 2016
Most co-related queries
Search query topic
Top 45 queries
n Weighted
Influenza complication
Cold/flu remedy
General influenza symptoms
Term for influenza
Specific influenza symptom
Symptoms of an influenza complication
Antibiotic medication remedies
General influenza
Symptoms of a related disease
Antiviral medication
Related disease
Unrelated to influenza
Total
11 18.15
8 5.05
5 2.60
4 3.74
4 2.54
4 2.21
3 6.23
2 0.18
2 1.66
1 0.39
1 6.66
0 0.00
45 49.40
Next 55 queries
n Weighted
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.21
Data sources
•  Public dataset from the CDC’s Influenza Sentinel Provider
Surveillance Network
5 3.40
6 5.03
1 0.07
6 0.30
6 3.74
2 0.92
3 3.17
1 0.32
2 0.77
1 0.74
3 3.77
19 28.37
55 50.60
•  Average percentage of all outpatient visits to sentinel providers that were
ILI-related
•  Query submitted to the Google search engine
•  2003 ~ 2008
Google processes more than 24 Petabytes of data per day
The top 45 queries were used in the final model; the next 55 queries are presented for
comparison purposes. The number of queries in each topic is indicated, as well as queryvolume-weighted counts, reflecting the relative frequency of queries in each topic.
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.22
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.23
What makes this work innovative?
•  There have been other approaches:
•  CDC’s weekly report
•  Call volume to telephone triage advice lines
•  Tracking over-the-counter drug sales
•  Online activity for influenza surveillance
Knowledge and
Insights
HOW?
•  Search queries submitted to a Swedish medical website
•  Tracking search keys (‘flu’, or ‘influenza’) in Yahoo search queries
Data, data, data…
•  What is the innovation in Google’s approach?
4
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.24
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.25
Related research areas
Knowledge and
Insights
HOW?
Distributed Computing
Part 0. Introduction
Storage systems
Course Introduction
Machine Learning
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.26
1/19/2016
Data, data, data…
CS435 Introduction to Big Data - Spring 2016
Related research areas
Related courses in CS at CSU
•  Storage systems
•  How can we efficiently resolve queries on massive amounts of input data?
•  The input dataset may be presented in the form of a distributed data stream
•  Machine Learning/Statistics (CS480, CS545)
•  Machine learning
•  How can we efficiently solve large-scale machine learning problems?
•  The input data may be massive; stored in a distributed cluster of machines
W1.A.27
•  Distributed Systems (CS455, CS555)
•  Database Systems (Cs430, CS530)
•  Computer Communications (CS457, CS557)
•  Computer Security (Cs356, CS556)
•  Distributed computing
•  How can we efficiently solve large-scale optimization problems in
distributed computing environments?
•  For example, how can we efficiently solve large-scale combinatorial
problems, e.g. processing of large scale graphs?
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.28
1/19/2016
CS435 Introduction to Big Data - Spring 2016
About me
Communications (1/2)
•  Research area
•  Big Data
•  Course Website
•  http://www.cs.colostate.edu/~cs435
•  Announcements: Check the course website at least twice a week.
•  Schedule (course materials, readings, assignments)
•  Policies
•  Storage, retrievals, analytics and visualization
•  Projects
•  Glean
•  Predictive analysis at scale
•  Galileo
•  Distributed data storage system for large scale geospatial time-series datasets
•  Mendel
W1.A.29
•  Canvas
•  Assignment submission
•  Discussion board
•  Grades
•  Genomic data storage system
•  GeoLens
•  Visual Analytics
5
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.30
CS435 Introduction to Big Data - Spring 2016
Communications (2/2)
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.31
Course Structure
5. Help
Sessions
•  Contact Instructor
•  sangmi@colostate.edu
•  Office hour
1. Lectures
•  Tuesday 2:00 ~ 3:00PM
•  Friday 10:30 ~ 11:30AM and by appointment
•  Office: CSB456
4. Term
Projects
•  URL: http://www.cs.colostate.edu/~sangmi
2. Assignments
•  Contact GTA
•  Saptashwa Mitra
•  Office hour: TBA
1/19/2016
Lectures
3.
Quizzes
and
Exams
W1.A.32
CS435 Introduction to Big Data - Spring 2016
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Course Structure
Course introduction, Big Data
Example, Revisit useful
technologies
5. Help
Sessions
Part 0
Part 1
Part 2
•  Introduction to
Big Data
•  Large scale
Data Analysis
with MapReduce
•  Data storage
and flow
management
Simple examples, How
MapReduce works, Advanced
case studies, and features
1/19/2016
W1.A.33
4. Term
Projects
2. Assignments
3.
Quizzes
and
Exams
Data flow management,
NoSQL data storage, Data
exchange, semi-structured data
model
CS435 Introduction to Big Data - Spring 2016
W1.A.34
1. Lectures
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Assignments (1/2)
Assignments (2/2)
•  Providing hands-on experience
•  All assignments will be individual submissions
W1.A.35
•  Required programming language: Java
•  Assignment 1: Set up your Hadoop cluster and Ngram (Unigram,
bigram) generation and summarization of e-books using
MapReduce
•  Assignment 2: Ngram corpus analysis using Hadoop: Author
detection using an NLP algorithm
•  Grading will be based on demo
•  Late submission policy
•  Up to 2 days of late submission is accepted
•  10% deduction per 24 hours will be applied
•  Assignment submission
•  Canvas
•  Assignment 3: Identifying the author/publishing year/genre using
Ngram corpus analysis using AWS Elastic MapReduce Service
(EMS) and HBase
•  Thank you for the grants, AWS!
6
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.36
Course Structure
1/19/2016
W1.A.37
CS435 Introduction to Big Data - Spring 2016
Quizzes and Exams (1/2)
5. Help
Sessions
•  Pop quizzes
•  About 10 quizzes this semester
•  Simple questions
•  Each quiz is about ~1% of the course grade
•  2 lowest scores will be eliminated at the end of the semester
•  Open book, open note
•  No Internet, no collaboration
1. Lectures
4. Term
Projects
2. Assignments
3.
Quizzes
and
Exams
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.38
Quizzes and Exams (2/2)
1/19/2016
W1.A.39
CS435 Introduction to Big Data - Spring 2016
Course Structure
•  Exams
•  Midterm and Final
•  40% of the course grade
•  Mixture of simple questions and comprehensive questions
5. Help
Sessions
1. Lectures
4. Term
Projects
2. Assignments
3.
Quizzes
and
Exams
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.40
Term Project
1/19/2016
W1.A.41
CS435 Introduction to Big Data - Spring 2016
How do I select my Term Project topic?
Do you have
any
teammate?
•  Comprehensive Big Data software designing experiences
•  Phase 0: Find your teammate!
Yes
Sit together!
No
•  Phase 1: Term Project Proposal
•  Phase 2: Final submission: Software and Report
Go through the example topics
•  Phase 3: Presentation (Class presentation) and demonstration of
Did you find
interesting
topic?
your software
No
•  Term project is a team effort 2-3 students per team
•  No single member team allowed
Yes
Yes
Do you have
teammate?
Yes
Did you
come up
with your
own topic?
•  http://www.cs.colostate.edu/~cs435/
No
Submit your team
No
- 
- 
List your name on the Canvas
discussion board with your
Strengths
Topics that you are interested in
7
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.42
CS435 Introduction to Big Data - Spring 2016
1/19/2016
W1.A.43
CS435 Introduction to Big Data - Spring 2016
Example topics from previous courses
Example topics from previous courses
•  Similarity and Clustering on The Million Song Dataset
•  “Time to Answer” for Questions on stackoverflow.com using Map
Reduce
•  Methods Used to Analyze Eclipse and Mozilla Bug Data
•  Analysis of words for spell-checking in search queries using
•  Movie Recommendation using Collaborative Filtering
digitized books and articles
•  Climate Visualization and Predictive Analysis
•  Who is Building Wikipedia?
•  Wikipedia Page Traffic Statistics Analysis
Trending Topics and Page Count Prediction on Wikipedia Traffic
Log Data
•  Supporting Emergency Response During Natural Disasters with
Twitter Data
1/19/2016
W1.A.44
CS435 Introduction to Big Data - Spring 2016
Course Structure
1/19/2016
W1.A.45
CS435 Introduction to Big Data - Spring 2016
Help Sessions
5. Help
Sessions
•  4 lab sessions are planned
•  Installing and configuring Apache Hadoop
•  Regarding Assignment 1
•  Regarding Assignment 2
•  Regarding Assignment 3
1. Lectures
•  Additional lab sessions (up to 3 sessions) will be scheduled
4. Term
Projects
based on student requests
2. Assignments
•  Help sessions will be led by GTA and recorded
3.
Quizzes
and
Exams
1/19/2016
W1.A.46
CS435 Introduction to Big Data - Spring 2016
Grading Policy
1/19/2016
W1.A.47
CS435 Introduction to Big Data - Spring 2016
Term Project Grading Policy
•  20% of course grade
Category
Points
Percentage of Final
Grade
Category
Points
Percentage of Final
Grade
Programming Assignments
(10% x 3)
30
30 %
Phase 0
1
1%
Term Project
20
30 %
Phase 1
5
5%
Quizzes and participation
10
10 %
Phase 2
8
8%
Midterm Exams (20% x 2)
40
40%
Phase 3
Class presentation (3%)
Software demo (3%)
6
6%
8
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.48
Final Letter Grade
CS435 Introduction to Big Data - Spring 2016
W1.A.49
CS435 Introduction to Big Data - Spring 2016
Course Timeline
Letter grades will be based on the following standard breakpoints
•  >= 90 is an A,
•  >= 88 is an A-,
•  >=86 is a B+,
•  >=80 is a B,
•  >=78 is a B-,
•  >=76 is a C+,
•  >=70 is a C,
•  >=60 is a D,
•  and <60 is an F
1/19/2016
1/19/2016
W1.A.50
PA3
PA2
PA1
TP: Phase 0.
January
TP: Phase 2.
TP: Phase 3.
TP: Phase 1.
February
March
Midterm 1/19/2016
May
April
Final Exam
CS435 Introduction to Big Data - Spring 2016
Course Policy
Course Policy
•  No make-up for missed exams
•  Except in extraordinary circumstances (e.g., major illness, family
emergency)
•  No Cell-phones in the class.
W1.A.51
•  No Laptops during the exam, and quiz.
•  If you need to use a laptop, please sit in the back row.
•  No make-up for missed quizzes
•  Except for the case where student provided an advance notice to the
instructor based on an emergency
•  I will ask you to turn off your laptop if needed.
•  No text book.
•  I provide several on-line materials and research papers.
•  Rely on the lecture notes
1/19/2016
CS435 Introduction to Big Data - Spring 2016
W1.A.52
1/19/2016
CS435 Introduction to Big Data - Spring 2016
Plagiarism
More importantly,
•  We use plagiarism detection software
•  Attend the class, ask questions, and discuss
•  Please do not copy code or sentences from your references or
•  Check the course web page and Canvas regularly
your colleague's work
•  Discuss with each other, but do not write your software together!
•  Try new technologies and apply them
W1.A.53
•  Share your experiences with other students in class
9
CS435 Introduction to Big Data
Spring 2016 Colorado State University
1/19/2016
CS435 Introduction to Big Data - Spring 2016
1/19/2016 Week 1-A
Instructor: Sangmi Pallickara
W1.A.54
Questions?
10
Download