Big Data Processing, 2014/15 - Computer Science & Engineering

advertisement
Big Data Processing, 2014/15
Lecture 1: Introduction!
!
Claudia Hauff (Web Information Systems)!
ti2736b-ewi@tudelft.nl
1
Course organisation
•
2 lectures a week (Tuesday, Thursday) for 7 weeks
•
Content:
• Data streaming algorithms
• MapReduce (bulk of the course)
• Approaches to iterative big-data problems
•
One “free” lecture in January (suggestions?)
• e.g. Big data visualisations, Industry talk, etc.
•
Course style: interactive when possible
2
Sm
all
Course content
•
Introduction!
•
Data streams 1 & 2
•
The MapReduce paradigm!
•
Looking behind the scenes of MapReduce: HDFS & Scheduling!
•
Algorithm design for MapReduce!
•
A high-level language for MapReduce: Pig 1 & 2!
•
MapReduce is not a database, but HBase nearly is
•
Lets iterate a bit: Graph algorithms & Giraph!
•
How does all of this work together? ZooKeeper/Yarn
3
ch
an
ge
s
ar
ep
os
s
ibl
e.
Assignments
•
A mix of pen-and-paper and programming
•
Individual work
•
5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5
on Mondays 8.45am - 12.45pm
• Lab attendance is not compulsory
!
•
Two lab assistants will help you with your lab work (Kilian
Grashoff and Robert Carosi)
•
Assignments are handed out on Thursdays, and due the
following Thursday (6 in total);
4
Grading & exams
•
Course grade: 25% assignments, 75% final exam
•
Weekly quizzes: up to 1 additional grade point
• +0.5 if you answer >50% correctly
Assignments and
quizzes?? Yes!
• +1 if you answer >75% correctly
•
You can pass without handing in the assignments (not
recommended!)
•
Final exam: January 28, 2-5pm (MC & open questions)
•
Resit: April 15, 2-5pm (MC & open questions)
5
Questions?
•
Questions are always welcome
•
Contact preferably via mail: ti2736b-ewi@tudelft.nl
•
Questions and answers may be posted on
blackboard if they are relevant for others as well
•
If the pace is too slow/fast, please let me know
6
Reading material (lectures)
•
Recommended material are either book chapters
or papers, usually available through the TU Delft
campus network
•
Course covers a recent topic, no single book
available that covers all angles!
•
MapReduce: Data-Intensive Text Processing with
MapReduce by Lin, Dyer and Hirst, Morgan and
Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/
7
Reading material (assignments)
•
A book on Java if you are not yet comfortable with it!
•
Hadoop: Hadoop: The Definite Guide by Tom
White, O’Reilly Media, 2012 (3rd edition)
8
Big Data
Course objectives
•
Explain the ideas behind the “big data buzz”
•
Understand and describe the three different
paradigms covered in class
•
Code productively in one of the most important
big data software frameworks we have to date:
Hadoop (and tools building on it)
•
Transform big data problems into sensible
algorithmic solutions
10
Today’s learning objectives
•
Explain and recognise the V’s of big data in use
case scenarios
•
Explain the main differences between data
streaming and MapReduce algorithms
•
Identify the correct approach (streaming vs.
MapReduce) to be taken in an application setting
11
What is “big data”?
•
A buzz word, fuzzy boundaries
!
!
!
!
•
“Massive amounts of diverse, unstructured
data produced by high-performance
applications.”
“Data too large & complex to be effectively
handled by standard database technologies
currently found in most organisations.”
Requires novel infrastructure to support storage and
processing
12
Large-scale computing is
not new
•
Weather forecasting has been a long-term scientific
challenge
• Supercomputers were already used in the 1970s
• Equation crunching
ECMWF Source
13
Large-scale computing is
not new
•
Weather forecasting has been a long-term scientific
challenge
• Supercomputers were already used in the 1970s
• Equation crunching
ECMWF Source
14
Big data processing
•
So-called big data technologies are about
discovering patterns (in semi/unstructured data)
•
Main focus is on how to make computations on
big data feasible, i.e. without a supercomputer
• We use cluster(s) of commodity hardware!
•
Next quarter: Data Mining course (will use
Hadoop*) focuses on how to discover those
patterns
*probably
15
Just an academic exercise?
•
Cloud computing: “Anything running inside a browser
that gathers and stores user-generated content”
•
Utility computing
• Computing as a metered service!
• A “cloud user” buys any amount of computing
power from a “cloud provider” (pay-per-use)
• Virtual machine instances
• IaaS: infrastructure as a service!
• Amazon Web Services is the dominant provider
16
No
!!
Just an academic exercise?
You can run your own big data experiments!
17
AWS EC2 Pricing
Progress often driven by
industry
•
Development of big data standards & (open source)
software commonly driven by companies such as
Google, Facebook, Twitter, Yahoo! …
•
Why do they care about big data?
!
•
More data
More
knowledge
More knowledge leads to
• better customer engagement
• fraud prevention
• new products
18
More
money
Big data analytics: IBM pitch
https://www.youtube.com/watch?v=1RYKgj-QK4I
19
A concrete example: big
data vs. small data
but with a lot of data …
perfect
quality
complex
algorithm
with little training
simple algorithms fail
Task: confusion set disambiguation
then vs. than
to vs. two vs. too
poor
little
training data
a lot
20
Scaling to very very large corpora for !
natural language disambiguation.!
M. Banko and E. Brill, 2001.
The 3 V’s
•
Volume: large amounts of data
•
Variety: data comes in many different forms from
diverse sources
•
Velocity: the content is changing quickly
21
The 5 V’s
•
Volume: large amounts of data
•
Variety: data comes in many different forms from
diverse sources
•
Velocity: the content is changing quickly
•
Value: data alone is not enough; how can value be
derived from it?
•
Veracity: can we trust the data? how accurate is it?
22
3
co /5 V
mm ’s
on mo
ly st
us !
ed
The 7 V’s
•
Volume: large amounts of data
•
Variety: data comes in many different forms from diverse sources
•
Velocity: the content is changing quickly
•
Value: data alone is not enough; how can value be derived from it?
•
Veracity: can we trust the data? how accurate is it?
•
Validity: ensure that the interpreted data is sound
•
Visibility: data from diverse sources need to be stitched together
23
Question: how do these attributes apply to the
use case of Flickr?
The 5 V’s
•
Volume: large amounts of data
•
Variety: data comes in many different forms from
diverse sources
•
Velocity: the content is changing quickly
•
Value: data alone is not enough; how can value be
derived from it?
•
Veracity: can we trust the data? how accurate is it?
24
Instantiations
25
Terminology
•
Batch processing: running a series of computer
programs without human intervention
•
Near real-time: brief delay between the data
becoming available and it being processed
•
Real-time: guaranteed responses between the
data becoming available and it being processed
26
Terminology contd.
standard in the past
•
Structured data
(well defined fields)
•
Semi-structured data
•
Unstructured data
(by humans for humans)
most common today
27
Unstructured text
•
To get value out of unstructured text we need to
impose structure automatically!
• Parse text
• Extract meaning from it (can be easy or difficult)
•
Amount of data we create is more than doubling
every two years, most new data is unstructured or
at most semi-structured
•
Text is not everything - images, video, audio, etc.
28
Extracting meaning
easy
difficult
29
IBM Watson: How it works
https://www.youtube.com/watch?v=_Xcmh1LQB9I
8 minutes worth your time - even if not in class.
!
Contains a nice piece about the use of unstructured text.
30
Examples of Volume &
Velocity: Twitter
•
>500 million tweets a day
•
On average >5700 tweets a second
•
Peaks of >100,000 tweets a second
• Super Bowl
• US election
• New Year’s Eve
• Football World Cup (672M tweets in total #WorldCup)
!
•
•
Messages are instantly accessible for search!
Messages are used in post-hoc analyses to gather insights
31
#WorldCup
http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/
32
Examples of Volume: the Large
Hadron Collider @ CERN
•
The world’s largest particle accelerator buried in a
tunnel with 27km circumference
•
Enormous amounts of sensors register the passage of particles
•
40 million events/second (1MB raw data per event)
•
Generates 15 Petabytes a year (15-25 million GB)
Image src
33
Examples of Velocity: targeted
advertising on the Web
•
US revenues in 2013: ~$40 billion
•
Advertisers usually get paid per click
•
For each search request, search engines decide
• whether to show an ad
• which ad to show
•
Users willing at best to wait 2 seconds for their
search results
•
Feedback loop via user clicks, user searches,
mouse movements, etc.
34
Examples of Velocity: targeted
advertising on the Web
•
US revenues in 2013: ~$40 billion
•
Advertisers usually get paid per click
•
For each search request, search engines decide
• whether to show an ad
• which ad to show
•
Users willing at best to wait 2 seconds for their
search results
•
Feedback loop via user clicks, user searches,
mouse movements, etc.
35
More interactions/data on
the Web
•
YouTube: 4 billion views a day, one hour of video
uploaded every second
•
Facebook: 483 million daily active users (Dec.
2011); 300 Petabytes of data
•
Google: >1 billion searches per day (March 2011)
•
Google processed 100 Terabytes of data per day
in 2004 and 20 Petabytes data/day in 2008
•
Internet Archive: contains 2 Petabytes of data,
grows 20 Terabytes per month (2011)
36
Question:
how
can
we
come
up
with
a
Use case:
movie
suggestion for Frank?!
recommendations
Think about data-intensive and non-intensive approaches.
Bob
Alice
?
Tom
Jane
?
?
?
?
?
?
?
?
?
?
?
?
Frank
?
?
Joe
?
?
?
37
Bob likes X-MEN.
?
Jane dislikes X-MEN.
?
Should we suggest
X-MEN to Frank?
?
Use case contd.
•
Ignore the data, use experts instead (movie
reviewers); assumes no large subscriber/reviewer
divergence
•
Use all data but ignore individual preferences;
assumes that most users are close to the average
•
Lump people into preference groups based on
shared likes/dislikes; compute group-based
average score per movie
•
Focus computational effort on difficult movies
(some
are
universally
liked/disliked)
A whole
research
field is
concerned with this
38
question: recommender systems.
Use case contd.
•
Netflix Prize: open competition for the best
collaborative filtering algorithm to predict user
ratings for films, based on previous ratings
(>100 million ratings by 0.5 million users for
~17,000 movies)
•
First competitor to improve over Netflix’s baseline
by 10% receives $1,000,000
•
Competition started in 2006, price money was
paid out in 2009 (winner was 20 minutes quicker
than the runner up - same performance!)
Many research teams competed - 39innovation driven by industry again!
Question:
whatof
kind
of data do you need for
Example
Variety:
this task?
Restaurant Locator
•
Task: given a person’s location, list the top five
restaurants in the neighbourhood
•
Required data:
• World map
• List of all restaurants in the world (opening
hours, GPS coordinates, menu, special offers)
• Reviews/ratings
• Optional: social media stream(s)
Data is continuously changing (restaurants close,
new ones open, data formats change, etc.)
•
40
Society can benefit too, not
just companies
•
Accurate predictions of natural disasters and
diseases
•
Better responses to disaster discovery
• Timely & effective decisions
• Provide resources where need the most
!
•
•
Complete disease/genomics databases to enable
biomedical discoveries
Accurate models to support forecasting of
ecosystem developments
41
Idea: earthquake warnings
•
Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time
posting abilities
!
Earthquake!
occurs
!
!
Warn people
further away
people in the area tweet about it
Challenges: how to detect when a tweet is about
Question:
do
you think
thisearthquake
is possible?
an actual
earthquake,
which
is it If
about and
where
is the
what
are
thecentre
challenges?
•
42
so,
Idea: earthquake warnings
•
Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time
posting abilities
!
Earthquake!
occurs
!
!
•
Warn people
further away
people in the area tweet about it
Challenges: how to detect when a tweet is about
an actual earthquake, which earthquake is it
about and where is the centre
43
It i
s
Idea: earthquake warnings
contd.
•
Goal: warning should reach people earlier than
the seismic waves
•
Travel times of seismic waves: 3-7km/s; arrival
time of a wave 100km away: 20 seconds!
•
Performance of an existing system (Sakaki et al.,
Twitter-based
2010):
earthquake
44
po
ss
ibl
e!!
traditional warning system
A brief introduction to
Streaming & MapReduce
A brief introduction to
Streaming
Data streaming scenario
•
Continuous and rapid input of data (“stream of data”)
•
Limited memory to store the data - less than linear in
the input size
•
Limited time to process each data item - sequential
access
•
Algorithms have one (or very few passes) over the
data
We go for the practical setup!
•
Can be approached from a practical or mathematical
point of view: metric embedding, pseudo-random
computations, sparse approx. theory …
47
Data streaming example
3
6
5
4
1
8
2
What is the
missing
number?
stream of n numbers; permutation from 1 to n;
one number is missing;
we are allowed one pass over the data
Solution 1: memorise all number seen so far;
memory requirements: n bit (impractical for large n)
48
Data streaming example
3
6
5
4
1
8
2
What is the
missing
number?
stream of n numbers; permutation from 1 to n;
one number is missing;
we are allowed one pass over the data
subtract seen
numbers
Solution 2:
(closed form)
sum of all numbers from 1 to n
49
memory: 2logn
Data streaming contd.
3
4
9
4
9
8
2
What is the
average?!
What is the
median?
we are allowed one pass over the data,
can only store 3 numbers
•
Average: can be computed by keeping track of
two numbers (sum and #numbers seen)
!
•
Median: sample data points - but how?
50
Data streaming contd.
•
Typically: simple functions of the stream are
computed and used as input to other algorithms
• Median
• Number of distinct elements
• Longest increasing sequence
• …
•
Closed form solutions are rare
•
Common approaches are approximations of the
true value: sampling, hashing
51
A brief introduction to
MapReduce
MapReduce is an industry
is the open-source implementation of t
standard Hadoop
MapReduce framework.
QCon 2013 (San Francisco)
“QCon empowers software development by facilitating the spread of knowledge and
innovation in the developer community. A practitioner-driven conference …”
53
Industry is moving fast
54
QCon 2014 (San Francisco)
MapReduce
•
Designed for batch processing over large data
sets
•
No limits on the number of passes, memory or
time
•
Programming model for distributed computations
inspired by the functional programming
paradigm
55
MapReduce example
WordCount: given an input text, determine the frequency
of each word. The “Hello World” of the MapReduce realm.
Input text: The dog walks around the house. The dog is in the house.
text 1
Input:
text
text 2
text 3
Mapper
Mapper
sort
Reducer!
(add)
(dog,1)!
Mapper (in,1)!
(dog,1)!
(the,1)! (walks,1)!
dog: 2!
(house,1) (house,1)! walks: 1!
(house,1)! house: 2!
56
…
…
MapReduce example
We implement the Mapper and the Reducer.
Hadoop (and other tools) are responsible for the “rest”.
text 1
Input:
text
text 2
text 3
Mapper
Mapper
sort
Reducer!
(add)
(dog,1)!
Mapper (in,1)!
(dog,1)!
(the,1)! (walks,1)!
dog: 2!
(house,1) (house,1)! walks: 1!
(house,1)! house: 2!
57
…
…
Summary
•
What are the characteristics of “big data”?
•
Example use cases of big data
•
Hopefully a convincing argument why you should
care
•
A brief introduction of data streams and
MapReduce
58
Reading material
Required reading!
None.
!
Recommended reading!
Principles of Big Data: Preparing, Sharing, and
Analyzing Complex Information by Jules Berman.
Chapters 1, 14 and 15.
59
THE END
Download