Introduction to Big Data Professor Emile Chungtien Chi

advertisement
Introduction to Big Data
Professor Emile Chungtien Chi
College of Staten Island / CUNY
Tongji University 8 May 2014
Attribution
• Professor Rouming Jin; Kent State University
– CS 4/6/79995: ST: Big Data & Analytics
• Wikipedia
• Reference and source books
– Hadoop: The Definitive Guide;Tom White; O'Reilly Media,
3rd Edition, 2012
– Hadoop In Action; Chuck Lam; Manning Publications; 2011
– Data-Intensive Text Processing with MapReduce, Jimmy Lin
& Chris Dyer
(www.umiacs.umd.edu/~jimmylin/MapReduce-bookfinal.pdf)
– Data Mining: Concepts and Techniques; Jiawei Han,
Micheline Kamber; 3rd Edition; Morgan Kaufmann; 2011
2
What is Big Data?
No single definition; this is from Wikipedia:
• Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using database
management tools or traditional data processing
applications.
• The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found
to "spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and
determine real-time roadway traffic conditions.”
3
Big Data: 3V’s
4
Big Numbers
•
•
•
•
•
•
•
•
103 kB kilobyte
106 MB megabyte
109 GB gigabyte
1012 TB terabyte
1015 PB petabyte
1018 EB exabyte
1021 ZB zettabyte
1024 YB yottabyte
• 1079
PC memory in 1980
PC memory in 1994
PC memory, population of the earth
PC hard drive
data in the US Library of Congress
YouTube servers
Snowden’s NSA data
Number of stars in the universe
Number of protons in the universe
• 10100 googol (not Google)
• 10googol googolplex
5
Volume (Scale)
• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
6
Volume (Scale)
• The combined space of all computer hard drives in the
world was estimated at approximately 160 EB in 2006
• Seagate Technology reported selling 330 EB worth of hard
drives in 2011
• As of 2013, the World Wide Web is estimated to have 4 ZB
• Mark Liberman calculated the storage requirements for all
human speech ever spoken at 42 ZB if digitized as 16 kHz
16-bit audio
• Research from the UC San Diego: in 2008, Americans used
3.6 zettabytes of data
• 5 ZB was the estimated size of the US National Security
Agency data revealed by Edward Snowden's NSA leaks
7
30 billion RFID
12+ TBs
camera
phones
world wide
100s of
millions
of GPS
enabled
data every day
? TBs of
of tweet data
every day
tags today
(1.3B in 2005)
4.6
billion
devices sold
annually
25+ TBs of
2+
billion
log data
every day
76 million smart meters
in 2009…
200M by 2014
people on
the Web
by end
2011
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Variety (Complexity)
•
•
•
•
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
– Social Network, Semantic Web (RDF), …
•
Streaming Data
– You can only scan the data once
•
A single application can be generating/collecting
many types of data
•
Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
10
A Single View to the Customer
Banking
Finance
Social
Media
Our
Known
History
Customer
Gaming
Entertain
Purchas
e
Velocity (Speed)
• Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions  missed opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store near your current
location
– Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate attention
12
Real-time/Fast Data
Mobile devices
(tracking all objects all the time)
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Sensor technology and networks
(measuring all kinds of data)
•
•
•
Progress and innovation are no longer hindered by the ability to collect data
But are hindered by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and in a scalable
fashion
This is the problem addressed by data analytics
13
Real-Time Analytics/Decision Requirement
Product
Recommendations
that are Relevant
& Compelling
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
Influence
Behavior
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Customer
Preventing Fraud
as it is Occurring
& preventing more
proactively
Friend Invitations
to join a
Game or Activity
that expands
business
Some Make it 4V’s
15
Harnessing Big Data
•
•
•
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
16
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
17
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
18
Big Data Architectures
• shared nothing architecture (SN)
– distributed computing architecture in which each node is
independent and self-sufficient
– none of the nodes share memory or disk storage
• massively parallel architecture
– Use a large number of processors to perform a set of
coordinated computations in parallel
• scale-out architecture
– increase capacity and performance beyond the physical
limits of a single processor and disk array
– a combination of hybrid arrays can be seamlessly
configured into a storage cluster that can accommodate
growing workloads
Massively parallel architecture uses
cheap off-the-shelf computers
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
High Performance Computers at
The College of Staten Island / CUNY
• “SALK” is a Cray XE6m with a total of 1280 processor cores. Salk is
reserved for large parallel jobs, particularly those requiring more
than 64 cores. Emphasis is on applications in the environmental
sciences and astrophysics. Salk is named in honor of Dr. Jonas Salk,
the developer of the first polio vaccine, and a City College alumnus
• “ANDY” is an SGI cluster with 744 processor cores and 96 NVIDIA
Fermi processor accelerators. Andy is for jobs using 64 cores or
fewer, for jobs using the NVIDIA Fermi’s, and for Gaussian
jobs. Andy is named in honor of Dr. Andrew S. Grove, a City College
alumnus and one of the founders of the Intel Corporation
• “BOB” is a Dell cluster with 232 processor cores. Bob is for jobs
using 64 cores or fewer, and parallel Matlab jobs. Bob is named in
honor of Dr. Robert E. Kahn, an alumnus of the City College, who,
along with Vinton G. Cerf, invented the TCP/IP protocol.
Typical Large-Data Problem
•
•
•
•
•
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
• The problem:
– Diverse input format (data diversity & heterogeneity)
– Large Scale: Terabytes, Petabytes
– Parallelization
(Dean and Ghemawat, OSDI 2004)
Divide and Conquer Strategy
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
Combine
Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than
workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have
finished?
• What if workers die?
What is the common theme of all of these problems?
Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
exchange state)
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism
Source: Ricardo Guimarães Herrmann
Managing Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each other
– We don’t know the order in which workers access shared data
• Thus, we need:
– Semaphores (lock, unlock)
– Conditional variables (wait, notify, broadcast)
– Barriers
• Still, lots of problems:
– Deadlock, livelock, race conditions...
– Dining philosophers, sleeping barbers, cigarette smokers...
• Moral of the story: be careful!
Concurrency Challenge!
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason about
– At the scale of datacenters (even across datacenters)
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program with it
– Burden on the programmer to explicitly manage
everything
MapReduce: Big Data Processing Abstraction
Typical Large-Data Problem
•
•
•
•
•
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Key idea: provide a functional abstraction for these two
operations
(Dean and Ghemawat, OSDI 2004)
Functional Programming
• Is a programming paradigm, i.e. a style of computer
programming that treats computation as the evaluation of
mathematical functions
• Emphasizes functions that produce results that depend
only on their inputs and not on the program state
– i.e. pure mathematical functions
• Is a declarative programming paradigm, which means
programming is done with expressions
• In functional code, the output value of a function depends
only on the arguments that are input to the function, so
calling a function f twice with the same value for an
argument x will produce the same result f(x) both times
MapReduce
• A programming model for processing large data
sets with a parallel, distributed algorithm on a
cluster with massively parallel architecture
– a Map() procedure that performs filtering and
sorting
• e.g. sorting students by surname into queues, one queue
for each name
– Reduce() procedure that performs a summary
operation
• e.g. counting the number of students in each queue,
yielding surname frequencies
MapReduce
• The "MapReduce System" (also called
"infrastructure" or "framework")
– manages distributed servers, running the various
tasks in parallel
– managing all communications and data transfers
between the various parts of the system
– provides for redundancy and fault tolerance
MapReduce
• The MapReduce model is inspired by the map
and reduce functions commonly used in
functional programming
• their purpose in MapReduce is not the same
as in their original forms
• The key contributions of MapReduce
– are not the actual map and reduce functions
– are the scalability and fault-tolerance achieved for
a variety of applications by optimizing the
execution engine once
MapReduce “Runtime”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed file
system
MapReduce Implementations
• Google has a proprietary implementation in C++
– Bindings in Java, Python
• Hadoop is an open-source implementation in Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem
• Lots of custom research implementations
– For GPUs, cell processors, etc.
Apache Hadoop
• Scalable fault-tolerant distributed system for Big Data:
–
–
–
–
Data Storage
Data Processing
A virtual Big Data machine
Borrowed concepts/Ideas from Google; Open source
under the Apache license
• Core Hadoop has two main systems:
– Hadoop/MapReduce: distributed big data processing
infrastructure (abstract/paradigm, fault-tolerant, schedule,
execution)
– HDFS (Hadoop Distributed File System): fault-tolerant,
high-bandwidth, high availability distributed storage
Hadoop History
– Google GFS paper published
July 2005 – Nutch uses MapReduce
Feb 2006 – Becomes Lucent subproject
Apr 2007 – Yahoo! on 1000-node cluster
Jan 2008 – An Apache Top Level Project
Jul 2008 – A 4000 node test cluster
• Dec 2004
•
•
•
•
•
• Sept 2008 – Hive becomes a Hadoop subproject
• Feb 2009 – The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and produces data
that is now used in every Yahoo! Web search query.
• June 2009 – On June 10, 2009, Yahoo! made available the source
code to the version of Hadoop it runs in production.
• In 2010 Facebook claimed that they have the largest Hadoop cluster
in the world with 21 PB of storage. On July 27, 2011 they
announced the data has grown to 30 PB.
Who uses Hadoop?
•
•
•
•
•
•
•
•
•
•
Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Hadoop Cloud Resources
• Hadoop on your local machine
• Hadoop in a virtual machine on your local
machine (Pseudo-Distributed on Ubuntu)
• Hadoop in the clouds with Amazon EC2
Example Word Count (Map)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word,one);
}
}
}
Example Word Count (Reduce)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Example Word Count (Driver)
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Word Count Execution
Input
the quick
brown fox
Map
Map
Shuffle & Sort
Reduce
Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
the, 1
fox, 1
the, 1
the fox ate
the mouse
Map
quick, 1
how, 1
now, 1
brown, 1
how now
brown cow
Map
ate, 1
mouse, 1
cow, 1
An Optimization: The Combiner
• A combiner is a local aggregation function
for repeated keys produced by same map
• For associative ops. like sum, count, max
• Decreases size of intermediate data
• Example: local counting for Word Count:
def combiner(key, values):
output(key, sum(values))
Word Count with Combiner
Input
the quick
brown fox
Map & Combine
Map
Shuffle & Sort
Reduce
Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
the, 2
fox, 1
the fox ate
the mouse
Map
quick, 1
how, 1
now, 1
brown, 1
how now
brown cow
Map
ate, 1
mouse, 1
cow, 1
End of Presentation
Download