Large-Scale Data Processing

advertisement
Large-Scale Data Processing
Eiko Yoneki
eiko.yoneki@cl.cam.ac.uk
http://www.cl.cam.ac.uk/~ey204
Systems Research Group
University of Cambridge Computer Laboratory
2010s: Big Data
ƒ Why Big Data now?
ƒ Increase of Storage Capacity
ƒ Increase of Processing Capacity
ƒ Availability of Data
ƒ Hardware and software
technologies can manage
ocean of data
up to 2003 5 exabytes
Æ 2012 2.7 zettabytes (500 x more)
Æ 2015 ~8 zettabytes (3 x more than 2012)
2
Examples of Big Data
ƒ Facebook:
ƒ 300 PB data warehouse (600TB/day)
ƒ 1 billion users
ƒ Twitter Firehose:
ƒ 500 million tweet/day
ƒ CERN
ƒ 15 PB/year - Stored in RDB
ƒ Google:
ƒ 40000 search queries/second
ƒ ebay
ƒ 9PB of user data+ >50 TB/day
ƒ Amazon web services
ƒ Estimated ~450000 servers for AWS
ƒ S3 450B objects, peak 290K request/sec
ƒ JPMorganChase
ƒ 150PB on 50K+ servers with 15K apps running
3
Scale-Up vs. Scale-Out
ƒ Popular solution for big data processing Æ to scale
and build on distribution and combine theoretically
unlimited number of machines in a single
distributed storage
ƒ Scale up: add resources to single node in a system
ƒ Scale out: add more nodes to a system
4
Big Data: Technologies
ƒ Distributed infrastructure
ƒ Cloud (e.g. Infrastructure as a service, Amazon EC2,
Google App Engine, Elastic, Azure)
cf. Multi-core (parallel computing)
ƒ Storage
ƒ Distributed storage (e.g. Amazon S3, Hadoop
Distributed File System (HDFS), Google File
System (GFS))
ƒ Data model/indexing
ƒ High-performance schema-free database (e.g.
NoSQL DB - Redis, BigTable, Hbase, Neo4J)
ƒ Programming model
ƒ Distributed processing (e.g. MapReduce)
5
MapReduce Programming
ƒ
ƒ
ƒ
ƒ
Target problem needs to be parallelisable
Split into a set of smaller code (map)
Next small piece of code executed in parallel
Results from map operation get synthesised
into a result of original problem (reduce)
6
Programming
ƒ Non standard programming models
ƒ Data (flow) parallel programming
ƒ e.g. MapReduce, Dryad/LINQ, NAIAD, Spark
MapReduce:
Hadoop
DAG (Directed Acyclic Graph)
based: Dryad/Spark/Tez
Two-Stage fixed dataflow
More flexible dataflow model
7
Do we need new types of algorithms?
ƒ Cannot always store all data
ƒ Online/streaming algorithms
ƒ Have we seen x before?
ƒ Rolling average of previous K items
ƒ Incremental updating
ƒ Memory vs. disk becomes critical
ƒ Algorithms with limited passes
ƒ N2 is impossible and fast data processing
ƒ Approximate algorithms, sampling
ƒ Iterative operation (e.g. machine learning)
ƒ Data has different relations to other data
ƒ Algorithms for high-dimensional data (efficient
multidimensional indexing)
8
Big Data Analytics Stack
9
Hadoop Big Data Analytics Stack
Storm
10
Spark Big Data Analytics Stack
11
How about Graph (Network) Data?
Brain Networks:
100B neurons(700T
links) requires 100s
GB memory
Bipartite graph of
phrases in
documents
Gene expression
data
Protein Interactions
[genomebiology.com]
Social media data
Airline Graphs
Web 1.4B
pages(6.6B
links)
12
How about Graph Data?
Brain Networks:
100B neurons(700T
links) requires 100s
GB memory
Bipartite graph of
phrases in
documents
Gene expression
data
Protein Interactions
[genomebiology.com]
Social media data
Airline Graphs
Web 1.4B
pages(6.6B
links)
13
Data-Parallel vs. Graph-Parallel
ƒ Data-Parallel for all? Graph-Parallel is hard!
ƒ Data-Parallel (sort/search - randomly split data to feed
MapReduce)
ƒ Not every graph algorithm is parallelisable
(interdependent computation)
ƒ Not much data access locality
ƒ High data access to computation ratio
14
Graph-Parallel
ƒ Graph-Parallel (Graph Specific Data Parallel)
ƒ Vertex-based iterative computation model
ƒ Use of iterative Bulk Synchronous Parallel Model
Pregel (Google), Giraph (Apache), Graphlab,
GraphChi (CMU)
ƒ Optimisation over data parallel
GraphX/Spark (U.C. Berkeley)
ƒ Data-flow programming – more general framework
NAIAD (MSR)
15
Bulk Synchronous Parallel Model
ƒ Computation is sequence of iterations
ƒ Each iteration is called a super-step
ƒ Computation at each vertex in parallel
Example: Computing of maximum vertex value
16
Big Data Analytics Stack
Query
Language
Machine learning
Streaming
Processing
Graph
Processing
Execution Engine
Percolator
Resource Manager
Storage
GFS
Database
17
Single Computer?
ƒ Use of powerful HW/SW parallelism
ƒ SSDs as external memory
ƒ GPU for massive parallelism
ƒ Exploit graph structure/algorithm for processing
Computation Platform
= CPU CPU CPU
…
CPU
Multi-core
Single Computer
Cluster
Input
Storage/Stream
GPU
Parallelism
Here
HD/SSD
(External Memory)
18
Conclusions
ƒ Increasing capability of hardware and software
will make big data accessible
ƒ Challenges in data intensive processing
ƒ Data-Parallel (MapReduce) and Graph-Parallel
ƒ Graph-Parallel is hard but great potential of
advanced data mining
ƒ Data-driven computations, poor data locality, needs to
consider beyond computation model
ƒ Inter-disciplinary approach is necessary
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Distributed systems
Networking
Database
Algorithms
Statistics
Machine learning
19
Conclusions
ƒ Increasing capability of hardware and software
will make big data accessible
ƒ Challenges in data intensive processing
ƒ Data-Parallel (MapReduce) and Graph-Parallel
ƒ Graph-Parallel is hard but great potential of
advanced data mining
ƒ Data-driven computations, poor data locality, needs to
consider beyond computation model
ƒ Inter-disciplinary approach is necessary
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Distributed systems
Networking
Database
Algorithms
Statistics
Machine learning
20
3Vs of Big Data
ƒ Volume: terabytes to petabytes scale
ƒ Velocity: Time sensitive – streaming
ƒ Variety: beyond structured data (e.g. text, audio,
video etc.)
Time-sensitive
to maximise its
value
Beyond
structured
data
SOURCE: IBM
Terabytes or even
Petabytes scale
21
MapReduce Programming Model
ƒ Target problem needs to be parallelisable
22
External Memory
ƒ Lot of work on computation
ƒ Little attention to storage
ƒ Store LARGE amount of graph structure data (majority
of data is edges)
ƒ Efficiently move it to computation
Potential solutions:
ƒ Cost effective but efficient storage
ƒ Move to SSDs (or HD) from RAM
ƒ Reduce latency
ƒ Runtime prefetching
ƒ Streaming (edge centric approach)
ƒ Reduce storage requirements
ƒ Compressed adjacency lists
23
Download