Document 15144465

advertisement
Thus Far
• Locality is important!!!
– Need to get Processing closer to storage
– Need to get tasks close to data
• Rack locality: Hadoop
• Kill task and if a local slot is available: Quincy
• Why?
– Network is bad: gives horrible performance
– Why?
• Over-subscription of network
What Has Changed?
• Network is no-longer over-subscribed
– Fat-tree, VL2
• Network has fewer congestion points
– Helios, C-thru, Hedera, MicroTE
• Server uplinks are much faster
• Implications: network transfers are much faster
– Network is now just as fast as Disk I/o
– Difference between local and rack local is only 8%
• Storage practices have also changed
– Compression is being used
• Smaller amount of data need to be transferred
– De-replication is being practiced
• There’s only one copy so locality is really hard to achieve
So What Now?
• No need to worry about locality when doing
placement
– Placement can happen faster
– Scheduling algorithms can be smaller/simpler
• Network is as fast as SATA disk? But still a lot
slower than SSD?
– If SDD used then disk-locality is a problem AGAIN!
– However too costly to be used for all storage
Caching with Memory/SSD
• 94% of all jobs can have input fit in memory
• So a new problem is memory locality
– Want to place a task where it will have access to
data already in memory
• Interesting challenges:
– 46% of task use data that is never re-used
• So need to pre-catch for these tasks
– Current caching scheme are ineffective
How do you build a FS that Ignore
locality
• FDS from MSR ignores locality
• Eliminate networking problem to remove
importance of locality
• Eliminate meta-data server problems to
improve throughput of the whole system
Meta-data Server
• Current meta-data server (name-node)
– Stores mapping of chunks to servers
– Central point of failure
– Central bottle-neck
• Processing issues: before anyone reads/writes must
consult metadata server
• Storage issues: must store location of EVERY chunk and
size of every chunk
FDS’s Meta-data Server
– Only store list of servers:
• smaller memory footprint:
• # servers <<< # chunks
– Clients only interact with it at startup
• Not every-time they need to read/write
• To determine where to read/write: Consistent hashing
– Write/read data at server at this location in array
» Hash(GUID)/#-server
• # reads/writes <<<< # client boot
Network Changes
• Uses VL2 style Clos Network
– Eliminates over-subscription+ congestion
• 1 TCP doesn’t saturate Server 10-gig NIC
– Use 5 TCP connections to saturate link
• Since VL2, No congestion in core but maybe at
receiver
– Receiver controls the senders sending rate
• Receiver sends rate-limiting messages to
Disk locality is almost a distant
problem
• Advances in networking
– Eliminate over-subscription/congestion
• We have prototype of FDS that doesn’t need locality
– Uses VL2
– Eliminates meta-data servers
• New problem, new challenges
– Memory locality
• New cache replacement techniques
• New pre-caching schemes
Class Wrap-UP
• What have we covered and learned?
• The big data-stack
– How to optimize each layer?
– What are the challenges in each layer?
– Are there any opportunities to optimize across
layers?
Big-Data Stack: App Paradigms
• Commodity devices impact the design of
application paradigms
– Hadoop: dealing with failures
• Addresses n/w oversubscription—rack aware placement
• Straggler detection mitigation --- restart tasks
– Dryad: hadoop for smarter programmers
• Can create more expressive task DAGs (non cyclic)
• Can determine which should run locally on same devs
• Dryad does optimizations: adds extra nodes to do temp
aggregation
App
Hadoop
Dryad
Sharing
Virt Drawbacks
N/W Paradign
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: App Paradigms
Revisited
• User visible services are complex and
composed of multiple M-R jobs
– Flume & DryadLinQ
• Delay Execution until output is required
• Allows for various optimizations
• Storing output to HDFS between M-R jobs adds times
– Eliminate HDFS between jobs
• Programmers aren’t smart, often have extra unnecessarily steps
– Knowing what is required for output, you can eliminate
unnecessary
FlumeJava
DryadLinQ
App
Hadoop
Dryad
Sharing
Virt Drawbacks
N/W Paradign
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: App Paradigms
Revisited-yet-again
• User visible services require interactivity: so jobs need to be fast.
Jobs should return results before completing processing
– Hadoop-Online:
• pipeline results from map to reduce before done.
• Pipeline too early and reduce need to do sorting
– Increases processing overhead on reduce: BAD!!!
– RRD: Spark
• Store data in memory: much faster than disk
• Instead of doing process: create abstract graph of processing
and to processing when output is required
– Allows for optimizations
• Failure recovery is the challenge
FlumeJava
DryadLinQ
HadoopOnline
Spark
App
Hadoop
Dryad
Omega
Mesos
Sharing
Virt Drawbacks
N/W Paradign
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: Sharing in Caring
How to share a non-virtualized cluster
• Sharing is good: you have too much data and cost
too much to build many cluster for same data
• Need dynamic sharing: if static, you can waste
• Mesos:
– Resource offers: give app options of resources and let
them pick
– App knows best
• Omega:
– Optimistic allocation: each scheduler picks resources
and if there’s a conflict omega detects this and gives
resources to only one. Others pick new resources
– Even with conflicts this is much better than centralized
entity
FlumeJava
DryadLinQ
HadoopOnline
Spark
App
Hadoop
Dryad
Omega
Mesos
Sharing
Virt Drawbacks
N/W Paradign
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: Sharing in Caring
Cloud Sharing
• Clouds gives the illusion of equality
– H/W differences  diff performance
– Poor isolation  tenants can impact each other
• I/O and CPU bound jobs can conflict.
FlumeJava
DryadLinQ
HadoopOnline
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
RFA
CloudGaming
Sharing
Virt Drawbacks
N/W Paradign
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: Better Networks
• Networks give bad performance
– Cause: Congestion + over-subscription
• VL2/Portland
– Eliminate over-subscription + congestion with
commodity devices+ECMP
• Helios/C-through
– Mitigate congestion by carefully adding new
capacity
FlumeJava
DryadLinQ
HadoopOnline
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
C-Thru
Sharing
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: Better Networks
• When you need multiple servers to service a
request
– .99100 = .65 (HORRIBLE)
– Duplicate requests: send same request to 2 servers
• At-least one will finish within acceptable time
– Dolly: be smart when selecting the 2 servers
• You don’t want I/O contention because that leads to bad
perf
• Avoid Maps using same replicas
• Avoid Reducers reading same intermediate output
FlumeJava
DryadLinQ
HadoopOnline
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
C-Thru
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Dolly (Clones)
Sharing
Mantrei
Tail Latency
Big-Data Stack: Networks Sharing
• How to share efficiently while making guarantees
• Elastic-Switch
– Two level bandwidth allocation system
• Orchestra
– M/R has barriers and completion is based on a set of
flows not individual flows
– Make optimization to a set of flows
• Hull: Trade BW for latency
– Want zero buffering: but TCP needs buffering
– Limit traffic to 90% of link and use the remaining 10%
as buffers
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
C-Thru
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Dolly (Clones)
Orchestra
Sharing
Mantrei
Hull
Tail Latency
N/W Sharing
SDN
Storage
Big-Data Stack: Enter SDN
• Remove the Control plane from the switches and
centralize it
• Centralization == Scalability challenges
• NOX: how does it scale to data-centers
– How many controllers do you need?
• How should you design these controllers:
– Kandoo: a hierarchy (many local and 1 global
controller, local communicate with the global;)
– ONIX: a mesh (communication through a DHT or DB)
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
Kandoo
C-Thru
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Dolly (Clones)
Orchestra
Sharing
Mantrei
Hull
Tail Latency
N/W Sharing
ONIX
SDN
Storage
Big Data Stack: SDN+Big-Data
• FlowComb:
– Detect app patterns and have SDN controller
assign paths based on knowledge of traffic
patterns and contention
• Sinbad:
– HDFS writes are important
– Let SDN controller tell HDFS best place to write
data to based on knowledge of n/w congesetion
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
C-Thru
Kandoo
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Dolly (Clones)
Orchestra
Sharing
Mantrei
Hull
Tail Latency
N/W Sharing
ONIX
SDN
FlowComb
SinBaD
Storage
Big Data Stack: Distributed Storage
• Ideal: Nice API, low latency, scalable
• Problem: H/W fails a lot, in limited locations,
and contains limited resources
• Partition: gives good performance
– Cassandra: use consistent hashing
– Megastore: each partition == A RDBMS with good
consistency guarantees
• Replicate: Multiple copies avoid failures
– Megastore: replicas allow for low latency
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
C-Thru
Kandoo
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Dolly (Clones)
Orchestra
Sharing
Mantrei
Hull
Tail Latency
N/W Sharing
ONIX
SDN
FlowComb
SinBaD
Megastore
Casandra
Storage
Big Data Stack: Disk locality Irrelevant
• Disk Locality is becoming irrelevant
– Data is getting smaller (compressed) so smaller
times
– Networks are getting much faster (only 8% slower)
• Mem locality is new challenge
– Input for 94% fit in mem
– Need new caching+prefetching schemes
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
C-Thru
Sharing
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Mantrei
Dolly (Clones)
Hull
Orchestra
Kandoo
Tail Latency
N/W Sharing
ONIX
SDN
FlowComb
SinBaD
Megastore
Casandra
Storage
Disk-locality irrelevant
FDS
FlumeJava
HadoopOnline
DryadLinQ
Spark
App
Hadoop
Dryad
Omega
Mesos
BobTail
VL2
Portland
RFA
Helios
Tail Latency
Elastic Cloud
C-Thru
Sharing
CloudGaming
Virt Drawbacks
Hedera
N/W Paradign
MicroTE
Mantrei
Dolly (Clones)
Hull
Orchestra
Kandoo
Tail Latency
N/W Sharing
ONIX
SDN
FlowComb
SinBaD
Megastore
Casandra
Storage
Disk-locality irrelevant
FDS
Download