Thus Far • Locality is important!!! – Need to get Processing closer to storage – Need to get tasks close to data • Rack locality: Hadoop • Kill task and if a local slot is available: Quincy • Why? – Network is bad: gives horrible performance – Why? • Over-subscription of network What Has Changed? • Network is no-longer over-subscribed – Fat-tree, VL2 • Network has fewer congestion points – Helios, C-thru, Hedera, MicroTE • Server uplinks are much faster • Implications: network transfers are much faster – Network is now just as fast as Disk I/o – Difference between local and rack local is only 8% • Storage practices have also changed – Compression is being used • Smaller amount of data need to be transferred – De-replication is being practiced • There’s only one copy so locality is really hard to achieve So What Now? • No need to worry about locality when doing placement – Placement can happen faster – Scheduling algorithms can be smaller/simpler • Network is as fast as SATA disk? But still a lot slower than SSD? – If SDD used then disk-locality is a problem AGAIN! – However too costly to be used for all storage Caching with Memory/SSD • 94% of all jobs can have input fit in memory • So a new problem is memory locality – Want to place a task where it will have access to data already in memory • Interesting challenges: – 46% of task use data that is never re-used • So need to pre-catch for these tasks – Current caching scheme are ineffective How do you build a FS that Ignore locality • FDS from MSR ignores locality • Eliminate networking problem to remove importance of locality • Eliminate meta-data server problems to improve throughput of the whole system Meta-data Server • Current meta-data server (name-node) – Stores mapping of chunks to servers – Central point of failure – Central bottle-neck • Processing issues: before anyone reads/writes must consult metadata server • Storage issues: must store location of EVERY chunk and size of every chunk FDS’s Meta-data Server – Only store list of servers: • smaller memory footprint: • # servers <<< # chunks – Clients only interact with it at startup • Not every-time they need to read/write • To determine where to read/write: Consistent hashing – Write/read data at server at this location in array » Hash(GUID)/#-server • # reads/writes <<<< # client boot Network Changes • Uses VL2 style Clos Network – Eliminates over-subscription+ congestion • 1 TCP doesn’t saturate Server 10-gig NIC – Use 5 TCP connections to saturate link • Since VL2, No congestion in core but maybe at receiver – Receiver controls the senders sending rate • Receiver sends rate-limiting messages to Disk locality is almost a distant problem • Advances in networking – Eliminate over-subscription/congestion • We have prototype of FDS that doesn’t need locality – Uses VL2 – Eliminates meta-data servers • New problem, new challenges – Memory locality • New cache replacement techniques • New pre-caching schemes Class Wrap-UP • What have we covered and learned? • The big data-stack – How to optimize each layer? – What are the challenges in each layer? – Are there any opportunities to optimize across layers? Big-Data Stack: App Paradigms • Commodity devices impact the design of application paradigms – Hadoop: dealing with failures • Addresses n/w oversubscription—rack aware placement • Straggler detection mitigation --- restart tasks – Dryad: hadoop for smarter programmers • Can create more expressive task DAGs (non cyclic) • Can determine which should run locally on same devs • Dryad does optimizations: adds extra nodes to do temp aggregation App Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage Big-Data Stack: App Paradigms Revisited • User visible services are complex and composed of multiple M-R jobs – Flume & DryadLinQ • Delay Execution until output is required • Allows for various optimizations • Storing output to HDFS between M-R jobs adds times – Eliminate HDFS between jobs • Programmers aren’t smart, often have extra unnecessarily steps – Knowing what is required for output, you can eliminate unnecessary FlumeJava DryadLinQ App Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage Big-Data Stack: App Paradigms Revisited-yet-again • User visible services require interactivity: so jobs need to be fast. Jobs should return results before completing processing – Hadoop-Online: • pipeline results from map to reduce before done. • Pipeline too early and reduce need to do sorting – Increases processing overhead on reduce: BAD!!! – RRD: Spark • Store data in memory: much faster than disk • Instead of doing process: create abstract graph of processing and to processing when output is required – Allows for optimizations • Failure recovery is the challenge FlumeJava DryadLinQ HadoopOnline Spark App Hadoop Dryad Omega Mesos Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage Big-Data Stack: Sharing in Caring How to share a non-virtualized cluster • Sharing is good: you have too much data and cost too much to build many cluster for same data • Need dynamic sharing: if static, you can waste • Mesos: – Resource offers: give app options of resources and let them pick – App knows best • Omega: – Optimistic allocation: each scheduler picks resources and if there’s a conflict omega detects this and gives resources to only one. Others pick new resources – Even with conflicts this is much better than centralized entity FlumeJava DryadLinQ HadoopOnline Spark App Hadoop Dryad Omega Mesos Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage Big-Data Stack: Sharing in Caring Cloud Sharing • Clouds gives the illusion of equality – H/W differences diff performance – Poor isolation tenants can impact each other • I/O and CPU bound jobs can conflict. FlumeJava DryadLinQ HadoopOnline Spark App Hadoop Dryad Omega Mesos BobTail RFA CloudGaming Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage Big-Data Stack: Better Networks • Networks give bad performance – Cause: Congestion + over-subscription • VL2/Portland – Eliminate over-subscription + congestion with commodity devices+ECMP • Helios/C-through – Mitigate congestion by carefully adding new capacity FlumeJava DryadLinQ HadoopOnline Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios C-Thru Sharing CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Tail Latency N/W Sharing SDN Storage Big-Data Stack: Better Networks • When you need multiple servers to service a request – .99100 = .65 (HORRIBLE) – Duplicate requests: send same request to 2 servers • At-least one will finish within acceptable time – Dolly: be smart when selecting the 2 servers • You don’t want I/O contention because that leads to bad perf • Avoid Maps using same replicas • Avoid Reducers reading same intermediate output FlumeJava DryadLinQ HadoopOnline Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency C-Thru CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Dolly (Clones) Sharing Mantrei Tail Latency Big-Data Stack: Networks Sharing • How to share efficiently while making guarantees • Elastic-Switch – Two level bandwidth allocation system • Orchestra – M/R has barriers and completion is based on a set of flows not individual flows – Make optimization to a set of flows • Hull: Trade BW for latency – Want zero buffering: but TCP needs buffering – Limit traffic to 90% of link and use the remaining 10% as buffers FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud C-Thru CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Dolly (Clones) Orchestra Sharing Mantrei Hull Tail Latency N/W Sharing SDN Storage Big-Data Stack: Enter SDN • Remove the Control plane from the switches and centralize it • Centralization == Scalability challenges • NOX: how does it scale to data-centers – How many controllers do you need? • How should you design these controllers: – Kandoo: a hierarchy (many local and 1 global controller, local communicate with the global;) – ONIX: a mesh (communication through a DHT or DB) FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud Kandoo C-Thru CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Dolly (Clones) Orchestra Sharing Mantrei Hull Tail Latency N/W Sharing ONIX SDN Storage Big Data Stack: SDN+Big-Data • FlowComb: – Detect app patterns and have SDN controller assign paths based on knowledge of traffic patterns and contention • Sinbad: – HDFS writes are important – Let SDN controller tell HDFS best place to write data to based on knowledge of n/w congesetion FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud C-Thru Kandoo CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Dolly (Clones) Orchestra Sharing Mantrei Hull Tail Latency N/W Sharing ONIX SDN FlowComb SinBaD Storage Big Data Stack: Distributed Storage • Ideal: Nice API, low latency, scalable • Problem: H/W fails a lot, in limited locations, and contains limited resources • Partition: gives good performance – Cassandra: use consistent hashing – Megastore: each partition == A RDBMS with good consistency guarantees • Replicate: Multiple copies avoid failures – Megastore: replicas allow for low latency FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud C-Thru Kandoo CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Dolly (Clones) Orchestra Sharing Mantrei Hull Tail Latency N/W Sharing ONIX SDN FlowComb SinBaD Megastore Casandra Storage Big Data Stack: Disk locality Irrelevant • Disk Locality is becoming irrelevant – Data is getting smaller (compressed) so smaller times – Networks are getting much faster (only 8% slower) • Mem locality is new challenge – Input for 94% fit in mem – Need new caching+prefetching schemes FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud C-Thru Sharing CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Mantrei Dolly (Clones) Hull Orchestra Kandoo Tail Latency N/W Sharing ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS FlumeJava HadoopOnline DryadLinQ Spark App Hadoop Dryad Omega Mesos BobTail VL2 Portland RFA Helios Tail Latency Elastic Cloud C-Thru Sharing CloudGaming Virt Drawbacks Hedera N/W Paradign MicroTE Mantrei Dolly (Clones) Hull Orchestra Kandoo Tail Latency N/W Sharing ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS