Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007 Roadmap • • • • • Background The problem space Some early solutions Research frontier/opportunities Wrapup Background • Grids are distributed … but also centralized – Condor, Globus, BOINC, Grid Services, VOs – Why? client-server based • Centralization pros – Security, policy, global resource management • Decentralization pros – Reliability, dynamic, flexible, scalable – **Fertile CS research frontier** Challenges • May have to live within the Grid ecosystem – Condor, Globus, Grid services, VOs, etc. – First principle approaches are risky (Legion) • 50K foot view – How to decentralize Grids yet retain their existing features? – High performance, workflows, performance prediction, etc. Decentralized Grid platform • Minimal assumptions about each “node” • Nodes have associated “assets” (A) – basic: CPU, memory, disk, etc. – complex: application services – exposed interface to assets: OS, Condor, BOINC, Web service • • • • • Nodes may up or down Node trust is not a given (do X, does Y instead) Nodes may connect to other nodes or not Nodes may be aggregates Grid may be large > 100K nodes, scalability is key Grid Overlay Grid service Condor network Raw – OS services BOINC network Grid Overlay - Join Grid service Condor network Raw – OS services BOINC network Grid Overlay - Departure Grid service Condor network Raw – OS services BOINC network Routing = Discovery discover A Query contains sufficient information to locate a node: RSL, ClassAd, etc Exact match or semantic match Routing = Discovery bingo! Routing = Discovery Discovered node returns a handle sufficient for the “client” to interact with it - perform service invocation, job/data transmission, etc Routing = Discovery • Three parties – initiator of discovery events for A – client: invocation, health of A – node offering A • Often initiator and client will be the same • Other times client will be determined dynamically – if W is a web service and results are returned to a calling client, want to locate CW near W => – discover W, then CW ! Routing = Discovery X discover A Routing = Discovery Routing = Discovery bingo! Routing = Discovery Routing = Discovery outside client Routing = Discovery discover A’s Routing = Discovery Grid Overlay • This generalizes … – Resource query (query contains job requirements) – Looks like decentralized “matchmaking” • These are the easy cases … – independent simple queries • find a CPU with characteristics x, y, z • find 100 CPUs each with x, y, z – suppose queries are complex or related? • find N CPUs with aggregate power = G Gflops • locate an asset near a prior discovered asset Grid Scenarios • Grid applications are more challenging – Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies • individual job/task needs a resource near a data source • workflow • queries are not independent – Metrics are collective • • • • not simply raw throughput makespan response QoS Related Work • Maryland/Purdue – matchmaking CAN • Oregon-CCOF – time-zone Related Work (cont’d) None of these approaches address the Grid scenarios (in a decentralized manner) – Complex multi-task data/control dependencies – Collective metrics 50K Ft Research Issues • Overlay Architecture – structured, unstructured, hybrid – what is the right architecture? • Decentralized control/data dependencies – how to do it? • Reliability – how to achieve it? • Collective metrics – how to achieve them? Context: Application Model answer = data source = component service request job task … Context: Application Models Reliability Collective metrics Data dependence Control dependence Context: Environment • RIDGE project - ridge.cs.umn.edu – reliable infrastructure for donation grid envs • Live deployment on PlanetLab – planet-lab.org – 700 nodes spanning 335 sites and 35 countries – emulators and simulators • Applications – BLAST – Traffic planning – Image comparison Application Models Reliability Collective metrics Data dependence Control dependence Reliability Example B C G E D B G Reliability Example B C G E D B G CG CG responsible for G’s health Reliability Example B C G E D B G, loc(CG ) CG Reliability Example B C G E D BG CG could also discover G then CG Reliability Example B C G E D X B CG Reliability Example B C G E D G. … CG Reliability Example B C G E D G CG Client Replication B C G E D B G Client Replication B C G E D BG CG2 CG1 loc (G), loc (CG1), loc (CG2) propagated Client Replication B C G E D BG CG2 X CG1 client “hand-off” depends on nature of G and interaction Component Replication B C G E D B G Component Replication B C G E D G1 CG G2 Replication Research • Nodes are unreliable – crash, hacked, churn, malicious, slow, etc. • How many replicas? – too many – waste of resources – too few – application suffers System Model • Reputation rating ri– degree of node 0.9 reliability 0.8 0.8 0.7 0.7 • Dynamically size the redundancy based on ri 0.4 • Nodes are not connected and check-in to a central server 0.3 0.4 • Note: variable sized groups 0.8 0.8 Reputation-based Scheduling • Reputation rating – Techniques for estimating reliability based on past interactions • Reputation-based scheduling algorithms – Using reliabilities for allocating work – Relies on a success threshold parameter Algorithm Space • How many replicas? – first-, best-fit, random, fixed, … – algorithms compute how many replicas to meet a success threshold • How to reach consensus? – M-first (better for timeliness) – Majority (better for byzantine threats) Experimental Results: correctness This was a simulation based on byzantine behavior … majority voting Experimental Results: timeliness M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE Next steps • Nodes are decentralized, but not trust management! • Need a peer-based trust exchange framework – Stanford: Eigentrust project – local exchange until network converges to a global state Application Models Reliability Collective metrics Data dependence Control dependence Collective Metrics • Throughput not always the best metric • Response, completion time, application-centric – makespan - response BLAST Communication Makespan • Nodes download data from replicated data nodes – Nodes choose “data servers” independently (decentralized) – Minimize the maximum download time for all worker nodes (communication makespan) data download dominates Data node selection • Several possible factors – Proximity (RTT) – Network bandwidth – Server capacity Mean Download Time / RTT [Download Time vs. RTT - linear] 1000 900 Time (msec) 800 flux 700 tamu 600 venus 500 ksu 400 ubc 300 wroc 200 100 0 1 3 5 Concurrency 7 10 [Download Time vs. Bandwidth - exp] Heuristic Ranking Function • Query to get candidates, RTT/bw probes • Node i, data server node j – Cost function = rtti,j * exp(kj /bwi,j), kj load/capacity • Least cost data node selected independently • Three server selection heuristics that use kj – BW-ONLY: kj = 1 – BW-LOAD: kj = n-minute average load (past) – BW-CAND: kj = # of candidate responses in last m seconds (~ future load) Performance Comparison Computational Makespan compute dominates: BLAST Computational Makespan * equal-sized variable-sized Next Steps • Other makespan scenarios • Eliminate probes for bw and RTT -> estimation • Richer collective metrics – deadlines: user-in-the-loop Application Models Reliability Collective metrics Data dependence Control dependence Application Models Reliability Collective metrics Data dependence Control dependence Data Dependence • Data-dependent component needs access to one or more data sources – data may be large discover A , Data Dependence (cont’d) discover A , Where to run it? The Problem • Where to run a data-dependent component? – determine candidate set – select a candidate • Unlikely a candidate knows downstream bw from particular data nodes • Idea: infer bw from neighbor observations w/r to data nodes! Estimation Technique C1 • C1 may have had little past interaction with – … but its neighbors may have C2 • For each neighbor generate a download estimate: – DT: prior download time to from neighbor – RTT: from candidate and neighbor to respectively – DP: average weighted measure of prior download times for any node to any data source Estimation Technique (cont’d) • Download Power (DP) characterizes download capability of a node – DP = average (DT * RTT) – DT not enough (far-away vs. nearby data source) • Estimation associated with each neighbor ni – ElapsedEst [ni] = α ∙ β ∙ DT • α : my_RTT/neighbor_RTT (to ) • β : neighbor_DP /my_DP • no active probes: historical data, RTT inference • Combining neighbor estimates – mean, median, min, …. – median worked the best • Take a min over all candidate estimates Comparison of Candidate Selection Heuristics Impact of Neighbor Size (Mix, N=8, Trial=50k) 55 OMNI RANDOM PROXIM SELF NEIGHBOR 50 45 Mean Elapse Time (sec) 40 35 30 25 20 15 10 5 0 2 4 8 16 Neighbor Size 32 SELF uses direct observations Take Away • Next steps – routing to the best candidates • Locality between a data source and component – scalable, no probing needed – many uses Application Models Reliability Collective metrics Data dependence Control dependence The Problem • How to enable decentralized control? – propagate downstream graph stages – perform distributed synchronization • Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project) Control Example B C G E D B control node token matching Simple Example B C G E D Control Example {E, B*C*D} B {C, G} C G {D, G} {E, B*C*D} E D {E, B*C*D} Control Example B C G E D B {E, B*C*D} C {E, B*C*D} D {E, B*C*D} Control Example B C G E D B {E, B*C*D, loc(SB) } C {E, B*C*D, loc(SC) } D {E, B*C*D, loc(SD) } output stored at loc(…) – where component is run, or client, or a storage node Control Example B C G E D BB C D Control Example B C G E D BB C D Control Example B C G E D BB C D Control Example B C G E D BB E C D Control Example B C G E D BB E C D Control Example B C G E D How to color and route tokens so that they arrive to the same control node? BB E C D Open Problems • Support for Global Operations – troubleshooting – what happened? – monitoring – application progress? – cleanup – application died, cleanup state • Load balance across different applications – routing to guarantee dispersion Summary • Decentralizing Grids is a challenging problem • Re-think systems, algorithms, protocols, and middleware => fertile research • Keep our “eye on the ball” – reliability, scalability, and maintaining performance • Some preliminary progress on “point solutions” My visit • Looking to apply some of these ideas to existing UK projects via collaboration • Current and potential projects – Decentralized dataflow: (Adam Barker) – Decentralized applications: Haplotype analysis (Andrea Christoforou, Mike Baker) – Decentralized control: openKnowledge (Dave Robertson) • Goal – improve reliability and scalability of applications and/or infrastructures Questions EXTRAS Non-stationarity • Nodes may suddenly shift gears – deliberately malicious, virus, detach/rejoin – underlying reliability distribution changes • Solution – window-based rating – adapt/learn ltarget • Experiment: blackout at round 300 (30% effected) Adapting … Adaptive Algorithm throughput success rate throughput success rate Scheduling Algorithms Estimation Accuracy • Objects: 27 (.5 MB – 2MB) • Nodes: 130 on PlanetLab • Download: 15,000 times from a randomly chosen node • Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time – ‘1’ means perfect estimation • Accept if the estimation is within a range measured ± (measured * error) – Accept with error=0.33: 67% of the total are accepted – Accept with error=0.50: 83% of the total are accepted Impact of Churn Comparision of Elapsed Time (Candidate=8, Neighbor=8) 6.5 Without Churn Churn 0.1% Churn 0.5% Churn 1.0% 6 Ratio to Omniscient 5.5 Random mean 5 Global(Prox) mean 4.5 4 3.5 3 2.5 0 0.5 1 1.5 • Jinoh – mean over what? 2 2.5 Query 3 3.5 4 4.5 5 4 x 10 Estimating RTT • • • – We use distance = √(RTT+1) Simple RTT inference technique based on triangle inequality Triangle Inequality: Latency(a,c) ≤ Latency(a,b) + Latency(b,c) |Latency(a,b)-Latency(b,c)| ≤ Latency(a,c) ≤ Latency(a,b)+Latency(b,c) • Pick the intersected area as the range, and take the mean Lower bound Higher bound Via Neighbor A Via Neighbor B Via Neighbor C Inference Intersected range Final inference RTT RTT Inference Result • More neighbors, greater accuracy • With 5 neighbors, 85% of the total < 16% error Inferred Latency Difference CDF 1 0.9 N=2 N=3 N=5 N=10 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 0.1 0 -150 -100 -50 0 50 100 Latency Difference (= |Inferred-Measured|) 150 200 Other Constraints {E, B*C*D} B {C, A, dep-CD} C A {D, A, dep-CD} {E, B*C*D} E D {E, B*C*D} C & D interact and they should be co-allocated, nearby … Tokens in bold should route to same control point so a collective query for C & D can be issued Support for Global Operations • • • • Troubleshooting – what happened? Monitoring – application progress? Cleanup – application died, cleanup state Solution mechanism: propagate control node IPs back to origin (=> origin IP piggybacked) • Control nodes and matcher nodes report progress (or lack thereof via timeouts) to origin • Load balance across different applications Other Constraints {E, B*C*D} B {C, A} C A {D, A} {E, B*C*D} E D {E, B*C*D} C & D interact and they should be co-allocated, nearby … Combining Neighbors’ Estimation Acceptance 50% 0.91 0.9 Accepted Rate 0.89 0.88 RANDOM CLOSEST MEAN MEDIAN RANK WMEAN TRMEAN 0.87 0.86 0.85 0.84 0.83 0 5 10 15 Neighbor Size 20 25 30 • MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20) • 3 neighbors gives greatest bang Effect of Candidate Size Impact of Candidate Size (Mix, N=8, Trial=25k) 120 OMNI RANDOM PROXIM SELF NEIGHBOR Mean Elapse Time (sec) 100 80 60 40 20 0 24 8 16 32 Candidate Size ALL Performance Comparison Parameters: Data size: 2MB Replication: 10 Candidates: 5 Computation Makespan (cont’d) • Now bring in reliability … makespan improvement scales well # components Token loss • Between B and matcher; matcher and next stage – matcher must notify CB when token arrives (pass loc(CB) with B’s token – destination (E) must notify CB when token arrives (pass loc(CB) with B’s token BB E C D RTT Inference • >= 90-95% of Internet paths obey triangle inequality – RTT (a, c) <= RTT (a, b) + RTT (b, c) – RTT (server, c) <= RTT (server, ni) + RTT (ni, c) – upper- bound – lower-bound: | RTT (server, ni) - RTT (ni, c) | • iterate over all neighbors to get max L, min U • return mid-point