Toward third Generation Desktop Grids (Private Virtual Cluster) Ala Rezmerita, Franck Cappello INRIA Grand-Large.lri.fr 1 Agenda • • • • • • Basic Concepts of DGrids First and second generation Desktop Grids Third generation concept PVC Early evaluation Conclusion 2 Basic Concepts of Desktop Grids • Bag of tasks, master-worker, divide and conquer applications • Batch schedulers (clusters) • Virtual machines (Java, .net) • Standard OS (Windows, Linux, MaxOS X) • Cycle stealing (Desktop PCs) • Condor (single administration domain) 3 First Generation DG • Single application / Single user – SETI@HOME (1998) • Research for Extra Terrestrial I • 33.79 Teraflop/s (12.3 Teraflop/s for the ASCI White!), 2003 – DECRYPTHON • Protein Sequence comparison – RSA-155 (1996?) • Breaking encryption keys – COSM 4 First Gen DG Architecture Centralized architecture Client application Params. /results. + Monolythique architecture User + Admin interface Coordinator/ Resource Disc. Parameters Results PC Application Scheduler Task + Data + Net OS + Sandbox Protocols PC Firewall/NAT 5 Second Generation of DG • Multi-applications / “Multi-users” platforms: – BOINC (2002?) • SETI@home, Genome@home, XtremLab… – XTREMWEB (2001) • XtremWeb-CH, GTRS, XW-HEP, etc.. – Platform (ActiveCluster), United devices, Entropia, etc. – Alchemi (.NET based) 6 Second Gen DG Architecture Centralized architecture (split tasks/data mgnt., Inter node com.) Client application Params. / results. Results Monolythique architecture User + Admin interface Application Scheduler Coordinator/ Scheduler (Tasks) Parameters + PC Task + Data + Net OS + Sandbox Protocols Data Manager Scheduler (Tasks) Firewall/NAT 7 What we have learned from the history • Rigid architecture – – – – • • • • Dedicated Dedicated Dedicated Dedicated (Open source: Yes, Modularity: No) job scheduler data management/file system connection protocols transport protocols Centralized architecture No direct communication Almost no security Restricted application domain (essentially High Throughput Computing) 8 • Third Generation Modular architecture Concept – Pluggable / selectable job scheduler – Pluggable / selectable data management/file system – Pluggable / selectable connection protocols – Pluggable / selectable transport protocols • • • • Decentralized architecture Direct communications Strong security Unlimited application domain (restrictions imposed only by the platform performance) PC PC PC User + Admin interface Applications Schedulers Task + Data + Net OS + Sandbox Protocols 9 3rd Gen Dgrid: design PC User + Admin interface PC Virtualization at IP level! Applications PC Apps: Binary codes Scheduler/runtime Condor, MPI Data management LUSTRE / Bittorrent OS XEN / Virtual PC Connectivity / Security IP Virtualization Network 10 PVC (Private Virtual Cluster) A generic framework turning dynamically a set of resources belonging to different administration domains into a cluster •Connectivity/Security –Dynamically connects firewall protected nodes without breaking the security rules of the local domains •Compatibility –Creates an execution environment for existing cluster applications and tools (schedulers, file systems, etc.) 11 PVC Architecture Peer's modules: Coordination Communication interposition Network virtualization Security Connectivity Broker: Collects connection requests Forwards data between peers Realizes the virtualization Helps in the security negotiation 12 Virtualization • Virtual network on top of the real one • Uses a virtual network interface • Provides virtual to real IP address translation • Features an IP range and DNS The proposed solution respects the IP standards – Virtual IP: class E (240.0.0.1 – 255.255.255.254) of IP addresses – Virtual Names: use of a virtual domain name (.pvc) 13 Interposition • Catch the applications’ communications and transform them for transparent tunneling • Interposition techniques: – – – – LibC overload (IP masquerading, modified kernel) Tun / Tap (encapsulation: IP over IP) Netfilter (IP masquerading, kernel level) Any other… 14 Interposition techniques Application packet ld_preload? (1) Yes U. Space Security check + Connection + IP Masquerading LibC’ No Socket Interface LibC Kernel Route Selection 240.X.X.X Security check + Connection + Encapsulation Tun/Tap (2) Interposition modules Netfilter Security challenge + Connection + IP Masquerading (3) Group check Std network interface 15 Connectivity Goal: Direct connections between the peers Firewall/NAT traversal techniques: • UPnP - firewall configuration technique • UDP/TCP hole punching - online gaming and voice over IP • Traversing TCP – novel technique • Any other… 16 Security • Fulfil the security policy of every local domain • Enforce a cross-domain security policy B PkM[PKC] PKM Master peer in every virtual cluster • Implements global security policy • Registers new hosts PVC peer must: C M Master Put PKC Get PKM S PK: Public Key Pk: Private Key • Check the target peer’s membership to the same virtual cluster • After the connection establishment, authenticate target peer identity 17 Security Security protocol is based on double asymmetric keys mechanism (1)/(2) Membership to the same virtual cluster (3)/(4)/(5) Mutual authentication The PVC security protocol ensures that : Only hosts of a same virtual cluster are connected Only trusted connections become visible for the application 18 Performance Evaluation • PVC objectives: Minimal overhead for communications – Network performance with/without PVC – Connection establishment • Execution of real applications without modification in MPI – NAS benchmarks – MPIPOV program – Scientific application DOT • Bag of Tasks typical setting (BLAST on DNA database) – Condor flocking strategies – Broadcast protocol – Spot Checking / Replication + voting • Evaluation platforms – Grid’5000 (Grid eXplorer Cluster) – DSL-Lab (PC @ home, connected on Public ADSL network) 19 Communication perf. Connection overhead Direct Communication overhead 20 Connection overhead Performed on DSL-Lab platform using a specific test suite Reasonable overhead in the context of the P2P applications 21 Bandwidth overhead Technique: LibC overload Performed on a local PC cluster with three different Ethernet (Netperf) networks: 1Gbps, 100Mbps and 10Mbps 22 Communication overhead Technique: Tun / Tap Netfilter QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. 23 MPI applications Applications : • NAS benchmarks class A (EP, FT, CG and BT) • DOT • MPIPOV Results for NAS EP: Measured acceleration is almost linear Overhead lower than 5% Other experiments: Losses of performances due to ADSL network 24 Typical configuration for Bag of tasks (Seti@home like) PC Application PC Result certification PC Scheduler/runtime Condor Data management Bittorrent OS Using PVC! BLAST (DNA) Connectivity / Virtualization OS PVC Network 25 Broadcast protocol? Question: Bittorent instead of Condor transport protocol? BLAST application DNA Database (3.2 GB) 64 nodes Condor exec. time grows Protportionnly with #jobs Condor + Bittorent exec. Time stays almost constant 26 Distribution of job management? Question: how many job managers for a set of nodes? Condor Flocking 64 sequences of 10 BLAST jobs (between 40 and 160 seconds, with an average of 70 seconds) QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image. QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. 27 Result certification? Question: how to detect bad results? Spot Checking (black listing) Replication + voting Not implemented in Condor 20 lines of script both Test: 70 machines 10% Saboteurs (randomly choosen) QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. How many jobs are required to detect the 7 saboteurs? 28 Computational/Data Desktop Grids BOINC/Xtremweb like applications User selected scheduler (Condor, Torque, OAR, etc.) Communication between workers (MPI, Distributed file systems, distributed archive, etc.) “Instant Grid” Connecting resources “sans effort”: family Grid, Inter School Grids, etc. Sharing resources and content. Example: Apple TV synchronized with a remote Itune “Passe muraille” runtime Applications OpenMPI Extended Clusters Run applications and manage resources beyond limits of admin domains 29 Conclusion Third Generation Desktop Grids (2007)… • Break the rigidity, again! • Let users choose and run their favourite QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image. environments (Engineers may help) • PVC: connectivity + security + compatibility – Dynamically establishes virtual clusters – Modular, extensible architecture – Features properties required for 3rd Gen. Desktop Grids • Security model + Use of applications without any modification + With minimal communication overhead • On going work (on Grid’5000) – Test the scalability and fault tolerance of cluster tools in the Dgrid Context – Test more applications – Test & improve the scalability of the security system 30 Questions? 31 Condor flocking with PVC Question: May we use several Schedulers? Use of a synthetic job that consumes resources (1 sec.) Sequence of 1000 submissions Submits the synthetic jobs from the same host to Condor pool Min Max Average Stdev 1 Master 2 Master 4 Master 00:00:27 00:33:49 00:17:09 00:09:38 00:00:06 00:33:29 00:16:48 00:09:39 00:00:13 00:33:34 00:16:54 00:09:38 16 Master 32 Master 00:00:05 00:31:13 00:16:42 00:09:30 00:00:05 00:33:27 00:16:46 00:09:39 64 Master 00:00:05 00:32:41 00:16:00 00:09:37 Future work: make Condor Flocking fault tolerant. 32