Grid and Cloud Computing: Real-life Instances of Distributed Systems Adriana Iamnitchi University of South Florida anda@cse.usf.edu http://www.cse.usf.edu/~anda Grid: Definitions Definition 1: Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities (1998) Definition 2: A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial Quality of Service (2002) Grid: Resource-Sharing Environment • Users: – 1000s from 10s institutions – Well-established communities • Resources: – Computers, data, instruments, storage, applications – Owned/administered by institutions • • Applications: data- and compute-intensive processing Approach: common infrastructure The Globus Toolkit® Includes slides borrowed freely from The Globus team How It Started While helping to build/integrate a diverse range of distributed applications, the same problems kept showing up over and over again. – Too hard to keep track of authentication data (ID/password) across institutions – Too hard to monitor system and application status across institutions – Too many ways to submit jobs – Too many ways to store & access files and data – Too many ways to keep track of data – Too easy to leave “dangling” resources lying around (robustness) Grid Architecture in a Nutshell Forget homogeneity! Grid Services vs. Web Services • Web Services Resource Framework, a specification developed by OASIS, specifies how to make web services statefull. – Joint effort between Grid and Web Services communities • OGSA: Open Grid Services Architecture – standardizes all common services used in grid application (job management services, resource management services, security services, etc.) by specifying a set of standard interfaces for these services. • Grid services: implement OGSA Statefull vs. Stateless Services Stateless Service Stateful vs. Stateless Services Stateful Service The Globus Toolkit • • The Globus Toolkit (GT) is a collection of solutions to problems that frequently come up when trying to build collaborative distributed applications. Not turnkey solutions, but building blocks and tools for application developers and system integrators. – Some components (e.g., file transfer) go farther than others (e.g., remote job submission) toward end-user relevance. • • To date, the Toolkit has focused on simplifying heterogeneity for application developers. The goal has been to capitalize on and encourage use of existing standards (IETF, W3C, OASIS, GGF). – The Toolkit also includes reference implementations of new/proposed standards in these organizations. Globus Toolkit Components G T 4 G T 3 G T 2 G T 3 G T 4 Community Scheduler Framework [contribution] Delegation Service Python WS Core [contribution] C WS Core Community Authorization Service OGSA-DAI [Tech Preview] WS Authentication Authorization Reliable File Transfer Grid Resource Allocation Mgmt (WS GRAM) Monitoring & Discovery System (MDS4) Java WS Core GridFTP Grid Resource Allocation Mgmt (Pre-WS GRAM) Monitoring & Discovery System (MDS2) C Common Libraries Pre-WS Authentication Authorization Web Services Components Components Replica Location Service XIO Credential Management Security Data Management Non-WS Execution Management Information Services Common Runtime How it Really Happens Web Browser Compute Server Simulation Tool Web Portal Registration Service Data Viewer Tool Chat Tool Credential Repository Telepresence Monitor Application services organize VOs & enable access to other services Camera Camera Database service Data Catalog Database service Database service Certificate authority Users work with client applications Compute Server Collective services aggregate &/or virtualize resources Resources implement standard access & management interfaces How it Really Happens (without Globus) Simulation Tool Web Browser Web Portal Application Developer 10 Off the Shelf 12 Globus Toolkit 0 Grid Community 0 Compute Server B Compute Server Registration Service Data Viewer Tool Chat Tool Credential Repository Application services organize VOs & enable access to other services Camera Telepresence Monitor Data Catalog Certificate authority Users work with client applications A Collective services aggregate &/or virtualize resources Camera C Database service D Database service E Database service Resources implement standard access & management interfaces How it Really Happens (with Globus) Globus GRAM Simulation Tool Web Browser Globus GRAM Globus Index Service CHEF Application Developer 2 Off the Shelf 9 Globus Toolkit 4 Grid Community 4 Data Viewer Tool CHEF Chat Teamlet MyProxy Users work with client applications Application services organize VOs & enable access to other services Camera Globus DAI Globus DAI Globus Certificate Authority DAI Collective services aggregate &/or virtualize resources Compute Server Camera Telepresence Monitor Globus MCS/RLS Compute Server Database service Database service Database service Resources implement standard access & management interfaces Building a Grid (in Practice) • Building a Grid system or application is currently an exercise in software integration. – – – – – – – – • Define user requirements Derive system requirements or features Survey existing components Identify useful components Develop components to fit into the gaps Integrate the system Deploy and test the system Maintain the system during its operation This should be done iteratively, with many loops and eddys in the flow. Relationships between Globus and Web Services Globus Components: GridFTP • A high-performance, secure data transfer service optimized for highbandwidth wide-area networks – – – – – • FTP with extensions Uses basic Grid security (control and data channels) Multiple data channels for parallel transfers Partial file transfers Third-party (direct server-to-server) transfers GGF recommendation GFD.20 OSGCC 2008 Globus Primer: An Introduction to Globus Software Basic Transfer One control channel, several parallel data channels Third-party Transfer Control channels to each server, several parallel data channels between servers 18 Globus Components: Striped GridFTP • GridFTP supports a striped (multi-node) configuration – – – • Requires shared/parallel filesystem on all nodes – OSGCC 2008 Establish control channel with one node Coordinate data channels on multiple nodes Allows use of many NICs in a single transfer On high-performance WANs, aggregate performance is limited by filesystem data rates Globus Primer: An Introduction to Globus Software 19 Globus Components: Reliable File Transfer • • A WSRF service for queuing file transfer requests – – Server-to-server transfers Checkpointing for restarts – Database back-end for failovers Allows clients to request transfers and then “disappear” – – No need to manage the transfer Status monitoring available if desired OSGCC 2008 Globus Primer: An Introduction to Globus Software 20 Globus Components: Replica Location Service • A distributed system for tracking replicated data – – • Consistent local state maintained in Local Replica Catalogs (LRCs) Collective state with relaxed consistency maintained in Replica Location Indices (RLIs) Simple Hierarchy The most basic deployment of RLS Performance features – – – Soft state maintenance of RLI state Compression of state updates Membership and partitioning information maintenance Fully Connected High availability of the data at all sites Tiered Hierarchy For very large systems and/or very Large collections OSGCC 2008 Globus Primer: An Introduction to Globus Software 21 From Grids to Cloud Computing? From Grids to Cloud Computing • Logical steps: – Make the grids public – Provide much simpler interfaces (and more limited control) – Charge usage of resources • Instead of relying on implicit incentives from science collaborations • Ideally, a “pay-as-you-go” rate • In reality: – Different history • Cloud computing as utility computing (1966 paper) • However, the promise of cloud computing finds a great user base in science grids due to: – Intense computations – Huge amounts of storage needs – Yet… P2P vs. Grid vs. Cloud Computing: Google Trends All regions US only Last 12 Months, World-Wide A Yahoo in 'cloud computing' research with HP-Intel, WA today - Jul 29 2008 B How Cloud Computing Is Changing The World, KMBC.com - Aug 4 2008 C 3Tera Brings Windows to Cloud Computing, Earthtimes (press release) - Oct 1 2008 D Infrastructure Cloud Computing, SYS-CON Media - Oct 28 2008 E Cloud Computing Expo: Cloud Reaches Washington, DC, SYS-CON Brasil (Assinatura) Jan 26 2009 F Acumen Solutions First to Launch Cloud Computing Practice to Deliver Innovative Solutions to Government, Trading Markets (press release) - Feb 25 2009 What is Cloud Computing? • Old idea: Software as a Service (SaaS) – Def: delivering applications over the Internet • Recently: “[Hardware, Infrastructure, Platform] as a service” – Poorly defined so we avoid all “X as a service” • Utility Computing: pay-as-you-go computing – Illusion of infinite resources – No up-front cost – Fine-grained billing (e.g. hourly) Cloud computing: a new term for the long-held dream of utility computing (first defined in 1966) – Refers to both the application delivered as services over the Internet and the hardware and software systems in the datacenters that provide those services. 26 Why Now? • Experience with very large datacenters – Unprecedented economies of scale • Other factors – – – – Pervasive broadband Internet Fast x86 virtualization Pay-as-you-go billing model Standard software stack 27 Amazon S3 for Science Grids: a Viable Solution? Joint work with Mayur Palankar (USF) Matei Ripeanu (UBC) Simson Garfinkel (Harvard) Overview Science Grids • • Data-intensive scientific collaborations Produce, analyze, and archive huge volumes of data (PetaBytes) – High data management and maintenance costs – Files are often used by groups of users and not individually Amazon Simple Storage Service (S3) • Novel storage ‘utility’: • Self-defined performance targets: • Pay-as-you go pricing: – Direct access to storage Keeps decreasing – scalable, infinite data durability, 99.99% availability, fast data access – $0.15/month/GB stored and $0.10-$0.17/GB transferred Is offloading data storage from an in-house mass storage system to S3 feasible and cost-effective? The DØ Experiment • High-energy physics collaboration • Traces from January ‘03 to March ’05 (27 months) • 375 TB data, 5.2 PB transferred • Shared data usage: no access control • 561 users from 70+ institutions in 18 countries • High intensity data usage: ~550Mbps sustained access rate in DZero • 113,062 jobs running for 973,892 hours over the period of 27 months Trace recording interval 01/2003 – 03/2005 Number of jobs 113,062 Hours of computation 973,892 Total storage volume 375 TB Total data processed 5.2 PB Average data access rate 273 GB/hour Approach • Characterize S3 – Does it live up to its own claims? – Study meantime superseded by cloudstatus.com • Toy scenario: evaluate a representative scientific application (DZero) in this context – Estimate performance and costs – Is the functionality provided adequate? • Outline – S3 architecture – Toy scenario: S3-supported DZero: cost and functionality requirements – Lessons/suggested improvements Amazon S3 Architecture • Two-level namespace – Buckets (think directories) – Global names – Two goals: data organization and charging – Data objects – Opaque object (max 5GB) – Metadata (attribute-value, up to 4KB) • Functionality – Simple put/get functionality – Limited search functionality – Objects are immutable, cannot be renamed • Data access protocols – SOAP – REST – BitTorrent Amazon S3 Functionality • • • • • • • • Amazon S3 is intentionally built with a minimal feature set. Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited. Each object is stored in a bucket and retrieved via a unique, developerassigned key. A bucket can be located in the United States or in Europe. All objects within the bucket will be stored in the bucket’s location, but the objects can be accessed from anywhere. Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users. Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit. Built to be flexible so that protocol or functional layers can easily be added. Default download protocol is HTTP. A BitTorrent™ protocol interface is provided to lower costs for high-scale distribution. Additional interfaces will be added in the future. Reliability backed with the Amazon S3 Service Level Agreement. Amazon S3 Architecture • Security – Identities – Assigned by S3 when initial contract is ‘signed’ – Authentication – Public/private key scheme – But private key is generated by Amazon! – Access control – Access control lists (limited to 100 principals) – ACL attributes – FullControl – Read & Write (objects cannot be written) – ReadACL & WriteACL (for buckets or objects) – Auditing (pseudo) – S3 can provide a log record Access Performance via BitTorrent Our question: does BitTorrent work how it is supposed to? Answer: Yes. S3 Evaluation: Cost • Scenario 1: All data stored at S3 and processed by DZero – Storage cost: $691,000/year ($829,440 for S3-Europe) – Transfer: $335,012/year – $85,500/month • Scenario 2: Reducing transfer costs: – Caching: – 50TB cooperative cache: $43,888 per year in transfer costs (~10 times lower) – BitTorrent and distributed replicas – Use EC2: Replace transfer costs with $43,284/year • Scenario 3: Reducing storage costs: – Archive cold data – lifetime of 30% files < 24 hours, 40% < a week, 50% < a month – Throw away derived data – Distinguish between raw and derived data Key Idea: Unbundling Performance Characteristics Problem: S3 is about an order of magnitude more expensive than inhouse maintenance of resources! • • High availability, high durability, high performance are bundled at a single pricing point… … but some applications need only one or two of them – A cache: availability and access performance – A backup solution (e.g., for DZero): durability and availability • Solution: SLAs that allow the user to specify their requirements and chose pricing point. Unbundling Performance Characteristics Application class Durability Availability High performance data access Cache No Depends Yes Long-term archival Yes No No Online production No Yes Yes Batch production No No Yes The resources needed to provide high performance data access, high data availability and long data durability are different S3 Evaluation: Security • Traditional risks with distributed storage are still a concern: – Permanent data loss, – Temporary data unavailability (DoS), – Loss of confidentiality – Malicious or erroneous data modifications – New risk: direct monetary loss – Magnified as there is no built-in solution to limit loss • • Security scheme’s big advantage: it’s simple … but has limitations – Access control – Hard to use ACLs in large systems – needs at least groups (now available) – ACLs limited to 100 principals – No support for fine grained delegation – Implicit trust between users and the S3 service – No support for un-repudiabiliy – No tools to limit risk Suggested Improvements • To lower costs: unbundle performance characteristics – Availability, durability, and access time are bundled at a single pricing point • Some applications need only one or two of them – Solution: SLAs that allow the user to specify their requirements and chose pricing point • To provide specific support for science collaborations – Better security support for complex collaborations – Additional functionality for better usability: – Metadata based searches – Renaming and mutating objects – Relax hard-coded limitations: 100 buckets, 100 users in ACL, etc. • Lesson for application integrators: Use application-level information to reduce costs – Raw vs. derived data – Exploit usage patterns: e.g., data gets cold. • AMAZON S3 FOR SCIENCE GRIDS: A VIABLE SOLUTION? – Not yet – In addition, sociological issues