CloudNet: Enterprise Ready Virtual Private Clouds K. K. Ramakrishnan AT&T Labs Research, Florham Park, NJ Joint work with: Timothy Wood, Jacobus van der Merwe, and Prashant Shenoy Thanks to Toby Ford, Joe Houle and others of AT&T for providing a view of the competitive landscape and of their efforts with Synaptic Computing and Storage Evolution of Computing and Communication Systems • We have seen over the last several decades, the evolution of computing, storage and networking capabilities – Each step in the evolution caused by changes in the relative speed or cost of computing, storage and communications • – – Caused processing and storage to be closer or farther from each other and from users E.g., moved from mainframes to mini-computers to workstations and PCs as computing costs and capabilities evolved relative to I/O Then moved to client-server and eventually to web based services as networking capabilities evolved • Each step may be viewed as attempt to take advantage of shared resources and multiplexing driven by cost, functionality and performance capabilities • So, is “Cloud Computing” déjà vu all over again? Page 2 Cloud Computing "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements.” Larry Ellison, CEO Oracle Wall Street Journal, September 26, 2008 Cloud computing is taking advantage of current balance in cost, performance and capability of computing vs. communications Page 3 Cloud Computing Rent computation and storage resources on demand • Fundamental Value Proposition: Shared, dynamic allocation of resources to reduce cost through multiplexing Cloud Platform – • Virtualization is a key component • Accessed by multiple enterprise sites Cloud Platform types: • Software as a Service – • Platform as a Service – • Hotmail, Google Docs Google App Engine, Microsoft Azure Infrastructure as a Service – Amazon EC2, VMware vCloud Page 4 Enterprise Sites ‘Cloud’ services grouped in 3 categories What is Purchased Bus Dev Cons Software as a Service Application that is run on a multitenant platform and delivered over the Internet Platform as a Service Service that provides platform to create and run applications Infrastructure as a Service Compute on demand (servers, storage, networking) using shared platform Connect / UC Brew Mobile Apps / Vertically Integrated Android Simple DB Apple iPhone Apple iPhone MEAP EC2, S3 Enabler: virtualization Early service versions: private server virtualization, utility hosting & storage Page 5 Moving to the Cloud Acme wants to move part of its payroll app into the cloud Should be easy, right…? Acme LAN Front End Reports Processing Tier Processing Tier Cloud Platform Page 6 Data Store Cloud Eco-System Physical Resources APIs and Services Admin Portal • Provides basic server and storage facilities • Provides LAN and load balancing facilities Compute as a Service • API Adaptation Services Cloud Enablers Service Orchestration Scheduling Customer Service Policy Virtualized machine: compute capacity – Events – – Common XaaS Tools • Image create / management / marketplace Virtual servers/ Server accessible storage Address management / Load balancing services Multi-tenant shared and dedicated environments Storage as a Service • XaaS Provisioning Configuration Application Platform as a Service Application Platform as a Service • Application environment to create, test, host/run applications (e.g. Java, .NET, J2EE, Python, etc.) • Multi-tenant shared and dedicated environments Charging Storage as a Service etc. Compute as a Service XaaS Servers Fault/Perf Monitoring Storage LAN / LB Network Capabilities Cloud Eco-system IP accessible storage services (http, NFS, CIF, etc.) • Other services (database, queuing, etc.) • Software as a Service • Advanced IaaS/PaaS/XaaS services – OSS/BSS – Cross service coordination and integration Capacity / Availability Services, Etc. Common XaaS Tools For Providing Services Leveraging Services • Services should leverage the capabilities of other services as appropriate • Compute and Storage as Service are foundational services – Basic building block for other services • – • Modularity is key to managing scale and availability – – Physical/service modularity Availability domains Build/support enhanced and new service offerings from existing service and OSS/BSS APIs Customer Service Policy/Rules – Managing Scale / Availability • Service Orchestration / Scheduling – Support enhanced services, SLA management, etc. Likely part of OSS/BSS framework (TBD) • API Adaptation • Enable the integration of disparate underlying service APIs into a coherent set of services to our customer • Includes network design, storage, servers, etc. •Cloud enablers • Strive for commonality, but do not assume one solution fits all • – At scale individual services may require specific architecture to scale appropriately Page 7 Support capabilities to aid in the development of XaaS services (e.g. logging, Resource /Service ID/reference mgmt, IDM, OSS/BSS integration, etc.) Enterprise Cloud Challenges Existing platforms do not meet the needs of enterprise customers • Insufficient security controls – • Deployment is difficult - transparency – – • Need isolation at server and network level Cloud resources are completely separate from local ones Can’t make VMs look like part of existing LAN Limited control over network resources – – Page 8 Cannot specify network topology or IP addresses Cannot reserve bandwidth or request QoS guarantees for network links Vision and Research Direction • We have been working towards making compute and storage resources location transparent for enterprises, and applications in general • Enable Location of Compute and Storage resources remotely Ensure Transparency and Security • Minimize performance impact from having remote resources • • Facilitate Migration of Resources in a Transparent and Seamless manner • • With as little application impact as possible Provide capability for Disaster Recovery • Remote resources need to be far enough away to avoid sharing of risk between local and remote resources • Page 9 Minimum RPO/RTO – to minimize impact on enterprise operations Problem #1: Transparency •Application may have been written for LAN environment – Might utilize broadcast or LAN service discovery •Must add Internet gateways for apps previously only on LAN - Must now communicate via public IPs or configure DNS Lack of transparency causes application modifications and infrastructure reconfigurations Acme LAN Front End front.acme.com GW Cloud Platform Processing proc.cloud.com Data Store data.acme.com Page 10 GW Problem #2: Security Acme’s servers are now accessible from the public internet! – Servers formerly on secure LAN now exposed to malicious users Must configure firewall rules to limit access – Fine grain rules are difficult to manage in dynamic environments Acme LAN Front End front.acme.com Lack of secure cloud connections exposes enterprise to threats from both in and out of the cloud Cloud Platform Processing proc.cloud.com Data Store data.acme.com Hacker123 hax.cloud.com Page 11 Problem #3: Flexible Resource Mgmt Benefit of cloud computing: ability to easily adjust resource capacities and add new VMs After a change must deal with transparency and security issues all over again! – Current platforms do not support network resource reservation (Bandwidth/QoS guarantees) – Enterprises want control over network resources. Cloud must support dynamic changes Acme LAN Front End Cloud Platform +1 front.acme.com +1 Processing proc.cloud.com Data Store data.acme.com +1 Processing #2 proc2.cloud.com Page 12 Key Observation Existing cloud platforms primarily cover storage and computation Cloud Platform Disk VM + + Enterprise Sites Enterprise Clouds need control over the network as well Page 13 Virtual Private Clouds A Virtual Private Cloud is… A secure collection of server, storage, and network resources spanning one or more cloud data centers – That is seamlessly connected to one or more enterprise sites – VM Enterprise Sites VM VM VM Cloud Sites Virtual Private Networks (VPNs) Layer 2 and 3 MPLS based VPNs – Created by network provider with no end host configuration – Already used by many businesses! – Page 14 VPC Benefits For the customer: – Isolates network & compute resources • – Cloud resources are only accessible through VPN Simplifies deployment since cloud looks same as local resources For the service provider: – Provides mechanism for control over resource reservation within provider network – Simplifies management of multiple data centers by combining them into large resource pools Page 15 CloudNet Cloud Manager • Allocates computation and storage resources • Manages VLAN assignment within cloud network Network Manager • Creates and configure VPN endpoints • Reserves network resources Network Manager VPN VPN Page 16 Routers Customer Edge Provider Edge Cloud Manager VLAN VM VM VLAN VM VM CloudNet: Virtual Private Clouds Cloud Site X VPN A Server VPN B VPC A Server PE PE AT&T Backbone Cloud Site Y Server PE VPN A PE VPN B Server Server VPC B Virtual Private Cloud: • Collection of cloud resources presented to customer as a private set of cloud resources, transparently and securely connected to customer VPN Page 17 CloudNet System Components CloudNet Portal Controller VPN A Cloud Platform Server Network Manager Cloud Manager Server VPN B PE AT&T Backbone PE CE Router Server VPN A Server PE VPN B Server Cloud Domain Network Domain High level abstraction: Cloud Manager: Network Manager (IRSCP): • Create compute resources • Create compute resources • VPN management (network side) • Map into VPN • Map into VPN (cloud side) • Dynamic VPN mapping/stitching • Cross domain interaction Page 18 CloudNet System Components Network Domain Network Manager VPN A VPN B IRSCP PE PE PE VPN B VPN A • Network side: Dynamic VPN mapping with IRSCP • On PE connected to cloud platform • Pre-configure VPN interfaces with Route-targets • IRSCP re-writes the RTs as needed to dynamically map these interfaces into specific VPNs Page 19 CloudNet System Components Network Domain Remapping Network Manager VPN A VPN B IRSCP PE VPN A PE VPN B PE VPN B VPN A • Network side: Dynamic VPN mapping with IRSCP • On PE connected to cloud platform • Pre-configure VPN interfaces with Route-targets • IRSCP re-writes the RTs as needed to dynamically map these interfaces into specific VPNs Page 20 System Components Cloud Domain Remapping Cloud Manager Server Server CE Server CE Server Server Switch PE Router Cloud Computing Platform On Cloud Computing Platform side, the Cloud Manager : • Dynamically creates and configures a logical router on a physical router platform • Create virtual machine with requested images and configuration • Hook all of them together Page 21 WAN Migration Layer 2 VPNs make WAN act like a LAN Page 22 Can use existing LAN migration techniques to move across WAN WAN Migration Layer 2 VPNs make WAN act like a LAN CE Customer Site PE Cloud Site 1 A VLAN PE B ARP! Layer 2 VPN (VPLS) Router Switch VPN endpoint Page 23 CE ARP! PE B VLAN Cloud Site 2 Take advantage of LAN migration techniques, suitably enhanced, to move across WAN WAN Migration Optimization • Once connectivity is setup, migration requires Storage Migration • Live Memory Migration • • Storage Migration is done through a combination of Asynchronous Copy of disk storage to remote site initially • Synchronous copy of incremental updates subsequently during live memory migration • • Live Memory Migration needs to balance multiple needs Total Migration Time for live memory (reduced application performance) • Pause Time (application has to be quiescent for final transfer) • Amount of Data Transfer (Bandwidth Requirement) • Page 24 Live Memory Migration over WANs • There has been quite a bit of work in the past for VM migration over LANs • But these need to be enhanced to work well over WANs, especially to accommodate varying/limited network capacities and latencies • We’ve been working on an approach to combine multiple techniques Dynamic Stop and Copy • Content Based Redundancy • Incremental updates (page deltas) • • Overall benefit is significant reduction in migration and pause times, especially for limited bandwidth between sites • Page 25 We’ve been experimenting with our implementation for multiple applications Performance of CloudNet Live Migration over WANs Page 26 Performance of CloudNet Live Migration over WANs TPC-W Page 27 Kernel SpecJBB Performance of CloudNet Live Migration over WANs TPC-W Page 28 Kernel SpecJBB AT&T delivers services globally across a footprint of 38 IDCs, including five ‘Super IDCs’…. UK Amsterdam NY Metro Japan San Diego DC Metro North America - 23 Singapore Asia/Pacific - 9 Super IDC: Singapore Other Locations: Bangalore, Hong Kong, Shanghai (3), Tokyo, Osaka, Sydney Super IDCs: Annapolis, Piscataway, San Diego Other Locations: Boston, New York, Secaucus, Ashburn, Atlanta (2), Orlando, Miami, Chicago (2), Dallas (2), Nashville, Mesa, Los Angeles (2), San Jose, San Francisco, Seattle, Toronto Super IDC: Amsterdam Other Locations: London, Birmingham, Frankfurt, Paris, Nice Super IDCs Additional AT&T IDCs Additional expansions underway and planned for 2009 Locations and dates are subject to change Page 29 Europe - 6 Page 29 “The Enterprise Cloud” Flexible Delivery Models Hybrid … • Access to client, partner network, and third party resources Public … • • • • • • Cloud Services Efficient, resilient Preserves capital Access by subscription Flexible, Agile, Open, Fast Leverage broad experience base Faster time to market “The Enterprise Cloud” “Workload Distribution Network” Use Cases … • • • • • © 2009 AT&T Intellectual Property. All rights reserved. Disaster Recovery Portable Migration Follow the Sun; Follow the Moon Low Cost Lowest Latency Page 30 Private … • • Limited access Added security and privacy Summary Cloud Computing for enterprises requires: • Security • Transparency • Flexibility CloudNet can help provide these features • Defines interface between cloud platform and network provider • Uses VPNs for secure, seamless connections • Employs virtualization at server, router, and network levels to improve agility and efficiency Current Work • Algorithms to optimize migration time, pause time and network bandwidth requirements for WAN migration • Network optimizations to reduce latency of WAN migration Page 31