kk-banquet-talk-final - Laboratory for Advanced System Software

CloudNet: Enterprise
Ready Virtual Private
Clouds
K. K. Ramakrishnan
AT&T Labs Research,
Florham Park, NJ
Joint work with:
Timothy Wood, Jacobus van der Merwe, and Prashant
Shenoy
Thanks to Toby Ford, Joe Houle and others of AT&T for
providing a view of the competitive landscape and of their
efforts with Synaptic Computing and Storage
Evolution of Computing and
Communication Systems
• We have seen over the last several decades, the evolution of
computing, storage and networking capabilities
–
Each step in the evolution caused by changes in the relative speed or
cost of computing, storage and communications
•
–
–
Caused processing and storage to be closer or farther from each other and
from users
E.g., moved from mainframes to mini-computers to workstations and
PCs as computing costs and capabilities evolved relative to I/O
Then moved to client-server and eventually to web based services as
networking capabilities evolved
•
Each step may be viewed as attempt to take advantage of shared
resources and multiplexing driven by cost, functionality and
performance capabilities
•
So, is “Cloud Computing” déjà vu all over again?
Page 2
Cloud Computing
"The interesting thing about cloud computing is that we've redefined
cloud computing to include everything that we already do. I can't
think of anything that isn't cloud computing with all of these
announcements.”
Larry Ellison, CEO Oracle
Wall Street Journal, September 26, 2008
Cloud computing is taking advantage of current balance in cost,
performance and capability of computing vs. communications
Page 3
Cloud Computing
Rent computation and storage resources on
demand
•
Fundamental Value Proposition:
Shared, dynamic allocation of resources
to reduce cost through multiplexing
Cloud Platform
–
• Virtualization is a key component
• Accessed by multiple enterprise sites
Cloud Platform types:
• Software as a Service
–
•
Platform as a Service
–
•
Hotmail, Google Docs
Google App Engine, Microsoft Azure
Infrastructure as a Service
–
Amazon EC2, VMware vCloud
Page 4
Enterprise Sites
‘Cloud’ services grouped in 3 categories
What is Purchased
Bus
Dev
Cons
Software as a Service
Application that is run on a multitenant platform and delivered over
the Internet
Platform as a Service
Service that provides platform to
create and run applications
Infrastructure as a
Service
Compute on demand (servers, storage,
networking) using shared platform
Connect / UC


Brew
Mobile Apps /
Vertically
Integrated


Android
Simple DB

Apple
iPhone
Apple
iPhone
MEAP

EC2, S3
Enabler: virtualization Early service versions: private server virtualization, utility hosting &
storage
Page 5
Moving to the Cloud
Acme wants to move part of its payroll app into the cloud
Should be easy, right…?
Acme LAN
Front End
Reports
Processing
Tier
Processing
Tier
Cloud Platform
Page 6
Data Store
Cloud Eco-System
Physical Resources
APIs and Services
Admin Portal
•
Provides basic server and storage facilities
•
Provides LAN and load balancing facilities
Compute as a Service
•
API
Adaptation
Services
Cloud
Enablers
Service
Orchestration
Scheduling
Customer
Service
Policy
Virtualized machine: compute capacity
–
Events
–
–
Common XaaS Tools
•
Image create / management / marketplace
Virtual servers/ Server accessible storage
Address management / Load balancing services
Multi-tenant shared and dedicated environments
Storage as a Service
•
XaaS
Provisioning
Configuration
Application Platform as a Service
Application Platform as a Service
•
Application environment to create, test, host/run
applications (e.g. Java, .NET, J2EE, Python, etc.)
•
Multi-tenant shared and dedicated environments
Charging
Storage as a Service
etc.
Compute as a Service
XaaS
Servers
Fault/Perf
Monitoring
Storage
LAN / LB Network Capabilities
Cloud Eco-system
IP accessible storage services (http, NFS, CIF, etc.)
•
Other services (database, queuing, etc.)
•
Software as a Service
•
Advanced IaaS/PaaS/XaaS services
–
OSS/BSS
–
Cross service coordination and integration
Capacity / Availability Services, Etc.
Common XaaS Tools For Providing Services
Leveraging Services
•
Services should leverage the capabilities of other services as appropriate
•
Compute and Storage as Service are foundational services
–
Basic building block for other services
•
–
•
Modularity is key to managing scale and availability
–
–
Physical/service modularity
Availability domains
Build/support enhanced and new service offerings from
existing service and OSS/BSS APIs
Customer Service Policy/Rules
–
Managing Scale / Availability
•
Service Orchestration / Scheduling
–
Support enhanced services, SLA management, etc.
Likely part of OSS/BSS framework (TBD)
•
API Adaptation
•
Enable the integration of disparate underlying service APIs into a
coherent set of services to our customer
•
Includes network design, storage, servers, etc.
•Cloud enablers
•
Strive for commonality, but do not assume one solution fits all
•
–
At scale individual services may require specific architecture to scale appropriately
Page 7
Support capabilities to aid in the development of XaaS services
(e.g. logging, Resource /Service ID/reference mgmt,
IDM, OSS/BSS integration, etc.)
Enterprise Cloud Challenges
Existing platforms do not meet the needs of
enterprise customers
•
Insufficient security controls
–
•
Deployment is difficult - transparency
–
–
•
Need isolation at server and network level
Cloud resources are completely separate from local ones
Can’t make VMs look like part of existing LAN
Limited control over network resources
–
–
Page 8
Cannot specify network topology or IP addresses
Cannot reserve bandwidth or request QoS guarantees for
network links
Vision and Research Direction
• We have been working towards making compute and
storage resources location transparent for enterprises, and
applications in general
• Enable Location of Compute and Storage resources
remotely
Ensure Transparency and Security
• Minimize performance impact from having remote resources
•
•
Facilitate Migration of Resources in a Transparent and
Seamless manner
•
•
With as little application impact as possible
Provide capability for Disaster Recovery
•
Remote resources need to be far enough away to avoid sharing
of risk between local and remote resources
•
Page 9
Minimum RPO/RTO – to minimize impact on enterprise operations
Problem #1: Transparency
•Application may have been written for LAN environment
–
Might utilize broadcast or LAN service discovery
•Must add Internet gateways for apps previously only on LAN
- Must now communicate via public IPs or configure DNS
Lack of transparency causes
application modifications and
infrastructure reconfigurations
Acme LAN
Front End
front.acme.com
GW
Cloud Platform
Processing
proc.cloud.com
Data Store
data.acme.com
Page 10
GW
Problem #2: Security
Acme’s servers are now accessible from the public internet!
–
Servers formerly on secure LAN now exposed to malicious users
Must configure firewall rules to limit access
–
Fine grain rules are difficult to manage in dynamic environments
Acme LAN
Front End
front.acme.com
Lack of secure cloud connections
exposes enterprise to threats from
both in and out of the cloud
Cloud Platform
Processing
proc.cloud.com
Data Store
data.acme.com
Hacker123
hax.cloud.com
Page 11
Problem #3: Flexible Resource Mgmt
Benefit of cloud computing: ability to easily adjust resource
capacities and add new VMs
After a change must deal with transparency and security issues
all over again!
– Current platforms do not support network resource reservation
(Bandwidth/QoS guarantees)
–
Enterprises want control over
network resources. Cloud must
support dynamic changes
Acme LAN
Front End
Cloud Platform
+1
front.acme.com
+1
Processing
proc.cloud.com
Data Store
data.acme.com
+1
Processing #2
proc2.cloud.com
Page 12
Key Observation
Existing cloud platforms primarily
cover storage and computation
Cloud Platform
Disk
VM
+
+
Enterprise Sites
Enterprise Clouds need control
over the network as well
Page 13
Virtual Private Clouds
A Virtual Private Cloud is…
A secure collection of server, storage, and network resources
spanning one or more cloud data centers
– That is seamlessly connected to one or more enterprise sites
–
VM
Enterprise
Sites
VM
VM
VM
Cloud
Sites
Virtual Private Networks (VPNs)
Layer 2 and 3 MPLS based VPNs
– Created by network provider with no end host configuration
– Already used by many businesses!
–
Page 14
VPC Benefits
For the customer:
– Isolates network & compute resources
•
–
Cloud resources are only accessible through VPN
Simplifies deployment since cloud looks same as local
resources
For the service provider:
– Provides mechanism for control over resource reservation
within provider network
– Simplifies management of multiple data centers by
combining them into large resource pools
Page 15
CloudNet
Cloud Manager
•
Allocates computation and storage resources
•
Manages VLAN assignment within cloud network
Network Manager
•
Creates and configure VPN endpoints
•
Reserves network resources
Network Manager
VPN
VPN
Page 16
Routers
Customer Edge
Provider Edge
Cloud Manager
VLAN
VM VM
VLAN
VM VM
CloudNet: Virtual Private Clouds
Cloud Site X
VPN A
Server
VPN B
VPC A
Server
PE
PE
AT&T Backbone
Cloud Site Y
Server
PE
VPN A
PE
VPN B
Server
Server
VPC B
Virtual Private Cloud:
• Collection of cloud resources presented to customer as a private set of cloud
resources, transparently and securely connected to customer VPN
Page 17
CloudNet System Components
CloudNet
Portal
Controller
VPN A
Cloud Platform
Server
Network
Manager
Cloud
Manager
Server
VPN B
PE
AT&T Backbone
PE
CE
Router
Server
VPN A
Server
PE
VPN B
Server
Cloud Domain
Network Domain
High level abstraction:
Cloud Manager:
Network Manager (IRSCP):
• Create compute resources
• Create compute resources
• VPN management (network side)
• Map into VPN
• Map into VPN (cloud side)
• Dynamic VPN mapping/stitching
• Cross domain interaction
Page 18
CloudNet System Components Network Domain
Network Manager
VPN A
VPN B
IRSCP
PE
PE
PE
VPN B
VPN A
• Network side: Dynamic VPN mapping with IRSCP
• On PE connected to cloud platform
• Pre-configure VPN interfaces with Route-targets
• IRSCP re-writes the RTs as needed to dynamically map these interfaces into
specific VPNs
Page 19
CloudNet System Components Network Domain
Remapping
Network Manager
VPN A
VPN B
IRSCP
PE
VPN A
PE
VPN B
PE
VPN B
VPN A
• Network side: Dynamic VPN mapping with IRSCP
• On PE connected to cloud platform
• Pre-configure VPN interfaces with Route-targets
• IRSCP re-writes the RTs as needed to dynamically map these interfaces into
specific VPNs
Page 20
System Components Cloud Domain
Remapping
Cloud
Manager
Server
Server
CE
Server
CE
Server
Server
Switch
PE
Router
Cloud Computing Platform
On Cloud Computing Platform side, the Cloud Manager :
• Dynamically creates and configures a logical router on a physical router
platform
• Create virtual machine with requested images and configuration
• Hook all of them together
Page 21
WAN Migration
Layer 2 VPNs make WAN act like a LAN
Page 22
Can use existing LAN migration
techniques to move across WAN
WAN Migration
Layer 2 VPNs make WAN act like a LAN
CE
Customer Site
PE
Cloud Site 1
A
VLAN
PE
B
ARP!
Layer 2 VPN (VPLS)
Router
Switch
VPN endpoint
Page 23
CE
ARP!
PE
B
VLAN
Cloud Site 2
Take advantage of LAN migration techniques,
suitably enhanced, to move across WAN
WAN Migration Optimization
• Once connectivity is setup, migration requires
Storage Migration
• Live Memory Migration
•
•
Storage Migration is done through a combination of
Asynchronous Copy of disk storage to remote site initially
• Synchronous copy of incremental updates subsequently during live
memory migration
•
•
Live Memory Migration needs to balance multiple needs
Total Migration Time for live memory (reduced application performance)
• Pause Time (application has to be quiescent for final transfer)
• Amount of Data Transfer (Bandwidth Requirement)
•
Page 24
Live Memory Migration over
WANs
• There has been quite a bit of work in the past for VM
migration over LANs
• But these need to be enhanced to work well over WANs,
especially to accommodate varying/limited network
capacities and latencies
• We’ve been working on an approach to combine multiple
techniques
Dynamic Stop and Copy
• Content Based Redundancy
• Incremental updates (page deltas)
•
•
Overall benefit is significant reduction in migration and
pause times, especially for limited bandwidth between sites
•
Page 25
We’ve been experimenting with our implementation for multiple
applications
Performance of CloudNet Live
Migration over WANs
Page 26
Performance of CloudNet Live
Migration over WANs
TPC-W
Page 27
Kernel
SpecJBB
Performance of CloudNet Live
Migration over WANs
TPC-W
Page 28
Kernel
SpecJBB
AT&T delivers services globally across a footprint
of 38 IDCs, including five ‘Super IDCs’….
UK
Amsterdam
NY Metro
Japan
San Diego
DC Metro
North America - 23
Singapore
Asia/Pacific - 9
Super IDC: Singapore
Other Locations:
Bangalore, Hong Kong,
Shanghai (3), Tokyo,
Osaka, Sydney
Super IDCs: Annapolis,
Piscataway, San Diego
Other Locations: Boston, New
York, Secaucus, Ashburn, Atlanta
(2), Orlando, Miami, Chicago (2),
Dallas (2), Nashville, Mesa, Los
Angeles (2), San Jose, San
Francisco, Seattle, Toronto
Super IDC: Amsterdam
Other Locations: London,
Birmingham, Frankfurt,
Paris, Nice
Super IDCs
Additional AT&T IDCs
Additional expansions underway and planned for 2009
Locations and dates are subject to change
Page 29
Europe - 6
Page 29
“The Enterprise Cloud” Flexible Delivery Models
Hybrid …
• Access to client, partner
network, and third party
resources
Public …
•
•
•
•
•
•
Cloud Services
Efficient, resilient
Preserves capital
Access by subscription
Flexible, Agile, Open, Fast
Leverage broad experience base
Faster time to market
“The Enterprise Cloud”
“Workload Distribution Network”
Use Cases …
•
•
•
•
•
© 2009 AT&T Intellectual Property. All rights reserved.
Disaster Recovery
Portable Migration
Follow the Sun; Follow the Moon
Low Cost
Lowest Latency
Page 30
Private …
•
•
Limited access
Added security and privacy
Summary
Cloud Computing for enterprises requires:
• Security
•
Transparency
•
Flexibility
CloudNet can help provide these features
• Defines interface between cloud platform and network provider
•
Uses VPNs for secure, seamless connections
•
Employs virtualization at server, router, and network levels to
improve agility and efficiency
Current Work
•
Algorithms to optimize migration time, pause time and network
bandwidth requirements for WAN migration
•
Network optimizations to reduce latency of WAN migration
Page 31