grid & performability Aad van Moorsel aadvanmoorsel.com outline to set the stage: • what is grid? • what is performability? three perspectives on grid performability: • `customer’ requirements • system implementation – utility computing • associated research challenges – focus on stochastic modeling April 2003 Copyright Aad van Moorsel, HP Labs page 2 what is grid? what is performability? April 2003 Copyright Aad van Moorsel, HP Labs page 3 grid for me, and in this talk: • • • middleware layer, Globus-like shares resources crosses boundaries – • software-implemented boundaries – – – – • administrative domains, user domains, enterprise domains, … flexibility in who uses what when flexibility in what is secured against whom when flexibility in who charges for what when … makes resources manageable – – – April 2003 grades of QoS dynamic management of QoS service level agreements, business metrics and penalties Copyright Aad van Moorsel, HP Labs page 4 performability for me, and in this talk: • quality of service (QoS) context: • Meyer: metric P(T<t) where T was some random variable • my thesis: meaningful quantitative evaluation of a system (definition 2 out of 3) • others: performance and reliability • SPN models for system state, rewards or queuing networks for performance/metric April 2003 Copyright Aad van Moorsel, HP Labs page 5 grid & performability we accept the claim that grid is software that will facilitate flexible performability management • the software design still leaves to be desired – automation? autonomous? autonomic? – scaling? inter-business? security? • but the applications will drive it in the right direction – utility computing – service-centric outsourcing April 2003 Copyright Aad van Moorsel, HP Labs page 6 grid & performability `customer’ perspective April 2003 Copyright Aad van Moorsel, HP Labs page 7 business costs of owning and operating IT have gone through the roof April 2003 Copyright Aad van Moorsel, HP Labs page 8 business cost of IT failures downtime costs per hour brokerage operations credit card authorization e-bay (1 outage 22 hours) amazon.com package shipping services home shopping channel catalog sales center airline reservation center cellular service activation on-line network fees ATM service fees $6,450,000 $2,600,000 $225,000 $180,000 $150,000 $113,000 $90,000 $89,000 $41,000 $25,000 $14,000 survey of computer damages in France, 2000 source: Dave Patterson keynote at FAST ‘02 April 2003 Copyright Aad van Moorsel, HP Labs page 9 operational complexity: scale courtesy of Lisa Spainhower, IBM April 2003 Copyright Aad van Moorsel, HP Labs page 10 operator faces heterogeneity Content Logic Processes Place content closer to Business where it is needed BPR CDN Databases dynamic composition App servers Reengineer business process Select services for each activity in the process dynamically Web servers Share a database vs. Number of web servers App server databaseNumber of app servers Web server needed Utility Utility needed Utility Software create a new database Re-index tables to Start and stop new app ZLE, DBMSservers optimize queries Servers Hardware April 2003 Network Load balance transactions load across servers balancing Storage Allocate machines to Reserve network applications bandwidth prior toRSVP use UDC/QM/SF Assign storage devices to workloads Storage Replace a failed machine transparently by migrating its VMs applications Configure buffer sizes in device drivers to maximize performance QoS-based routing decisions Copyright Aad van Moorsel, HP Labs management page 11 operation faces federation needs Content Logic Place content closer to Business where it is needed Databases Share a database vs. Software create a new database Re-index tables to optimize queries Servers Hardware April 2003 Processes Reengineer business process Select services for each activity in the process dynamically App servers Web servers Number of app servers needed Number of web servers needed Start and stop new app servers Load balance transactions across servers Network Storage Allocate machines to applications Reserve network bandwidth prior to use Assign storage devices to workloads Replace a failed machine transparently by migrating its applications QoS-based routing decisions Configure buffer sizes in device drivers to maximize performance Copyright Aad van Moorsel, HP Labs page 12 customer needs business-driven, automated operator tools for systems with increasing scale, heterogeneity and federation challenges April 2003 Copyright Aad van Moorsel, HP Labs page 13 grid & performability system perspective (utility computing) April 2003 Copyright Aad van Moorsel, HP Labs page 14 twin UDCs in HP Labs • built the first large utility data center in Palo Alto (US) and Bristol (UK) – learn what it takes to build a solution – move HPL IT services to the UDC • the first virtualized data center – from Server, storage, networks to energy management – dynamically assigns applications to resources – customer sees resources as ‘utility’ – operator sees resources as ‘utility’ April 2003 Copyright Aad van Moorsel, HP Labs page 15 utility computing from usage perspective reserving resources getting resources flexing resources UDC2 ? Server Cluster UDC1 April 2003 Copyright Aad van Moorsel, HP Labs page 16 utility computing from operator perspective (prototype developed at HP Labs, initially gtk2, currently migrated to gtk3) Utility Data Center = programmable pool of data center resources Grid interface UDC GRAM UDC/XML Interface UDC GRAM = Globus Gatekeeper + UDC Adapter April 2003 Copyright Aad van Moorsel, HP Labs page 17 title configure properties April 2003 Copyright Aad van Moorsel, HP Labs page 18 title generate RSL April 2003 Copyright Aad van Moorsel, HP Labs page 19 utility computing for operators utility computing has great potential to improve operations: • better utilization of resources • better tools for setting up applications • new business models, better accountability but UDC is just one, high-end solution need something that is open, extensible, uniform, … grid based management backplane April 2003 Copyright Aad van Moorsel, HP Labs page 20 utility computing grid middleware OpenView orchestrates IT HP valueadd management leverage Grid OpenView command and control management backplane: monitoring, rich discovery, life-cycle, coordinated ‘act’, policy, biz-impact driven adaptation, flexible secure mgmt domains base Grid: uniform interface, single sign-on, federation, stateful services everything is a Grid service April 2003 SLA Copyright Aad van Moorsel, HP Labs page 21 more automation: flexing resources objective: increase asset utilization via resource sharing while providing a desired quality of service for applications approach: a statistical multiplexing technique for resource utilities that host business applications characteristics of business applications: • require resources continuously • changes in number of users and workload mix may result in: – time varying demands – large peak to mean ratios for demand – future demands that are difficult to predict precisely • customers want assurances they will get resources when needed – for example, resource request will be satisfied with a prob. p=0.999 – i.e. 999 times out of 1000 – customers don’t always need an assurance of p=1.0 April 2003 Copyright Aad van Moorsel, HP Labs page 22 statistical demand profiles to guide the development of our techniques we rely on gathered data: – – – 48 servers in an HP data center hosting business applications each with 2 to 8 CPUs create a statistical demand profile for each application – – – compact representation of pattern for demand characterize “day of week” and “day of weekend” separately • ignore weekends for the purpose of the study characterize a “weekday” by 24 60-minute time slots • probability mass function (pmf) gives the observed distribution for the number of CPUs needed per slot the profiles populate a calendar of “expected demand” for the utility – April 2003 enables admission control Copyright Aad van Moorsel, HP Labs page 23 admission control approach • • • a new application requests admission to the utility assume we admit the new application unfold its profile onto the utility’s calendar for a capacity planning horizon – • • • for example, several months into the future characterize the calendar’s new per-slot distributions of aggregate demand use distributions to estimate required size of resource pool admit application if there are sufficient resources April 2003 Copyright Aad van Moorsel, HP Labs page 24 demands for a time slot t applications utility: - distribution of aggregate demand is approximated by the joint pmf - however, we must also consider correlations between application demands April 2003 Copyright Aad van Moorsel, HP Labs page 25 experimental design and results • how many CPUs are needed if applications: – are statically assigned their peak numbers of CPUs? are assigned the peak number of CPUs needed on per-slot basis? are offered assurance p that resource requests will be satisfied? – – • about the experiments: – include application demand correlations as measured include 60 minute warm-up/warm-down application migration overheads reported estimates verified using trace driven simulation – – resource access mechanism static peak per slot (p=1.0) statistical multiplexing p=0.999 statistical multiplexing p=0.99 April 2003 number of CPUs required 309 275 179 (estimate) 163 (estimate) Copyright Aad van Moorsel, HP Labs page 26 grid & performability modeling research perspective April 2003 Copyright Aad van Moorsel, HP Labs page 27 modeling issue I the many perspectives of virtualization virtualization enables flexibility in UDC: 1. storage area networks let applications use any storage device 2. computing virtualization allows to assign CPUs dynamically to customers 3. virtual LAN creates a secure private network virtualization gives the illusion of some traditional functionality (‘boundaries’), but implements it ‘soft’ modeling challenges: different views for different users, dynamic changing of boundaries (performability!), how to utilize the models contained by the software April 2003 Copyright Aad van Moorsel, HP Labs page 28 modeling issue II on-line algorithms on-line algorithms are key to conquer complexity: • automated adaptation needs on-line algorithms on-line algorithms come in many shapes and forms: • days: resource scheduling • seconds: load balancing, admission control, retries • milliseconds: memory optimization, real-time scheduling typical issues: • speed of the model solution • chose between statistical and structural models • obtaining the right on-line data • plug-in algorithm module need data model that fits with operational model April 2003 Copyright Aad van Moorsel, HP Labs page 29 modeling issue III how to validate large scale systems many facets to scale: • more and more devices • more and more interconnected (even globally) • increasing number of users • multi-party and multi-ownership • greater differences in scale: smaller devices, bigger data centers • amount of data collected and analysis done increases with the scale of the systems we have no good ways of analyzing large-scale systems: no test beds, no reliable data, no widely accepted modeling approaches April 2003 Copyright Aad van Moorsel, HP Labs page 30 modeling issue IV how to evaluate for business metrics the real metric of interest is euros: • how much is the total cost of ownership • how much am I as customer willing to pay for a service • what penalties do I as provider accept in an SLA • if I invest x, what is the return on IT investment how do we model the money/QoS correlation? April 2003 Copyright Aad van Moorsel, HP Labs page 31 conclusion • adaptive/utility/autonomic computing has intrinsic need for QoS (performability) modeling and analysis • the grid is believed to be the platform of choice – • applications are more interesting than the middleware challenges for stochastic modeling larger than ever in this setting: – – – – April 2003 virtualization on-line algorithms large-scale systems business metrics Copyright Aad van Moorsel, HP Labs page 32