A Survey of Virtualization Technologies Ken Moreau Solutions Architect, Hewlett-Packard Overview This paper surveys the virtualization technologies for the servers, operating systems, networks, software and storage systems available from many vendors. It describes the common functions that all of the virtualization technologies perform, shows where they are the same and where they are different on each platform, and introduces a method of fairly evaluating the technologies to match them to business requirements. As much as possible, it does not discuss performance, configurations or base functionality of the operating systems, servers, networks or storage systems. The focus for the audience of this document is a person who is technically familiar with one or more of the virtualization products discussed here and who wishes to learn about one or more of the other virtualization products, as well as anyone who is evaluating various virtualization products to find the ones that fit a stated business need. Introduction Virtualization is the abstraction of server, storage, and network resources from their physical functions, e.g. processor, memory, I/O controllers, disks, network and storage switches, etc, into pools of functionality which can be managed functionally regardless of their implementation or location. In other words, all servers, storage, and network devices can be aggregated into independent pools of resources to be used as needed, regardless of the actual implementation of those resources. It also works the other way, in that some elements may be subdivided (server partitions, storage LUNs) to provide a more granular level of control. Elements from these pools can then be allocated, provisioned, and managed—manually or automatically—to meet the changing needs and priorities of your business. Virtualization technologies are present on every component of the computing environment. The types of virtualization covered in this paper include: Hard partitioning, to electrically divide large servers into multiple smaller servers Soft or dynamic partitioning, to divide a single physical server or a hard partition of a larger server, into multiple operating system instances Micro partitioning, to divide a single processor and a single I/O card among multiple operating system instances Resource partitions, to allocate some of the resources of an operating system instance to an application Workload management, to allocate an application and its resources to one or more operating system instances in a cluster or resource partitions in an operating system instance Dynamic server capacity, such as Pay Per Use, instant Capacity (iCAP) or Capacity On Demand (iCOD) Clustering, to divide an application environment among multiple servers. (This includes cluster network alias) Network Address Translation, to have a single load balancing server be the front end to many (dozens or hundreds) of servers for near infinite scalability © 2008 Hewlett-Packard Development Company, L.P. Media Access Control (MAC) spoofing, to simplify the network management tasks by allowing a single network port act as the front end for multiple server NIC ports Redundant Array of Independent Disks (RAID), whether it is done by operating system software, 3rd party software acting as an application, the storage controller in the server, or the storage controller in the storage array itself Channel bonding, to combine the functionality and performance of multiple I/O cards into a single logical unit In-cabinet storage virtualization, to allocate a logical disk unit across multiple physical disks that are connected to a single set of storage controllers Between-cabinet storage virtualization, to allocate a logical disk unit across multiple physical disks that are connected to multiple storage controllers. Some of the virtualization technologies that affect servers but are more focused on the desktop is application virtualization: Remote desktop, where the desktop is used only as a delivery device, supplying the display, keyboard and mouse, and the application runs on the server. Citrix Presentation Server (previously known as MetaFrame) and Microsoft Windows Virtual Desktop are the most popular products in this area. Virtual applications, where some software components are installed on the desktop and coordinate with a server to that the desktop O/S recognizes the application as fully installed. Altiris Software Virtualization is the most well known in this area, with Microsoft SoftGrid becoming a player. Application streaming allows the application to run on either the desktop or the server, whichever is most appropriate at that time. AppStream and Softricity (recently acquired by Microsoft) are the most well known in this area. Allowing software to be delivered as a service, where some components run on the desktop and some run on the server. Web-based e-mail is the most common application, but any web page which delivers Java to the desktop but does a great deal of processing on the server also fit this model. Eventually this paper will discuss all of the above technologies. However, at this time I will only discuss the server partitioning technologies. History In the early days of computing, there was a direct connection between the problem being solved and the hardware being used to solve it. In 1594 John Napier invented the Napier Bones to perform multiplication, while in 1727 Antonius Braun made the first general purpose calculator that could add, subtract, multiple and divide. It took Charles Babbage to develop a general purpose machine that could actually be “programmed” to perform different operations, when he created the Analytical Engine in 1834. In 1886 Herman Hollerith constructed a mechanical device that added and sorted for the US Census of 1890, leading to the creation of IBM. Even after the invention of the electronic calculators and computing engines, such as the “Bombe” in 1943, there was still a direct connection between the problem to be solved and the physical system needed to solve it. The earliest form of programming involved wiring a board in order to solve a specific problem, and changing the program involved physically moving the wires and connections on the board itself. The invention of the transistor and electronic circuits, plus the concept of changing the paths of the electricity flowing through those circuits without having to change their physical configuration, finally broke this “hard wired” connection. In 1946 Alan Turing publicized the © 2008 Hewlett-Packard Development Company, L.P. concept of “stored programming”, helping the inventors of ENIAC to create EDVAC, as well as the Manchester “Baby” and Whirlwind. For the first time, the function of the machine could be changed without modifying it physically, and the concept of software was born. But even with this, the machines were still solving only one problem at a time, and there was a direct connection between the program and the hardware it ran on: the program could only access the physical memory and other hardware resources physically present in the system, and only one program at a time could be running. Storage was the same way, in that the storage administrator had to work with the individual disk drives, and originally individual tracks and sectors, without the benefit of such abstractions as file names and directories. Further, they spent a lot of time striving for good performance by physically moving data from one disk to another in order to balance the load of both performance and physical space on the drives. This was incredibly time consuming and disrupted the use of the systems because the disks needed to be off-line while it was being done. Further, it was a job that was never finished: as new workloads appeared and as the workloads shifted day to day (for example, end of the month processing might be vastly different than middle of the month processing), the work needed to be done again and again and again. And because the storage was attached to the servers via controllers embedded in the system, adding new storage often required taking the server down to install new I/O cards. But with the invention of operating systems and timesharing, this paradigm began to be broken: each process could act as if it had an entire computer dedicated to itself, even though that computer was being shared between many programs running in different processes. In effect, the operating system “virtualized” the single physical machine into many virtual machines. This continued with the invention of virtual memory by Tom Kilburn in 1961, time-sharing on the Atlas machine in the early ‘60s, and symmetric multi-processing on many machines in the ‘70s, where each program still saw a flat memory space and a single processor, but the processor being used from one machine cycle to the next might be different physical units, and the memory might be stored on hard disk at some point, but the program neither knew nor cared about such details. Similarly, the invention of storage controller based RAID (Redundant Array of Independent Disks) allowed the storage administrator to bundle a few disks together into slightly larger groups, of two or three disks in a mirrorset (RAID-1) or four, six or eight disks in a striped mirrorset (RAID 1+0). Then placing these storage controllers into the storage units themselves removed them from the host system, removing their management from the system administrators completely. The industry has now taken that virtualization to the next level, where we are breaking the connection between the physical computer systems and the programs they run, and having the freedom to move processes between computer systems, to have many I/O cards act as a single very fast I/O card, to have many processors act as a single very fast processor, or to have a single very fast processor act as many slightly slower processors. And we are doing the same thing with storage, where a hundred or more disks can be grouped together and managed as a single entity, with automatic load balancing and on-line expansion, and attaching them to the servers via FibreChannel switches or network switches via iSCSI, such that new storage can be added to the environment without needing to take down the servers. Server Virtualization Hard Partitioning Hard partitioning divides a large symmetric multi-processing (SMP) server into multiple smaller servers, with full electrical isolation between the partitions. In this way each of the hard partitions can be powered off and restarted without affecting any of the other hard partitions, effectively creating multiple independent servers from a single server. Each of the hard partitions commonly has processors, memory and I/O slots allocated to them. © 2008 Hewlett-Packard Development Company, L.P. Hard partitions can be treated as individual servers, and the operating systems are unable to tell the difference between a hard partition in a large server and an entire smaller server. The operating system simply sees the resources that are allocated to it. The advantage of hard partitions over the same configuration of processors and memory in smaller servers is flexibility. If you have, for example, a single sixteen processor server that you have hard partitioned into four four-way servers, you have achieved the same functionality as if you bought four four-way servers. But you now have the ability to re-partition and create eight two-way servers, or two four-way and one eight-way server, simply by shutting down the operating system instances, issuing a few partition manager commands or mouse clicks, and booting the operating systems into the new hardware configuration. As workloads change and new business requirements occur, this can be very useful. Further, there are many cases where the maintenance and licensing costs of the single larger server can be less than the similar charges for the same processor count of multiple smaller servers. The basic architecture of the server will either allow hard partitioning or not. The three common server architectures are bus-based, switch-based and mesh-based. Bus-Based Switch-Based Mesh-Based Figure 1 Server Architectures Bus-based (also known as “planar”) architectures have a single path that enables communication between all of the components in the system. (Some of the more advanced planar systems might have two paths, one for memory and one for I/O, but the principle still applies.) All processors, memory slots and I/O slots “plug in” to the path(s) in order to send messages to the other components. This entire environment, with processor slots, memory slots, I/O slots all connected together is called a “chipset”. In this way the processors can read and write the memory and send and receive information through the network and storage cards directly, without passing through any intermediate component. The best analogy for this design is that of a flat network, where all of the systems plug directly into a single network repeater. The advantage of this design is that the latency between any two components is identical: any processor can read any memory location with the same access times, and any I/O card can directly access either any processor or any memory location with the same access times. This is commonly known as a Uniform Memory Architecture (UMA). The disadvantage of this design is that this single communications path can become clogged with messages, forcing the different components to wait to send their message. This sets an upper limit on scalability. However, this upper limit is high and getting higher all the time, as the clock rate on the communications path is increased by the chipset vendors. The bandwidth of the communications path is more than adequate for low-end and mid-range servers, which tend to focus on scale-out and multiple cooperating machines for larger workloads. Switch-based architectures are more like modern networks, with switches rather than hubs connecting the processors, memory and I/O slots, allowing multiple simultaneous communications. High end systems have a tiered approach, where switches connecting the processors and memory (and frequently the I/O slots) also connect to higher level switches that act as the systems interconnect, to allow communications across many components. This is analogous to a modern network, with many switches and routers connecting many sub-networks. © 2008 Hewlett-Packard Development Company, L.P. The advantage of this design is that it increases the total throughput between the components, and allows linear scalability as new components are added, since the additional new components provide additional switch throughput. The disadvantage of this design is that it increases the latency of the communications between the components that are multiple switches away. This is called Non Uniform Memory Architecture (NUMA), because the latency between components on the same switch is different than the latency between components that cross the additional switches. If the additional latency is high enough, it can cause significant problems in performance as you scale up with larger numbers of processors, multiple gigabytes of memory, and many I/O slots. Mesh-based architectures place the switches in the processor itself, allowing the processors to define the routing of any messages. All memory is directly connected to the processor, instead of being connected to the processor by an external switch. This leads to dramatically lower memory latency when the processor accesses directly connected memory, and only moderate additional latency when the processor access memory that is directly connected to another processor in the mesh. This results from the combination of the high speed of the processor interconnect where each additional processor adds memory bandwidth, and the intelligent routing of the messages that minimizes the number of intermediate hops between processors. In practice, the average memory latency in a mesh architecture compares very well to the local memory latency in a switch architecture, is better than the memory latency in a bus architecture, and is dramatically better than the memory latency across multiple switches in a switch architecture because there are no bottlenecks for the memory transfer. The most significant advantage of switch and mesh based designs is that they allow for hard partitioning, simply by having the various switches (whether they are internal or external to the processor) allow or prohibit communications between various components. This provides the electrical isolation required by hard partitions that is not possible in a bus-based design. AMD Opteron 200 and 800 (and more recently the 2000 and 8000) class processors are technically mesh-based, in that they have the inter-processor connections on the processors themselves. However, at this time neither AMD nor any systems vendor has implemented the dynamic routing needed for hard partitioning. All of the partitioning being done with Opteron systems is being done at the switch level, outside of the processor. Therefore, I have classified the Opteron systems based on the types of partitioning which are available today, either bus-based or switch-based. If/when AMD and systems vendors offer systems which implement mesh-based partitioning, I will modify this paper. The following shows the partitioning capabilities of the major servers on the market today: Bus-Based Servers (no hard partitions) Switch-Based Servers (hard partitions) Mesh-Based Servers (varies) Dell PowerEdge (all) None None HewlettPackard HP 9000 PA-RISC rp34xx, rp4440 Integrity rx16xx, rx26xx, rx4640 NonStop (all) ProLiant (BL, DL, ML xx0) AlphaServer DS1x, DS2x, ES40, ES45 (Note: no hard partitions) GS80, GS160, GS320 HP 9000 PA-RISC rp74xx, rp84xx, Superdome Integrity rx76xx, rx86xx, Superdome AlphaServer ES47, ES80, GS1280 (Note: hard partitions) © 2008 Hewlett-Packard Development Company, L.P. ProLiant (BL, DL, ML xx5) (Note: no hard partitions) IBM System OpenPower (all) x 2xx, 3xx, 33x, 34x i and p 510, 520, 550 System x 440, 460, 3950 i and p 560, 570 (Note: hard partitions between chassis only) System i and p 590, 595 (Note: no hard partitions) NEC Express5800 (all) Express5800 1080Rf, 1160Xf, 1320Xf None Sun Micro systems Sun Fire V1xx, V2xx, V4xx, V8xx, V1280 Sun Fire E2900, E4900, E6900, 12K, 15K, 20K, 25K (Note: no hard partitions) Sun Fire V20z, V40z X2100, X4100X4600 (Note: no hard partitions) Unisys ES3105, ES3120, ES3140 ES7000 Aires (all), Orion (all) None Figure 2 Server Hard Partitioning Dell Dell uses a bus architecture for the PowerEdge servers (Intel Pentium Xeon and AMD Opteron), therefore there is no hard partitioning in any Dell server. Hewlett-Packard HP uses a bus architecture for the low-end and mid-range servers: HP 9000 rp1xxx, rp2xxx and rp4xxx lines (PA-RISC), Integrity rx1xxx, rx2xxx, rx3xxx and rx4xxx lines (Itanium), NonStop servers (MIPS) and ProLiant BL, DL and ML lines (AMD Opteron and Intel Pentium Xeon). Therefore, there is no hard partitioning for these servers. HP uses a switch architecture for some entry level and mid-range servers: AlphaServers DS1x, DS2x, ES40 and ES45 lines (Alpha) Because these are entry level and mid-range, there is no hard partitioning for these servers. HP uses a cross-bar architecture for some high-end servers: AlphaServer GS80, GS160 and GS320 (Alpha) HP 9000 rp7xxx, rp8xxx and Superdome lines (PA-RISC) and Integrity rx7xxx, rx8xxx and Superdome lines (Itanium). The cross-bar architectures between the AlphaServer GS80, GS160 and GS320, and the HP 9000 and Integrity high-end systems are shown in Figure 3. i/o cell cell crossbar backplane superdome cabinet AlphaServer crossbar crossbar © 2008 Hewlett-Packard Development Company, L.P. crossbar i/o backplane superdome cabinet HP 9000 and Integrity Figure 3 HP Cross-Bar Architectures Each of the basic building blocks in the AlphaServer line (code named Wildfire) is called “System Building Blocks” (SBB’s), while the sx1000 and sx2000 chipset defines “cells” in the HP 9000 and Integrity systems. In both cases the internal switch (called a “cell controller” in the sx1000 and sx2000) connects the four processor slots, memory slots, ports to external I/O cages for the I/O cards, and ports to the system cross-bar switch. The AlphaServer has a single global switch that connects all of the SBB’s, while the HP 9000 and Integrity servers have two levels of switch between the cells: a cross-bar that connects up to four cells, and direct connections from each cross-bar to three other cross-bars. The management console directs the management processor of the system to either allow or not allow communications between specific components across the global and cross-bar switches, defining the partition configuration. This defines the hard partitioning, which is therefore granular to the SBB and cell level. This hard partitioning offers true electrical isolation such that one partition can be powered up or down without affecting any other partition. In September 2007, HP added the capability of adding and removing cells to and from the hard partitions in the cell based Integrity servers. Called “dynamic nPars”, it works similarly to Sun Dynamic System Domains, in that removing a cell board from a hard partition consists of preventing any new processes from being scheduled on the processors in the cell to be removed, moving all data from the memory in the cell to memory in other cells, and (optionally, if there is an I/O chassis attached to the cell) switching all I/O to the alternate paths on I/O chassis on the remaining cells. When this has been accomplished, the cell is detached from the hard partition. Adding a cell with or without an I/O chassis to a hard partition is simply the process in reverse. One significant difference between the two designs is the way memory is used by the operating systems. Using OpenVMS and Tru64 UNIX, the AlphaServer optimizes the use of SBB local memory in order to minimize memory latency, but this comes at the expense of much higher memory latency when crossing the global switch: the dreaded “NUMA effect”. This has significant effects on performance, far more than anyone realized at first release. The most recent versions of the operating systems worked on reducing this behavior, but it is still a factor. HP-UX on the HP 9000 and Integrity servers balances memory access across all available cells. This increases the average memory latency, but almost completely eliminates the NUMA effect, and results in almost UMA performance on this NUMA design. HP offers the option of allowing “cell local memory” for both the HP 9000 and Integrity servers, but in practice most users disable this feature. The primary exception to this rule is high performance computing (HPC), where each workload can be isolated to a particular cell’s processors and memory, and memory bandwidth and latency are critical. Note that the communication across the cross-bar switches in the HP 9000 and Integrity servers in the sx1000 chipset is limited to 250MHz, independent of the clock speed of the processors. In the sx2000 chipset this was changed to use a serial link at a much higher clock rate. This eliminated a single point of failure of the clock module. HP uses the mesh architecture for some high-end servers: AlphaServer ES47, ES80 and GS1280 (Alpha). The ES47, ES80 and GS1280 (code named Marvel) systems use the mesh architecture of the Alpha EV7 processor to connect each processor to up to four other processors (at the N(orth), S(outh), E(ast) and W(est) connections), to the memory that is directly attached to that processor through the two memory ports, and to the external I/O cage for the I/O cards. The management console directs the firmware of the GS1280 as to which processors and I/O cages are connected to each other, defining the partition configuration. This defines the hard partitioning, which is therefore granular to the processor and card cage level. This hard partitioning offers true electrical isolation such that one partition can be powered up or down without affecting any other partition. © 2008 Hewlett-Packard Development Company, L.P. 4-5* x RDRAM S N EV7 W E 4-5 x RDRAM I/O EV7 Processor Ports EV7 Mesh Figure 4 HP Alpha EV7 Architecture The EV7 architecture allows superb memory access times to all locally connected memory, and the high-speed processor interconnect allows good to excellent memory access times to all of the local memory on the other processors. The actual access time is dependent on the number of “hops” (processor to processor communications) that the memory access requires. Figure 5 shows the affect of hops on memory latency in nano-seconds, as the distance (“hops”) from the center point (of 75ns) increases: 355 319 283 247 211 247 283 319 319 283 247 211 175 211 247 283 283 247 211 175 140 175 211 247 247 211 175 140 75 140 175 211 283 247 211 175 140 175 211 247 319 283 247 211 175 211 247 283 355 319 283 247 211 247 283 319 391 355 319 283 247 283 319 355 Figure 5 EV7 Memory Latency The mesh based architecture tends to localize the memory accesses to either the processor which is running the program accessing the memory, or processors very close to that processor. An example of this is a large in-memory data structure such as an Oracle SGA which will not fit into the memory of a single processor. And the process scheduler tends to schedule processes on processors near the memory they will access. The net result of this is that the average hop count is less than 2, which results in memory access times which are equal or superior to bus-based systems. IBM IBM uses a bus architecture for most servers: eServer xSeries 326 (AMD Opteron), eServer BladeCenter HS20 and HS40 (Intel Pentium Xeon), eServer BladeCenter JS40 (PowerPC), eServer OpenPower (Power5), System i and p 510, 520 and 550 (Power5+), and System x servers (Intel Pentium Xeon). © 2008 Hewlett-Packard Development Company, L.P. Therefore, there is no hard partitioning in these servers. IBM Power4, Power5 and Power5+ are mesh based (in what IBM refers to as an “inter-processor fabric”) for many servers: System i servers (Power4 and Power5/5+) and System p servers (Power5+). 16-way (1 book) 64-way (4 books) Figure 5 IBM Power4/5/5+ Architecture Figure 5 shows the connections between each processor. Each blue block is a multi-processor module (MCM) holding four Power processors (the chartreuse, green, blue and purple blocks). Each processor can connect to two other processors via the distributed switch on the MCM, and one other processor on each of two other MCM’s. Each processor has its own memory and I/O slots. Note that the Power4 series does not have L3 cache, where this was added for the Power5 and Power5+. Two MCM’s with four processors each, with two threads per Power5 or Power5+ processor (which IBM calls a “16-way”, as there are 8 processors with 16 cores) is called a “book”, and multiple books are connected to form a system. Not all connections between the distributed switches are always used. In the dual-processor module (DCM) or quad-processor module (QCM) or in systems with 16 or 32 processors, only two or three connections between processors are used. Note the extra connection between the processors in the 64-way configuration that is not present in the 16-way configuration. IBM achieves their excellent performance because of the high bandwidth and low latency of the distributed switch. The processors can communicate with the other processors in the same MCM or QCM on each clock tick, while communicating between modules on every other clock tick. The “hop” distance and memory latency and bandwidth in the 4 book environment is similar to the Alpha EV7 environment for 32 processors, but will give better performance overall because the dual-core nature of the Power5 allows one less hop for the same number of cores. There is no hard partitioning for these servers. IBM has chosen to implement its mainframe “logical partition” capability, with either LPARs (logical partitions, prior to AIX 5.2, with fixed allocations of dedicated processors, memory and I/O slots), DLPARs (dynamic logical partitions, starting with AIX 5.2, which also has dedicated processors, memory and I/O slots but added the ability to move those between DLPARs without rebooting) and SPLPARs (shared processor logical partitions, using the IBM Hypervisor as a host operating system for several guest operating systems). IBM discusses “processor hot-swap”, but because of the lack of electrical isolation in the pSeries, this is actually part of their Capacity Upgrade on Demand (CUoD) package, where there are spare processors not allocated to a partition which can be activated as needed to either augment the existing processors or replace a failed processor in a partition. Specifically, IBM does not offer the © 2008 Hewlett-Packard Development Company, L.P. capability to physically remove and replace a processor without powering down the system. Therefore, the pSeries does offer the capability to delay shutting down all of the logical partitions until it is more convenient to do that, but does require that the entire system be shut down in order to repair or upgrade processors or memory. IBM uses the X-Architecture for the high-end System x servers. Each XA-32 chipset (Intel Pentium Xeon processors) or XA-64 chipset (Intel Itanium processors, which IBM is de-emphasizing) is called a “node” and has four processor slots, memory slots, and a PCI-X to remote I/O bridge controller. The enhanced chipset adds an L4 cache and a synchronous memory interface buffer. Note that instead of a cross-bar architecture, communications between the nodes is done via the SMP expansion ports. There are three SMP expansion ports per node, meaning that each node can communicate directly with only three other nodes. This limits scalability to 16 processors, but allows hard partitioning for these servers. Note that in this case, the SMP expansion ports are the only communications path that can be enabled or disabled, such that the partitioning granularity is at the node level. This offers true electrical isolation between nodes. IBM uses a similar architecture for the System p560 and p570 servers. Each building block contains a processor drawer with four processors, memory slots and I/O slots, and can be connected to other building blocks. The key difference between the p570 and the x servers is that the x servers connections are point-to-point, while up to four building blocks of the p570 can be connected together using the appropriate “flex cable” (FC) in a ring topology. Each processor drawer must be fully populated. This does allow for hard partitioning and electrical isolation at the processor drawer level, but only by removing the existing flex cable and replacing it with the specific cable for that specific number of building blocks. There are different flex cables for a two building block, a three building block, and a four building block configuration, and they are not interchangeable. Further, there must be precisely one “master node” with all of the other building blocks being “slave nodes”, so you cannot easily split a four building block configuration into (for example) a pair of two building block configurations. Both of these designs limit the environment to 16 processors (4 nodes or building blocks) and neither uses a switch (point to point in the x servers and a ring topology in the p570). These would seem to be serious limitations, but given the dual core nature of the processors (Intel Pentium Xeon in the x servers and Power5 in the p570), these servers can effectively scale to 32 cores, which will cover quite a lot of the mid-range market. Figure 6 IBM X-Architecture and p570 Architecture NEC NEC uses a bus architecture for the Intel Xeon based servers: NEC Express5800/100 Therefore there is no hard partitioning for these servers. © 2008 Hewlett-Packard Development Company, L.P. NEC uses a switch architecture for the Intel Itanium based servers: NEC Express5800/1000 Figure 7 NEC A3 Architecture The A3 architecture uses a high speed crossbar switch to connect each of the “cells”, each of which contains four processor sockets and 32 memory slots. The crossbar connects each of the cells to each other, as well as to the PCI I/O slot chassis. One of the features of the A3 chipset is Very Large Cache (VLC). VLC provides a separate Cache Coherency Interface (CCI) path both internally to the cell and externally between cells which enables processor cache coherency, which is also controlled by the service processor. The cache coherency scheme is tag based for inquiries and directory based for transfers. Both the standard communications through the crossbar, and the separate CCI, are managed by the service processor. This enables hard partitioning at the cell level. With Windows 2003, the hard partitioning was static, in that the operating system had to be shut down in order to modify the hard partitions. However, in Windows 2008, Microsoft implemented the Windows Hardware Error Architecture (WHEA), which provides significant advances in how the chipset and the operating system can manage hardware more dynamically. Calling it the Windows Hardware Error Architecture (emphasis added) is somewhat mis-leading, in that the capabilities of the WHEA’s interaction between the hardware and the operating system go far beyond simply handling errors. NEC took advantage of these advances, and implemented Dynamic Hardware Partitioning, which allows hot replacement and hot addition of components (processors, memory, I/O and full cell boards) to a hard partition. Sun Microsystems Sun uses a bus architecture for the low-end and mid-range servers: Sun Fire V100, V120 (UltraSPARC IIi), V210, V240, V250, V440 (UltraSPARC IIIi), V480, V880 and V1280 (UltraSPARC III) Sun Fire V490 and V890 (UltraSPARC IV) and Sun Fire V20z. V40z, X2100, X4100 through X4600 servers (AMD Opteron). Therefore there is no hard partitioning for these servers. © 2008 Hewlett-Packard Development Company, L.P. Figure 8 Sun Fireplane Architecture Sun uses a cross-bar architecture called the Fireplane for the mid-range and high-end servers: Sun Fire E2900, E4900 and E6900 (UltraSPARC IV) and Sun Fire 12K, 15K, E20K and E25K lines (UltraSPARC IV). Each system board (in the E2900 server) or Uniboard (E4900 and above) has 4 processor slots, memory slots and direct attached I/O slots. This is equivalent to an HP Alpha System Building Block or HP Integrity cell board or IBM node. The system boards and Uniboards are switch-based, where a pair of processors and their associated memory are connected by a switch, and the two processor/memory pairs are connected to each other and to the I/O slots by a second level switch. Each pair of system boards and Uniboards are connected to each other by an expander board (where one system board or Uniboard is “slot 0” and the other is “slot 1”), and expander boards are connected to each other by the Fireplane interconnect, which consists of another level of three cross-bar switches: Address Cross-Bar aka the “data switch”, which passes address requests from one system board or Uniboard to another system board or Uniboard, Response Cross-Bar aka the “system data interface”, which passes requests back to the requesting system board or Uniboard, and Data Cross-Bar aka “Sun Fireplane interconnect”, which transfers the requested data between system boards or Uniboards. Note that all remote memory accesses in a split expander take 13ns longer than the same remote memory accesses in an expander where both slots are in the same domain. Sun documentation uses the terms “3x3 switch”, “10x10 switch” and “18x18 switch”. This terminology indicates the number of ports in each switch: 3, 10 or 18 ports. So, for example, the 18x18 switch connects 18 Uniboards in the E25K, for a total of 72 processors (18 Uniboards each with 4 processors). © 2008 Hewlett-Packard Development Company, L.P. Note that there four levels of switches involved in the high end systems: “3x3 switch”, which connects a pair of processors and their memory in a system board or Uniboard to each other, “3x3 switch”, which connects two pairs of processors and their memory together into a complete system board or Uniboard, “10x10 switch”, (also known as an expander board) which connects two system boards together, and “18x18 switch”, which connects the expander boards in the E20K and E25K servers. These switches allow partitioning for these servers, which Sun refers to as Dynamic System Domains (DSD). DSD partitioning is done at the system board (slot) level. When both system boards of an expander are connected to different domains, the expander is called a “split expander”. Figure 9 Sun Fireplane Partitioning All components can be dynamically removed from a domain and added to a domain, similar to the HP Dynamic nPar. Removing a system board or Uniboard from a domain consists of preventing any new processes from being scheduled on the processors in the system board or Uniboard, moving all data from the memory in the system board or Uniboard to memory in other system boards or Uniboards, and switching all I/O to the alternate paths on other system boards or Uniboards. When this has been accomplished, the system board or Uniboard is detached from the domain, and repair operations can be performed on it. Adding a system board or Uniboard to a domain is simply the process in reverse. Note that a failed system board or Uniboard must have the system go through the hardware validation process (equivalent to the Power On Self Test (POST) of other systems) in order to be added to a running domain. This validation process requires that all running partitions be shut down, as this is a system wide activity, and can take an extended period of time (12 minutes for © 2008 Hewlett-Packard Development Company, L.P. one configuration of a Sun Fire 4800). Migrating the running memory from the existing system board or Uniboard to the added system board or Uniboard can take additional time (minutes). Sun states that this offers “electrically fault-isolated hard partitions”, but there is a significant difference in this than what other vendors mean by this term. Sun emphasizes the “on-line add and replace” (OLAR) features of the servers, and does not refer to any of the other capabilities of hard partitioning. Therefore, because of this different focus, by the definitions as recognized by this paper and the other vendors, Dynamic System Domains do not offer hard partitioning. Unisys Unisys uses a bus architecture for the blade and mid-range servers ES3105 (Intel Pentium Xeon), ES3120, ES3120L, ES3140, ES3140L (Intel Pentium Xeon) Therefore there is no hard partitioning on these servers. Unisys uses either a cross-bar and point-to-point architecture for the high end servers ES7000/one (Intel Itanium and Pentium Xeon) ES7000/400 (Intel Itanium and Pentium EM64T) The ES7000/400 servers are built around the Intel E8870 chipset, which has eight processor slots, up to 64GBytes of memory and eight PCI-X I/O slots. A pair of pods is connected to the Cellular MultiProcessing V1 (CMP) cross-bar switch, and up to two CMP cross-bar switches are connected together to form a large SMP server. Unisys also adds up to 256MBytes of shared L4 cache for all of the processors, and up to 12 additional PCI-X I/O slots. The ES7000/400 is built with two power domains, such that the maximum configuration in a single partition is half of the entire system, or four of these chipsets for a maximum of 16 processors, 256GBytes of memory and 32 I/O slots (using the 100MHz PCI-X cards) or 16 I/O slots (using the 133MHz PCI-X cards). The ES7000/400 can also be partitioned into four partitions, each containing two of the chipsets. The hard partitioning is at the chipset boundary. Note that the processor slots can either contain Itanium or Pentium processors, but you cannot mix processor types in a single partition. Pentium is supported for backwards compatibility, with Itanium being the premier offering. The ES7000/one (previously known as the /400 and /600) server is built around the “cell”, which is a 3U module with 4 processor slots, up to 64GBytes of memory and 5 PCI-X I/O slots. A pair of cells is connected to the Computer Interconnect Module 600 (CIM600), and 4 of the CIM600’s are connected together with the Cellular MultiProcessing V2 (CMP2) FleXbar point-to-point cabling to form a large SMP server. Unisys also adds up to 384MBytes of shared L4 cache for all of the processors, and up to 88 additional PCI-X I/O slots. Note that each cell can only be connected to two other cells, so in a fully configured ES7000/one, there will be up to 3 hops to access memory in another cell. The ES7000/one can be hard partitioned down to the cell, so the system can support from 1 to 8 hard partitions. SMP cables 8P 8P Clock Sync CIM CIM 8P 8P © 2008 Hewlett-Packard Development Company, L.P. Clock Module CMP Cross-Bar Switch CMP2 FleXbar cabling Figure 10 Unisys Cellular MultiProcessing Architecture Soft Partitioning Hard partitions are done exclusively in the server hardware. The operating systems are unaware of this, and treat all of the hardware that they see as the complete set of server hardware, and require a reboot in order to change that view of the hardware. But some operating environments on some servers also have the ability to further divide a hard partition into multiple instances of the operating system, and then to dynamically move hardware components between these instances without rebooting either of the running instances. These are called soft, or dynamic, partitions. One of the advantages of dynamic partitions is that they protect your environment against peaks and valleys in your workload. Traditionally, you build a system with the processors and memory for the worst-case workload for each of the operating system instances, accepting the fact that this extra hardware will be unused most of the time. In addition, with hard partitioning becoming more popular through system consolidation, each operating system instance in each hard partition requires enough processors and memory for the worst-case workload. But in many cases, you won’t need the extra hardware in all operating system instances at the same time. Dynamic partitioning lets you move this little bit of extra hardware between partitions of a larger system easily, without rebooting the operating system instance. For example, you can have the majority of your processors run in the on-line processing partition during the day when the majority of your users are on-line, automatically move them to the batch partition at night (that was basically idle all day but is extremely busy at night), and then move them back to the on-line processing partition in the morning, using a script that runs at those times of the day. This same functionality can be accomplished with hard partitions, but it requires that you reboot the operating system instances in both of the partitions. This is often inconvenient or impossible in production workloads, while dynamic partitioning does not require rebooting of either operating system instance. Soft partitioning functionality is as follows: Move processors Move I/O slots Move memory Share memory between partitions DLPARs AIX Yes Yes Yes No Galaxy OpenVMS Yes No No Yes Logical Domains Solaris T1 and T2 Yes No No No vPars HP-UX Yes No Yes No Figure 11 Soft Partitioning IBM DLPARs IBM’s Dynamic Logical PARtitions (DLPARs) on AIX (Advanced Interactive eXecutive) are an extension of the Logical PARtitions (LPARs) that have previously been available under the eServer iSeries and pSeries. LPARs are allocated processors, memory (in 256MByte chunks called “regions”) and I/O cards, and these allocations are fixed at boot time and can only be modified by rebooting the LPAR. DLPARs add the “dynamic” part, which is the movement of resources such as processors, memory and I/O cards between DLPARs. Note that, as shown in the previous section, © 2008 Hewlett-Packard Development Company, L.P. the eServer iSeries and pSeries does not offer hard partitions, so IBM offers LPARs and DLPARs instead. Moving resources between DLPARs involves cooperation between the server firmware and the operating system. This server firmware is an optional upgrade and is called the “Power Hypervisor”. Each AIX instance has a set of “virtual resource connectors” where processors, memory and I/O slots can be configured. When a resource is added to an instance, a message is sent by the Hardware Management Console (HMC) through the Hypervisor to the operating system instance that then connects that resource to one of the virtual resource connectors. The HMC keeps track of which resources belong to which operating system instance (or no operating system instance, if the resource is unassigned), and the operating system will receive an error if it attempts to assign a resource without the cooperation of the HMC. When an operating instance releases a resource, the Hypervisor initializes it to remove all existing data on the resource. IBM has grouped all of the virtualization technologies under the “Virtualization Engine” umbrella, which covers LPARs, DLPARs, micro partitioning, virtual I/O server and monitoring of the entire environment. I will discuss the other technologies later. HP Galaxy and vPars HP’s Galaxy on OpenVMS uses the Adaptive Partitioned MultiProcessing (APMP) model to allow multiple OpenVMS instances to run in a single server that can be either an entire server or a hard partition. Each OpenVMS instance has private processors, memory and I/O slots. OpenVMS Galaxy is unique among the soft partitioning technologies in that it can share memory between partitions for any program to use (note that the Unisys shared memory is only for LAN interconnect and is between hard partitions). The partition manager is used to set the initial configuration of the Galaxy instances by creating and setting the values of console variables. Processors can then be moved between instances either by the global WorkLoad Manager (gWLM) software or by the partition manager moving the processor between the instances. HP’s vPars on HP-UX allows multiple HP-UX 11i instances to run in a single server that can be either an entire server or a hard partition on HP 9000 PA-RISC or Integrity cell-based servers. Each HP-UX instance has private processors, memory and I/O slots. The partition manager is used to set the initial configuration of the vPars. Processors and memory can then be moved between instances by either gWLM or by the partition manager moving the resource between the instances. Note that moving an active processor from one vPar to another takes under a second, but moving active memory from one vPar to another takes longer, due to the need to drain the memory of the active information in it, and then to cleanse the memory for security purposes: this can take several minutes. Whether it is done by the partition manager, by the HP-UX or OpenVMS instance, or by the gWLM tool, the operation is the same: stopping the processor in one instance, re-assigning it to another instance, and then starting the processor in the new instance. The function of stopping and starting the processor has been in the respective operating systems for years: what is new is the ability to re-assign the processor between instances of the operating systems. Both HP-UX and OpenVMS offer tools to dynamically balance processors across dynamic partitions. As previously mentioned, both HP-UX and OpenVMS Work Load Manager (WLM) and Global Work Load Manager (gWLM) work alongside the Process Resource Manager (PRM) to offer goal- based performance management of applications. Processors can be moved between vPars in order to achieve balanced performance across the larger system. OpenVMS also offers the Galaxy Balancer (GCU$BALANCER) utility, that can load balance processor demand across multiple Galaxy instances. Both of these work with either clustered or un-clustered instances. HP has grouped all of the virtualization technologies under the “Virtual Infrastructure” and “Virtual Server Environment” umbrella, which covers hard partitioning, soft partitioning, clustering, pay-per-use and instant capacity. I will discuss the variable capacity technologies later. © 2008 Hewlett-Packard Development Company, L.P. Sun Microsystems Observant readers will note that Sun Dynamic System Domain (DSD) is not present in the table, even though it does allow some movement of components between domains. As I discussed in the previous section, DSD is more of an “on-line add and replace” (OLAR) capability than a partitioning technology. Sun documentation does not discuss the dynamic movement of resources in the same way as both HP and IBM do, so I have chosen to follow their emphasis and not include DSD as a dynamic partitioning technology. Sun recently introduced Logical Domains (LDoms). Sun states that they fit between the hard partition capability of Dynamic System Domains and the resource partitions of Zones. LDoms are bound to a particular set of processor and memory resources, and do allow a specific domain to directly access a specific I/O card or a specific port on a multi-port I/O card, which is very similar to an HP vPar or an IBM LPAR. However, they virtualize the physical hardware in a way more closely related to micro partitions: LDoms are assigned virtual processors which are threads inside the cores of a T1 (Niagara) processor, and they share the I/O devices between domains, which is not the way any of the other soft partitioning products from the other vendors work. Virtual processors can be moved between LDoms without rebooting, but memory and I/O resources cannot be moved. To get around this restriction, Sun introduced the concept of “delayed reconfiguration”, which means that the changes will occur the next time the LDom is rebooted. LDoms are only available on the T1000 and T2000 servers, and only for Solaris 10. Because they can run on a single thread of a single core in a processor, there can be up to 32 LDoms in a single processor T1000. Sun has characterized LDoms as soft partitions, while it seems to me that they are more similar to micro partitions. Because they are a mix of technologies, I have placed LDoms in both this section and the micro partitioning section. Micro Partitioning Hard partitioning and soft partitioning both operate on single components of the server hardware: a processor, a bank of memory, an I/O slot, etc. And while this is quite useful, the incredible advances in server performance frequently leads to server consolidation cases where an older application may require far less than an entire component of the server hardware. For example, an application that performs quite adequately on a 333MHz Pentium II with 10MByte/sec SCSI-2 and 10Mbit Ethernet, simply does not need all of the power of a 3.2GHz Pentium Xeon quad core processor with a 4Gbit FibreChannel and GigaBit Ethernet card. But if the original server needs to be retired (perhaps because parts are no longer available, or the hardware maintenance is becoming increasingly expensive), than that faster hardware might actually be the low end of the current server offerings. One popular option has been to stack multiple applications into a single operating system instance. This works very well for many operations, because it shares the hardware among the different applications using standard process scheduling and memory utilization algorithms in the operating system, which work extremely well. And in the cases where they do not, the standard resource management tools can be very effective. Further, this can be extremely cost effective, because it amortizes the cost of the hardware and software across the multiple applications, allowing a few application or database software per-processor licenses to serve many different applications. I will discuss this “resource partitioning” later in the paper. However, this is not a universal panacea. The different applications frequently have different operational requirements. Most often, this comes in the form of operating system or application version or patch levels, where one application needs a more recent version while another application needs to remain on an older version. One application might require system downtime for maintenance at the same time another application is experiencing a mission-critical peak time. Balancing all of these competing requirements is often difficult, and frequently impossible. © 2008 Hewlett-Packard Development Company, L.P. The solution is to divide up the extremely fast server components, and let multiple operating system instances timeshare each processor, storage card or network card: in effect, to partition the individual components in the same way we have partitioned the server in the previous cases. This is called micro partitioning. It works the same way as hard and soft partitioning, but instead of allocating whole units of server components to each operating system instance, we timeshare the components between instances of the operating system instance. The allocations are generally dynamic based on current load, in the same way that timesharing allocates processor, memory, storage and network bandwidth dynamically inside a single instance of an operating system in a server. But, again in the same way, there are tools that can guarantee specific allocations of resources to specific operating system instances. I will discuss the details of these guarantees later. The benefits of micro partitioning are many. The operating system instance can remain as it always was, with the same network name and address, accessing the same volumes, on the same patch level of the operating system and application, and in general not affecting (or being affected by) the other operating system instances. Many operating system instances can be hosted on a very small number of servers, which are treated as a “pool” of resources to be allocated dynamically among all of the operating system instances. Micro partitioning separates the physical system from the operating system instances (called the “guest” system instances) by a virtual intermediary (called the “host” system instance). The “host” instance interacts directly with the physical server hardware components, controlling and working with those components in the normal way that operating systems control and work with server hardware in a non-hosted environment. The difference is that the sole function of this host is to pretend to be the server hardware to the instances of the guest operating systems. Each guest instance operates as if it were directly interacting with the server hardware: it runs on the processor, it uses physical memory, it performs I/O through the network and storage I/O cards, etc. But in reality it is merely requesting some percentage of the underlying server hardware components through the host instance which is pretending to be that underlying server System Instance System Instance System Instance Application System Instance System Instance Virtualized Intermediary Virtualized Intermediary Physical System Thin Host Host OS Physical System Fat Host hardware. Figure 12 Micro Partitioning Types The host instance can be a dedicated environment (a “virtual machine monitor” or “hypervisor”) whose only function is to emulate the underlying hardware, classically known as Type 1 hypervisor but which I will call the “thin host” model. Or it can be a set of services on a standard operating © 2008 Hewlett-Packard Development Company, L.P. system, classically known as a Type 2 hypervisor but which I will call the “fat host” model. This terminology is a conscious copy of the “thin client”/”fat client” model. Note that hypervisors were first implemented in 1967 in the IBM System/360-67: like everything else in the virtualization world, they have a long history but only recently have received the attention they deserve, as the processors and I/O sub-system have become fast enough, and the memory has become dense (and cheap) enough, to allow them to be productive and cost-effective outside of the mainframe. In the fat host model, the operating system can either run its own applications in addition to the guest instances, or it can be stripped down to remove most of its services in order to enhance the performance and reduce the overhead of the guest instances. Stripping it down also has the benefit of reducing the number of targets for virus attacks, making the host instance more secure and reliable. One of the features of the “fat host” model is that it can offer services to one or more of the guest instances, such as sharing of devices such as DVDs or tape arrays. The “thin host” model cannot generally do this, because it does not offer sufficient locking primitives to the guest instances which are needed to coordinate access to shared devices or files. One interesting behavior of Type 1 (“thin host”) hypervisors is that they tend to become “fatter” over time. The IBM hypervisor running on the Power4 processors was extremely thin, much more like the Microsoft Hardware Abstraction Layer (HAL) than like a VMware ESX environment. But the Power5 and 5+ versions have added virtual I/O capabilities, processor dispatch to enable the movement of physical processors between guest instances, and reliability-availabilityserviceability (RAS) enhancements in response to customer demands. This “feature creep” has added to the overhead of the host instance, which has been partly offset by the constantly increasing speed of the underlying server hardware. But some vendors are recognizing the problems of this, and are offering the option of extremely thin hypervisors, which are actually embedded into the hardware itself. I will discuss this later in the paper. In either model, the guest instances are unaware of the existence of the host instance, and run exactly as if they were running on the “bare iron” server hardware. The guest instances generally communicate with other instances of the operating systems (whether they are guest instances on the same physical server, guest instances on another server, or standalone on yet another server) through the standard methods of networking and storage. Therefore, standard business level security is the same as if these instances were running on separate systems, and no change to the application is required. All vendors provide tools which convert standalone bootable instances of operating systems (primarily the boot disk) into a format which is suitable to boot the operating system as a guest instance. In many cases this is a container file which holds the entire guest instance. This process is known as “P2V”: physical to (2) virtual. The container file is completely portable between virtual machine monitors of that type, though in general you cannot move a container file between virtual machine monitors from different vendors. In other cases, this is a standard bootable disk with a small information file needed by the host instance to boot the guest instance. The container file and information file concept leads to one of the major advantages of micro partitioning, in that some virtual machine monitors allow the guest instances to be moved dynamically between servers. A guest instance can be suspended and then effectively “swapped out” of one server, its current memory image stored in the container file or disk, and then “swapped in” and its operation resumed on another server, with all running programs and context fully intact. Some host instances can also perform the transfer of the in-memory information using standard networking, which is often faster than saving that information to the container file and then reading it back in on the target guest instance. Either way, the applications will not be executing instructions while this is occurring, but the operation is usually far faster and easier than shutting down the instance and bringing it up on another server, which would be required if the operating system instance was running standalone on a server. © 2008 Hewlett-Packard Development Company, L.P. Note that this is done at the operating system level, not the application level. The virtual machine monitors on the separate servers take care of transparently transferring all of the network and storage connections between the servers, such that the guest instance never notices that anything has occurred. This is similar to failing over application instances between different nodes in a cluster, but is often more capable of maintaining current user connections and in-flight transactions. There are restrictions and operational requirements on this movement, but with proper planning it does allow an operating system (guest) instance to be active for longer than any of the underlying server systems have been active, which represents significant advantages in terms of achieving high availability of applications while still performing the required maintenance on the servers. One of the restrictions is that vendors frequently require extremely similar hardware. Intel Flex Migration allows the processor to fool the hypervisor into thinking that successive generations are the same as previous generations of processors, to allow vMotion/LiveMigrate across multiple generations of processors. It also has another advantage even in normal production when the hardware is operating properly, which is the case almost all of the time, since today every vendor’s server hardware is extremely reliable. In the traditional model of load balancing, such as in soft partitions, we would move the resources to the workload. But with the ability to move guest instances between host instances, we can move the workload to the resources. So if you have a group of servers which are running virtual machine monitors, you can move guest instances between them in order to achieve load balancing among a group of servers, instead of (with soft partitioning) just inside a single server. Micro partitioning is not without cost. When comparing a standalone instance of an operating system and application to that same operating system and application running as a guest instance in a micro partitioned environment, a performance overhead penalty of 20% to 40% is common. The reasons for this are many, and include the re-direction needed to execute privileged instructions, competition for resources with the other guest instances, and the simple overhead of the host instance. Further, even when there is no resource contention, each operation (usually I/O operations) will suffer from additional latency compared to the standalone model. One way to think about this is to recognize that all partitioning schemes involve some degree of lying to the operating system instances. And just like when people lie, the greater the size of the lie, the greater amount of effort needed to support the lie. In the case of hard and soft partitioning, the lie is not that extensive, so the amount of effort (aka, overhead) is small. But in the case of micro partitioning, where the guest instance is completely isolated from the underlying hardware and the host instance (whether a thin host virtual machine monitor or a fat host operating system with micro partitioning services) is lying about every interaction the guest instance has with the hardware, the amount of effort (aka, overhead) to support the lie is high. So micro partitioning is not appropriate when the performance of the guest instance is critical and would require all of the performance of the underlying server hardware. Further, you must consider the impact of one guest instance on another guest instance, such that if two guest instances both require a large percentage of the underlying server hardware at the same time, the resources might not be available to satisfy both of them. One factor that often arises is time. Most operating systems keep track of time by counting timer interrupts or processor clock ticks. For example, Windows sets a 10 milli-second clock timer which will fire 100 times per second, in order to track timer events, though in some cases it will reset this to a 1 milli-second timer. And this works well when running on bare iron hardware. But when these interrupts are themselves being shared among many other guest instances, it is extremely common to see “clock drift”, where the times reported by the guest instances are © 2008 Hewlett-Packard Development Company, L.P. slightly behind each other (and by different amounts), and significantly behind the actual time. Some operating systems notice that this is occurring, and attempt to adjust for this drift by making up for the estimated number of clock ticks which have been missed. While this is sometimes effective, if the estimation is wrong and the operating system applies too much correction, it can lead to the opposite problem: the operating system getting ahead of real time. Different virtualization products handle this in different ways. One of the most common is to use the Network Time Protocol (NTP) in the guest instance on a frequent basis to ensure that the clock is accurate. There are many white papers and other technical documentation which discuss how to handle this problem, and I have listed some of them in the appendix. I particularly recommend the EMC VMware paper, as it gives a solid background on the different hardware clocks and how they interact with various operating systems, primarily Linux and Windows. Note that the processors and memory are not always the point of contention. Processors have become so fast (as Moore’s Law predicted decades ago) and memory has become so large (due to 64-bit technologies becoming available on every server platform) that in very many cases the bottleneck has moved to the I/O subsystem. The overhead of the TCP/IP stack and the requirement to move very large blocks of data from the SCSI and FibreChannel cards to physical memory, while coordinating the I/O between the many guest instances across a single channel, is very expensive. And the expense is not bandwidth: guest instances can communicate with each other over the host instance’s memory, and can communicate between servers using GigaBit Ethernet (soon to be 10 GigaBit Ethernet), 4Gbit FibreChannel and even Infiniband. And all of these I/O technologies can be “trunked” together to provide massive throughput. No, the expense in most cases is latency, where the guest instances must communicate locking and other information in messages which are very small, but each message incurs latency of 20,000-50,000 nano-seconds instead of the latency of local memory which is 200-500 nano-seconds. All processors wait at exactly the same speed, and with each message taking approximately 100 times as long to reach its destination, you should carefully consider this I/O contention when evaluating how many guest instances to place on a single host. But often more important than the overhead, is the business cost of the failure of a single server. If only one application was running on that server, the impact of the failure is limited to that application (though that impact might be severe, it is still only one application). But if 10, 20 or even 100 different application instances were running on a single server, the impact would be more wide-spread and could be much more significant. This is another example of the classic “scale up” vs. “scale out” discussion. In some cases it will make more sense to have a fewer number of larger and more reliable servers with many guest instances on each server (putting all your eggs in a few baskets, and then watching those baskets extremely carefully), and in other cases it will make more sense to have a larger number of smaller and cheaper servers with fewer guest instances on each server. There is no single right answer, and you need to consider all the factors carefully when planning for micro partitioning. Para-Virtualization One of the difficulties of virtualizing the standard processors from AMD and Intel is that the operating systems use “ring 0” (the most privileged mode) for the functions such as direct memory access (DMA) for I/O, mapping page tables to processes, mapping virtual memory to physical memory, handling device interrupts, and accessing data structures which control the scheduling and interaction of the different user processes. (Interesting historical note: the Digital Equipment Corporation VAX was the first processor to use four ring levels, and most succeeding processors have followed this example, including Itanium, Opteron and Pentium). Ring 1 and ring 2 are used for varying levels of privileged functions, while applications run in ring 3 (the least privileged mode). But if the virtual machine monitor runs in ring 0, then a guest instance cannot also run in ring 0, because it would interfere with the virtual machine monitor and represent security problems, where a guest instance could accidentally map the physical memory which the © 2008 Hewlett-Packard Development Company, L.P. host instance had allocated to another guest instance, or respond to a device interrupt from another guest instance. This problem was first addressed in software and only more recently in hardware. There are two methods that the software has for dealing with the problem of both the guest instance and the host instance each requiring exclusive access to ring 0: either the virtual machine monitor intercepts all highly privileged operations, or the developers of the guest operating system modify their source code to notice that they are running as a guest instance and therefore call the virtual machine monitor instead of executing those instructions directly. The first method is the one which is used by most early virtualization products, while the second method was pioneered in the Linux market with Xen but has since moved to the mainstream for almost all major operating systems. In addition to this interception, it possible to use “ring compression” to run these operating systems as a guest instance. Because most operating systems did not use all of the rings (specifically, they used only ring 3 and ring 0), the hypervisor could “de-privilege” the operating system, forcing the guest instance to run in rings 3 and 1 instead of rings 3 and 0. Only the hypervisor runs in ring 0, and all privileged operations are executed by the hypervisor. This is very slow, especially when those calls occur in a loop, such as changing the page protection on a long list of pages, or mapping a long list of virtual addresses to physical addresses. If each of these operations needs to be intercepted individually, sent to the host instance for execution, and the guest instance resumes only to instantly have the next instruction intercepted as it performs the next operation in the list, performance will suffer dramatically. Para-virtualization addresses this by having the operating system be modified to notice at runtime if they are running as a guest instance, and to execute these calls to the hypervisor directly. This is much more efficient, but requires some assistance from the processor itself in order to fully isolate the guest instance and the hypervisor, as well as allow extremely fast communication between them. Without this hardware assistance, the hypervisor is forced to use a combination of emulation, dynamic patching and trapping in order to have the guest instance execute properly, all of which has very high overhead. This is where hardware virtualization comes in. Because of the requirements, demand grew up to address the problem in hardware, specifically in the processor itself. AMD, IBM, Intel and Sun have all placed virtualization technology into their physical processors. AMD has the Pacifica project, IBM added virtualization technology to the Power4 and all subsequent processors, Intel has the Vanderpool project which implements Virtualization Technology for Itanium (VT-i) and Pentium Xeon (VT-x), and Sun has implemented the “hyper-privileged level” in the T1 (Niagara) processor. AMD, VT-x and Sun’s T1 work by creating an extra ring level, below ring 0, which Intel calls VMX Root, or “ring -1” (ring minus one). This allows the guest instances to actually run in ring 0, and not require de-privileging the operating systems, either during booting or during development. The processor itself traps the privileged instruction or I/O operation and gives it to the virtual machine monitor, instead of executing it directly. This allows much higher performance of the guest operating systems, by reducing the overhead of transferring control between the guest and host instances whenever one of the privileged instructions are executed, as well as making it easier (or even possible) for standard editions of operating systems to run as guest instances. IBM’s POWER4 processor introduced support for logical partitioning with a new privileged processor state called hypervisor mode, similar to ring -1. It is accessed using hypervisor call functions, which are generated by the operating system kernel running in a partition. Hypervisor mode allows for a secure mode of operation that is required for various system functions where logical partition integrity and security are required. The hypervisor validates that the partition has ownership of the resources it is attempting to access, such as processor, memory, and I/O, then completes the function. This mechanism allows for complete isolation of partition resources. In © 2008 Hewlett-Packard Development Company, L.P. the POWER5/5+ and POWER6 processors, further design enhancements are introduced that enable the sharing of processors by multiple partitions. Intel introduced virtualization technology to Itanium (VT-i) starting with the “Montecito” processor slightly differently. A new processor status register (PSR.vm) was created, where the host instance runs with PSR.vm =0 while guest instances run at PSR.vm =1. All 4 privilege levels can be used regardless of the value of PSR.vm. When PSR.vm =1, all privileged instructions and some non privileged cause a new virtualization fault. PSR.vm is cleared to 0 on all interrupts, so the host instance handles all interrupts, even those belonging to guest instances. The host instance will handle either the privileged operation or the interrupt, and then set PSR.vm =1 and use the “rfi” instruction to return control to the guest instance. These technologies are commonly grouped under the common name “Hardware Virtual Machine” (HVM), and I will use this term to refer to all of them. Note that as the percentage of processors with this HVM technology increases in the market (and this has occurred, such that all general purpose processors now have this ability), and the percentage of operating system instances running in micro partitions has increased (and this has also occurred, with virtually every customer implementing some form of server consolidation) the need for para-virtualization has increased. As stated above, para-virtualization requires the operating systems to detect whether they are running as a guest instance, and either to execute the code directly (if they are running on bare iron) or to call the virtual machine monitor (if they are running as a guest instance). These checks will slow down the system running as a standard instance very slightly, but dramatically speed up when running as a guest instance. As the percentage of operating systems instances operating in micro partitions continues to grow, this trade-off will be well worth the extremely minor cost. In 2005, VMware proposed a formal interface for the communication between the guest and host instances, called the Virtual Machine Interface (VMI). Xen has proposed a different formal interface for this communication, called the “hyper-call interface”, but VMware and Xen are working together to make the hyper-call interface compatible with VMI by using paravirtualization optimizations (paravirt-ops). As stated above, Linux led the way with para-virtualization, but all major O/S vendors have implemented it in their base operating systems. AIX in SPLPARs, HP-UX in HPVM (especially with the Advanced Virtual I/O system), OpenVMS in HPVM (in a future major release), and Solaris 10 in either xVM or LDOMs (depending on the processor), all incorporate para-virtualization to some level, with further enhancements almost certain. Microsoft is using the Xen hyper-call interface in Windows 2008, and is calling the run-time detection “enlightenment”. So when you are deciding on a micro-partitioning solution, you need to verify whether your particular version of your particular guest operating system is running fully virtualized (where nothing needs to change in your guest environment, but performance will be relatively slow), or whether it is para-virtualized (where you need a very recent version of the operating system, but performance will be relatively good), and whether you need to use the latest processors which contain HVM as part of the mix. If you always buy the latest and greatest for your entire environment, para-virtualized guests running on processors with HVM will be very easy to find: in fact, it will be difficult to get away from. But very few customers are able to refresh their entire infrastructure just to move to the latest and greatest hardware and software, no matter how much the systems and application vendors might wish them to do so. So you need to carefully match your environment to the actual capabilities of the solution, and verify the type of virtualization that is present in your micro-partitioned solution. © 2008 Hewlett-Packard Development Company, L.P. Dynamic Capacity There are two different meanings of dynamic capacity for micro partitioning: modifying the capacity of the host instance and modifying the capacity of the guest instance. The host instance can modify the amount of resources available to it by a variety of mechanisms. A physical resource can have an increasing number of soft errors such that the host instance operating system will remove it from the resource set in a graceful way without a crash or a reboot. A physical resource can be added to the environment via hot-add, and the host instance operating system can add it to the available resources. Some operating systems support variable capacity programs, such as IBM’s “On/Off Capacity On Demand” or HP’s “Temporary Instant Capacity”, which allow physical resources can be activated temporarily for peak processing times. No matter how it occurs, the virtual machine monitor will allocate these new resources to the guest instances, as follows. Physical resources can be allocated for the exclusive use of a guest instance, such that they are not accessible by any other guest instance. This is always done for memory, but is also useful for network or storage adapters either for security reasons or to ensure a high level of performance. Because the guest instances usually cannot dynamically modify their virtual processor and memory resources, the guest instances will not take advantage of the new processors and memory in an exclusive way. However, new network and storage adapters can be added to the guest instances in the standard way that host instances recognize new paths. But in most cases, the guest instances are allocated resources by the virtual machine monitor as virtual resources. In the case of processors, you specify a number of virtual processors for the guest instance, which can exceed the number of physical processors on the server, and the virtual processors are time-shared among the physical processors. In the case of network or storage adapters, you specify virtual paths which are then mapped to physical network or storage adapters or to virtual network adapters inside the virtual machine monitor. Access to the virtual resources is done by either a priority scheme or by a percentage of the physical resources. Priority schemes work exactly as they do among processes inside a single instance of an operating system, and do not guarantee that a specific amount of resources will always be available to a specific guest instance. The percentage mechanism does specify this type of guarantee, and these percentages can be either “capped” or “uncapped”. A guest instance with capped percentages can access (for example) 50% of the physical processor resources but never any more, whether or not the physical resources are being used or idle. A guest instance with uncapped percentages is guaranteed that (for example) 50% of the physical processor resources are available to it, but if the physical processors are idle it can use as much as is available. The above examples are for processors, but are equally applicable to storage and network adapter bandwidth. Note that the percentages are in terms of percentage of resources, not in terms of physical resources. 50% processor time means that the guest instance uses half of the processing time of the physical processors, such that if there are two processors it is equivalent to dedicating an entire physical processor to the guest instance. But the guest instance will execute on either of the physical processors: it is the percentage that is guaranteed, not access to a specific processor. Therefore, if the host instance modifies the amount of physical resources available to it (such as by the methods discussed above for the host instance), the percentages and priorities are computed based on the new amount of physical resources. Such that if a new processor is added to the physical server, and the guest instances are running uncapped, they will see an increase in their performance. The guest instances may be able to modify their virtual resources (which I refer to as “dynamic resources”) or their percentages dynamically, or they may require a reboot of the guest instance to modify these values. Further, the computation of percentage of resources may be spread among all of the physical resources (in which case the percentages may exceed 100%), or a virtual © 2008 Hewlett-Packard Development Company, L.P. resource may be specified in terms of physical resources (in which case the percentage cannot exceed 100%). See the specific discussions below for these details. I/O Models There are two different methods for I/O in micro partitioning available today: I/O done by the virtual machine monitor, and I/O done by a dedicated partition. System Instance System Instance System Instance Device Drivers Device Drivers Device Drivers Virtual I/O Virtual CPU Virtual Mem Virtualization Intermediary Native I/O Native CPU Native Mem I/O Mgr Back End Drivers Device Drivers System Instance System Instance Front End Drivers Front End Drivers Virtual CPU Virtual Mem Virtualization Intermediary Native CPU Native Mem Physical System Physical System Virtual Machine Dedicated Partition Figure 13 Micro Partitioning I/O Types In the first case, the host instance isolates the server hardware from the guest instances completely, such that the guest instances are unable to access the server hardware except through the host instance. All device drivers in the guest instances pass their requests to the virtual I/O drivers in the host instance, which then controls the hardware directly. This has the advantage that the guest operating systems can run un-changed, without even being aware that they are not running on native hardware. This helps in server consolidation of older operating systems which cannot be modified because they are no longer being actively developed, or are out of support. In the second case, a dedicated partition (which IBM calls the “Virtual I/O Server”, which came from the mainframe technology of the same name, and which Linux calls Domain Zero (Dom0)) handles all of the I/O, without having to pass through the host instance. Each of the guest instances performs all I/O by passing its request as a message to one of the other guest instances, which then performs the I/O directly and passes the results back to the original guest instance as a message. This guest instance requires dedicated physical resources such as memory and processors, since it is responsible for handling the I/O interrupts for the entire environment. The advantage of this is that the virtual machine monitor can be extremely small and efficient, since it does not need any drivers for either storage or networking. All of this is handled by the guest instance which is handling the I/O and managing the entire environment. The disadvantage of this is that the guest instance doing the I/O can become overwhelmed with the requests from the other guest instances, but this can be handled like any other performance issue. Further, it takes two complete round-trips through the host operating system to perform a single I/O: once from the guest instance to the Virtual I/O Server, and then once back from the Virtual I/O server to the guest instance. Finally, the guest instance represents a single point of failure: no I/O can take place while the guest instance is failing and re-starting. This can be avoided by having two virtual I/O server guest instances, but this will consume extra physical resources, leaving less for the application server guest instances In either case the host instance or the Virtual I/O Server can mask the specific I/O devices from the guest instances by presenting “standard” hardware to those guest instances. For example, a guest instance could believe it is communicating over a standard 100Mbit Ethernet card and © 2008 Hewlett-Packard Development Company, L.P. standard SCSI-2 card (for which it has the standard drivers), while the virtual machine monitor or guest instance handling the I/O is actually using very high-speed GigaBit Ethernet and FibreChannel cards, which the guest instances do not yet support. This allows much older operating system and application versions to continue to run without modifications or upgrades. One corollary of this device virtualization is that many of the standard tools which operating systems offer for higher performance or reliability are not supported or recommended when running those operating systems as guest instances. For example, HP-UX does not support Auto Port Aggregation for networking or SecurePath for storage when run as a guest instance. And this makes sense, because all of these kinds of products believe that they are working directly with specific hardware channels, which is not true in running as a guest instance. Most of the micro partitioning products have the ability to create a private network among the guest instances, allowing very high speed communications with standard networking protocols between those guest instances through the host instance without passing through the Ethernet cards. This is known as “virtual switches”. The following shows the virtualization capabilities of the major micro partitioning products on the market today (the “p” or “f” in the “System Instance” column indicates whether the guest instance is capable of running in a para-virtualized environment or a fully virtualized environment): Hardware Platform Host Model I/O Done By: System Instance O/S’s (p/f) Private Network Citrix XenServer AMD Opteron, Intel Pentium Thin: Linux Dedicated partition Linuxp, Windowsf Yes (internal) EMC VMware ESX AMD Opteron, Intel Pentium Thin: dedicated Virtualization Intermediary Linuxp, NetWaref, EMC VMware Server (GSX) AMD Opteron, Intel Pentium Fat: Linux or Windows Virtualization intermediary MS-DOSf, Linuxf, HP Virtual Machine Intel Itanium Fat: HP-UX Virtualization intermediary HP-UXp, Linuxp, Yes (host only network) Yes (host only network) Yes IBM Micropartition (SPLpars) IBM Micropartition (SPLpars) VIOS Linux (Xen) Red Hat, SuSE Power5 (i5, p5, p6) Thin: Hypervisor Virtualization intermediary AIXp, i5/OSp, Linuxp Yes (Virtual Ethernet) Power5 (i5, p5, p6) Thin: Hypervisor Dedicated partition AIXp, i5/OSp, Linuxp Yes (Virtual Ethernet) AMD Opteron, Intel Pentium/Itanium AMD Opteron, Intel Pentium Thin: Linux Dedicated partition Linuxp, Windowsf Yes Thin: Hypervisor Dedicated partition Linuxp, Windowsp Yes AMD Opteron, Intel Pentium Thin: Linux Dedicated partition Linuxp, Windowsf No Intel Pentium Macintosh server Thick: Mac OS X Virtualization intermediary Linux, Mac OS X, Windows No Microsoft Windows Hyper-V Oracle VM (Xen) Parallels Server Solarisp, Windowsf Solarisf, Windowsf Windowsf © 2008 Hewlett-Packard Development Company, L.P. Sun Logical Domains UltraSPARC T1, T2 Thin: Firmware Dedicated partition Linuxp, Solaris 10p Yes Sun xVM UltraSPARC, Intel Pentium Thin: Hypervisor Dedicated partition BSDp, Linuxp, No Solarisp, Windowsf Figure 14a Micro Partitioning Products Guests per host Citrix XenServer EMC VMware ESX EMC VMware Server (aka, GSX) HP Virtual Machine IBM Micropartition SPLpars IBM Micropartition SPLpars VIOS Linux Virtualization Red Hat, SuSE Microsoft Windows Hyper-V Oracle VM (Xen) Parallels Server Sun Logical Domain Sun xVM vCores per guest GBytes per guest Overcommit Resources Dynamic Resources 32 (Linux) 8 (Win) 4 32 No No XenMotion 16 No vMotion 2 3.6 Processor, Memory No No No 254 4 36 Processor No No 254 (up to 10/core) 64 (up to # cores) (total memory) Processor No No 254 (up to 10/core) 64 (up to # cores) (total memory) Processor No Yes Unlimited 32 64 Processor Processor Yes 1 SE, 4 EE, unlimited DC Unlimited 4 (O/S dependent) No No No 4 3.6 32b Unlimited 64b 32 No No Yes 50 4 8 No No No 64 (up to # threads) Unlimited 4 32 No Processor No 4 32 Yes No No Unlimited 128 (up to 8/core) 64 Dynamic Migration Figure 14b Micro Partitioning Products EMC VMware VMware has been doing virtualization longer than any of the other products in the market. VMware was a stand-alone company but was purchased by EMC at the end of 2003. VMware offers both thin (VMware ESX, which VMware refers to as the “non-hosted” model) and fat (VMware Server, which was previously known as GSX, which VMware refers to as the “hosted” model) virtual machine monitors on AMD Opteron and Intel Pentium Xeon systems. VMware also offers a version of this software called VMware Workstation, which I will not cover here because it is focused on the desktop market, and I am focusing on the server market. However, this is a © 2008 Hewlett-Packard Development Company, L.P. very solid product that is used by many notebook and laptop PC users to allow the use of Linux and Windows at the same time without rebooting. VMware ESX runs as a dedicated virtual machine monitor. The “VMkernel” is a micro-kernel that is loaded into a Linux kernel as a driver. No applications or any other code can be run in the VMkernel. VMware ESX supports FreeBSD, many distributions of Linux, Novell NetWare and Microsoft Windows from NT to Windows 2003 Enterprise Edition in 32-bit, and 64-bit versions of Windows 2003, Red Hat and SuSE Linux, and Solaris 10 (x86/x64). Linux and Solaris can be paravirtualized, but this requires HVM hardware. VMware ESX 3i is an integrated hypervisor which gets rid of the Linux O/S, fits into 32MBytes, and is booted from an internal USB key. This is shipping from multiple vendors (Dell, HP and IBM) as an “embedded” hypervisor, providing complete integration with the hardware, and looking much more like IBM’s Advanced Power Virtualization (APV) model, where it is not possible to run an O/S instance on the bare iron, but only through the hypervisor. The system will power up directly into ESX 3i, and all operating systems will boot as guest instances. Note that at the same time as ESX 3i, VMware is moving away from SNMP and moving toward the Common Information Model (CIM) standard. This means that the standard console access will not work, but the vendors are shipping CIM providers to implement management. VMware Server (aka, GSX) installs in and runs as a service on a standard instance of Windows 2003 Web, Standard and Enterprise editions, Windows 2000 Advanced Server, and many distributions of Linux. VMware GSX supports DOS 6.2, FreeBSD, many distributions of Linux, Novell NetWare, and Windows all the way from Windows 3.11 to Windows 2003 Enterprise Edition. Note that VMware Server is being offered as a free download by EMC, while support for that is offered as a fee based service. EMC offers the standard ability to either install an O/S instance directly from the distribution media, or to convert a running (physical) instance into a guest (virtual) instance, using P2V (physical to (2) virtual) tools. EMC previously offered VMware P2V assistant, but with VMware V3 it now offers VMware Converter 3, and which is available in both Starter (a free download) and Enterprise (licensed with a support contract with VMware Virtual Center) editions. VMware is so popular that many server vendors also supply these kinds of tools, such as HP’s ProLiant Server Migration Pack. Note that with VMware Virtual Machine Importer, VMware ESX can read the container files created by Microsoft Virtual Server (though Microsoft has indicated that this is a violation of the Microsoft license agreement). Further, ESX 3.0 has implemented VMFS (Virtual Machine File System), a clustered file system for multiple host instances to share container files. VMware emulates a generic Intel processor-set and standard I/O cards, and performs all I/O in the virtual machine monitor, providing support for the widest variety of operating systems and applications. This is most useful for older operating system guest instances which do not offer support for 1000BaseT or 4GBit FibreChannel cards. VMware has the ability to move a poweredoff guest instance (migration) or a running guest instance from one server to another (migration with VMotion). VMware can take a snapshot of a running guest instance, which can then be used to restore a guest instance to that exact state. Note that VMware recognizes that the external environment (databases on disks which were not part of the snapshot, file downloads and network connections to other systems, etc) can confuse both the restored guest instance and the external systems. VMware can create a virtual network switch to provide a private network among the guest instances, which VMware calls a “host only network”. VMware also offers a file system optimized for the guests, called VMFS. VMware manages the guest instances with Virtual Infrastructure. This offers a dashboard to start, stop and monitor all of the instances across multiple host systems, VMotion to manually migrate guest instances between host instances, virtual server provisioning to rapidly deploy new images, and performance management to show thresholding and utilization. The host instance offers fair share scheduling among all of the instances, and has tools to dedicate a percentage of resources (processor, memory, I/O) to a specific guest instance. VMware Virtual Infrastructure also offers © 2008 Hewlett-Packard Development Company, L.P. the Distributed Resource Scheduler (DRS) which monitors and balances computing capacity across the resource pools (multiple host instances) which can accept the VMotion migration of guest instances which are experiencing performance peaks. Note that bare ESX 3i cannot be managed by Virtual Infrastructure, but requires extra licenses (VI Standard) to allow it to be managed by Virtual Infrastructure. Guest instances can be allocated 1/8th of a processor, and are tuned by specifying “resource shares” of the processors, which are specified in terms of percentages (proportions) of the available processor resources. These shares are computed across the total number of physical processors in the system, and are used to arbitrate between guest instances when processor contention occurs as a way of preventing specific guest instances from monopolizing processor resources. Memory is specified in terms of megabytes. Each guest instance is given a minimum and maximum share, and if there are not enough processor or memory resources, the guest instance will not boot. The minimum and maximum values for processor resources are dynamic and can be changed by the Virtual Infrastructure Manager, however changing the memory allocation requires rebooting the guest instance. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. Note the licensing of the guest operating system instances. With VMware, Windows requires a license for each guest instance. HP HP Virtual Machine (HPVM) is available on all Integrity servers. The host instance is a standard version of HP-UX 11i V2 or V3 running as a fat host. It is a core part of the Virtual Server Environment (VSE), and is managed by the HP Virtualization Manager, which is WBEM based and plugs into the HP Systems Management Home Page and VSE Management Tools. It supports HPUX 11iV2 and V3, Linux and Windows, and will support OpenVMS in a future release (see below). Note that all of the guest operating systems must be the native HP Integrity versions, and there are restrictions around the respective versions of HP-UX hosts and guests. All of the operating systems are para-virtualized, in that HPVM is using ring compression to force them run in ring 1 instead of ring 0. Guest instances can be allocated 1/20th of a processor, and are tuned by specifying an “entitlement” of in increments of 1/100th of the processor. The percentages are specified in terms of single processors, and apply to each virtual processor. For example, if there are four physical processors, and a guest instance is specified with two virtual processors, an entitlement of 50% means that each virtual processor can consume up to ½ of a physical processor. Increasing or decreasing the number of physical processors would not affect the processing time available to this particular guest instance, but would affect how many entitlements are available to other guest instances. Memory is specified in terms of megabytes. Each guest instance is given a minimum and maximum entitlement, and if there are not enough processor or memory resources, the guest instance will not boot. The minimum, maximum and capped values for processor entitlements are dynamic and can be changed by the Virtualization Manager. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. The HPVM performs all of the I/O for all of the guest instances. There is no option for specifying a dedicated partition for I/O. HPVM virtualizes all of the I/O ports, offering an Intel GigaBit Ethernet, SCSI-2 and serial port. All of the devices supported in HP-UX 11iV2 are supported by the host instance. SAN and other device management can only be done from the host instance. HPVM V2.0 offers the ability to directly map a physical I/O device to a guest instance, called “attached I/O”, to allow access to media such as DVD writers or tape silos. HPVM has the standard feature of creating a guest instance from a standalone instance, which can be the same bootable disk as the standalone instance plus a small information file. Tools such as “p2vassist” are used to select the source system, which applications should be installed on © 2008 Hewlett-Packard Development Company, L.P. the target system, and the file systems to be used by the target system. HPVM has the ability to move a powered-off guest instance (migration) but not a running guest instance from one server to another. HPVM can take a snapshot of a powered-off guest instance, but not a running guest instance, except through storage based snapshot tools. HPVM can create a private network among the guest instances, for what HP calls “inter-VM communication”. HP offers flexible licensing of the host instances. You may license all of the physical processor cores, which then explicitly authorizes all of the guest instances. You may also license only the virtual processor cores which are allocated to the specific guest instance. For example, if one of your guest instances was running HP-UX 11iV3 High Availability (or 11iV2 Mission Critical) Operating Environment and another of your guest instances was running HP-UX Foundation Operating Environment, you could license all of the processors with HAOE/MCOE (to allow any number of guest instances to run either MCOE/HAOE or FOE), or you could license some virtual processors with HAOE/MCOE and other virtual processors with FOE. You should analyze the most cost-efficient method of licensing for your environment. Note that HPVM V2.0 does not support VT-i, but does ring compression instead, even on processors which have VT-i technology. Because OpenVMS uses all four ring levels, deprivileging and ring compression is more difficult. A future version of HPVM with a future Itanium processor will support VT-i, such that ring compression will not be required, at which time it will be possible to have OpenVMS as a guest instance. IBM IBM micro partitioning is available on all Power5 systems, including eServer iSeries and pSeries, running AIX, i5/OS (previously known as AS/400), Linux, and Microsoft Windows (which requires specialized IXS/IXA hardware). AIX, i5/OS and Linux are para-virtualized, while Windows is not. The Power Hypervisor firmware controls the micro partitions in the same way it controls the DLPARs mentioned earlier, and is controlled by the same Integrated Virtualization Manager (IVM) console, or by the Hardware Management Console (HMC) referenced earlier. Note that some IBM documentation refers to the hypervisor, others refer to SPLPARs (Shared Processor Logical Partitions) and still others refer to SDLPARs (Shared Dynamic Logical Partitions): be aware that all of these are referring to the same product. I will refer to them as SPLPARs. Guest instances can be allocated 1/10th of a processor, and are tuned by specifying an “entitlement” of in increments of 1/100th of the processor. The entitlements are in terms of percentages of the physical resources, such that if there are four physical processors, you can specify up to 400% entitlement for a specific guest instance. Each guest instance is given a minimum and maximum entitlement, and if there are less physical resources than the minimum entitlement, the guest instance will not boot. If the guest instance’s maximum entitlement is defined as “capped”, it cannot use more than this even if the resources are available. If the guest instance’s maximum entitlement is not defined as “capped”, it can use all available resources, which are called the “excess”. The minimum, maximum and capped values are dynamic and can be changed by the DLPARs manager for active partitions, without requiring a reboot of the guest instance. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. The hypervisor performs all of the I/O for all LPARs and DLPARs, and can be set to do this for SPLPARs guest instances. In this case, each guest instance has dedicated I/O cards which cannot be shared by the other partitions. However, SPLPARs can use the optional “Virtual I/O Server” (VIOS) software, such that I/O cards can be shared between guest instances, and one of the guest instances is dedicated to perform the I/O for all of the other instances. The VIOS instance uses special API’s which communicate directly through the Power5/5+/6 inter-chip fabric, which is significantly faster than any other communications path, including DMA. Note that IBM strongly recommends that the VIOS get a dedicated processor and memory, which cannot be used by any of the other guest instances. If a second VIOS is created to avoid a single point of failure, IBM recommends that it also get a dedicated processor and memory, which cannot be shared with © 2008 Hewlett-Packard Development Company, L.P. either the application guest instances or the original VIOS instance. In versions of AIX prior to 5.3 this was a requirement, but IBM backed off on this requirement and now makes it just a very strong recommendation, primarily so that 2 processor systems would have processors available to do real work. Note that you can have multiple VIOS’s which can be dedicated to a specific type of I/O: LAN, SCSI, FibreChannel, etc, and that these virtual I/O servers can failover to each other. The VIOS can be either a DLpar or an SPLpar with AIX 5.3 and beyond, but it must be a DLpar with AIX 5.2 or earlier. Another method of achieving higher availability is to have the hypervisor use MPIO to load balance across the I/O channels and provide failover. IBM is unique in that LPARs, DLPARs and SPLPARs can be run side by side on the same server. The SPLPARs may or may not connect to the virtual I/O server(s), as described above. IBM micro partitioning has the standard features of creating a guest instance from a standalone instance, which must be a logical volume. IBM has the ability to move a powered-off guest instance (migration), and has recently added Live Partition Migration (LPM) on Power6 for AIX 5.3 and above and specific versions of Linux, which can move a running guest instance from one server to another. Note that LPM requires the use of VIOS for both the source and targets, and the source and target environments must use the same virtualization interface (either HMC or IVM). IBM has not yet qualified LPM with HACMP. IBM can take a snapshot of a powered-off guest instance, but not a running guest instance, except through storage based snapshot tools. IBM micro partitioning can create a private network among the guest instances, which IBM calls a “virtual Ethernet”. IBM offers flexible licensing of the host instances. You may license all of the physical processors, which then authorize all of the guest instances. You may also license the entitled capacity of particular guest instances, including fractional processors. Note that IBM licenses all of its software by Processor Value Unit (PVU), not by core or by processor. But IBM requires a license for all of the processors which are able to run a particular piece of software, and the documentation specifically calls out the case where an LPAR with two SPLPARs is sharing 4 processors. One of the SPLPARs is running an application with the other SPLPAR running DB2, and the example states that because all 4 processors can be used by the second SPLPAR, you need to license DB2 for all 4 processors, using the PVU factors. Microsoft Microsoft Windows Virtual Server 2005 is part of Microsoft’s “Dynamic Systems Initiative”, and is a set of services which run on top of the standard Windows 2003 operating system. It comes in either Standard Edition (up to 4 physical processors for the host instance) or Enterprise Edition (32 physical processors for the host instance), but the functionality is otherwise the same. The host instance can run its own applications, or just act as the virtual machine monitor for the server, whichever is preferred by the administrator. The host instance can be clustered with other host instances on other physical servers with MSCS. The guest instances are fully virtualized. Microsoft is shipping Virtual Server as a free download with Windows Server 2003, and it will be available with the next release of Windows Server, known as “Longhorn”. Virtual Server supports Windows guest instances, and with the supplemental Virtual Machine Additions For Linux software, will also support a wide variety of Linux guest instances. See the Xen section for more information on support of Linux guest instances. The System Center Virtual Machine Manager will manage all of the host and guest instances. Virtual Server has the standard features of creating a guest instance from a standalone instance. Virtual Server has the ability to move a powered-off guest instance (migration), and the ability to move a running guest instance from one server to another, through the “Virtual Hard Disks” (i.e., container files). Virtual Server can take a snapshot of a powered-off guest instance, but not a running guest instance, except through storage based snapshot tools. Note that the guest © 2008 Hewlett-Packard Development Company, L.P. instances must run under a specific Windows account, but Microsoft supplies a default account if none is specified. Guest instances can be allocated processor resources either by “weight” (a number between 1 and 10,000, where the guest instances with higher numbers receive higher preference to resources) or by capacity, where each guest instance is given a minimum and maximum percentage of processor resources. In order to guarantee that a guest instance will receive the maximum processor resources, specify 100 as the “reserve capacity”, which will allocate an entire physical processor to that guest instance. If there are not enough processor or memory resources, the guest instance will not boot. The minimum, maximum and reserved values for processor entitlements are dynamic and can be changed by the Virtual Machine Manager, however changing the processor entitlements and memory allocation requires rebooting the guest instance. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. Virtual Server performs all I/O in the virtual machine monitor. Virtual Server can create a private network which allows communication between the guest instances, from the guest instance to the virtual machine monitor hosting the guest instance, and to operating system instances running outside of the virtual machine monitor, which Microsoft calls a “virtual network”. Microsoft Hyper-V is the next generation of Virtual Server, and implements para-virtualization, but has not yet been released, so I will not discuss it in this paper. Thin hypervisor running in ring -1 (requires HMV), AMD/Pentium only, dedication partition for I/O and management, special “VMbus” for I/O (similar to IBM), para-virtualized Linux and Windows (para-virtualized Linux uses the VMbus, fully virtualized O/S’s don’t use it), Standard Edition has 1 guest, Enterprise Edition has 4 guests, DataCenter Edition has unlimited guests, Itanium is not supported in Hyper-V but the Itanium Edition of Windows 2008 licenses unlimited guests with 3rd party hypervisors. Linux guests require downloading the Integration Services (basically drivers). Supports private network and Quick Migration which is Windows Failover Clustering: does *NOT* support Live Migration (a consideration for a future release, planned failover saves and restores state, takes seconds to minutes). Don’t put multiple container files on the same LUN, as the LUN can only be mounted on one hypervisor at a time, so if one guest moves, all guests with their containers on that LUN move. Bundled as part of the price of each of the editions, each edition is available without Hyper-V with a very slight price reduction ($999/system for SE with Hyper-V, $971/system for SE without Hyper-V). Hyper-V Manager MMC (snap-in, manages groups of Hyper-V hosts and guests one server at a time, doesn’t include copy/clone/P2V/failover wizards), System Center Virtual Machine Manager 2008 (part of System Center Enterprise Suite, preferred for enterprise, includes all functionality, can do self provisioning, will support VMware VI3 if Virtual Center Server is running, can initiate a vMotion), ProLiant Essentials Virtual Machine Manager, HP Insight Dynamics VSE. Up to 4 vCPUs/host depending on the host (Windows 2008 up to 4, Windows Vista up to 2, Windows 2003 up to 1, don’t know about Linux), can run capped and uncapped. Parallels Server for Macintosh Parallels Server for Macintosh is a package running on top of Mac OS X, as a thick hypervisor, on Apple servers Intel Pentium Xeon with HVM hardware (aka, VT-x). It supports Mac OS X, many modern Linux distributions, and Windows. Parallels Desktop is a companion product for the workstation world, which supports the same set of guest operating systems. Parallels Server has the standard features of creating a guest instance from a standalone instance and from another virtual guest (P2V and V2V). Parallels Server has the ability to move a poweredoff guest instance (migration), but does not offer live migration. Parallels Server cannot create a private network among the guest instances. © 2008 Hewlett-Packard Development Company, L.P. Management is mostly through the command line, though Parallels does offer an integrated console for GUI access, and an open API to allow user written extensions. There is a large library of these extensions available and included in the package. The integrated Parallels Explorer allows access to the virtual disk files when the guest operating system is not running, for ease of updating inactive guests. Image Cloning (ie, off-line backup) is also available. Sun Microsystems Sun offers multiple products, depending on the underlying processor: Logical Domains on the T1 and T2 processor, and xVM on the UltraSPARC and Intel Pentium Xeon processor. They are otherwise very similar products. Sun is explicit that Logical Domains (LDoms) is a soft partitioning product, and not a micro partitioning product. However, I would disagree with this positioning, as the LDOM hypervisor is implemented in firmware, and provides complete O/S instance isolation and sharing of processors and I/O cards among all of the guest O/S instances. LDoms are only available on the Sun Fire T1000 and T2000 servers running the T1 and T2 processors. LDoms support Solaris 10 as well as Linux, but note that neither Red Hat nor SUSE support the T1/T2 processor, so the Linux support is primarily from Ubuntu. All of the supported operating systems are para-virtualized. The LDom environment consists of, logically enough, “domains”. LDoms are managed through the Logical Domain Manager, which runs in the “Control Domain”. All I/O is done by the “I/O Domain”, and there can be up to 2 I/O domains for redundancy. One of the I/O domains must be the control domain. All guest O/S instances run in “Guest Domains”. Guest domains can communicate to the I/O domains through a Logical Domain Channel, which provide full access to all shared I/O devices. Guest domains can also directly map an I/O device, such that the device is owned that that guest domain and not shareable. Virtual network switches can be implemented for communication between guest domains, which can act as private networks. LDom documentation refers to “virtual CPUs”, but they are actually threads running in the T1 and T2 processors. Since each processor contains 4 cores, and each core has 8 threads, you can schedule up to 32 virtual CPUs with only a single physical processor. Guest domains can be created from new installations or by using the P2V tools which are bundled with LDoms. Virtual CPUs can be moved dynamically between guest domains, but memory and I/O changes require rebooting the guest domain. LDoms do not support live migration. Performance management is handled using the standard Solaris tools of pools, PRM and psets. LDoms is currently a free download, but Sun will charge for support. Sun also offers xVM on the UltraSPARC and Intel Pentium Xeon processors. It is based on Xen. The xVM environment consists of “domains” which can run BSD, Linux, Solaris or Windows. All of the operating systems are para-virtualized. The “Control Domain” must run Solaris. All I/O is done by the “I/O Domain”, and there can be up to 2 I/O domains for redundancy. One of the I/O domains must be the control domain. All guest O/S instances run in “Guest Domains”. Guest domains can communicate to the I/O domains through a Logical Domain Channel, which provide full access to all shared I/O devices. Guest domains can also directly map an I/O device, such that the device is owned that that guest domain and not shareable. Virtual network switches can be implemented for communication between guest domains, which can act as private networks. Guest domains can be created from new installations or by using the P2V tools which are bundled with LDoms. Virtual CPUs can be moved dynamically between guest domains, but memory and I/O changes require rebooting the guest domain. xVM does not support live migration. Performance management is handled using the standard Solaris tools of pools, PRM and psets. Virtual Iron Virtual Iron “Native Virtualization” was originally derived from the open source Xen virtual machine monitor, but has since diverged from it. Native Virtualization supports AMD Opteron © 2008 Hewlett-Packard Development Company, L.P. and Intel Xeon processors with HVM. Native Virtualization supports 32-bit and 64-bit Linux and Windows as guest operating systems within a resource group called the “Virtual DataCenter” (VDC). Virtual Iron uses a dedicated instance, called the Virtual Services Partition, to perform all I/O.Virtual Iron has the ability to move a powered-off guest instance (migration) and the ability to move a running guest instance from one server to another (LiveMigrate). Virtual Iron can run on any storage, but the use of LiveMigrate requires that the container file(s) for all of the Virtual Volumes for the guest instance be on storage which is shared between the two servers, and that the two servers are located on the same network sub-net. This allows the movement of the guest instance to occur in a few seconds, while the operations of the guest instance are only suspended for 100 milli-seconds and all network and storage connections are fully preserved during the migration. One of the capabilities that Virtual Iron has added is the ability to use LiveMigrate to proactively load balance and optimize the performance of the guest instances across servers, called LiveCapacity. This continuously samples the performance of the guest instances and the host instances across the Virtual Data Center, and migrates guest instances according to the policies specified by the system administrators. Virtual Server can take a snapshot of a poweredoff guest instance, but not a running guest instance, except through storage based snapshot tools. Virtual Iron creates virtual network switches in the host instance for the use of the guest instances. These can either be directly connected to physical networks or can be used to provide a private network among the guest instances, which Virtual Iron calls a “private internal virtual switch”. Virtual Iron manages the guest instances with Virtualization Manager. This offers a dashboard to start, stop and monitor all of the instances across multiple host systems, LiveMigrate to manually migrate guest instances between host instances, virtual server provisioning to create virtual volumes and rapidly deploy new images, and performance management to show thresholding and utilization and implement auto-recovery (LiveRecovery). The host instance offers weighted credit scheduling among all of the instances, which takes values from 1 to 4096 as weights for each domain (higher weighted domains get higher priority to resources). The Virtual Services Partition gets the highest weight by default, and this cannot be over-ridden. Note that there is no way to dedicate a percentage of the resources to a guest instance, except by the relative weights of the instances. If there are not enough memory resources, the guest instance will not boot. You cannot overallocate memory, but you can over-allocate processors. The weighted credit values for processor resources are dynamic and can be changed by the Virtual Infrastructure Manager. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. Xen Xen is a popular virtual machine monitor for Linux on x86 systems, from notebook PCs to large SMP servers. Xen is an open source project founded and maintained by XenSource, and uses the para-virtualization model. Prior to Xen 3.0 and the virtualization enhancements available in the AMD and Intel processors, each of the guest instances (called “domains”) needed be modified with a set of packages to implement para-virtualization. Xen 3.0 has led the industry in para-virtualization, by taking advantage of HVM in AMD Opteron, Intel Itanium and Pentium Xeon, and Sun UltraSPARC T1 processors. Xen uses this technology to allow “pass-through” execution of guest operating systems which do not support paravirtualization, but only on processors with HVM. So the operating systems which have been modified to support para-virtualization will continue to make calls to the host operating system, while operating systems which have not been modified to support para-virtualization (such as Microsoft Windows and Sun Solaris) will run as a guest operating system but execute in ring 0 normally, and the host operating system will not interfere with the direct connection to the © 2008 Hewlett-Packard Development Company, L.P. underlying hardware. In either case the overhead of the host operating system will be dramatically less than the normal case of guest operating systems, usually on the order of 3-5%. Note that the other products are rapidly catching up to Xen in terms of para-virtualization efficiency. Xen uses the virtual I/O server model, and all I/O is done in “domain 0” or “dom0”, which is a dedicated guest instance (all standard guest instances are referred to as “domU”). One of the ways that Xen keeps the host instance small is by performing many of the more complex functions in dom0, which is running a full version of Linux. Therefore, all of the management of the Xen virtual machine monitor and the guest operating system instances are also handled in dom0. Xen has a management GUI-based dashboard. Xen has the standard feature of creating a guest instance from a standalone instance (P2V) into a container file. The Citrix implementation has the ability to move a running guest instance from one server to another on AMD Opteron and Intel Pentium Xeon, while the Red Hat and SUSE implementations can also do live migration on Intel Itanium. The host instance offers weighted credit scheduling among all of the instances, which takes values from 1 to 4096 as weights for each domain (higher weighted domains get higher priority to resources). The Virtual Services Partition gets the highest weight by default, and this cannot be over-ridden. Note that there is no way to dedicate a percentage of the resources to a guest instance, except by the relative weights of the instances. If there are not enough memory resources, the guest instance will not boot. You cannot overallocate memory, but you can over-allocate processors. The weighted credit values for processor resources are dynamic and can be changed by the Virtual Infrastructure Manager. Modifying the virtual resources of the guest (virtual processors, memory and network and storage paths) requires a reboot of the guest instance. Citrix has purchased XenSource, and is implementing Xen on AMD Opteron and Intel Pentium Xeon with Windows and Linux guests. They have not chosen to implement Citrix Xen on any other platform. Citrix Xen supports LiveMigration in the higher editions. Citrix Xen requires virtualization support in the processor (AMD Virtual Technology and Intel VT-x), paravirtualization drivers in the host and para-virtualized Windows guests, but does not require these for Linux guests. Citrix has commercially packaged this technology into four products: XenServer Express Edition, which supports up to 4 guest instances per hosts, and up to two processor sockets and 4GBytes in the host server. Note that this does not support XenMotion or most of the management tools. XenServer Standard Edition, which supports an unlimited number of guest instances per host (subject to memory limitations on the host), and up to two processor sockets and 128GBytes in the host server. This edition adds some management tools, but does not support XenMotion. XenServer Enterprise Edition, which supports an unlimited number of guest instances per host, and unlimited processor sockets and memory in the host server. This edition adds some additional management tools and XenMotion. XenServer Platinum Edition, which supports an unlimited number of guest instances per host, and unlimited processor sockets and memory in the host server. This edition adds dynamic provisioning of guest instances. Note that the P2V tools for XenEnterprise are available directly from Citrix, but the P2V tools for Windows is available from Leostream and Platespin. XenEnterprise does not yet have the ability to move a running guest instance from one server to another on any processor architecture. © 2008 Hewlett-Packard Development Company, L.P. XenEnterprise can take a snapshot of a powered-off guest instance, but not a running guest instance, except through storage based snapshot tools. All of the above discussion describes the technology which is available from Citrix. But because Xen is an open source product, many other companies have incorporated Xen into their products. Novell SUSE Linux Enterprise Edition 10, Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux Advanced Server 5 are shipping Xen implementations with both para-virtualization and full virtualization. Both SUSE and Red Hat support live migration on AMD Opteron, Intel Pentium Xeon and Intel Itanium. Note that this version of Xen supports Linux with para-virtualization, but does not yet support Windows with para-virtualization. Oracle has implemented Oracle VM using Xen. It is only supported on AMD Opteron and Intel Pentium Xeon. Clustering of Micro Partitions • • Some vendors support clustering of host instances to provide failover of guest instances − VMware ESX 3.0 HA (aka Distributed Availability Services) − HP-UX Virtual Machine with Serviceguard − IBM Hypervisor does not support clustering of host instances − Virtual Iron supports clustering of host instances. − Windows Virtual Server 2005 runs as a service with Microsoft Cluster Server (MSCS) Some vendors support clustering of guest instances − AIX supports HACMP in LPARs and DLPARs, but not SPLPars with V5.2+ − HP-UX and Linux Serviceguard are supported in guest instances with HPVM V2.0 − Oracle RAC is supported in a guest instance of an IBM SPLPAR and a VMware ESX guest instance, but is not supported in a guest instance with any other micro partitioning product from any vendor − Windows 2003 supports MSCS of guest instances under VMware ESX. Microsoft Cluster Server (MSCS) is not supported in a Windows guest of an HPVM, because of non-implemented SCSI features in the virtual SCSI driver, thus there is no way to use it to fail an application over. Operating System Instance Resource Partitioning All of the above technologies involve running multiple operating system instances at the same time on the same server hardware. This is most common in server consolidation, where you wish to maintain the same user experience, with the same system names, network addresses, etc, as on the individual servers which have been replaced with a single larger server. But licensing and maintaining individual operating system instances can become expensive. To reduce this overhead, many businesses are choosing to consolidate their environments to a few standardized operating systems and versions, but even doing so requires applying the same maintenance to many instances which can be quite time consuming. An alternative to this is to “stack” many instances of the applications on a fewer number of operating system instances. This reduces the amount of maintenance which needs to be done, while also reducing the number of licenses for both the operating systems and the applications. This helps the system administrators, but the users still have many of the concerns around high availability, resource consumption and contention for resources by the other application instances which were discussed in the micro partitioning section when application stacking is done. Because all of the application instances are running on a single instance of the operating system, © 2008 Hewlett-Packard Development Company, L.P. they are competing for scheduling for execution on the processors, they are competing for the use of physical memory, and they are competing for bandwidth on the I/O cards. When applications are stacked onto a single server, users still want guarantees that they will be able to have their fair share of the resources on the system, and won’t be starved because some other users begin consuming all of the resources on the system. In the same way that the dramatically increasing performance and scalability of modern servers can overcome these concerns in the micro partitioning model, they are equally able to be overcome in the application stacking model. Many operating systems have added the capability to group applications and manage them as a single entity, and ensuring that they receive some pre-determined percentage of the system resources. These are called “resource partitions”. They are similar to the other partitioning technologies, except that they work entirely within a single instance of a standard operating system. There is one additional concern about application stacking which is not present in the other partitioning models: version control. With hard partitioning, soft partitioning and micro partitioning, each partition is running a complete version of the operating system, and can therefore have different operating system versions and patches, and application versions and patches, than any other of the operating system instances. Because there is actually only a single instance of the operating system, this is more difficult in resource partitioning. Vendors have developed two styles of doing resource partitioning: having the operating system present the same interface to the entire system and simply perform resource management, or have the operating system present virtualized operating environments to each of the resource partitions and then manage each of the virtualized operating environments. The first model isolates the applications for performance, but requires that all of the applications have the same version and patch level. Some application vendors do allow multiple versions of their application to run in the same operating system, but that is external to resource partitioning and will not be discussed here. The second model can actually run different applications at different version and patch levels, both of the application and of the operating system itself. But even with all of these restrictions, resource partitioning is increasingly popular, because modern server systems are far more powerful than ever before, so a single system can handle the load which overwhelmed multiple older systems, and it reduces costs dramatically, as discussed above. In the above discussion, I have used the word “resource” to describe how the operating system allocates processing capabilities to each application group. These “resources” are primarily processors, memory and I/O bandwidth and priority. All major operating systems offer tools to control how the processor scheduling is apportioned between the different application instances. Some of the operating systems additionally offer tools which control how memory is allocated, with a few also offering tools which control access to the I/O cards. But in some cases you need more than this scheduling, you require security isolation between the different application instances. When, for example, the Human Resources group was on its own server and was managing confidential employee records containing salary information, there was no problem because access to that server could be tightly controlled. But when HR is simply one of the applications on the same physical server as many other users, there might need to be some additional level of security so a user who has privileges for another of the applications on that server cannot snoop into the HR application. Therefore, some operating systems additionally offer role based access controls (RBAC) to allow privileges to apply to one of the resource partitions but not any of the other resource partitions. Further, some operating systems offer tools to control the communications between processes between resource partitions, to offer the same high level of security as if those processes were on two physically separate servers. Many of the traditional tools, especially in the UNIX space, operate on operating system metrics, such as percentage of processor time or megabytes of allocated memory. However, users © 2008 Hewlett-Packard Development Company, L.P. generally don’t care about such things, and are more focused on metrics which are the basis of their business function, such as having queries completed in a certain number of seconds, response time of the application, or job duration. These metrics are often formalized as Service Level Agreements (SLA’s), and IT departments are often measured on these metrics, frequently to the level of basing financial rewards for the IT department personnel on meeting or exceeding them. Some of the tools for managing resource partitions are capable of adjusting the resource allocation based on these metrics. HP HP-UX Secure Resource Partitions HP OpenVMS Class Scheduler IBM AIX Class Scheduler Linux VServer Microsoft Windows System Resource Manager Parallels Virtuozzo, OpenVZ for Linux and Windows Sun Solaris Container Sun Solaris Zone Figure 15 Resource Partitions Manages Isolation Level Processor, Memory, I/O Processor, Memory, I/O Processor, Memory, I/O Processor, Memory Processor, Memory Processor, Memory, I/O Processor, Memory, I/O Application Tools for SLA’s Higher Security Yes Application PRM, WLM, gWLM PRM, WLM, gWLM WLM Virtual O/S PRM Yes Application Policy Based No Virtual O/S Policy Based Yes Application Virtual O/S SRM, FSS No Yes Application Not Needed No HP HP-UX isolates the applications for performance management, and requires a single version and patch level. It offers a variety of tools to manage application performance. “Processor Sets” (Psets) allow you to define groups of processors (or, more recently, cores) as a set, which can then be assigned to one or more applications for their exclusive use. “Fair Share Scheduling” (FSS) allows you to specify entitlements for each application group, and the processor time of that Pset (which could be all of the processors in the system, or could be a subset of them) is scheduled based on those entitlements. Both of these can be integrated with the “Process Resource Manager” (PRM), which adds the ability to dedicate memory and I/O bandwidth to the application group. PRM works with operating system metrics, while WorkLoad Manager (WLM) and Global WorkLoad Manager (gWLM) work with application metrics such as response time. WLM works within a single server environment, where gWLM can define specific policies which work across multiple servers and can actually move resources between servers. Secure Resource Partitions increase the security of the resource partitions by enforcing communications restrictions between partitions, ensuring that a process in one resource partition cannot communicate with a process in another resource partition, except through standard networking paths, exactly as if those two processes were on separate physical servers. HP-UX also offers toolkits for many popular applications, which can manage the different modules of those applications in different resource partitions. This allows, for example, different SAP modules to be consolidated onto a single server while still receiving guaranteed percentages of the entire system resources. OpenVMS isolates the applications for performance management, and requires a single version and patch level. It offers a variety of tools to manage application performance. “Processor Sets” (Psets) allow you to define groups of processors (or, more recently, cores) as a set, which can then be assigned to one or more applications for their exclusive use. “Class Scheduling” allows you to specify percentages of processor time available for each application group. Both of these can be © 2008 Hewlett-Packard Development Company, L.P. integrated with the “Process Resource Manager” (PRM), which adds the ability to dedicate memory and I/O bandwidth to the application group. PRM, WorkLoad Manager (WLM) and Global WorkLoad Manager (gWLM) work with application metrics such as response time. Note that WLM and gWLM are the same across both HP-UX and OpenVMS. WLM works within a single server environment, where gWLM can define specific policies which work across multiple servers and can actually move resources between servers. OpenVMS does not add any specific security capabilities to the groups, as the existing security policies are already capable of controlling and enforcing the communications and other security considerations between the resource partitions. IBM AIX WorkLoad Manager (WLM) isolates the applications for performance management, and requires a single version and patch level. It controls the processor, memory and I/O resources available to groups of applications, which are assigned to a “class”. The resources are enforced at the class level, with all of the processes inside the class sharing the resources equally. Resources are allocated either in terms of shares of the total amount of system resources, or hard and soft limits in terms of percentages of the total system resources. Soft limits can be exceeded if the resources are available, but hard limits can never be exceeded even if the resources are otherwise idle. There is no additional security controlling communications between the classes. Note that prior versions of the Enterprise WorkLoad Manager (EWLM) were simply a load balancer which allocated incoming requests between multiple servers, and did not manage the resources inside or between operating system instances. EWLM V2 has been enhanced to allow it to dynamically move processors between DLPARs, and to modify the resource entitlements of SPLPARs. Note that EWLM can perform these actions no matter the operating system running in the partition. Partition Load Manager (PLM, which requires the presence of an HMC) can move processors, memory and I/O between DLPARs and SPLPARs. Each of the partitions has a minimum, maximum and guaranteed resource entitlement, which is allocated to the partitions by their entitlement divided by the total number of entitlements of that resource for all of the partitions. PLM will monitor the thresholds every ten seconds, and if the thresholds are exceeded six times in a row, a “dynamic LPAR event” will be triggered and resources will potentially be moved, assuming the resources are available. IBM brings all of the above tools together via the IBM Director, which includes server and storage deployment tools, server (including micro-partitions) and storage monitoring tools. Linux Linux VServer isolates the environments at an operating system level, and requires a single version and patch level. The operating system kernel code is separated from the user mode code, which is placed into Virtual Private Servers (VPS), which appear to the application as a physical server. Each of the VPS’s see a set of processors, memory and I/O cards, as well as an entire distinct file system and device namespace, distinct from any other VPS. VServer controls processor and memory resources for each VPS using standard prm tools. The file system for each VPS is distinguished from all other file systems for all other VPS’s by the use of the “chroot” (change root) command, which defines a new “/” (root file system) for each VPS. In this way all of the applications can continue to use their own file system tree, which each sees as the entire set of file systems, without interfering with any other VPS. Note that VServer is completely separate from Linux Virtual Server. Linux Virtual Server is a high performance (technical) computing (HPTC or HPC) environment, which allows many separate systems to cooperate in solving a difficult or very large mathematical problem. © 2008 Hewlett-Packard Development Company, L.P. Microsoft Windows System Resource Manager isolates the applications for performance management, and requires a single version and patch level. It controls processor and memory resources for each application process, which can be any non-system process. An application is defined by the command line which starts it, so multiple application instances can run in a single Windows instance, and be differentiated by their originating command line. It defines a set of “policies”, which can be exported and shared between servers, to manage a group of servers with the same policies. Resources are allocated in terms of percentage of total processor resources, and megabytes of physical memory and page file space. These are hard limits, and cannot be exceeded even if the resources are otherwise idle. There is no additional security controlling communications between the application instances. Windows System Resource Manager is a free download for Windows 2003 Enterprise and DataCenter Edition customers, for all Windows server platforms. Note that this same technology is built into the SQL Server product, to allow multiple versions and instances of SQL Server to be run simultaneously. See the section on application resource partitions for more information. Sun Microsystems Sun offers two types of resource partitions: containers and zones. Solaris Containers define a resource pool which contains one or more physical resource sets, which contain processors, memory and I/O cards. These resource pools are then allocated to the containers by means of “shares”. Resources are allocated in terms of percentage of total processor resources, and megabytes of physical memory and page file space. Containers can run either “capped” (hard limits, and cannot be exceeded even if the resources are otherwise idle) or “uncapped (resources can be used if they are otherwise idle). There can also be resources which are unallocated, and are available on demand to any specific resource pool. Solaris uses Fair Share Scheduling (FSS) and Solaris Resource Management (SRM) to manage the resources. Containers do not implement any extra security beyond the standard restrictions on inter-process communication. Solaris Zones are a specialized form of container which offers a virtualized operating system environment which can run Solaris. Each zone has its own interface to the environment, including network addresses, a host name, and its own root user. There is a “global zone” which is owned by the base Solaris instance, and then each of the applications runs in a “non-global zone”. Solaris can share resources between one or more zones, such as when multiple zones are running the same application, they share a single physical in-memory copy of that image in the global zone. Further, the different zones can share disk volumes, because all of the volumes are owned by the global zone. Solaris increases the security of the resource partitions by enforcing communications restrictions between zones, ensuring that a process in one zone cannot communicate with a process in another zone, except through standard networking paths, exactly as if those two processes were on separate physical servers. Further, different non-global zones can be run in different trust domains. Solaris Zones have the same “resource pools” as Containers which are allocated to each of the non-global zones, which controls the amount of resources that they can access. Resources are allocated in terms of percentage of total processor resources, and megabytes of physical memory and page file space. These are hard limits, and cannot be exceeded even if the resources are otherwise idle. Solaris uses Fair Share Scheduling (FSS) and Solaris Resource Management (SRM) © 2008 Hewlett-Packard Development Company, L.P. to manage the resources. Different non-global zones can be run in different trust domains, so there is extra security between zones. Parallels (SWsoft) Virtuozzo Parallels Virtuozzo (which was SWsoft but recently changed their company’s name) is a set of services which run on top of 32-bit and 64-bit Linux (almost all distributions) and Windows (2003 and 2008 Server, but not earlier Windows versions), on AMD Opteron, Intel Pentium Xeon and Itanium processors. Virtuozzo presents a virtualized operating environment in a container, similar to Sun’s use of the term, which Linux refers to as “Virtual Private Servers” (VPS). The base operating system instance can run its own applications, or just act as the virtual machine monitor for the server, whichever is preferred by the administrator. The host instance can be clustered with other host instances on other physical servers with the appropriate clustering technology for the host operating system. The Virtuozzo Management Console (VZMC, the primary management interface) or the Virtuozzo Control Center (VZCC, a web based management interface) will manage all of the host and guest instances. The guest instance itself can be managed through the Virtuozzo Power Panel (VZPP). Virtuozzo has the standard features of creating a guest instance from a standalone instance (VZP2V), but can also import existing virtual environments from VMware and Xen and transform them into containers. Some of the system files (DLL’s in Windows, for example) in the new containers are replaced with links to the Virtuozzo files during this migration, which is also how Virtuozzo can maintain different patch levels in the different containers on the same host operating system. But for all common operating system files, Virtuozzo maintains a single copy in the host operating environment, which means that mass deployments of a common version of an operating system (which is the normal and recommended case) can be done with very little disk space, far less than having unique files for each operating system installation. Virtuozzo has the ability to move a running guest instance from one server to another as true live migration, and takes about 15-20 seconds in a Windows environment and zero time in a Linux environment. Note that Virtuozzo is the exception to the rule about the O/S and application vendor requiring error reports running in a non-virtualized environment, and not accepting error reports which occur in a virtualized environment. Microsoft has agreed to support their products inside a Virtuozzo virtual container. The Parallels Infrastructure Manager controls all of the above tools, and provides an overview of all of the hardware resources and the containers, with the ability to drill-down to each container. The DataCenter Automation Suite (DAS) tracks the performance of all of the virtual environments, and can perform chargeback. The containers can be dynamically modified for processor, memory and I/O. Virtuozzo includes a copy of Acronis TrueImage Backup for dynamic snapshots of the containers. Virtuozzo performs all I/O in the host operating system. Variable Capacity Virtualization Many environments have business cycles which require significantly more or less resources at various times. The increased resources required by the end of the month processing for accounting summaries and the Christmas business peak for retail businesses, or the reduced requirements for a weekend when no business users are logged in to the system, both mean that the system resources are wasted during the off-peak times. Traditionally, these resources would simply go to waste, with very low or even zero utilization for long periods of time. However, recently many vendors have implemented the ability to enable and disable resources dynamically, allowing the owners of the system to only pay for the resources that are actually being used at any given moment. © 2008 Hewlett-Packard Development Company, L.P. This may not seem like virtualization. However, remember my definition of virtualization at the beginning of this paper: the abstraction of resources from their physical function. One of the capabilities that vendors have implemented is the ability to migrate authorization to use the resources (aka, software licenses) between servers, such that there are fewer licenses than there are physical resources but all of the resources of any given system are available for portions of the time. Because it is using the hardware and software in a way that adapts it to the business requirement (as opposed to fitting the business requirement into the capabilities of the computing environment) it seems to me to fit the definition of virtualization, so I am including this concept in this paper. There are several different types of variable capacity virtualization: Some environments can guarantee that they will need to be expanded, but not be able to accurately predict when that expansion will be required, and that the resources will never be removed. A business which is growing its customer base, or its employee population, might require a constantly growing system. This is the “shrink-wrap” model, similar to the software model where if you break the shrink-wrap plastic around a software kit, you have purchased the software and cannot return it for a refund. This model allows you to have the resources physically installed in the system but not running or paid for, until you decide that they are required, at which time you purchase them and they are enabled for use. Some environments have peaks and valleys of utilization, such that resources need to be enabled and disabled at various times. A business which has seasonal variation in utilization, or a business which needs to be ready for momentary spikes of demand, might wish to only pay for the resources which are actually being used at any given time. Or another business might need to have several servers which are fully populated with resources (such as processors), but have fewer software licenses than there are physical processors, and to migrate these software licenses to the server which most needs them at any given moment. This is the “phone card” model, similar to the pre-paid phone cards which have a certain amount of time on them, which can be used whenever the owner of the card needs to make a call. This model allows the owner of the server to buy a certain amount of time on that resource, and to consume that time whenever the business requires. Some environments have the same peaks and valleys of utilization, but the business uses leasing as a financial tool. Note that the licensing and support of this equipment is over and above the pricing of the equipment itself. So, for example, any Oracle or SQL Server or SAP licenses required for the additional processors, would be in addition to the fees paid to the hardware vendor for the equipment itself. This should be considered when evaluating variable capacity virtualization, and the funds for the entire environment of the additional equipment should be forecast, so you won’t be surprised by an audit stating that you are out of compliance with your mission critical software packages, or by the inability to activate that software when you desperately need it. HP Integrity, PA-RISC HP-UX, OpenVMS Permanent Activation iCAP Temporary Activation TiCAP 30 days by 30 minutes Utility Pricing Pay Per Use (post-pay) Purchase Method Website, E-mail reporting © 2008 Hewlett-Packard Development Company, L.P. Granularity Cell boards, processors, memory (in HP ProLiant Linux, Windows IBM iSeries, pSeries AIX, i5 iCAP Capacity Upgrade on Demand (pre-pay) n/a Trial CoD 30 days, Reserve CoD 30 days by 24 hours (pre-pay) Sun E2900 - E25K COD 2.0 T-CoD Solaris 30 days by 30 days (post-pay) Figure 16 Variable Capacity Virtualization n/a On/Off CoD (post-pay) Capacity Backup (post-pay) N/A (optional) Order via SIM auto tracking Website, E-mail reporting cell boards) Server, Blade Enclosure Phone call to Sun Center Uniboards, processors, memory Processors, memory HP HP implements Instant Capacity (iCAP) across both the Integrity and PA-RISC cell based servers, and the ProLiant servers, but in a very different way. The Integrity and PA-RISC cell based servers (rx76xx, rx86xx, rp7xxx, rp8xxx and Superdome) allow iCAP for individual processors, as well as cell boards which contain processors and memory. iCAP is the purchase plan, where a customer would pay some percentage (approximately 25%) of the price of the equipment and have it installed in the system. When the customer needs the additional capacity, they would contact HP with a purchase order for the remaining cost of the equipment (note that this is the price at the time of the activation, not the price at the time of original purchase), and would get a code word, which can be entered into the system and the new equipment is activated for that system instance. This works with either nPARs for HP-UX or OpenVMS (which would instantly see the new equipment in the operating system instance where you entered the code word), vPARs for HP-UX (which again would see the new equipment in the operating system instance where you entered the code word, but which could then move the processors to another operating system instance in the same nPAR), or dynamic nPARs for HP-UX (which could add the new cell board and its processors and memory to a running instance). The same servers offer Temporary Instant Capacity (TiCAP) for processors. TiCAP is the “phone card” model, where you purchase 30 days of processor utilization, and which you can then use in any HP-UX or OpenVMS operating system instance. The time is checked approximately every 30 minutes, and each TiCAP core which is active at that point counts against the time. So, for example, if you activated four cores for two hours, you would consume eight hours of time against your 30 days. TiCAP is bound to a single server, and customers often want to share licenses between servers. This is primarily useful in a disaster recovery scenario, where the DR server is idle for months at a time. HP offers Global Instant Capacity (GiCAP), which is an extension of TiCAP, in that it will cover a group of servers which may be geographically diverse. So, for example, you could have two identical rx8640 servers, one for production and one for DR, with all 32 cores active in your production site and only one core active in your DR site. If you need to fail over to your DR site, you use the GiCAP Manager to deactivate 31 licenses at your production site (since, by definition, you aren’t using them anymore) and activate them at your DR site. In this way, both the © 2008 Hewlett-Packard Development Company, L.P. production and DR site are fully licensed (though not at the same time), and you have only purchased 33 licenses for the environment, saving the licensing costs of 31 licenses. Both TiCAP and GiCAP allow you to consume more than is available on your TiCAP/GiCAP license. In this case, the monitoring records a negative balance, which is then adjusted when you purchase the next TiCAP/GiCAP license. So, for example, if you consume five extra days of processing time, and you purchase a new 30 day license, you will have 25 days remaining on your license when you install the new license. HP will not disable processors when you exceed the time on the license. Note that iCAP processors include 5 days of TiCAP licenses as part of their purchase. Pay Per Use (PPU) is the leasing model, where all of the equipment is fully installed and operational at all times, but HP installs a monitor on the system to measure the actual utilization of the system. This is averaged over 24 hours, and then summed over a month, to produce an average utilization, which is then applied to your lease payment. So, for example, if your utilization was 80%, you would pay roughly 80% of your lease payment for that month. If the next month you experienced a spike to 90% utilization, your lease payment would be 90%. And so on if in the following month your utilization dropped to 70%. Many customers achieve significant savings with this model. TiCAP, GiCAP and PPU require communication back to HP. This can be done through the OpenView Services Essentials Pack (was ISEE) via e-mail, through the HP website, or via phone. Note that if your system is isolated for security reasons and cannot communicate to HP, HP will accept FAX’ed or regular mail reports or other means of communication. If an active processor fails, an iCAP or TiCAP processor can be activated in the same nPAR to take the place of the failed processor. This is not a chargeable event, in that the same number of processors were active before and after the processor failed. The ProLiant servers, both rack mount and blades, implement iCAP on entire servers and blade enclosures. HP will work with the customer to establish inventory levels and invoicing information. Customers are able to simply install servers from their inventory, which are detected by the Systems Insight Manager (SIM) and updated in the invoicing database. Then, when benchmarks are reached (for example, down to 5 servers in stock), purchase orders are automatically executed and product shipped to restore the inventory levels. Blades are done the same way, except that the blade enclosure costs are amortized across the number of blades in the enclosure. So, for example, if 16 blades are to be installed in an enclosure, the price of each blade would include 1/16 th of the price of the enclosure itself. IBM The Power5 and Power6 based servers allow Capacity Upgrade on Demand (CUoD) for processors and memory. CUoD is the purchase plan, where a customer would pay some percentage of the price of the equipment and have it installed in the system. When the customer needs the additional capacity, they would contact IBM through their website with a purchase order for the remaining cost of the equipment, send the Vital Product Data (VPD, effectively the system serial number) via either FAX or via e-mail from the HMC, and would get a code word, which can be entered into the system via the HMC and the new equipment is activated, and available for use with any DLPAR or SPLPAR which are uncapped. The same servers offer Trial Capacity on Demand (Trial CoD) and Reserve Capacity on Demand (Reserve CoD) for processors and memory. Reserve CoD is the “phone card” model, where you © 2008 Hewlett-Packard Development Company, L.P. purchase 30 days of processor utilization, and which you can then use in any AIX and Linux operating system instance by simply moving the Reserve CoD processors to the active processor pool via the HMC. The time is consumed for the entire time each processor is in the active processor pool, and each processor counts against the time. So, for example, if you activated four cores for two hours, you would consume eight hours of time against your 30 days. The processors are available to any DLPAR or SPLPAR which are uncapped. Trial Capacity on Demand is slightly different, in that it operates for an entire 30 day period (which is counted only during the time when the server is powered on). It is intended for an evaluation of servers and software, or other extended period of usage. Otherwise, it works the same way as Reserve CoD. Capacity Back Up (CBU) works with On/Off Capacity on Demand (On/Off CoD) to allow the activation and deactivation of processors and memory in a leased environment. Activation of a processor or memory unit is for an entire 24 hour period, and usage is summed and paid for at the end of the quarter. Otherwise, it works the same way as Reserve CoD. Sun Microsystems The Uniboard based servers (E2900, E4900, E6900, E20K and E25K) allow Capacity on Demand (CoD) for individual processors and memory. CoD is the purchase plan, where a customer would pay some percentage of the price of the equipment and have it installed in the system. When the customer needs the additional capacity, they would contact Sun via phone with a purchase order for the remaining cost of the equipment, and would get a Right to Use (RTU) license via e-mail, which can be entered into the system and the new equipment is activated for that system instance. The same servers (except the E2900, which is excluded) offer Temporary Capacity on Demand (TCoD) for processors. T-CoD is the “phone card” model, where you purchase 30 days of processor utilization. If six months of T-CoD RTU’s are purchased, that is equivalent to purchasing the equipment. Sun offers the “headroom” feature, where you can activate the processors and memory immediately, and then purchase the CoD or T-CoD RTU’s after the fact. You have up to one month to purchase these RTU’s. If an active processor fails, a T-CoD processor can be activated to take the place of the failed processor. This is not a chargeable event, in that the same number of processors were active before and after the processor failed. Virtualization Automation All of the above technologies allow the system administrator to migrate, enable and disable the resources by hand, either through a command line interface (CLI), a script or a management console. However, the demands of a 24x7 business environment, and the inevitable peaks and valleys of business processes, often makes it inconvenient to require the human beings who are managing the system to closely monitor the performance of all of their systems on a moment by moment basis. So some vendors have created software automation which can perform this monitoring, and which is capable of migrating, enabling and disabling the resources according to business rules. Citrix XenServer Soft n/a Micro n/a Resource XenCenter © 2008 Hewlett-Packard Development Company, L.P. Utility n/a Egenera n/a EMC VMware n/a Processing Area Network n/a CLI, SIM n/a n/a CLI, SIM, CapAd CLI, HMC, IVM n/a n/a n/a Sun CLI CLI SWsoft n/a n/a XenSource n/a n/a HP IBM Linux (Red Hat, SUSE) Microsoft Figure 17 Virtualization Management n/a n/a Virtual Infrastructure CLI, SIM, CapAd CLI, HMC, IVM CLI, Virtual Machine Mgr VMM, MOM, System Center CLI, xVM Ops Center Virtuozzo Mgt Console Virtual Infrastructure n/a CLI, PRM, SIM WLM Many SRM PRM DataCenter Automation n/a Note that “n/a” in the above table indicates that either this vendor or product does not offer this type of virtualization so by definition there is no automation capability, or this vendor does not offer automation of this type of virtualization. Egenera EMC VMware HP IBM Linux (Red Hat, SUSE) Microsoft Sun SWsoft Soft n/a n/a gWLM eWLM, PLM n/a Micro PAN FSS, DRS n/a eWLM, PLM DRS Resource n/a n/a PRM, (g)WLM WLM n/a Utility n/a n/a gWLM n/a n/a n/a n/a n/a SRM VxM Ops Center DataCenter Automation n/a SRM FSS, SRM DataCenter Automation n/a n/a n/a n/a XenSource n/a Figure 18 Virtualization Automation n/a Operating System Emulation One method of virtualizing systems that may not be obvious as a virtualization technology is emulation. But if you think about it, emulation is the abstraction of server, storage, and network resources from their physical configurations, in that the operating system may be running on hardware which is completely different than the original server hardware, and so I include it as virtualization. The following are popular methods of emulation by virtualization: Aries o HP-UX PA-RISC on HP-UX Integrity Charon-VAX o OpenVMS VAX on Windows x86 and OpenVMS Alpha © 2008 Hewlett-Packard Development Company, L.P. Intel o IA-32 Execution Layer on Itanium Platform Solutions, Inc with T3 Technologies o zOS, OS/390 on HP Integrity (Liberty Server) o zOS, OS/390 on IBM xServer (tServer) Transitive Software (QuickTransit) o Solaris SPARC on Solaris x86, Linux x86 and Linux Itanium o Linux x86 on IBM pServer (PowerVM Lx86) Note that at this time (July 2008), IBM has purchased Platform Solutions, with the stated intent to “maximizing customer value” of this technology. Pardon me if I cynically believe that they will bury this technology completely, in order not to interfere with their mainframe revenue stream. © 2008 Hewlett-Packard Development Company, L.P. Summary All of the above technologies fit into a continuum of server partitioning, as shown in Figure 15. Single physical node Hard partitions within a server Single system image per node within a cluster Soft partitions and/ or Micro-partitions within a hard partition of a server Hard Partition 1 Hard Partition 1 Soft Partition 1 • OS image with HW fault isolation • Dedicated CPU RAM & I/ O Hard Partition 2 • OS image with HW fault isolation • Dedicated CPU RAM & I/ O • • OS + SW fault isolation Dedicated CPU, RAM • • OS + SW fault isolation Dedicated CPU, RAM Application 1 • Guaranteed compute resources (shares or percentages) Soft Partition 2 Application 2 • Hard Partition 2 Hard Partition 3 N ode Resource partitions and/ or Secure partitions within a system image Application 3 Virtual Machine 1 • • • OS + SW fault isolation Virtual + Shared CPU, I/ O Virtualized Memory • • • OS + SW fault isolation Virtual + Shared CPU, I/ O Virtualized Memory Hard Partition n Virtual Machine 2 • OS image with HW fault isolation • Dedicated CPU RAM & I/ O Isolation Guaranteed compute resources (shares or percentages) Application n • Guaranteed compute resources (shares or percentages) Flexibility Figure 19 Server Virtualization Continuum There are many choices that can be made when deciding how to virtualize a server, and these choices can be overwhelming. Not all servers can do all of the types of virtualization, but even the subset that a particular server can do can often present too many choices. You need to go back and understand the business goals for the server, in order to make good choices for the one or more virtualization technologies that you will apply to that server. And you should consider some of the following criteria when choosing among those technologies: Internal resistance. Developers and BU representatives may resist new application hosting, partitioning and stacking strategies. ISV Support. ISV’s may not support their application running in a virtualized environment. Platforms that support virtualization at the hardware level have minimum OS versions that might be incompatible with some older applications. Security challenges that occur when two separate applications with privileged users are combined with non-privileged users on the same OS instance. What applications are available for that server and that operating environment? Does your application vendor support a virtualized environment for production? Do they have a set of “best practices” for a virtualized environment? And this applies equally to the case where you are running “home-grown” applications and acting as your own application vendor and support. What is the size of the applications? Are you doing server consolidation where a single processor with a few tens of megabytes of memory is sufficient for multiple instances, or are you adding more and more load to the system where you will constantly need to add processors and memory and I/O cards? What is the degree of fluctuation in the workload? Is it purely cyclic (Christmas buying season, end of the month processing, etc), is it random (acquiring a new company which doubles the user base), or is it constantly increasing? How predictable is the growth? © 2008 Hewlett-Packard Development Company, L.P. And can you afford to have one workload slow down the other workloads by consuming all of the available resources on the server? What level of security isolation do you require? An application service provider (ASP) would typically offer and advertise strong isolation between customers, but any business might require that same level of isolation between departments, due to regulatory issues. What degree of availability do you require? Can you afford to have many application instances become unavailable for minutes during the failover if a single system goes down? How do you plan for new workloads? Testing and training are the most common activities which require new operating system instances quickly, and then will remove them just as quickly, but how quickly do you need a new database instance or a new application instance? There are many choices in server virtualization, and many technologies involved in implementing those choices. But you need to focus on the business reasons you purchased the server systems, make the correct choices for your business requirements, and then choose the correct server virtualization technology to implement those choices. © 2008 Hewlett-Packard Development Company, L.P. Network Virtualization Networking virtualization is not as complex as server virtualization, because there are many fewer layers of hardware and software involved. But we need to ensure many of the same features as in server virtualization, including efficiency, security, high performance, scalability, high availability, easy growth and redundancy, as our networks have almost moved beyond mission-critical into the realm of the very air we breathe. Without the networks providing communication between our users and our servers, all of the topics that I have discussed up to this point are completely useless. I will focus primarily on TCP/IP V4, as that is by far the most common networking protocol in use today. But where applicable, I will discuss other protocols such as AppleTalk, DECnet, NetBUEI, SNA, TCP/IP V6, UDP, etc, as well as other transport mediums like Infiniband. TCP/IP Subnets and Virtual LANs One of the first topics we need to discuss is subnetting. When Transmission Control Protocol / Internet Protocol (TCP/IP) was designed in 1973, the designers provided for networks of various sizes. The first section of the address was to be the network identifier, with the rest of the address being the specific host in that network. The size of the networks was explicit in the class-ful 32 bit addressing scheme, where if the first bit of the network address was 0, then the network identifier was 8 bits long, and it was one of the 126 Class A networks with up to 16 million addresses each, with the leading number being from 1 through 127. If the second bit of the network address was 0, then the network identifier was 16 bits long, and it was one of the 16,384 Class B networks with up to 65 thousand addresses each, with the leading number being from 128 through 191. If the third bit of the network addresses was 0, then the network identifier was 24 bits long, and it was one of the 2 million Class C networks with up to 254 hosts each, with the leading number being from 192 through 223. And this made sense at the time, and several of the founding networking companies got some of the prime networking real estate: Digital Equipment Corporation still owns the first Class A network (16.x.x.x). The routers build their tables based on these addresses, and perform all traffic routing based on the network identifier, such that each network identifier belongs to a single domain (ie, company or other organization). But after a while, it became obvious that class-ful addressing would not work out. For example, with only 126 Class A networks available, even Digital Equipment could not efficiently use 16 million addresses, as this was 100 times as many employees as it had at its peak. And there were far more than 125 other companies which needed more than the 16 thousand addresses which were available as Class B addresses. And companies which could not justify getting an entire Class B address needed more than the 254 addresses available in a Class C, so they needed multiple Class C domains. So the standard was extended to Class-less Internet Domain Routing (CIDR), where the break between the network identifier and the host address was more extensible and flexible. So each TCP/IP address requires a second piece of information, the “network mask”. We continue to use the class-ful addressing scheme to determine the network address, but add a new concept of the “subnet”, so we can determine which pieces of the 32 bit TCP/IP address are the network identifier, which are the subnet identifier, and which are the host. So, for example, if the TCP/IP address is 16.1.2.3 and the network mask is 255.255.255.0/24, then you apply the binary AND function to the two values, yielding 16.1.2.0 as the network identifier. But because the first bit of the network identifier is 0, we know this is a Class A address, so the network identifier is 16.0.0.0, the subnet identifier is 0.1.2.0, and 0.0.0.3 is the host. How does this help? Well, by allowing us to assign more granular groupings of networks to different organizations. For example, assume a company needs 1,000 addresses. Instead of © 2008 Hewlett-Packard Development Company, L.P. assigning an entire Class B address to a single company (and wasting 64 thousand host addresses), or assigning four Class C addresses to that same company (which complicates the network design and increases the routing table load on the primary routers), we can assign something between the two, with just enough addresses to satisfy the requirements. So we could give this company network identifiers 154.11.204.x and 154.11.205.x with a subnet mask of 255.255.254.0/23, and they would both be routed to this company, while consuming only 1,000 address, even though they are both Class B addresses. So what does this have to do with networking virtualization? Well, you can apply the same concept inside your company to further subdivide your network even beyond the high level subnet that was assigned to your company. If we were still using class-ful addressing, we could only have two different Class C domains (255.255.255.0/24): 154.11.204.1 through 154.11.204.254 and 154.11.205.1 through 154.11.205.254 But with a subnet mask of 255.255.255.128/25, you could assign network addresses: 154.11.204.1 through 154.11.204.126 to one group, 154.11.129 through 154.11.204.254 to another group, 154.11.205.1 through 154.11.205.126 to a third group and 154.11.205.129 through 154.11.205.254 to a fourth group. With a different subnet mask, you could go down to any level you wished, including subnets of 2 hosts (which would be silly, but perfectly in accordance with the standards). And you could have different subnets of different sizes. But let’s keep it simple. Now that we have the addresses segmented, we only have four domains, so our routing tables inside our company are very simple. This very lengthy discussion is simply to introduce the concept of subnetting, which is a virtualization strategy for network addressing. And in and of itself it is moderately useful (primarily to allow efficient subdivision of the increasingly scarce TCP/IP V4 network addresses, given the explosion of the number of devices that want IP addresses), but the real advantage in your company comes in when we begin to do virtual Local Area Networks (LANs). A LAN is simply a broadcast domain, where all network addresses are shared with no routing. So all network ports can see all network packets, as anyone who has experimented with a network sniffer or put their NIC into promiscuous mode can attest. Some network switches will perform some level of segmentation, such that the packet is only sent out the port to the specific device, but this is not universally true. A LAN is usually defined by one or more network switches. A Virtual LAN (VLAN) is simply a subdivision of that LAN, so that the broadcast domain is smaller and more focused on only those servers which interact with each other the most. VLANs are defined by IEEE 802.1q. VLANs can be defined in a variety of ways, depending on the requirements of the business. VLANs can be defined either at networking Layer 2 (media layer) or at Layer 3 (networking layer). Layer 2 VLANs define their members by physical attributes: either Media Access Control (MAC) addresses (the unique identifier on every network device) or by the port number on the network switch. This is extremely easy to implement, and corresponds well to the physical environment: everyone on this floor is on the same VLAN, or everyone in a specific group on this floor is on the same VLAN. However, it is often difficult to scale, as it requires tracking the MAC addresses of every network device and potentially assigning them to one of the VLANs. With tens of network devices, this is easy, but with thousands of network devices (including PC’s on desktops, servers, wireless devices wandering around the environment, plus the number of devices that are added or replaced or removed each month, etc) this is increasingly difficult to keep track of. © 2008 Hewlett-Packard Development Company, L.P. Layer 3 VLANs define their members by non-physical attributes: TCP/IP addresses, the networking protocol being used, whether they are part of an IP multicast group, etc. This allows much more flexibility, and removes the geographic portion of the solution. So, for example, you could assign a VLAN to everyone who is using the same networking protocol. This is common when different divisions of a company chose different solutions (DECnet vs TCP/IP, or AppleTalk vs SNA), or even when companies have grown by acquisition and the different companies had chosen different networking solutions. Protocol VLANs can achieve this level of isolation among the various groups, isolating the TCP/IP traffic from the DECnet traffic from the AppleTalk traffic. Layer 2 VLANs are good for non-TCP/IP environments, as well as for fairly static environments, while layer 3 VLANs are good for more dynamic environments. Assuming that your entire company has switched to TCP/IP (and this is increasingly common), you could assign a group of TCP/IP addresses to the Finance Department, and then assign those addresses to a specific VLAN. All desktops, notebook PC’s with wireless cards, printers, servers and other devices which are on the network all get their TCP/IP addresses from this group, and so they are all automatically part of the Finance VLAN. Even when the MAC address changes (due to replacing a broken NIC, or upgrading to a new PC), the TCP/IP address remains the same, so membership in the VLAN remains the same. IEEE standard 802.1x covers edge access security, where a user is authenticated via some means (whether at Layer 2 or 3, as discussed above), and then is allowed access to some The entire purpose of this, as stated before, is to restrict the size of the broadcast domain. All members of the Finance VLAN will see each other’s packets, but no one else will. This reduces the amount of traffic on the backbone routers, because if the packet is intended for one of the members of the VLAN, it never leaves their local switch and so is never routed. It also enhances security, in that all members of the Finance VLAN are automatically granted rights to things like printers, but non-members are automatically barred from those devices. In fact, the routers themselves perform this security function by simply never forwarding the packets to the devices, unless they are on the correct VLAN. This works extremely well when all members of the VLAN are in the same geographical location, and can be (for example) all plugged into the same network switch. This is the optimal case, where no routing ever occurs. However, it is rarely achieved in the real world, given the way people move around between offices, plus wireless devices which are intended to move between areas of the company. And you certainly don’t want a person to lose their rights to access a database server just because they happen to be visiting a remote office, which is far away from their home network switch. This means that in some cases, the network packets need to be routed. The way this is normally accomplished is via tunneling or label switching. Tunneling is taking the entire network packet, and treating it as the payload of a larger network packet. So a network packet that is isolated to a VLAN is incorporated into a network packet which is fully routable, and is not restricted to a specific VLAN. This larger packet is then sent across one or more routers to the appropriate network segment, where it is peeled away to reveal the smaller packet inside, which is then sent to the appropriate network device. We will discuss this in more detail in the VPN section. Label switching prefixes the original packet with a label which identifies the source and target routers for this packet. Multi Protocol Label Switching (MPLS) is a common Layer 3 method of doing this. MPLS adds a prefix to the packet and forwards it on to the target router, but the value of MPLS is that the prefix tells the target router how to forward on the packet, such that the target router does not have to spend a lot of time analyzing the packet and calculating the route. This makes it extremely efficient. You need to consider several factors when setting up VLANs. These include: © 2008 Hewlett-Packard Development Company, L.P. Access control – Ensure the all legitimate users have access to the services that they require, but that non-authorized users are excluded from accessing your environment. Whether this is done at Layer 2 or Layer 3, you need to correctly assign the privileges and rights to each user and each device, and to protect those systems that offer services. Path isolation – Make sure that the primary purpose of VLANs is accomplished: reducing the amount of traffic on the primary network backbone. You need to minimize the amount of routing that occurs between network segments, keeping the traffic isolated to each group of users in each VLAN. Centralized management – Implement centralized policy enforcement, with a common set of criteria, authentication, tracking and verification across your entire enterprise. Whether this is due to regulatory requirements, financial and reporting requirements, or simply best practices, you need to treat your entire company the same, such that people who move around in the company (such as with wireless devices or travelers who do not have a fixed office) are treated equally no matter where they are at any moment. Be careful of DHCP, because moving between subnets generates a new IP address and the layer 3 VLAN is based on the old IP address. I spent a great deal of time discussing subnetting, primarily because VLANs and subnets are often used interchangeably. Subnets isolate the routing to a specific group, and VLANs isolate the packet traffic to a specific group. It is extremely common to have a VLAN map to a specific subnet, and you need to consider both when you set up either one. If you are going to integrate Cisco and HP Procurve hardware on the same network, and you intend to use VLANs there are only a few things you need to remember: For end nodes - Cisco uses "mode access", HP uses "untagged" mode. For VLAN dot1q trunks - Cisco uses "mode trunk", HP uses "tagged" mode. For no VLAN association - Cisco uses no notation at all, HP uses "no" mode in the configuration menu, or you have VLAN support turned off. Virtual Private Networks (VPN) VLANs isolate the traffic, and that isolation results in somewhat higher security, in that devices not on the VLAN simply don’t see the traffic at all. However, as we discussed above, sometimes the traffic does leave the isolated area to travel over more public routes, such as when the VLAN is distributed across an entire physical WAN with multiple routers. This forces the VLAN traffic to traverse the public (whether that is private to a company but accessible to all divisions of that company, or it is a truly public network such as the Internet) network, and therefore we need to implement a higher level of security. One technique is to use tunneling, such as Virtual Private LAN Service (VPLS). This tags the VLAN that the packet came from with an identifier that is unique to that VLAN. This is an extension of the MPLS technique described above, except that it more clearly identifies the VLAN and not just the target router. The target edge device that receives the packet then uses this identifier to take action on the packet, such as routing. This is at Layer 2, and binds multiple LANs and VLANs into a single larger environment. The encapsulation is most frequently done by Point To Point Protocol (PPTP, which requires TCP/IP) or Layer 2 Tunneling Protocol (L2TP, which can work with many protocols and networking methods such as ATM). It fully encloses the original packet, and routes the contents to the edge device which then extracts the original packet and routes it appropriately. © 2008 Hewlett-Packard Development Company, L.P. But this does not protect the contents of the packet: a packet sniffer could identify the VPLS or MPLS packet and then peek inside to read the source and target addresses of the original (embedded) packet, and even the contents of the packet itself. This may be acceptable if the underlying network is trusted, such as it would be inside a building or campus network. These are called Trusted VPNs. But this may not be acceptable if there are security concerns inside the company, such as either regulatory requirements or multiple levels of security in a single organization. And this is clearly not acceptable if some external network is used, such as a city-wide WAN service or the Internet itself. In this case the security responsibility resides in the VPN, which requires some type of authentication of the users and then encryption of the data. Normal VLANs establish membership by various means (TCP/IP address, MAC address, switch port number, etc), but VPN’s demand specific authentication of the user. These strong authentication measures may include usernames and passwords, using Password Authentication Protocol (POP), Challenge Handshake Authentication Protocol (CHAP, including several versions from Microsoft called MS-CHAP and MS-CHAP V2), and the Extension Authentication Protocol (EAP). But many go further than this by requiring physical security keys implementing Public Key Infrastructure (PKI), such as USB dongles and smart cards which contain digital certificates from a trusted Certificate Authority agency, usually your corporate networking or security group. Then you need to encrypt the data. There are many types of encryption, but the most common for VPN’s is public key encryption, where a public key is paired with a private key to produce quite excellent security. The private key is usually stored on the same smart card as the authentication information. Note that VPN encryption does not provide end to end security. The data is fully protected when sent over the untrusted network, but then the data is decrypted by the VPN server at the trusted network edge, such that the information is sent in clear over the corporate network. In most cases this is completely adequate, but if necessary, corporations can implement secure VPN’s with encryption even inside their corporate network. Network Address Translation and Aliasing One of the problems with TCP/IP V4 is the extremely limited and disjoint addressing space. I am sure that the original designers never imagined the scope of what they were inventing, nor could they project the simply incredible number of networks, hosts and clients that we have today. Subnetting helps, but it doesn’t do anything to increase the number of available addresses, it merely lets us use any granularity that we wish in segmenting those addresses. Another problem is that we need near infinite scalability today in some of our web services, which far exceeds the capacity of any single system by any vendor. But we don’t need that scalability all the time, as it is economically convenient to expand and contract the capabilities dynamically, in response to actual demand. A solution for both of these problems is Network Address Translation (NAT). NAT uses the concept of a front end load balancing system with a public network address, to mask the multiple server systems behind it which actually perform the work. The load balancer simply accepts any incoming request from the user base, and then distributes it to one of the servers in its assigned servers. The assigned servers can either use one of the addresses owned by the company providing the service, or it can take advantage of some of the special addresses implicit in the TCP/IP standard, such as the 10.*.*.*, 172.16.*.* through 172.31.*.* and 192.168.*.* sequences. For the purposes of this discussion, I will use the 192.168.*.* numbers. These special addresses are reserved via RFC 1918 for private networks, and are not routable, and so are usable for operations like this. The advantage of using one of these addresses is that the same addresses can be used for multiple sets of load balanced servers, optimizing the use of the addresses available to the organization. © 2008 Hewlett-Packard Development Company, L.P. When a request comes in to one of the load balancing servers, in the example in Figure 20 we will use hp.com/pages with TCP/IP address 16.1.1.1, it is immediately forwarded to one of the application servers. For example, it could be forwarded to the server with address 192.168.1.1. The load balancing server would remember the packet and which server it sent it to, so that the response from 192.168.1.1 could be returned to the correct requestor. This “one to many” approach is more properly called “Port Address Translation” (PAT) or “Network Address Port Translation” (NAPT), because the load balancing server appends a port number to the public address, in order to differentiate the application servers and keep the conversations straight. In a “one to one” approach, it is truly “Network Address Translation” (NAT). (Note that this is common in home networking, where a single network address in a DSL or cable modem will translate into multiple addresses, one for each home networking device). In either case, the load balancer re-writes the IP address of the application server in both directions, so that the users see only the public address hp.com/pages and 16.1.1. (and cannot see the private addresses), and the servers see only their private addresses (and are unaware of the translation being done by the load balancers). Figure 20 Network Address Translation This solves the multiple problems: there are only a few public addresses required (one per service), and we can scale the number of application servers behind each service practically infinitely. AOL uses this design to serve web pages and their Instant Messaging software, and has a single environment of over 1,000 application servers. Google has almost half a million servers in this kind of structure. The individual application servers are known only to the load balancing servers: their names and/or IP addresses are not known anywhere else, since their functions are controlled strictly by the load balancing servers. So if an application server fails, or you need to add a series of new application servers, you simply place them on your network and add them to the routing tables in the load balancing server, and they are now in production. The outside world only ever sees the service offered by the load balancing server. To advertise the load balancing services, you would use the standard DNS or Wins or Novell Directory Services or Active Directory (or whatever you care to use) to create service names, assign them IP addresses, and place these into your namespace database. This would be © 2008 Hewlett-Packard Development Company, L.P. HP.com/pages and HP.com/apps, as a completely made up example. The IP addresses for these would also be in your directory server, such as 16.1.1.1 and 16.1.2.1. Then, behind the scenes, you create the set of application servers. These get assigned IP addresses, but no names, and the IP addresses would not be publicized in your directory server. The only visible address is that of the service offered by the load balancing server itself. In this way you can add new application servers, without having the change anything other than the routing table in the load balancing server. And when a system goes down, whether the outage is planned like system maintenance or software upgrades, or unplanned, because systems do occasionally fail, the entire environment keeps going. If you are serving web pages, the user simply clicks the “Refresh” button, and the new request is sent to one of the other servers. If you are serving applications, an external transaction monitor can re-start the transaction. In either case, the entire environment stays operational. As you can also see, you can have a mix of systems offering services, so you can choose the application servers based on the best application for the job, and not be restricted in your choices of environments. In this diagram the load balancing servers are single points of failure. In real life you could have multiple load balancing servers acting either in a failover or a load balancing capacity, depending on which load balancing server you chose. There are basically three different approaches to load balancers: networking, hardware and software. The traditional way of doing this was to let the Domain Name Server spread the load among the different application servers. This is simple to implement, since the DNS server simply rotated among a hard-coded list of IP addresses. It was also cheap, since you needed a DNS server anyway. But there were some serious deficiencies to this scheme: all of the other network and server components along the way tended to cache IP addresses, which defeated the round robin scheme. Further, the DNS server had no way of knowing whether any given server was ready to accept the next request, or whether it was overloaded, or whether it was even up. So requests could get lost, and servers could get overloaded, and no one would know. To get around these problems, you can use a dedicated network box to act as a router to the application servers. This has some serious advantages, including greater scalability, no CPU load on the application servers for routing, and very simple management. Your network people will find the management of the load balancing server to be very similar to the management of the network routing tables they take care of today, and will be pleased over how simple it is to manage. The one dis-advantage to this approach is cost: these systems cost between $5,000 and $50,000, depending on capacity. And you generally need to buy them in pairs, so you don’t have any single points of failure. There are two basic kinds of hardware load balancers: network switches and appliance balancers. Network switches, such as the Alteon, are basically standard network switches which have additional software running in the ASICs to handle the load balancing function. Appliance balancers such as the Cisco or the F5, on the other hand, run the load balancing software as their primary function, but don’t need the level of tuning or maintenance that servers need, since they are network appliances. Either type can easily handle the bandwidth of major Internet sites: you will find that your Internet connection will be more of a bottleneck than the load balancer. The other choice is to use a software load balancer. This is basically a service to be run, and it consumes CPU cycles, which means each of the application servers has fewer cycles to do the real work that you bought them for. However, it does not require a separate load balancing server or servers, so for small sites this might be cheaper. © 2008 Hewlett-Packard Development Company, L.P. All of the application servers must be on the same network segment and in the same IP subnet. The network cards are set in promiscuous mode, whereby all NICs accept all packets on that network segment. The load balancing service on each of the application servers then computes which application server should handle this particular task, and the other servers ignore the packet. The Windows Network Load Balancer (NLB) is part of the Windows 2003 Advanced Server and above. Microsoft purchased this code from Convoy. Beowulf does this same job with Linux. One of the things that you must be aware of is the different amount of information that each of the load balancers has about the state of the application servers. Network load balancers are basically network switches with some added software. As such, they have only minimal information about the application servers. They do the best they can, using keepalive timers, sending “ping” and measuring the response time, and choosing the optimal application server via content routing. But these methodologies tend to measure network performance as much or more than they measure server load. They are also subject to tuning of the timeout values, where a busy server might be judged inoperable if the timeout value is too low, and an inoperable server might continue to get requests sent to it if the timeout value is too high. There are attempts to make this even more efficient by using Dynamic Feedback Protocol (DFP) and other tools to become more aware of the actual load on each of the servers. But they are still only network switches, and cannot understand the details of the servers at any given moment. This is not to say that overall they cannot do a reasonable job, but just to make you aware of some of the difficulties that they face in doing so. Software load balancers tend to do a better job at measuring the actual load on each of the servers, primarily because they are running full operating systems which are identical to the ones on the application servers themselves. In most cases, they are simply one of the application servers which is taking on the additional task of being the load balancing server. There is no perfect solution: network load balancers scale far higher than software load balancers, but tend to be less effective at measuring actual load. Software load balancers cost less, but tend to not be as effective at delivering services because they are also doing the load balancing work. But either solution can fully satisfy your business requirements, you simply have to keep all of the factors in mind when choosing between them. Another case of network aliases is the cluster network alias. In this case we have a service which is being offered by a software cluster, with very close cooperation between all of the members of the cluster. The service may be something like file or print services, a database service, or an application service like SAP. The service may be offered on multiple members of the cluster (in a shared everything cluster environment) or on only one member of the cluster (in a shared nothing cluster). The network alias is bound to one of the NIC’s on one of the members of the cluster, and accepts all incoming requests for that service. At that point the methods differ: • The Dynamic Name System (DNS) method offers a single network name, which is the alias for the cluster or the service. A list of “A” records specifies the network addresses of the actual systems in the cluster. The DNS server can “round robin” the requests among the list of network addresses, offering simple load balancing. One or more of the systems in the cluster dynamically update the list of “A” records in the DNS server to reflect system availability and load. This offers load balancing and higher availability, as requests will only be sent to systems that are running and able to do the work. TCP/IP Services for OpenVMS offers the Load Broker which calculates the workload of all of the cluster members, and the Metric Server to update the list of records in the DNS server. • If the address is not shared, it is bound to one of the network interface cards (NIC) on one of the systems in the cluster. All connections go to that NIC until that system exits the cluster, at which time the network address is bound to another NIC on another system in the cluster. This offers failover only, with no load balancing, but it does allow a client to © 2008 Hewlett-Packard Development Company, L.P. connect to the cluster without needing to know which NIC on which system is available at that moment. AIX HACMP, Linux LifeKeeper, MySQL Cluster, Oracle 9i/10g Real Applications Cluster, HP-UX Serviceguard, SunCluster and Windows Microsoft Cluster Service do not share the cluster alias. • If the address can be shared, the network address is still bound to one of the NICs on one of the systems in the cluster. Again, all connections go to that NIC until that system exits the cluster, at which time the network address is bound to another NIC on another system in the cluster. However, in this case, when a request is received, the redirection software on that system chooses the best system in the cluster to handle the request by taking into account the services offered by each system and the load on the systems at that moment, and then sends the request to the chosen system. This offers high availability and dynamic load balancing, and is more efficient than constantly updating the DNS server. NonStop Kernel and Tru64 UNIX TruCluster offer a shared cluster alias. Channel Bonding Even with the amazing advances in network performance, many applications exceed the capacity of a single NIC or switch port. The solution is the same one that was used for decades when applications ran out of processor performance: Symmetric Multi Processing (SMP), where we put many processors together to act as a single high bandwidth processor. This goes by many names: NIC teaming, port teaming, Ethernet trunking, Network Fault Tolerance, network Redundant Array of Independent Networks (netRAIN), Auto Port Aggregation (APA), etc. But they all mean the same thing: to put many NIC’s together into a single virtual NIC, which increases the bandwidth available by the number of physical NIC’s are bonded together. This is officially known as the Link Aggregation Protocol, covered by IEEE standard 802.3ad. The different operating systems handle this in one of two ways: active/active and active/passive. Active/active means that all channels are used at all times, with different requests being sent out and back on all channels, to achieve almost linear growth in network bandwidth. Failover is extremely fast, in that the failed packet is simply re-sent out on one of the other active channels. AIX offers EtherChannel, HP-UX offers Auto Port Aggregation (APA), Linux offers Channel Bonding, and Solaris offers Nemo (liblaadm), all active/active. In addition, several networking companies offer solutions, such as Broadcom Advanced Network Server Program for Linux and Windows, Dell PowerConnect Switches for Windows, HP’s ProLiant Essentials Intelligent Networking Pack for Linux and Windows, Intel Advanced Network Services with Multi-Vendor Teaming for Linux and Windows, and Paired Network Interface from MultiNet (and TCPware) for OpenVMS, all of which are also active/active. Active/passive means that only one channel is used for all traffic, with all of the other channels being available for failover but not in active use. FailSAFE IP on OpenVMS and Tru64 UNIX, LAN Failover on OpenVMS are examples of active/passive channel bonding. MAC Spoofing The Media Access Controller (MAC) is the fundamental base on which all other networking rests. Every network device in the world has a unique MAC address. In a sense, all of the above discussion, from DECnet to TCP/IP, from cluster alias to NAT, is a virtualization of the MAC address, in that they are all methods use to determine the route to and from the specific NIC across all other network devices. But I will focus this discussion on MAC spoofing, because otherwise this paper would end up being much longer than it is now… In most cases, you want to leave the MAC address as it is burned into your NIC, so that Address Resolution Protocol (ARP) can successfully translate your TCP/IP address into the route needed to © 2008 Hewlett-Packard Development Company, L.P. get the packets to your NIC. But in some cases, you want to be able to change this address to something else. Hackers would want to do this to pretend to be a valid network device on your network. Some home networking setups require this because the MAC address of the primary PC is registered with the ISP, and the cable or DSL modem needs to pretend to be that PC. But the legitimate reason for doing so in your server environment is that it enables re-configuration of a network very quickly and easily, by using a MAC address which is standing in for the actual MAC address on the NIC. The physical NIC can then be replaced (which results in a new MAC address, and which would ordinarily have forced a network change) without any network changes. Several products do this today: Cisco VFrame Data Center will use MAC spoofing to bind multiple channels into a single channel, increasing throughput, and will use MAC spoofing to allow network devices to be replaced without any changes to the network administration. Note that the IBM Virtual Fabric Architecture is simply an OEM version of Cisco VFrame Server Fabric Virtualization Software, using the Infiniband to LAN/SAN switches in the IBM blade enclosure. HP Virtual Connect has the c-Class blade enclosure present MAC addresses to the network that are not the addresses on the NIC’s in the blades themselves. This way, when blades get added or replaced, nothing has changed from the network point of view. Note that this equally applies to storage virtualization in managing SAN World Wide ID’s (WWID). IBM Open Fabric has the IBM BladeCenter present MAC addresses to the network that are managed by the Open Fabric Manager. Again, when blades get added or replaced or the workload moves between blade enclosures, nothing changes from the network point of view. Note that this equally applies to storage virtualization in managing SAN WWID’s. Every hypervisor which has internal networking capability uses MAC spoofing to present the virtual NIC as a physical NIC. Figure 21 MAC Spoofing Figure 21 shows how the network connections (MAC addresses 11 through 14) are spoofing the MAC addresses 01 through 04, by re-writing the MAC addresses on the physical NIC’s. (I use a short-hand form of MAC address, as the real ones are 12 digit hexadecimal numbers, and that wouldn’t fit on the diagram). Originally, a server NIC with MAC ID 03 was in place, and had its MAC address replaced by MAC ID 14. When server NIC MAC ID 03 needs to be replaced, a new server NIC with MAC ID 04 is swapped in, but the connector will spoof MAC address 04 with MAC address 14, and the rest of the network will not notice any change. If this scheme were not in place, a new MAC address would need to be registered, associated with a TCP/IP address, reassigned to any VLAN’s that were in place, etc. This can result in work orders that take days or weeks, just because a network card went bad or a server needed to be upgraded. © 2008 Hewlett-Packard Development Company, L.P. But the diagram also shows one of the downsides of that very migration. If server NIC with MAC ID 04 was relocated to the connector with MAC ID 13 (which is in a different subnet than MAC ID 14), things become complicated. For example, the subnet in blue will likely have a different set of DHCP addresses than the network area outside that subnet. So the server may be assigned a TCP/IP address that is outside of the security zone that it needs to be in, preventing it from accessing the servers it needs to access, and preventing legitimate users from accessing it. This is of moderate concern for blade enclosures and most implementations of VFrame, but is a large concern when doing failover between guest operating system instances across physical machines on different subnets. And it is a very large concern when doing DR failover to a different geography, with very different subnets. It is extremely easy to setup this type of failover between virtual machines, but they need to be thoroughly tested to catch this type of error. Storage Virtualization Storage virtualization has many of the same attributes as network virtualization: it is primarily concerned with communication to an outside agency instead of dividing up or aggregating the hardware and software inside the server, a lot of the virtualization happens outside of the server, and there are many more standards involved than in server virtualization, where the actual implementation tends to be very specific to each vendor. But it has some major differences as well: communication from the server tends to be to a very specific set of storage hardware (with specific lists of what host based adapter (HBA) is supported with what storage array), and there are fewer standards involved and they tend to be much looser than in the networking world (compare some of the Redundant Array of Independent (or Inexpensive, depending if you are a storage buyer or vendor) Disks (RAID) definitions with something like 802.1q). Storage virtualization, like all of the other virtualization technologies that we have been discussing, has been around for literally decades. We may have started off in the 1960’s with disk managers as people who kept maps of washing machine sized hard disks on a wall, and who handed out storage based on the physical tracks and sectors of those disks (which the user was responsible for keeping straight, with very little checking to see if they accidentally wandered into the next users section of disk), but file systems such as the UNIX File System (UFS, which was actually invented in 1969 by Ken Thompson in order to let him write Space Travel on a PDP-7) came into prominence in the early 1970’s, with volume managers following about 10 years later. This was the earliest form of storage virtualization, but it was soon followed by Digital Equipment Corporation with OpenVMS Volume Shadowing and IBM mirroring. At the same time, advanced RAID algorithms were invented at U.C. Berkeley, which started an industry around RAID storage systems. These were highly in vogue through the 1990s both in operating systems and storage controllers. During the ‘90s, the advances continued. The development of FibreChannel enabled fabric-based virtualization, and the continuing advances in microprocessors enabled intelligent switches that could incorporate virtualization became reality. Importantly, the way virtualization has been talked about publicly has changed radically over time. We were using virtualization techniques, but (as it was equally in the server world) the word wasn’t mentioned at all. And so, rather than coin a lot of new terms, we stuck with the names of technologies. Hence, mirroring, or RAID-5. Eventually the excitement about “virtualization” spread to the storage world. A lot of industry messaging became focused on virtualization as being the desired end goal. DEC and HP led this, and others soon hopped on board. Suddenly storage virtualization was the buzz of the industry. Note that the following discussion focuses on hard disk storage. Tape and optical storage virtualization is a topic that I may add in a future major release of this paper. The first thing we have to deal with is where is the virtualization taking place? If can exist either in the server software (whether that is the operating system or add-on products), the storage © 2008 Hewlett-Packard Development Company, L.P. controller (whether that is an HBA in a PCI slot in the server or embedded in a standalone storage array) or as part of a device offering management of the entities of the Storage Area Network (SAN), such as switches, controllers and servers offering file services to other devices on the network. And it can exist at multiple levels simultaneously, with an operating system implementing mirroring between storage systems which themselves have implemented redundancy and which are chaining multiple tiers of storage together transparently using multiply redundant paths. We can evaluate these different levels on four criteria: Management: How is the storage managed? What software tools are used to do the management? What is the overhead involved in performing this management? What complexities does storage virtualization add to the base management? Pooling: How can we combine the various levels of storage? How can the various vendor products be combined? What is the overhead in doing this aggregation? Data movement: How easily is data migrated between storage elements, including elements from different vendors? What type of tiering is available? Asset management: What is the span of control of those tools? How can performance and capacity be distributed among the storage elements, including elements from different vendors? The first level is at the server itself, including servers which offer file services such as NAS devices. Management: All of the major operating systems and their clustering products have some storage virtualization included (primarily RAID software), but there are many 3 rd party tools from storage vendors available to augment these native capabilities. And frequently these are very easy to manage, primarily because the system administrators are already trained and comfortable with the operating systems way of doing things. But all of the work involved in doing this virtualization (such as issuing the 2 or more I/O requests for each write to a RAID-1 volume, or performing the data copy involved in a clone) takes place in the host, detracting from the amount of processing power and memory available to do the real work for which you purchased the server. Pooling: In a perfect world, the servers can bring together many different subsystems from many different vendors. In this world it is somewhat less perfect than that, but it is often possible to pool together multiple types of storage from multiple vendors. This gives the servers the largest possible span of control and ability to pool storage. But it is often complex to do this, and sometimes requires additional software to manage the disparate storage subsystems. Further, just as in management, this aggregation will take processing power and memory away from doing the work that only the server can do. Data movement: Just as in pooling, servers can more easily migrate data between storage systems than any other level. But again, just as in pooling, this takes server resources. Asset management: Servers can see every piece of storage which they have mounted, and as pointed out above, this can be a very large group. But because most operating systems do not have the ability to mount the same storage system on multiple servers simultaneously (especially between operating systems), the servers can only manage those storage devices that they have mounted, which is a very small fraction of the total storage environment. Further, servers often do not have the ability to see behind the Logical Unit Number (LUN) which is presented by the storage array, such that they cannot manage the individual items in the storage array (disk drives, tape drives, storage controllers) or the infrastructure itself (switches, chained storage, etc). The next level is the storage array. © 2008 Hewlett-Packard Development Company, L.P. Management: Each storage array has a specific set of tools which are tailored to that particular vendor’s product family (where a family might be an HP EVA or an EMC CLARiiON). As such, each does a good job of understanding the details of that family. However, these tools do not extend outside the family into even the other storage arrays of that particular vendor, never mind the storage arrays from another vendor. Pooling: All storage arrays have the ability to group multiple disks into RAID arrays. Most storage arrays even have the ability to group multiple RAID arrays into large groups, to achieve superb throughput and bandwidth. Further, many storage arrays have the ability to “chain”, where a storage array can connect to a second storage array, and present its LUNs transparently to the servers as if it were a single large storage array. And this is supported even across storage arrays from multiple vendors. So pooling is a strength of storage array virtualization. But in the same way that servers can only manage those volumes that are mounted on the server itself, storage arrays can only pool those arrays that are physically connected to it. As such, the span of control is fairly small when compared to the entire storage environment. Data movement: Many storage arrays have the ability to perform snapshots or business copies, that is, to instantly make a copy of an entire group of storage for other uses such as backup or software testing. Some storage arrays have the ability to replicate data between storage arrays, using software such as EMC Symmetrix Data Replication Facility (SRDF) or HP Continuous Access (CA). But even these capabilities are extremely limited in their ability to migrate data between disparate storage arrays, and I have not found any storage array virtualization software which will replicate data between storage arrays from different vendors. Asset management: Storage array virtualization does a good job in managing the array itself, but this is extremely limited. Most of the drawbacks to storage-based virtualization are simply the result of “span”, ie storage-based is typically limited in scope to a single storage subsystem, whereas network- and server-based approaches can encompass many storage subsystems, even systems from different vendors. The final level is the SAN itself. Management: Managing the overall SAN is often easier than managing individual components, as there tend to be fewer of them (a few SAN switches rather than tens or hundreds of LUNs), and you get to take a higher level of the environment rather than focusing on the details of specific storage arrays. Further, you get to look at overall traffic flow rather than focusing on specific disk devices on specific servers. On the other hand, a heterogeneous SAN means multiple management tools, which often conflict with each other, leading to a poor overall view of the environment. Pooling: The SAN is the ultimate span of control, as it covers every server, every switch and every storage array. Pooling is easy at the SAN level among storage arrays and switches of the same type, but is often extremely difficult to accomplish between equipment of different types and especially between equipment from different vendors. Data movement: Again, this is very easy to do among equipment of the same type, but very difficult to do between equipment from different vendors. Asset management: The equipment can be tracked and viewed as a whole, but it is often difficult to track problems because of the lack of view inside the storage array. © 2008 Hewlett-Packard Development Company, L.P. RAID Beyond the virtualization of the file system and the volume manager (which are such a implicit part of the way we work that we don’t even consider them virtualization, even though they clearly fit even the loosest definition), RAID is the fundamental storage virtualization technology. At its core, RAID is a way of optimizing the cost per usable GByte, the performance, or the availability of the storage. The definitions are from the Storage Networking Industry Association (SNIA), but to me the simplest way to think about it is the old project manager’s mantra of “cheap, fast, good: pick any two”. I will give the SNIA definition first, and then add my comments. RAID 0: SNIA’s definition is “A disk array data mapping technique in which fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern. Disk striping is commonly called RAID Level 0 or RAID 0 because of its similarity to common RAID data mapping techniques. It includes no data protection, however, so strictly speaking, the appellation RAID is a misnomer.” This optimizes cost per usable GByte, because all of the space on the physical disks is usable: if you have a stripeset of three 73GByte disks, you get 73+73+73=219 usable GBytes of storage. It also optimizes performance, because you can execute parallel reads and writes on all of the members of the stripeset, so both total bandwidth and latency are improved. But it offers no data protection at all, in that if any disk fails, the entire stripeset needs to be rebuilt. So it is cheap and fast, but not good. RAID 1: SNIA’s definition is “A form of storage array in which two or more identical copies of data are maintained on separate media. Also known as RAID Level 1, disk shadowing, real-time copy, and t1 copy.” I would add “mirroring” to this definition. This optimizes read performance, because you can execute parallel reads on all of the members of the mirrorset, so both bandwidth and latency are improved. It does not help write performance, because all disks in the mirrorset must be written at the same time. It also optimizes availability, because on an ‘n’ member mirrorset, n – 1 disks can fail and you still have full access to all of the data. But it is the most expensive of the RAID choices, in that no matter how many disks you have, you have only the usable storage of the smallest disk in the mirrorset: if you have a mirrorset of three 73GByte disks, you get 73 usable GBytes of storage. So it is fast and good, but not cheap. RAID 2: SNIA’s definition is “A form of RAID in which a Hamming code computed on stripes of data on some of a RAID array's disks is stored on the remaining disks and serves as check data.” This is the first of the RAID choices which employs parity to be able to re-compute the data if it is lost, rather than keeping an extra copy of the data, as RAID 1 does. This parity data takes up somewhat less space than the original data, yielding approximately 60% of the usable space of the RAIDset (rather than the 50% of a two member mirrorset): if you have a RAID 2 set of three 73GByte disks, you get approximately (73*3*0.60)=132 usable GBytes of storage. The data is striped across the disks at the bit level, which is different than the other RAID definitions. The parity is striped across all of the available disks. RAID 2 does not help read or write performance, because every read must be distributed among all of the disks, and every write must write both the data on one disk and the parity information on another disk. But it does offer decent availability, because one disk can fail and you still have full access to all of the data (though some of it will be re-computed from the parity instead of being read directly). So it is not very cheap, not very fast and good. For these reasons, RAID 2 is very seldom used. RAID 3: SNIA’s definition is “A form of parity RAID in which all disks are assumed to be rotationally synchronized, and in which the data stripe size is no larger than the exported block size.” RAID 3 also uses parity, but stripes the data across the disks in bytes (rather than bits as in RAID 2) and isolates the parity to one of the disks in the RAIDset (rather than distributing it across all of the disks as in RAID 2). This is slightly better in space than RAID 2, and gets better as the number of disks in the RAIDset go up, as you lose only one entire disk to the parity information: if you have a RAID 3 set of ‘n’ 73GByte disks, you get n – 1 usable GBytes, for 73*(3-1)=146 usable GBytes (67%). If you have a RAID 3 set of 4 73GByte disks, you again get n – 1 usable GBytes, for © 2008 Hewlett-Packard Development Company, L.P. 73*(4-1) = 219 usable GBytes (75%). Performance is the same as a single disk drive, since the data is spread out among all of the disks, both reads and writes must access all of the disks for operations of any significant size, and the disks are synchronized so random reads of different areas of the disk are not allowed. Further, all writes must additionally update the parity disk. But it does offer decent availability, because any one disk (either data or parity) can fail and you still have full access to all of the data (though some of it will be re-computed from the parity instead of being read directly, if a data disk fails). So it is cheap and good. RAID 4: SNIA’s definition is “A form of parity RAID in which the disks operate independently, the data stripe size is no smaller than the exported block size, and all parity check data is stored on one disk.” This continues the growth we have seen from RAID 2 to RAID 3, in that the size of the data on each of the data disks is growing, with whole blocks being distributed among the data disks. Just like in RAID 3, there is a dedicated parity disk. Just like RAID 3, you get better space utilization as you increase the number of disks in the RAID 4 set. Read performance is improved, as there is now enough data on each disk to be useful, such that multiple disks can be read simultaneously with different data areas on each disk. But write performance is still limited to a single disks performance, as every write must update the parity disk. RAID 4 offers decent availability, because any one disk (either data or parity) can fail and you still have full access to all of the data (though some of it will be re-computed from the parity instead of being read directly, if a data disk fails). So it is cheap and good. RAID 5: SNIA’s definition is “A form of parity RAID in which the disks operate independently, the data stripe size is no smaller than the exported block size, and parity check data is distributed across the RAID array’s disks.” This is one of the more common RAID levels, as it takes the best features of RAID 2 and RAID 4 (distributed parity and large block sizes on each disk), to optimize across all three factors. Just like in RAID 4, you get better space utilization as you increase the number of disks in a RAID 5 set. Read performance is the sum of the disks performance, and reading different areas from different disks to achieve parallelization is normal. Write performance is also improved as the parity is distributed among multiple disks. RAID 5 offers decent availability, because any one disk (either data or parity) can fail and you still have full access to all of the data (though some of it will be re-computed from the parity instead of being read directly, if a data disk fails). So it is fairly cheap, pretty fast and reasonably good. RAID 6: SNIA’s definition is “Any form of RAID that can continue to execute read and write requests to all of the RAID array’s virtual disks in the presence of any two concurrent disk failures. Several methods, including dual check data computations (parity and Reed Solomon), orthogonal dual parity check data and diagonal parity, have been used to implement RAID Level 6.” Just like in RAID 5, you get better space utilization as you add disks, but you lose the equivalent of two disk drives worth of space due to the extra copy of the parity, so it is not as space efficient as RAID 5. Read and write performance is similar to RAID 5. One of the difficulties of RAID 5 is that it cannot survive the loss of two disks. RAID 6 is just like RAID 5 except that it adds a second copy of the parity data to be able to survive a second disk failure. So RAID 6 is fast and good. Storage requirements have grown, and there are moderate limits to the number of disks that can be combined into a single RAIDset. People wanted to have a smaller number of much large volumes, but not give up the high availability of some of the above RAID levels. To solve this problem, people started using multi-level RAID: one of the above RAID levels is used to create multiple RAID volumes, but then those multiple volumes are striped together into a single larger volume which is easier to manage. The nomenclature is (RAID level) + (RAID level, almost always 0), such as 1+0, but it is often written without the “+” such as 10. RAID 10: This uses RAID 0 (striping) on multiple RAID 1 (mirroring) volumes. It is otherwise identical to RAID 1, in that it is fast and good, but not cheap. RAID 50: This uses RAID 0 (striping) on multiple RAID 5 volumes. It is otherwise identical to RAID 5, in that it is fairly cheap, pretty fast and reasonably good. © 2008 Hewlett-Packard Development Company, L.P. RAID 60: This uses RAID 0 (striping) on multiple RAID 6 volumes. It is otherwise identical to RAID 6, in that it is fast and good but not cheap. One thing is critical in this technique: use RAID 1 or 5 or 6 first, then RAID 0. Remember that RAID 0 offers no availability whatsoever, so if one of the disks in a RAID 0 set fail, the entire RAID 0 set is unavailable. This will frequently put your entire RAIDset at risk. For example, assume that you have four disks (A, B, C and D) and you want to mirror and stripe. But you choose to stripe first and then mirror (effectively RAID 0+1). So you stripe A and B, then stripe C and D, and then mirror the resulting two volumes. Now A goes bad. This takes the entire stripeset A+B offline, and the data on C+D is now running completely unprotected. B is fine, but completely unusable. Any failure of either C or D makes your data completely unavailable. Contrast this with the case where you have the same four disks, and you mirror then stripe. So you mirror A and B, then mirror C and D, and stripe the resulting two volumes. Now A goes bad. B still has a complete copy of the data, so you are still fully operational. Now C goes bad. But D still has a complete copy of the data, so you are still fully operational. This shows that you can lose up to half of the volumes in your RAID 10 set simultaneously, and you are still operational. RAID 50 and 60 work similarly. Most vendors offer multiple host based adapters (HBA’s) which support some level of RAID for their directly connected storage (usually SCSI, now moving to Serial Attached SCSI (SAS)). I originally attempted to classify them, but there were just too many, and they were too specific to a server and an operating system, so I didn’t think it was worth it. Please check the vendors website or your local Sales Representative to find the details of these cards. Figure 22 describes the RAID support in the major storage arrays available today. The subscript on each RAID level is the number of disks that can be in that RAIDset. Vendor RAID Levels EqualLogic PS5000x: 58, 108 Dell EMC ClARiiON CX3: 016, 12, 39, 516, 616, 1016 Symmetrix DMX3. DMX4: 08, 12, 58, 616, 104 MSA: 12, 514, 1056, ADG (6)56 EVA (VRAID): 0168, 1168, 5168, 10168 HP XP: 18, 532, 68 Simple Modular Storage: 612 Hitachi Workgroup Modular Storage: 03, 12, 516, 68, 106 Adaptive Modular Storage: 03, 12, 516, 68, 106 Universal Storage Platform: 18, 58, 68 DS3xxx, DS4xxx: 012, 12, 312, 512, 1012 DS6xxx, DS8xxx: 58, 108 IBM Enterprise Storage Server F10, F20: 5256, 10256 StorageTek 2xxx, 6xxx: 030, 12, 330, 530, 1030 StorageTek 99xxV: 12, 530, 630, 1030 Sun Figure 22 Storage Array RAID Support StorEdge: 08, 12, 38, 58, 108 IBM enhances the normal RAID levels by adding an “E” to the ordinary RAID levels: 1E, 1E0, 5E and 5EE. Each “E” represents an extra disk of the set offering additional redundancy, so 1E means a mirrorset with a spare disk, while 5EE means a RAID 5 set with two spare disks. These will © 2008 Hewlett-Packard Development Company, L.P. automatically be activated in the event of a disk failure, with the data being reconstructed (whether copied in the case of mirroring or re-computed in the case of RAID 5) onto the newly activated disk. The IBM DS4xxx and the Sun StorageTek 6xxx are the same unit, both OEM’ed from Engenio. The IBM DS3/4/6/8 are driven by Power5+ systems, which abstract the underlying disks into logical volumes which are presented as LUNs. Each has two independent Power5+ systems, each with a pair of FC-AL and FC switches, cross-connected to each Power5+ system. FAStT is the old name for the low end DS4xxx series. Sun SANtricity on the StorageTek line allows Dynamic RAID Migration, ie, conversion from one RAID level to another with no re-formatting, though it does require lots of free space. One of the other topics that you need to understand is chunk size. Note that chunk size is different than stripe width, which is the number of drives in the stripeset. Chunk size is the smallest block of data which can be read or written at a time on each of the disks in a RAID 0 (when it is known as stripe size), 2, 3, 4, 5 and 6 RAIDset. For RAID 2 (bit striping) the chunk size is 1 bit, and for RAID 3 (byte striping), the chunk size is 1 byte. RAID 0, 4, 5, and 6 let you specify various stripe sizes, in order to optimize performance for your particular workload. And that is the key, because workloads vary. A large chunk size means that there is a large amount of any given file on each disk, where a small chunk size means that there is a small amount of any given file on each disk such that it may take many chunks to store the entire file. And a large chunk size means that you will always read a large amount of data from each disk (whether or not you will use all of that data), while a smaller chunk size means more parallel operations as you fetch data from all of the disks in your RAIDset. As you can see, the most important thing is the size of each I/O. Large chunk sizes means that the I/O subsystem can break a huge I/O into a fairly small number of I/O’s (in a perfect world, this would be the number of drives in the array) and execute them in parallel. Small chunk sizes means that the I/O subsystem can execute multiple smaller independent I/O’s in parallel, with each disk potentially being the source of multiple of these I/O’s. HP-UX chunk sizes are up to 262Kbyte, while Windows NTFS default is 4Kbytes. Oracle recommends a 1MByte chunk size for ASM for the database (to match the MAXIO database buffer size), but offers 1, 2, 4, 8, 16, 32 and 64MByte chunk sizes to help out sequential reads/writes for the non database files. Note that Sun calls a partition a “slice”, and stores special information in slice 0. The only thing you can do for chunk sizes is measure your actual I/O size over time, and optimize for both bandwidth and latency. Too large a chunk size means you may not be using all of your disk drives in parallel, and may be throwing away data that you read because your data doesn’t fill up a chunk. But too small a chunk size means that you are reading multiple chunks from the same disks, and therefore are serializing your I/O’s. The best thing is to accept the recommendations of your storage and database vendor for the initial implementation, and then measure your disk queue depth, to see if you are waiting on many I/O’s to complete on each disk. If so, then increase the chunk size and re-measure. But modifying your chunk size is often a major operation, involving re-initializing your volumes and restoring the data, so it is not to be done casually. But chunk size can have a measurable effect on performance. ASM/EVA Virtualization Almost every storage array in the industry today works in a similar way: bind individual disk drives into a RAIDset, potentially bind those RAIDsets together into a multi-level RAIDset (RAID 10, RAID 50, etc), and the present the result as a Logical Unit Number (LUN), ie, as a device to be mounted by a server. And this works very well, with quite good throughput, because the LUNs are assisted © 2008 Hewlett-Packard Development Company, L.P. with a large amount of cache and very sophisticated controllers. But this is not the only way to do things. HP’s Enterprise Virtual Array (EVA) storage array binds large groups of disks into a single volume group, and then partitions the single large storage pool into individual volumes, which can have their own RAID level. So, for example, 160 73GByte volumes (with a total usable space of 11.7TBytes, leaving 8 disks for sparing) might be partitioned into a 1TByte RAID 1 (consuming 2TBytes of actual space) volume and a 1TByte RAID 5 (consuming 1.2TBytes of actual space) volume, leaving 8.5TBytes of free space. But the difference between the EVA and every other storage array on the market is that the actual space is striped across every one of the 160 active disks, with automatic load balancing and re-distribution of the actual data blocks across all of those physical disks. If a disk fails, all of the information on that disk is reconstituted from the redundant information for the RAIDset (not in the case of RAID 0, of course), exactly as if they were physical disks. If new disks are added to the volume group, the data is normalized across all of the disks in the resulting volume group, and the same is true if disks are removed from the volume group. Oracle’s Automatic Storage Manager (ASM) does the same thing to the volumes that it manages, though in this case these are LUN’s presented by another storage array. ASM provides logical mirroring at the file level, not physical mirroring at the block level. Disks can be added to the “disk group” and are automatically included in the environment, with files and I/O quickly load balanced to the new space. Some of the storage arrays, such as the IBM DS3 and DS4 series, as well as some NAS devices, such as the IBM N series and the Sun NAS 5300 series, do this using standard file systems and volume management, but they do not have the same level of dynamic data migration and load balancing, nor do they do it to the level of the number of drives in the HP EVA. Disk Partitioning, Volume Management and File Systems Once all of LUN’s are presented to an operating system, whether they are individual disk volumes or RAIDsets, they can be managed by the operating system and 3rd party tools. OpenVMS deals strictly with the individual LUN’s as if they were disk volumes, and does not group them together except with HP RAID Software for OpenVMS for partitioning. Windows also deals only with individual LUN’s, but then partitions them using the Disk Administrator plug-in, and offers a “Volume Set” which is strictly a RAID 0 concatenation of the LUN’s into a single larger volume. For UNIX systems (AIX, HP-UX, Linux, Solaris and Tru64 UNIX), as well as other software such as Symantec Veritas and Oracle, this is done with a volume manager. Volume managers take the individual LUN’s and bind them together into larger entities called volume groups (VG’s). The VG is then treated as a standard LUN, which can be mounted and partitioned in any way needed. One advantage of VG’s over physical disks is that they can be expanded and contracted as needed, which physical disks cannot do. However, they do not do the same kind of automated block-based and file-based load balancing that is done in the EVA and ASM, respectively. As the USA and England are often described as two countries separated by a common language, so are the various tools which manage volume groups: they all use some common terms and some unique terms to describe the same concept. Figure 23 describes the terminology of the various volume managers. Tool LUN Partition of AIX LVM Physical Partition Logical HP-UX LVM Physical Volume Logical Linux DRBD Physical Volume Logical Solaris SVM Drive Tru64 LSM Volume Veritas VxVM Subdisk Slice Subdisk Plex © 2008 Hewlett-Packard Development Company, L.P. a disk Partition Extent Extent Logical Volume Logical Logical Volume Volume Volume Volume Volume Volume Volume Group Group Group Group Volume Volume Mirror Disk Mirror Mirror Mirror Copy Figure 23 Volume Group Management Terminology Volume Volume Disk Media Disk Group Disk Group Disk Group Disk Mirror Mirror Partition Copy All of the volume managers do some level of RAID. Figure 24 describes the operating system support for the various RAID levels, and the number of members which are supported for mirrorsets. RAID Levels 0 1 5 Host Environment AIX HP-UX Software Logical Volume Manager 3 Logical Volume Manager 3 MirrorDisk/UX Linux 2 Distributed Replicated Block Device NonStop Kernel 2 RAID-1, Process Pairs OpenVMS RAID Software for OpenVMS 3 Host Based Volume Shadowing Automatic Storage Management Oracle 10g 3 Solaris 3 Solaris Volume Manager Tru64 UNIX 32 Logical Storage Manager Veritas 3 Veritas Volume Manager (VxVM) Windows 2003 2 Disk Administrator Snap-in Figure 24 Operating System RAID Support The AIX LVM does not work with physical disks, but must convert them to logical volumes before RAID can be applied. The Oracle ASM automatically distributes 1MByte chunks for the database and 128KByte chunks for the control files and redo logs across all available volumes. Oracle ASM 10g only reads from the primary mirror, not the mirror copy unless a failure occurs in the primary. 11g adds the ability to specify which copy is read from, but it is manual and not load-balanced: it is for extended RAC environments where one volume is closer to the server than the other volume. Solaris Volume Manager offers both RAID-0 striped and concatenation but calls them both RAID0. Most vendors are moving to follow the industry standards Fabric Device Management Interface (FDMI), Storage Management Initiative Specification (SMI-S) and Storage Networking Industry Association (SNIA) HBA API. © 2008 Hewlett-Packard Development Company, L.P. Multi Pathing RAID only protects the disk device itself, but it is a long way from the server to the disk, and we need to protect the entire path. We do this with redundant components and multi-pathing. Note that the following discussion focuses mostly on FibreChannel and iSCSI, as it is more difficult to implement (and often less critical) for direct attached storage with SCSI and SAS. In a perfect world, we would have redundant HBA’s to redundant network switches to redundant storage controllers. In this way there would be no single point of failure (SPoF) in the storage environment (though the server would be an SPoF, that is outside the scope of this section). But when you have the above setup, you end up with many potential paths to a specific LUN, and the LUN will often show up with different names depending on the path chosen. You need to carefully trace out the path from the server to the LUN to ensure that you are not setting up alternate paths which share an SPoF. Most of the tools listed below are aware of this problem, and will assist you in disambiguating the paths and the devices along that path, and you should use these tools carefully to not be surprised by a failure during a critical production run. Another factor to consider is whether all of the paths are in use at the same time (active/active) or if there is a primary active path and all of the other paths are only there for transparent failover (active/passive). If they are active/active, what kind of load balancing is used among the paths? Some tools offer dynamic measurement of the latency and load of each path to optimize the performance of all of the paths, while others offer only simple “round-robin” scheduling regardless of the load on each path. Then you need to evaluate whether the tool automatically reconfigures the paths in the event of a failure. As I said in the above discussion around the multiple paths to each device, and how some of the paths have a common component, you need to consider whether the tool is aware that a single device failure (such as an HBA or a switch) will invalidate a whole series of paths, such that it will only re-attempt the I/O operation on one of the healthy paths. One subtle difference between the tools is whether they treat each path as an individual object and then load balance or failover between all of the paths, or whether they bundle the paths together into groups, with active/active load balancing inside a group but active/passive failover between groups. Finally, a note on micro-partitioning and multi-pathing. Most multi-pathing tools are not supported in a micro-partitioned environment (ie, in the guest operating system instance). This is because the I/O paths are virtualized, and many of the virtual HBA’s may be using a single physical HBA, with multiple guest instances using a single physical HBA, in order to optimize the server consolidation cost savings in cabling and switch ports. Since the guest O/S instance is unaware of these consolidations, implementing multi-pathing in the guest instance may be (probably is) completely useless. You need to implement the multi-pathing in the hypervisor in order to protect all of the guest instances equally, even those instances running operating systems which do not support sophisticated multi-pathing tools. Figure 25 and 26 list the tools available on each operating system for multi-pathing. Note that many of these tools come from the operating system (LVM on HP-UX and device mapper on Linux, for example) and support many different storage arrays from multiple vendors. But more of the tools come from the storage vendors, in which case it is common for the tool to run on multiple operating systems but only on that vendor’s specific set of storage arrays. © 2008 Hewlett-Packard Development Company, L.P. Tool Name Subsystem Device Driver pvlinks, LVM, SLVM Device Mapper O/S Support AIX HP-UX Linux Features active/active, load based scheduling, auto failback active/active, load based scheduling, auto failback active/active in a group, active/passive between groups, round robin with priority in a group, auto failback OpenVMS I/O OpenVMS active/passive, manual path balancing, auto failback MPxIO Solaris active/active, round robin, auto failback Multi-Path I/O Windows Technology API, implemented by vendors such as HP MPIO for EVA and SecurePath Figure 25 Operating System Multi-Pathing Tools IBM SDD supports 16 paths, and is included in the DS8xxx series at no additional charge. It has quite sophisticated load based scheduling across LPARs and DLPARs, with load based scheduling and active measurement of the state of each of the paths, with fully automated failover and failback. Note that only active/passive multi-pathing is offered on SPLPARs. HP-UX supports pvlinks, as well as multi-pathing support in the Logical Volume Manager and the Shared Logical Volume Manager. All offer active/active with sophisticated load based scheduling. Linux device mapper bundles multiple paths into groups, and all paths in the first group are used round-robin. If all paths in a group fail, the next group is used. RHEL and SUSE Linux on ProLiant offer active/passive with static load balancing (preferred path). Multi-Path I/O (MPxIO) is integrated in Solaris 8 and beyond. It is active/active, but uses only round robin scheduling among the paths. Windows does not offer a specific multi-path I/O module, but offers a technology API, which has been implemented by many storage vendors, including HP and QLogic, as a Device Specific Module (DSM) for each of their specific storage arrays. Tool Name EMC PowerPath Emulex MultiPulse Hitachi Dynamic Link Mgr HP SecurePath, Auto-Path O/S Support AIX, HP-UX, Linux, Solaris, Tru64, Windows ProLiant Linux AIX, HP-UX, Linux, Solaris, Windows AIX, HP-UX, Linux, Solaris, Windows IBM Subsystem Device Driver AIX, HP-UX, Linux, Solaris, Windows QLogic Multi-Path Linux, Windows Sun StorEdge Traffic Mgr Veritas Dynamic Multi-Path AIX, HP-UX, Windows AIX, HP-UX, Linux, Solaris, Windows Figure 26 3rd Party Multi-Pathing Tools Features active/active, load based scheduling, auto failback 8 paths, active/passive, auto failback active/active, round robin scheduling, auto failback 32 paths, active/active, load based scheduling, auto failback 16 paths, active/active, load based scheduling, auto failback 32 paths, active/passive, auto failback active/passive, auto failback active/active, load based scheduling, auto failback HP XP AutoPath supports dynamic load balancing including shortest queue length and shortest service time per path. © 2008 Hewlett-Packard Development Company, L.P. Symantec Veritas offers Dynamic Multi-Path (DMP). DMP does sophisticated path analysis to group paths with common components, to avoid failing over to a path with a failed component. However, note that V4.1 and later versions of Veritas Volume Manager (VxVM) and Dynamic Multipathing are supported on HP-UX 11i v3, but do not provide multi-pathing and load balancing; DMP acts as a pass-through driver, allowing multi-pathing and load balancing to be controlled by the HP-UX I/O subsystem instead. QLogic Multi-Path is a Device Specific Module (DSM) of MPIO for Windows. Hitachi Dynamic Link Manager supports round-robin load balancing, and automated path failover (active/active). It is managed by Global Link Availability Manager. Storage Chaining One of the difficulties of implementing storage over time is that the vendors are continually leapfrogging each other in terms of performance and functionality, and you need to purchase a storage array from another vendor in order to satisfy a real business requirement. And then there is the case where the storage vendor really wants your business, and is prepared to offer you a special deal to introduce their storage into your environment. The net result is that you often end up with a heterogeneous storage environment, with storage arrays from a variety of vendors. And this can cause problems, especially when you try to manage all of these disparate systems. There are very few tools which even offer to manage the entire environment, and none of them have the component level of understanding and management that the tool from that particular vendor on that particular storage offers, and that you frequently require. One solution for this is storage chaining, that is, to hide multiple storage arrays with another device, and have the front-end device pretend that all of the storage is physically in that device. There are two ways that this can be accomplished: in-band and out-of-band. The in-band method is to connect one storage array directly to second storage array, and have the second storage array simply appear as LUNs in the management tools of the first storage array. So, in effect, the second storage array disappears and the first storage array simply appears to have a great deal more space than it had before. In this case, the host sees a LUN through a FiberChannel port. The LUN is mapped to a Logical Device called an LDEV, and an external storage array is connected through other FibreChannel ports, which are simply mapped to the original LUN and presented to the server. In this way, whether the LUN is physically present in the storage array, or mapped through FibreChannel ports, the server cannot tell the difference: a LUN is a LUN, which can simply be used. The Hitachi Universal Storage Platform (USP) and HP XP arrays offer this type of chaining. It can also be done with a server in cooperation with the storage switches. The IBM SAN Volume Controller and HP (nee Compaq nee Digital)’s CASA (Continuous Access Storage Array) are standard servers which offer in-band storage chaining. The other choice is out-of-band, and to do it with clustered servers in cooperation with the storage switches. The difference between this is and the above tools, is that the storage appliance is only used for management and path control, and that all of the I/O is going through the storage switches normally. EMC Invista implements Data Path Controllers (DPC) controlled by the Cluster Path Controller (CPC), and the NetApp V series, run on standard servers and offer outof-band storage chaining. Network Attached Storage (NAS) As much as storage vendors have optimized their offerings, and reduced their prices in the face of the fierce competition with each other, HBA’s and network switches remain very expensive. They are far too expensive to supply to each desktop which needs to access the storage in the © 2008 Hewlett-Packard Development Company, L.P. organization (never mind the problem of wireless access to the SAN), and they may even be too expensive to supply to each small server which needs to access a great deal of storage. But network ports are cheap, and each desktop, notebook PC and small server already has one or more high speed network ports built in, since they need network access no matter what. So why can’t we supply storage over the standard network? Well, we can, and we call it NAS. NAS can either offer the block oriented storage of SCSI or FibreChannel through iSCSI, or it offers file system oriented storage such as NFS and CIFS (aka, Samba). iSCSI is simply the Small Computer Systems Interface (SCSI) protocol encapsulated into a networking protocol such as TCP/IP. It enables disk sharing over routable networks, without the distance limitations and storage isolation of standard SCSI cabling. It can be used with standard networking cards (NIC) and a software stack that emulates a SCSI driver, or it can be similar to a storage HBA with a NIC assisted with a TCP Offload Engine (TOE) to take care of the crushing processor overhead of TCP/IP networking protocols. The Network File System (NFS) was developed by Sun in 1984 (V2 in 1989, V3 in 1995 and V4 in 2003). It originally used Unreliable Data Protocol (UDP) for performance, but recently switched to TCP for routability and reliability. V1-3 were stateless (where each operation was discrete, with no knowledge of previous operations, which is excellent for failover but limits the efficiency of the entire session), with V4 being stateful (where the connection has more knowledge of the previous operations and can optimize caching and locking). The Server Message Block (SMB) remote file system protocol, also known as Common Internet File System (CIFS), has been the basis of Windows file serving since file serving functionality was introduced into Windows. Over the last several years, SMB's design limitations have restricted Windows file serving performance and the ability to take advantage of new local file system features. For example, the maximum buffer size that can be transmitted in a single message is about 60KB, and SMB 1.0 was not aware of NTFS client-side symbolic links that were added in Windows Vista and Windows Server 2008. Further, it used the NetBIOS protocols. Windows Vista (in 2006) and Windows Server 2008 (in 2008) introduce SMB 2.0, which is a new remote file serving protocol that Windows uses when both the client and server support it. Besides correctly processing client-side symbolic links and other NTFS enhancements, as well as using TCP, SMB 2.0 uses batching to minimize the number of messages exchanged between a client and a server. Batching can improve throughput on high-latency networks like wide-area networks (WANs) because it allows more data to be in flight at the same time. Whereas SMB 1.0 issued I/Os for a single file sequentially, SMB 2.0 implements I/O pipelining, allowing it to issue multiple concurrent I/Os for the same file. It measures the amount of server memory used by a client for outstanding I/Os to determine how deeply to pipeline. Because of the changes in the Windows I/O memory manager and I/O system, TCP/IP receive window auto-tuning, and improvements to the file copy engine, SMB 2.0 enables significant throughput improvements and reduction in file copy times for large transfers. SMB (both 1.0 and 2.0) are now available through open source implementations, under the Samba project, on all major operating systems. The storage vendors have recognized the prevalence of both NFS and CIFS (SMB or Samba), and have implemented those file system protocols over their standard storage connections, such as FibreChannel. In this way, a server can access storage using a single file system protocol, whether the storage is accessed via the network or FibreChannel Figure 27 describes the RAID support in the major NAS devices available today. As before, the subscript on each RAID level is the number of disks that can be in that RAIDset. Vendor RAID Levels Protocols © 2008 Hewlett-Packard Development Company, L.P. File System EMC Celerra NS/NSX: 12, 39, 516, 616 CLARiiON AX4, CX3: 016, 12, 39, 516, 616 1016 AiO: 012, 12, 512, 612 EFS: 0168, 1168, 5168, 10168 HP MSA 2000i: 048, 12, 348, 548, 648, CIFS, iSCSI, NFS FibreChannel, Network CIFS, iSCSI, NFS Network CIFS, NFS Network iSCSI 1048, 5048 FibreChannel, Network CIFS, NFS Network CIFS, iSCSI, NFS FibreChannel, Network IBM High Perf: 58, 50256 N: RAID-DP (6)28, 1656 CIFS, iSCSI, NFS FibreChannel, Network NetApp FASxxxx: 414, DP (6)28 CIFS, iSCSI, NFS DAS, FibreChannel, Network Hitachi Essential: 68 Line Figure 27 NAS RAID Support Tiering Not all data is created equal. Some of the data is absolutely mission-critical, with requirements for extremely high performance, zero downtime, and extremely frequent backups such that no data loss is acceptable. Some other data is still important to the organization, but performance is not as critical, some downtime is acceptable, and people could reasonably reconstruct the data if it was lost. Other data is archived, where it will be used only very occasionally, performance is not an issue, and the data never changes such that the backup you took 2 years ago is still valid. But you need to ensure that all of this data is available when it is needed, whether for financial reporting, regulatory requirements, protection against litigation, etc. Not all storage arrays are created equal. Some storage arrays offer incredibly fast performance with very high speed FibreChannel disks, each of which is fairly small in order to maximize the number of spindles for performance, multiple redundancy inside the array with no SPoF, and RAID 10 and multiple sparing. Other storage arrays offer lower speed FATA or SATA disks, each of which is huge in order to minimize the footprint, simplified implementation with little to no redundancy, and RAID 5 or less and very few spares. The first can be quite expensive, while the second can be fairly inexpensive. Putting these two concepts together, we can implement storage tiering. In the same way you evaluate the business value of your data in order to choose a RAID level and a backup schedule (including cloning and long distance replication), you can choose a storage array in order to match your business requirement to the amount you are willing to pay to store that particular set of data. You should avoid the tendency to express tiers as specific technologies. For example, you might say that all Tier 1 storage is FibreChannel based and that all Tier 2 storage is SAS or SATA based. Instead, present tiers as "service levels" that express real business needs, and let those business needs drive you to a solution. Five tiers of storage are standard: • Tier 1: The data is mission critical, requiring high performance with zero downtime and zero data loss acceptable © 2008 Hewlett-Packard Development Company, L.P. • Tier 2: The data is important, with good performance requirements, and a 2-4 hour recovery time and a 2-4 data loss being acceptable • Tier 3: Archived data, with minimal performance requirements, and 4-8 hour recovery time. Since the data is archived, data loss requirements are minimal, though you do want to verify the backups occasionally. • Tier 4: The data is now near-line, on automated tape or optical libraries, with a 1-2 hour access time. The data is automatically recovered from the backup media. • Tier 5: The data is off-line in a tape library, with up to a 72 hour access time. The data is manually recovered from the backup media. Data will often move between one tier and another. For example, this quarter’s order data may exist in Tier 1 storage, as it represents active contracts and orders which have yet to be fulfilled. But once the quarter is over and all the contracts are completed, this data may move to Tier 2. And once the financial year is over, the entire year’s data may move to Tier 3. This can be accomplished with Hierarchical Storage Management (HSM), Enterprise Content Management (ECM) and Information Lifecycle Management (ILM) tools. HSM is policy based management of backup and data migration, and is technology based. It establishes policies for data, and migrates the files transparently between the different tiers. EMC DiskXtender works with NAS, UNIX/Linux and Windows based storage. HP XP Auto-LUN will migrate data between the multiple tiers in an XP array, including the chained storage that it is management. IBM Grid Access Manager, IBM Tivoli Storage Manager and Veritas Volume Manager Dynamic Storage Tiering also implement HSM at the data level. It is also possible to do this with files. Arkivio Auto-Stor, F5 Acopia Intelligent File Virtualization and NuView File Lifecycle Manager will implement file based HSM. Database and application vendors recognize the need to do this inside their applications. EMC Documentum, Microsoft CommVault for Exchange, Mimosa NearPoint e-mail File System Archiving, Oracle, Microsoft SQL Server and DB2 all implement data based HSM by storing all of the data in their own set of databases, and categorizing the content, and migrating the documents appropriately according to the business policies of the organization. Note that when HSM moves data from on-line (whether Tier 1, 2 or 3) to near-line or off-line (Tier 4 or 5), it will leave behind a “stub” on the on-line storage array. This stub contains all of the file information (owner, size, name, etc), but is extremely small, in that it the only “data” is a pointer to the near-line or off-line storage where the actual data exists. When the file is accessed, the data is transparently (except for some latency) fetched from the near-line or off-line storage, exactly as if the data had existed on the on-line storage array. The advantage of this is that user access methods don’t change no matter where the data actually exists, and the backup of the stub is many times smaller than the actual data would be. ILM goes further than that, and implements policy based management of data from creation to destruction. It tracks the data itself, and moves beyond archiving into actually removing the data from the environment. Planning is the key to effective storage tiering. You need to establish a team and standards for the data. I know of customers who discovered that they were backing up hundreds of GBytes of .TMP files which were decades old, which had been stored on their Tier 1 storage, which were of no value whatsoever. So ensuring that the right data is on the right tier is critical. Implement a Proof of Concept, implement across the enterprise, and then continually review new data and assign it to the proper tier. Client Virtualization © 2008 Hewlett-Packard Development Company, L.P. Up to this point we have been talking about virtualizing the back end systems: servers, storage and networking. But there is another area where virtualization can yield the same kinds of benefits: the client devices. Personal computers (PC’s) today are incredible devices. They are far more capable (more processing power, more memory, more disk space, and far better graphics) than mainframe systems and high-end workstations of even a decade ago. Even notebook and tablet PC’s have power that exceeds the capacity of entire data centers that existed when we started our careers. And Moore’s Law has extended beyond the processor, into memory, disk and graphics, dropping the price of these devices to where it is reasonable to place this power in the hands of many or all of the people in your organization. But in some cases this power has exceeded the requirements of the users. Customer Service Representatives on the phone or at the counter of your business who run bounded applications with limited graphics, office workers who only run simple word processing and spreadsheet applications, manufacturing people who run data entry and reporting applications in an environmentally hostile environment, don’t need all of this power. They would be quite satisfied with a lower power (though still quite powerful) device. And even though prices have come down, the percentage of people in your organization that require at least some level of computing power has exploded. Twenty years ago, only engineers required workstations, and most offices had a few centralized word processing devices. Today, virtually every person in your office uses word processing, spreadsheets, web browsing and specialized office applications. Many people on factory floor and warehouse production lines require computer access in order to receive information about the orders to process and to report on the fulfillment of those orders. And it is the rare person indeed who does not require e-mail for their daily job. So the total cost (lower cost of each device times a much larger number of devices) is much higher than it used to be. Further, the maintenance of the PC, especially in the vast numbers that are required today, can overwhelm an organization. And there are security concerns in having hundreds of client devices containing highly sensitive data which are highly mobile (in the case of notebook and tablet PC’s). Note that in almost every case the server device doing the virtualization is based on an AMD Opteron or Intel Pentium Xeon, running either Linux or Windows. The reason for this is simple: these are the processors and operating systems that the clients are running today on their current client devices (ie, desktops). So unless I specify otherwise, the processors and operating systems are x86/x64 and Linux/Windows. I titled this section “Client Virtualization” because I wanted to incorporate the virtualization of the PC in all of its forms: desktop, notebook, tablet, thin client and other terminal devices. Client virtualization is extremely simple in concept: a common (usually very small) set of standard operating system “gold” images are deployed to a set of standard server devices in the datacenter, which run the standard desktop applications. The graphical user interface is delivered via simplified delivery device. © 2008 Hewlett-Packard Development Company, L.P. Figure 28 Client Virtualization Architecture Figure 28 shows the common components of a client virtualization solution: The storage needed to hold the data which is produced and consumed by the applications run by the virtual clients. The devices which are running the client software. The diagram specifies blade systems, but any kind of server would work equally effectively. These can be large servers, small rack-mount servers, or more recently blade servers. These are becoming an increasingly popular choice, due to the economic and environmental advantages of blade servers. The virtual clients (VC’s in the diagram) themselves. These can be either a single client operating system per server device, multiple client operating systems (usually virtualized, as discussed in the micro-partitioning section above) per server device, or a single operating system handling multiple clients in a timesharing environment. A connection manager handles the graphics, keyboard and pointing device connections between the virtual clients and the client devices. A remote protocol is required to interface over the network between the servers in the data center and the client devices. There are several popular protocols to handle the graphics communication, including Independent Computing Architecture (ICA) from Citrix, Remote Desktop Protocol (RDP) from Microsoft, Remote Frame Buffer (RFB) from the X11 consortium, Remote Graphics Software (RGS) from HP, and Simple Protocol for Independent Computing Environment (SPICE) from Qumranet, as well as several protocols which are proprietary to a specific client virtualization solution. Finally, the client device can be either a standard PC (desktop, notebook, tablet, etc) or a thin client device with only a monitor, keyboard, pointing device and just enough capability to display the graphics and return the keyboard and mouse events. Because each user still requires some type of client device, you might be wondering how this is better than simply giving each user a standard PC or notebook. I covered a little bit of this earlier, but let’s make it explicit: The thin client device is usually far cheaper than a standard desktop or notebook PC. For example, it often has no disk, and is not licensed for any software. It is extremely unusual to have 100% of the people using their client device at the same time. People are on vacation, traveling, in meetings, out sick, etc, leaving their client devices idle. In most organizations, 20% or more of the client devices are idle at any given moment. By reducing the number of virtualized clients by this amount, both in terms of servers in the datacenter and access devices, significant cost savings can be achieved: think of how much your organization could save by reducing your PC purchases by 20%. People often move around. By utilizing roaming profiles, a person can have their exact configuration (storage, networking, applications, and any other customizations allowed by their profile) available at any client device, simply by logging in. The client virtualization software will load their environment on the specific virtualized environment (an entire server, a guest operating system instance on a server, or a time-sharing client on a server) and deliver their profile to them, no matter where they are. A broken desktop often means that the person is idle. But a broken server in the datacenter simply means that all of the users on that device will shift over to another server and continue working with either no interruption (using guest operating systems instances with live migration) or a small interruption (being forced to log in again). Roaming profiles allow the movement of users and optimal use of the servers. And a broken client device is the same way: the user will simply log into any other similar client © 2008 Hewlett-Packard Development Company, L.P. device, and continue working without a major interruption. So client virtualization solutions mean higher availability for the clients. Security is also enhanced. With standard desktop and especially notebook PC’s, the files are physically resident on the disk on the PC. This makes them somewhat difficult to backup, very difficult to track, and extremely difficult to enforce version control. Further, physical control of the information is lost, either due to loss of the device by the user or by the deliberate theft of the information by corporate or other espionage. By having all of the disks for the virtual clients being resident on the storage in the datacenter, most of these problems are either eliminated or severely reduced. A very important component of a client virtualization solution is the graphics protocol. This is almost always the most important factor in whether the solution will meet the needs of the clients, and is also the one most often overlooked. It is fairly easy to analyze the performance requirements of each of the clients, and then to size the servers (processors and memory) and the back end (networking and storage) to satisfy those requirements, as the tools and expertise for this are similar to what the system administrator has been doing for servers for many years. But the graphics is often a new concept. In many cases the graphics protocol is going through the corporate backbone: this is the common case in corporate call centers or corporate offices, where the clients are in the same building or campus as the servers and the storage. But if the clients are remote, and the graphics protocol is going out through a WAN or even the Internet itself (which is often the case of home workers, remote offices, or multiple corporate offices being served by a single datacenter), then the networking infrastructure and therefore performance is something over which the system administrator has no control. And the problem is usually not bandwidth. The Internet makes amazing amounts of bandwidth available to any business and most homes, and wireless technology is available pretty much anywhere at speeds which used to only be available in dedicated circuits. Download speeds of 512Kbits per second are the low end of home networking today (whether through cable modem or telephone company DSL), are common at wireless hot-spots, and are well within the range of cellular wireless cards in major metropolitan centers. This is perfectly adequate for everything except large video images or extremely detailed graphics (real-time CAD drawings, etc). The problem in most cases is latency, and more precisely, variable latency, where the delivery delay between network packets is inconsistent. Human beings begin to detect interruptions at around 75 milli-seconds, and find it becomes bothersome at between 100 and 150 milli-seconds (over 1/10th of a second). When the screen is being refreshed at a certain rate (let’s say every 25 milli-seconds), and a refresh is delayed by more then 100 milli-seconds, humans perceive this as an interruption and a degradation of the quality of service. This is especially true in streaming media, such as audio and video. Audio especially is not bandwidth intensive, but hesitations caused by variable latency are quickly perceived as breaks in the audio. You need to be aware of this situation when deciding when and how to deploy client virtualization outside your campus. There are four primary user segments for client virtualization: At the very low end, the usage is primarily for task oriented uses such as a kiosk where only a limited set of tasks need to be done on a very infrequent basis. Low cost per user is a big driving factor for this solution. A shared Server Hosted environment, where a single operating system hosts many users in a time-sharing manner, is excellent for this segment. In the middle are solutions that target basic productivity and knowledge workers. These solutions are ideal for users who want a full desktop user experience and want to run some multimedia applications, whether they are doing it in an office location or a remote location. At the low end, each of the virtual client sessions can be a virtualized operating © 2008 Hewlett-Packard Development Company, L.P. system instance, which yields a very low cost per client, but may not scale if too many virtual instances are deployed. With Virtual Desktops, the server is now in the data center as well, but the server runs multiple instances of virtual machines each with its own OS instance. There are multiple users connecting to a single server in the backend, but each to its own OS instance in the VM. A more scalable solution is to use a dedicated server (low end to mid range) per virtual client. This delivers a guaranteed performance level, but the cost per user will be higher. The application is run on a Blade or Physical PC, and is delivered remotely to the client. The client device used by the user is typically a thin client that supports the protocol required to access the application remotely On the extreme high end is the blade workstation where users require a high end graphic experience that requires dedicated hardware. This kind of solution is typically good for users in the engineering community using 3D graphics or finance industry users that use multiple monitors to run and view several applications at the same time. This is the same as the previous segment in that each user has their own Blade or Physical PC, but in this case the system is much more powerful than in the previous segment, reflecting the greater demand for performance of the clients of this segment. There is a 1:1 mapping of user to the PC. Cost for this is higher, but still lower than placing a powerful CAD workstation on the desk of each user. There is a great deal of overlap between these solutions: one organizations definition of “extreme high end” might be another organizations definition of “middle tier”. But the distinction is mostly whether the server is shared (whether with different guest operating system instances or different users time-sharing a single bare iron operating system instance) or dedicated to a single user. User Type Server Hosted Citrix XenDesktop EMC VMware ACE EMC VMware VDI GoToMyPC HP Consolidated Client Infrastructure HP Virtual Desktop Interface Microsoft Terminal Services Qumranet Solid ICE Remote Desktop Virtual Desktop Blade/Physical PC Blade/Physical PC Graphics Independent Computing Architecture Local RDP w/extensions Proprietary RDP, Remote Graphics Software (RGS) Virtual Desktop Server Hosted Virtual Desktop Symantec pcAnywhere Blade/Physical PC Virtual Network Blade/Physical PC Computing X11 Blade/Physical PC Figure 29 Client Virtualization Products Auto Refresh Yes (pooled) No Yes No No Yes (pooled) Remote Desktop Protocol (RDP) RDP, Simple Protocol for ICE (SPICE) Proprietary Remote Frame Buffer No X11 No Yes No No GoToMyPC and Symantec pcAnywhere are one-to-one desktop solutions which are focused on allowing individuals to access their home or work desktops remotely. In most cases the graphics bandwidth is minimal, with the primary requirement for bandwidth being file transfer. I include them for completeness, but will not focus on them. Virtual Network Computing (VNC) is similar, in that it focuses on the client device (running the “VNC Viewer”) controlling a remote system (the “VNC Server”). Unlike most of the other products © 2008 Hewlett-Packard Development Company, L.P. covered in this topic, it has been implemented on a wide variety of platforms, and works very well even between platforms. A very common situation is a Windows or Macintosh system accessing a UNIX or OpenVMS server. There is even a Java VNC viewer running inside standard web browsers. VNC uses the Remote Frame Buffer (RFB) protocol, which transfers the bitmap of the screen to the graphics frame buffer, instead of transferring the actual graphics commands needed to re-draw the client device screen. VNC and RFB are now open source products. The “X Windowing System” (commonly called X11 or just X) was invented in 1984 at the Massachusetts Institute of Technology by a consortium including MIT, Digital Equipment Corporation and IBM. It is a general purpose graphics protocol which has been implemented on many platforms and client devices. The major difference between it and other graphics protocol was that it was explicitly designed to be used remotely over a network, instead of to a locally attached monitor. While it remains popular in the UNIX world, it has fairly high overhead and bandwidth requirements, and is more suited for a campus environment than a truly remote delivery mechanism. Therefore, in the dramatically larger desktop world (ie, on Windows), it has been superseded by other graphics protocols such as RDP. Citrix XenDesktop EMC VMware ACE and VDI HP CCI and VDI Microsoft Terminal Services Qumranet Solid ICE Software Virtualization (Software As A Service) In the server section we have been mostly talking about hardware virtualization, whether that was implemented in hardware (hard partitioning) or software (soft partitioning, resource partitioning, clustering, etc). This allows us to deliver the hardware resources only to the specific area (user, application, etc) that requires it. But there is a further distinction that is becoming very popular, and that is the ability to deliver only the software itself to the specific area (user, server, desktop, etc) that requires it. This is primarily for the purposes of reducing licensing costs and controlling the specific versions and patch levels of software that are available throughout your organization. So, for example, you might want to equip all of your employees with the ability to run Microsoft Word. One way would be to issue each of them a CD/DVD with a full copy of Microsoft Office 2003 Professional, and trust each of them to install this on their work system. And this will work, but it has several problems: Not all people are capable of installing the software in the optimal way, and many people have their own ideas of how to install software (which, while perfectly valid, do not match the ideas of other people in your organization nor the ideas of your IT department). You will end up with a variety of installation levels, which will be difficult to trouble-shoot. This is extremely expensive, because Microsoft will require you to pay for every one of these licensed copies, whether the people actually need them or not. When you transition to Microsoft Office XP Professional, you need to get back each of the original CD/DVD media (if the people still have it), and issue new DVD media. This is a very time consuming and lengthy task. It would be better if you could maintain a single copy of the full Microsoft Office suite, and allow each of your users to run it when and if they need it. All instances would be installed and deinstalled in exactly the same way (as specified by the IT department after careful study and following industry best practices), costs would be reduced because you would only license the © 2008 Hewlett-Packard Development Company, L.P. number of copies of the software that are running simultaneously at the peak times, and upgrades would be easy as you would only upgrade the single copy. This is the idea of Software As A Service (SAAS): delivering software packages when and where they are needed, running either on virtual desktops (as described in the previous section) or physical desktops connected through your corporate LAN. − Altiris SVS − Citrix XenApp (was Presentation Services) − Microsoft SoftGrid − VMware Thinstall © 2008 Hewlett-Packard Development Company, L.P. For more information General information about virtualization http://www.virtualization.info, a blog about virtualization with many good links http://en.wikipedia.org/wiki/Hypervisor, a description of hypervisors and their history http://it20.info/blogs/main/archive/2007/06/17/25.aspx, IT 2.0 Next Generation IT Infrastructure discussion comparing VMware, Viridian (aka, Hyper-V) and Xen. http://www.tcpipguide.com/free/t_IPAddressing.htm, on TCP/IP addressing and subnetting http://www.tech-faq.com/vlan.shtml, for a good discussion of VLAN technology On the history of computing http://www.thocp.net, The History Of Computing Project, for the history of hardware and software and some of the people involved in this http://www.computer50.org, for information on the Mark-1, Colossus, ENIAC and EDVAC http://www.itjungle.com/tfh/tfh013006-story04.html for a history of computing focused on virtualization, including some excellent pointers to other work http://cm.bell-labs.com/who/dmr/hist.html, for a discussion of the invention of UNIX and the UNIX File System (UFS) On 3Com products • http://www.3com.com/other/pdfs/solutions/en_US/20037401.pdf, for details on 3Com’s approach to VLANs On AMD products http://enterprise.amd.com/downloadables/Pacifica.ppt, for a description of the “Pacifica” project on processor virtualization On Cisco products http://www.cisco.com/en/US/prod/collateral/ps6418/ps6423/ps6429/product_data_sheet 0900aecd8029fc58.pdf, for a description of VFrame Server Fabric Virtualization On Citrix products http://www.xensource.com/PRODUCTS/Pages/XenEnterprise.aspx, details on the open source Xen project http://www.citrix.com/English/ps2/products/product.asp?contentID=683148&ntref=hp_n av_US, details of the Citrix Xen product On Egenera products http://www.egenera.com/products-panmanager-ex.htm, for a description of the Processing Area Network (PAN) hypervisor management tool © 2008 Hewlett-Packard Development Company, L.P. On EMC (VMware) products http://www.vmware.com/products/server/esx_features.html and http://www.vmware.com/products/server/gsx_features.html for details on the VMware virtual servers http://www.vmware.com/pdf/vi3_systems_guide.pdf for the list of supported server hardware and Chapter 13 for snapshots, and http://www.vmware.com/pdf/GuestOS_guide.pdf for the list of supported guest operating systems http://vmware-land.com/vmware_links.html for a lengthy list of VMware pointers, whitepapers and other documents http://www.vmware.com/products/esxi/, for details on ESXi http://www.vmware.com/pdf/vmware_timekeeping.pdf, for a discussion on adjusting the time in guest instances http://www.emc.com/products/family/powerpath-family.htm, EMC PowerPath http://www.emc.com/collateral/hardware/comparison/emc-celerra-systems.htm, the EMC Celerra storage home page http://www.emc.com/products/family/diskxtender-family.htm, HSM software http://www.emc.com/products/family/documentum-family.htm, EMC ILM products On Emulex products http://www.emulex.com/products/hba/multipulse/ds.jsp, RHEL and SUSE Linux on ProLiants http://www.vmware.com/products/desktop_virtualization.html, ACE and VDI desktop virtualization products On Hitachi Data Systems products • http://www.hds.com/products/storage-systems/index.html, TagmaStore SMS (primarily SMB), WMS, AMS and USP • http://www.hds.com/assets/pdf/intelligent-virtual-storage-controllers-hitachitagmastore-universal-storage-platform-and-network-storage-controller-architectureguide.pdf for a general guide to HDS storage products • http://www.hds.com/products/storage-software/hitachi-dynamic-link-manageradvanced-software.html, Hitachi Dynamic Link Manager • http://www.hds.com/products/storage-systems/index.html, home page for HDS storage products On HP products http://www.hp.com/products1/unix/operating/manageability/partitions, details on the HP-UX partitioning continuum, and http://www.hp.com/products1/unix/operating/manageability/partitions/virtual_partitions. html for the specifics on vPars © 2008 Hewlett-Packard Development Company, L.P. http://h71000.www7.hp.com/doc/72final/6512/6512pro_contents.html, details on OpenVMS Galaxy http://h71028.www7.hp.com/enterprise/cache/8886-0-0-225-121.aspx, Adaptive Enterprise Virtualization http://h71028.www7.hp.com/enterprise/downloads/040816_HP_VSE_WLM.PDF, D.H. Brown report on HP Virtual Server Environment for HP-UX http://h71028.www7.hp.com/enterprise/downloads/Intro_VM_WP_12_Sept%2005.pdf, white paper on HP Integrity Virtual Machine technology and http://h71028.www7.hp.com/ERC/downloads/c00677597.pdf, “Best Practices for Integrity Virtual Machines” http://docs.hp.com/en/hpux11iv2.html and http://docs.hp.com/en/hpux11iv3.html contains the documentation for Psets, PRM, WLM and gWLM http://h71000.www7.hp.com/doc/82FINAL/5841/aa-rnshd-te.PDF, Chapter 4.5 for a description of OpenVMS Class Scheduling http://www.hpl.hp.com/techreports/2007/HPL-2007-59.pdf, compares the performance of Xen and OpenVZ in similar workloads. http://www.hp.com/rnd/pdf_html/guest_vlan_paper.htm, discusses VLAN security http://h18004.www1.hp.com/products/servers/networking/whitepapers.html discusses HP networking products http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00814156/c00814156.pdf , discusses HP Virtual Connect http://docs.hp.com/en/5992-3385/ch01s01.html discusses HP-UX RAID technology http://docs.hp.com/en/native-multi-pathing/native_multipathing_wp_AR0803.pdf for HPUX 11iV3 dynamic multi-pathing http://h71000.www7.hp.com/doc/83FINAL/9996/9996pro_206.html#set_pref_path, OpenVMS preferred pathing http://h18006.www1.hp.com/storage/aiostorage.html, All in One (AiO) iSCSI storage arrays http://h18006.www1.hp.com/products/storageworks/efs/index.html, Enterprise File Services Clustered Gateway http://h18006.www1.hp.com/storage/disk_storage/msa_diskarrays/san_arrays/msa2000i/s pecs.html, Modular Storage Array specifications http://h18006.www1.hp.com/products/storage/software/autolunxp/index.html, HP AutoLun software On IBM products http://www-1.ibm.com/servers/eserver/bladecenter/, http://www-1.ibm.com/servers/eserver/xseries/, http://www-1.ibm.com/servers/eserver/iseries/, and http://www-1.ibm.com/servers/eserver/pseries/hardware/pseries_family.html, eServer brochures and full descriptions http://www-1.ibm.com/servers/eserver/xseries/xarchitecture/enterprise/index.html. for general information on X-Architecture technology in the eServer, but specifically © 2008 Hewlett-Packard Development Company, L.P. ftp://ftp.software.ibm.com/pc/pccbbs/pc_servers_pdf/exawhitepaper.pdf for a detailed white paper on the X-Architecture http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf, eServer p570 Technical Overview http://www.redbooks.ibm.com/abstracts/SG246251.html, logical partitioning on the IBM eServer iSeries http://www.redbooks.ibm.com/abstracts/sg247039.html and http://www.redbooks.ibm.com/redbooks/pdfs/sg247940.pdf, logical partitioning on the IBM eServer pSeries http://publib-b.boulder.ibm.com/abstracts/tips0119.html, LPAR planning for the IBM eServer pSeries, describes the overhead required for LPARs http://www.redbooks.ibm.com/abstracts/tips0426.html, a comparison of partitioning and workload management on AIX http://www-1.ibm.com/servers/eserver/about/virtualization/systems/, describes the Virtualization Engine, including micro partitioning and the Virtual I/O System. http://techsupport.services.ibm.com/server/vios is the documentation for VIOS. http://www.redbooks.ibm.com/redbooks/pdfs/sg246389.pdf, Appendix A describes the management of the entitlement of guest instances. http://www.redbooks.ibm.com/redbooks/pdfs/sg246785.pdf, describes Enterprise WorkLoad Manager (eWLM) V2.1 http://www.redbooks.ibm.com/abstracts/tips0426.html describes the attributes of WorkLoad Manager and Partition Load Manager http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/cuod.htm, http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/reservecoduse. htm, and http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/tcod.htm, http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/ipha2/cbu. htm, describes all of the Capacity On Demand permutations and billing for p5 and p6 servers http://www.ibm.com/software/network/dispatcher, IBM SecureWay Network Dispatcher is a front end load balancer http://www-03.ibm.com/systems/storage/resource/pguide/prodguidedisk.pdf, a general product guide for IBM storage with pointers to specific products http://www-1.ibm.com/support/docview.wss?uid=ssg1S7000303&aid=3, IBM Subsystem Device Driver http://www-03.ibm.com/systems/storage/network/index.html, home page for IBM network (NAS) storage http://www-03.ibm.com/systems/storage/software/gam/index.html, IBM Grid Access Manager On Intel products http://www.intel.com/technology/computing/vptech/, for a description of the “Vanderpool” VT-x project on processor virtualization © 2008 Hewlett-Packard Development Company, L.P. ftp://download.intel.com/technology/computing/vptech/30594201.pdf, for a description of the VT-i technology, specifically Chapter 3 On Microsoft products http://www.microsoft.com/windowsserversystem/virtualserver/default.mspx, the Virtual Server home page, http://www.microsoft.com/technet/prodtechnol/virtualserver/2005/proddocs/vs_tr_comp onents_vmm.mspx, the Virtual Server 2005 Administrator’s Guide, http://www.microsoft.com/windowsserversystem/virtualserver/evaluation/virtualizationfa q.mspx, the Virtual Server 2005 FAQ, and http://www.microsoft.com/windowsserversystem/virtualserver/overview/vs2005tech.mspx for the Virtual Server technical overview http://www.microsoft.com/windowsserver2008/en/us/white-papers.aspx, for details on Windows Server 2008 http://www.microsoft.com/windowsserversystem/virtualserver/evaluation/linuxguestsupp ort/default.mspx, for the description and download for Virtual Machine Additions For Linux http://www.microsoft.com/technet/archive/winntas/maintain/optimize/08wntpca.mspx?m fr=true discusses Windows RAID technology http://www.microsoft.com/windowsserver2003/technologies/storage/mpio/default.mspx, Windows multipathing On NEC products http://www.necam.com/Servers/Enterprise/collateral.cfm, the NEC Express5800/1000 main page, but especially http://www.necam.com/servers/enterprise/docs/NEC_Express58001000Series_2008_R1.pdf for details on the high end servers http://www.necam.com/DP/ for a description of Dynamic Hardware Partitioning On NetApp products • http://www.netapp.com/us/products/storage-systems/fas6000/, home page for NetApp FAS products On Novell products http://www.novell.com/solutions/datacenter/docs/dca_whitepaper.pdf, describes DataCenter Automation http://www.novell.com/products/zenworks/orchestrator/, describes ZenWorks Orchestrator On Oracle products http://www.oracle.com/corporate/pricing/partitioning.pdf, describes how Oracle “per processor” licensing works with partitioned servers © 2008 Hewlett-Packard Development Company, L.P. http://www.oracle.com/technologies/virtualization/index.html, covers Oracle VM and other Oracle virtualization technologies http://www.oracle.com/technology/products/database/asm/pdf/take%20the%20guesswo rk%20out%20of%20db%20tuning%2001-06.pdf and http://www.oracle.com/technology/products/database/asm/index.html discuss ASM and Oracle storage mechanisms On Qumranet Solid ICE products http://www.qumranet.com/products-and-solutions/solid-ice-white-papers, for the top level web page for Solid ICE desktop virtualization On Parallels (SWsoft) products http://www.swsoft.com/en/products/virtuozzo/, base product page for Virtuozzo http://www.parallels.com/en/products/server/mac/, base product page for Parallels Server for Macintosh On Red Hat products http://www.redhat.com/rhel/server/compare/, compares the virtualization features of Red Hat Enterprise Linux and Red Hat Enterprise Linux Advanced Platform http://linux-vserver.org/Welcome_to_Linux-VServer.org, and http://linuxvserver.org/Paper, Linux VServer which implements virtual operating system instances inside an O/S instance. http://sources.redhat.com/dm/ is the source, but http://lwn.net/Articles/124703/ is an excellent article on Linux Device Mapper and multi-path On Sun products http://www.sun.com/servers/entry/, http://www.sun.com/servers/midrange/, and http://www.sun.com/servers/highend/ for specific server information http://www.sun.com/servers/highend/whitepapers/Sun_Fire_Enterprise_Servers_Performa nce.pdf for a description of the Sun Fireplane System Interconnect, with more details at http://www.sun.com/servers/wp/docs/thirdparty.pdf, Section 4. See also http://docs.sun.com/source/817-4136-12, Chapters 2 and 4, for a description of the overhead and latency involved in the Fireplane http://www.sun.com/products-n-solutions/hardware/docs/pdf/819-1501-10.pdf, Sun Fire Systems Dynamic Reconfiguration User Guide and http://www.sun.com/servers/highend/sunfire_e25k/features.xml, for a description of Dynamic System Domain in the Sun Fire E25K, which states that Dynamic System Domains are “electrically fault-isolated hard partitions”, focusing on the OLAR features and not the partitioning features recognized by the other vendors http://www.sun.com/bigadmin/content/zones/, for a description of Solaris containers and zones and http://www.sun.com/blueprints/1006/820-0001.pdf for a Sun whitepaper on all virtualization technologies © 2008 Hewlett-Packard Development Company, L.P. http://www.sun.com/datacenter/consolidation/docs/IDC_SystemVirtualization_Oct2006.p df for an IDC whitepaper reviewing containers and zones and http://www.genunix.org/wiki/index.php/Solaris10_Tech_FAQ for a FAQ on Sun virtualization http://www.sun.com/datacenter/cod/, Sun’s Capacity On Demand program and products http://www.sun.com/blueprints/0207/820-0832.pdf and http://www.sun.com/servers/coolthreads/ldoms/datasheet.pdf, Sun Logical Domains http://docs.sun.com/app/docs/doc/819-2450 and http://www.sun.com/software/products/xvm/index.jsp, covers xVM and Zones http://docs.sun.com/source/820-2319-10/index.html, covers LDoms http://www.sun.com/storagetek/disk.jsp, SANtricity on the StorageTek line On Symantec products http://eval.symantec.com/mktginfo/enterprise/white_papers/entwhitepaper_vsf_5.0_dynamic_multi-pathing_05-2007.en-us.pdf, Dynamic Multi Pathing http://h20219.www2.hp.com/ERC/downloads/4AA1-2792ENW.pdf, discusses Symantec Veritas Volume Manager Dynamic Storage Tiering On Unisys products http://www.unisys.com/products/es7000__servers/features/crossbar__communication__pa ths.htm, describes the cross-bar communications and http://www.unisys.com/products/es7000__servers/features/partition__servers__within__a__ server.htm describes the hard partitioning capabilities http://www.serverworldmagazine.com/webpapers/2000/09_unisys.shtml, describes Cellular MultiProcessing (CMP) http://docs.sun.com/source/819-0139/index.html, the MPxIO Admin Guide On Virtual Iron products http://www.virtualiron.com/products/index.cfm describes “native virtualization” http://www.virtualiron.com/products/System_Requirements.cfm describes the systems requirements (hardware platforms, guest O/S’s, etc) of Virtual Iron http://virtualiron.com/solutions/virtual_infrastructure_management.cfm describes LiveMigrate and LiveCapacity On Virtual Network Computing (VNC) products http://www.realvnc.com/products/enterprise/index.html, describing the Enterprise Edition of VNC http://www.tightvnc.com/, describing one of the many freeware versions of VNC On Para-virtualization products © 2008 Hewlett-Packard Development Company, L.P. http://www.trango-systems.com/english/frameset_en.html, describes the Trango realtime embedded hypervisor http://www.linuxdevices.com/news/NS3442674062.html for a description of OpenKernel Labs (OKL)’s offering of a Linux micro-kernel used in cell phones, which offers a « Secure Hypercell » to encapsulate an entire Linux O/S (Symbian, primarily), an application or a device driver. VirtualLogix and Trango are competitors. Xen and the Art of Virtualization, describes the issues around virtualizing the x86 architecture and the Xen approach to solving those problems. Also see the October 15 2004 Linux Magazine article Now and Xen by Andrew Warfield and Keir Fraser http://www.cs.rochester.edu/sosp2003/papers/p133-barham.pdf, a technical white paper by the authors of Xen, detailing the design and implementation http://www.xensource.com/files/xensource_wp2.pdf, describes the para-virtualization model and how it can cooperate with operating systems which do not support paravirtualization, and http://xensource.com/products/xen_server for a comparison of the different editions of Xen http://www.egenera.com/reg/pdf/virtual_datacenter.pdf, describes the Egenera Processing Area Network (PAN) management tool which offers datacenter virtualization, or management of many host and guest operating system instances across a datacenter On emulation products • http://www.softresint.com/charon-vax/index.htm • http://www.transitive.com/news/news_20060627_2.htm • http://www.platform-solutions.com/products.php and http://www.t3t.com On storage in general http://www.wwpi.com/index.php?option=com_content&task=view&id=301&Itemid=44, “Achieving Simplicity with Clustered, Virtual Storage Architectures” by Rob Peglar http://www.adic.com/storenextfs, http://www.sun.com/storage/software/data_mgmt/qfs/ and http://www-1.ibm.com/servers/storage/software/virtualization/sfs/index.html for SAN-based file systems across non-clustered servers http://www.unixguide.net/unixguide.shtml, for a good comparison of the commands and directories for all the UNIX systems http://www.snia.org/education/dictionary/r for the official definition of RAID © 2008 Hewlett-Packard Development Company, L.P. Acknowledgements I would like to extend my heartfelt thanks to Matt Connon, Jeff Foster, Eric Martorell, Brad McCusker, Mohan Parthasarathy, Ram Rao, Alex Vasilevsky, John Vigeant and Scott Wiest for their invaluable input and review. I would also like to thank Nalini Bouri, K.C. Choi and Eamon Jenningsfor their enthusiastic support for this and my other whitepapers. Revision History V0.1, Mar 2005 – Preliminary version, with some material derived from “A Survey of Cluster Technologies” V2.4. V1.0, Jun 2006 – First released version, to HP internal audience V1.1, Sep 2006 – Presented at HP Technology Forum, Houston TX V1.2, Jan 2007 – Published in TYE V1.4, Jun 2007 – Presented at HP Technology Forum, Las Vegas NV V2.0, Jun 2008 – Presented as a pre-conference seminar at HP Technology Forum, Las Vegas NV © 2008 Hewlett-Packard Development Company, L.P.