A Survey of Virtualization Technologies

advertisement
A Survey of Virtualization Technologies
Ken Moreau
Solutions Architect, Hewlett-Packard
Overview
This paper surveys the virtualization technologies for the servers, operating systems, networks,
software and storage systems available from many vendors. It describes the common functions
that all of the virtualization technologies perform, shows where they are the same and where they
are different on each platform, and introduces a method of fairly evaluating the technologies to
match them to business requirements. As much as possible, it does not discuss performance,
configurations or base functionality of the operating systems, servers, networks or storage
systems.
The focus for the audience of this document is a person who is technically familiar with one or
more of the virtualization products discussed here and who wishes to learn about one or more of
the other virtualization products, as well as anyone who is evaluating various virtualization
products to find the ones that fit a stated business need.
Introduction
Virtualization is the abstraction of server, storage, and network resources from their physical
functions, e.g. processor, memory, I/O controllers, disks, network and storage switches, etc, into
pools of functionality which can be managed functionally regardless of their implementation or
location. In other words, all servers, storage, and network devices can be aggregated into
independent pools of resources to be used as needed, regardless of the actual implementation of
those resources. It also works the other way, in that some elements may be subdivided (server
partitions, storage LUNs) to provide a more granular level of control. Elements from these pools
can then be allocated, provisioned, and managed—manually or automatically—to meet the
changing needs and priorities of your business.
Virtualization technologies are present on every component of the computing environment. The
types of virtualization covered in this paper include:

Hard partitioning, to electrically divide large servers into multiple smaller servers

Soft or dynamic partitioning, to divide a single physical server or a hard partition of a
larger server, into multiple operating system instances

Micro partitioning, to divide a single processor and a single I/O card among multiple
operating system instances

Resource partitions, to allocate some of the resources of an operating system instance to
an application

Workload management, to allocate an application and its resources to one or more
operating system instances in a cluster or resource partitions in an operating system
instance

Dynamic server capacity, such as Pay Per Use, instant Capacity (iCAP) or Capacity On
Demand (iCOD)

Clustering, to divide an application environment among multiple servers. (This includes
cluster network alias)

Network Address Translation, to have a single load balancing server be the front end to
many (dozens or hundreds) of servers for near infinite scalability
© 2008 Hewlett-Packard Development Company, L.P.

Media Access Control (MAC) spoofing, to simplify the network management tasks by
allowing a single network port act as the front end for multiple server NIC ports

Redundant Array of Independent Disks (RAID), whether it is done by operating system
software, 3rd party software acting as an application, the storage controller in the server,
or the storage controller in the storage array itself

Channel bonding, to combine the functionality and performance of multiple I/O cards
into a single logical unit

In-cabinet storage virtualization, to allocate a logical disk unit across multiple physical
disks that are connected to a single set of storage controllers

Between-cabinet storage virtualization, to allocate a logical disk unit across multiple
physical disks that are connected to multiple storage controllers.
Some of the virtualization technologies that affect servers but are more focused on the desktop is
application virtualization:

Remote desktop, where the desktop is used only as a delivery device, supplying the
display, keyboard and mouse, and the application runs on the server. Citrix Presentation
Server (previously known as MetaFrame) and Microsoft Windows Virtual Desktop are the
most popular products in this area.

Virtual applications, where some software components are installed on the desktop and
coordinate with a server to that the desktop O/S recognizes the application as fully
installed. Altiris Software Virtualization is the most well known in this area, with Microsoft
SoftGrid becoming a player.

Application streaming allows the application to run on either the desktop or the server,
whichever is most appropriate at that time. AppStream and Softricity (recently acquired
by Microsoft) are the most well known in this area.

Allowing software to be delivered as a service, where some components run on the
desktop and some run on the server. Web-based e-mail is the most common application,
but any web page which delivers Java to the desktop but does a great deal of processing
on the server also fit this model.
Eventually this paper will discuss all of the above technologies. However, at this time I will only
discuss the server partitioning technologies.
History
In the early days of computing, there was a direct connection between the problem being solved
and the hardware being used to solve it. In 1594 John Napier invented the Napier Bones to
perform multiplication, while in 1727 Antonius Braun made the first general purpose calculator
that could add, subtract, multiple and divide. It took Charles Babbage to develop a general
purpose machine that could actually be “programmed” to perform different operations, when he
created the Analytical Engine in 1834. In 1886 Herman Hollerith constructed a mechanical device
that added and sorted for the US Census of 1890, leading to the creation of IBM.
Even after the invention of the electronic calculators and computing engines, such as the “Bombe”
in 1943, there was still a direct connection between the problem to be solved and the physical
system needed to solve it. The earliest form of programming involved wiring a board in order to
solve a specific problem, and changing the program involved physically moving the wires and
connections on the board itself.
The invention of the transistor and electronic circuits, plus the concept of changing the paths of
the electricity flowing through those circuits without having to change their physical
configuration, finally broke this “hard wired” connection. In 1946 Alan Turing publicized the
© 2008 Hewlett-Packard Development Company, L.P.
concept of “stored programming”, helping the inventors of ENIAC to create EDVAC, as well as the
Manchester “Baby” and Whirlwind. For the first time, the function of the machine could be
changed without modifying it physically, and the concept of software was born.
But even with this, the machines were still solving only one problem at a time, and there was a
direct connection between the program and the hardware it ran on: the program could only
access the physical memory and other hardware resources physically present in the system, and
only one program at a time could be running.
Storage was the same way, in that the storage administrator had to work with the individual disk
drives, and originally individual tracks and sectors, without the benefit of such abstractions as file
names and directories. Further, they spent a lot of time striving for good performance by
physically moving data from one disk to another in order to balance the load of both
performance and physical space on the drives. This was incredibly time consuming and disrupted
the use of the systems because the disks needed to be off-line while it was being done. Further, it
was a job that was never finished: as new workloads appeared and as the workloads shifted day
to day (for example, end of the month processing might be vastly different than middle of the
month processing), the work needed to be done again and again and again. And because the
storage was attached to the servers via controllers embedded in the system, adding new storage
often required taking the server down to install new I/O cards.
But with the invention of operating systems and timesharing, this paradigm began to be broken:
each process could act as if it had an entire computer dedicated to itself, even though that
computer was being shared between many programs running in different processes. In effect, the
operating system “virtualized” the single physical machine into many virtual machines. This
continued with the invention of virtual memory by Tom Kilburn in 1961, time-sharing on the Atlas
machine in the early ‘60s, and symmetric multi-processing on many machines in the ‘70s, where
each program still saw a flat memory space and a single processor, but the processor being used
from one machine cycle to the next might be different physical units, and the memory might be
stored on hard disk at some point, but the program neither knew nor cared about such details.
Similarly, the invention of storage controller based RAID (Redundant Array of Independent Disks)
allowed the storage administrator to bundle a few disks together into slightly larger groups, of
two or three disks in a mirrorset (RAID-1) or four, six or eight disks in a striped mirrorset (RAID
1+0). Then placing these storage controllers into the storage units themselves removed them
from the host system, removing their management from the system administrators completely.
The industry has now taken that virtualization to the next level, where we are breaking the
connection between the physical computer systems and the programs they run, and having the
freedom to move processes between computer systems, to have many I/O cards act as a single
very fast I/O card, to have many processors act as a single very fast processor, or to have a single
very fast processor act as many slightly slower processors. And we are doing the same thing with
storage, where a hundred or more disks can be grouped together and managed as a single entity,
with automatic load balancing and on-line expansion, and attaching them to the servers via
FibreChannel switches or network switches via iSCSI, such that new storage can be added to the
environment without needing to take down the servers.
Server Virtualization
Hard Partitioning
Hard partitioning divides a large symmetric multi-processing (SMP) server into multiple smaller
servers, with full electrical isolation between the partitions. In this way each of the hard partitions
can be powered off and restarted without affecting any of the other hard partitions, effectively
creating multiple independent servers from a single server. Each of the hard partitions commonly
has processors, memory and I/O slots allocated to them.
© 2008 Hewlett-Packard Development Company, L.P.
Hard partitions can be treated as individual servers, and the operating systems are unable to tell
the difference between a hard partition in a large server and an entire smaller server. The
operating system simply sees the resources that are allocated to it.
The advantage of hard partitions over the same configuration of processors and memory in
smaller servers is flexibility. If you have, for example, a single sixteen processor server that you
have hard partitioned into four four-way servers, you have achieved the same functionality as if
you bought four four-way servers. But you now have the ability to re-partition and create eight
two-way servers, or two four-way and one eight-way server, simply by shutting down the
operating system instances, issuing a few partition manager commands or mouse clicks, and
booting the operating systems into the new hardware configuration. As workloads change and
new business requirements occur, this can be very useful. Further, there are many cases where
the maintenance and licensing costs of the single larger server can be less than the similar
charges for the same processor count of multiple smaller servers.
The basic architecture of the server will either allow hard partitioning or not. The three common
server architectures are bus-based, switch-based and mesh-based.
Bus-Based
Switch-Based
Mesh-Based
Figure 1 Server Architectures
Bus-based (also known as “planar”) architectures have a single path that enables communication
between all of the components in the system. (Some of the more advanced planar systems might
have two paths, one for memory and one for I/O, but the principle still applies.) All processors,
memory slots and I/O slots “plug in” to the path(s) in order to send messages to the other
components. This entire environment, with processor slots, memory slots, I/O slots all connected
together is called a “chipset”. In this way the processors can read and write the memory and send
and receive information through the network and storage cards directly, without passing through
any intermediate component. The best analogy for this design is that of a flat network, where all
of the systems plug directly into a single network repeater. The advantage of this design is that
the latency between any two components is identical: any processor can read any memory
location with the same access times, and any I/O card can directly access either any processor or
any memory location with the same access times. This is commonly known as a Uniform Memory
Architecture (UMA). The disadvantage of this design is that this single communications path can
become clogged with messages, forcing the different components to wait to send their message.
This sets an upper limit on scalability. However, this upper limit is high and getting higher all the
time, as the clock rate on the communications path is increased by the chipset vendors. The
bandwidth of the communications path is more than adequate for low-end and mid-range
servers, which tend to focus on scale-out and multiple cooperating machines for larger workloads.
Switch-based architectures are more like modern networks, with switches rather than hubs
connecting the processors, memory and I/O slots, allowing multiple simultaneous
communications. High end systems have a tiered approach, where switches connecting the
processors and memory (and frequently the I/O slots) also connect to higher level switches that
act as the systems interconnect, to allow communications across many components. This is
analogous to a modern network, with many switches and routers connecting many sub-networks.
© 2008 Hewlett-Packard Development Company, L.P.
The advantage of this design is that it increases the total throughput between the components,
and allows linear scalability as new components are added, since the additional new components
provide additional switch throughput. The disadvantage of this design is that it increases the
latency of the communications between the components that are multiple switches away. This is
called Non Uniform Memory Architecture (NUMA), because the latency between components on
the same switch is different than the latency between components that cross the additional
switches. If the additional latency is high enough, it can cause significant problems in
performance as you scale up with larger numbers of processors, multiple gigabytes of memory,
and many I/O slots.
Mesh-based architectures place the switches in the processor itself, allowing the processors to
define the routing of any messages. All memory is directly connected to the processor, instead of
being connected to the processor by an external switch. This leads to dramatically lower memory
latency when the processor accesses directly connected memory, and only moderate additional
latency when the processor access memory that is directly connected to another processor in the
mesh. This results from the combination of the high speed of the processor interconnect where
each additional processor adds memory bandwidth, and the intelligent routing of the messages
that minimizes the number of intermediate hops between processors. In practice, the average
memory latency in a mesh architecture compares very well to the local memory latency in a
switch architecture, is better than the memory latency in a bus architecture, and is dramatically
better than the memory latency across multiple switches in a switch architecture because there
are no bottlenecks for the memory transfer.
The most significant advantage of switch and mesh based designs is that they allow for hard
partitioning, simply by having the various switches (whether they are internal or external to the
processor) allow or prohibit communications between various components. This provides the
electrical isolation required by hard partitions that is not possible in a bus-based design.
AMD Opteron 200 and 800 (and more recently the 2000 and 8000) class processors are
technically mesh-based, in that they have the inter-processor connections on the processors
themselves. However, at this time neither AMD nor any systems vendor has implemented the
dynamic routing needed for hard partitioning. All of the partitioning being done with Opteron
systems is being done at the switch level, outside of the processor. Therefore, I have classified the
Opteron systems based on the types of partitioning which are available today, either bus-based
or switch-based. If/when AMD and systems vendors offer systems which implement mesh-based
partitioning, I will modify this paper.
The following shows the partitioning capabilities of the major servers on the market today:
Bus-Based Servers
(no hard partitions)
Switch-Based Servers
(hard partitions)
Mesh-Based
Servers (varies)
Dell
PowerEdge (all)
None
None
HewlettPackard
HP 9000 PA-RISC
rp34xx, rp4440
Integrity
rx16xx, rx26xx, rx4640
NonStop (all)
ProLiant (BL, DL, ML xx0)
AlphaServer
DS1x, DS2x, ES40, ES45
(Note: no hard partitions)
GS80, GS160, GS320
HP 9000 PA-RISC
rp74xx, rp84xx, Superdome
Integrity
rx76xx, rx86xx, Superdome
AlphaServer
ES47, ES80,
GS1280
(Note: hard
partitions)
© 2008 Hewlett-Packard Development Company, L.P.
ProLiant (BL,
DL, ML xx5)
(Note: no hard
partitions)
IBM
System
OpenPower (all)
x 2xx, 3xx, 33x, 34x
i and p 510, 520, 550
System
x 440, 460, 3950
i and p 560, 570
(Note: hard partitions
between chassis only)
System
i and p
590, 595
(Note: no
hard partitions)
NEC
Express5800 (all)
Express5800
1080Rf, 1160Xf, 1320Xf
None
Sun Micro
systems
Sun Fire
V1xx, V2xx, V4xx, V8xx,
V1280
Sun Fire
E2900, E4900, E6900,
12K, 15K, 20K, 25K
(Note: no hard partitions)
Sun Fire
V20z, V40z
X2100, X4100X4600 (Note: no
hard partitions)
Unisys
ES3105, ES3120, ES3140
ES7000
Aires (all), Orion (all)
None
Figure 2 Server Hard Partitioning
Dell
Dell uses a bus architecture for the PowerEdge servers (Intel Pentium Xeon and AMD Opteron),
therefore there is no hard partitioning in any Dell server.
Hewlett-Packard
HP uses a bus architecture for the low-end and mid-range servers:

HP 9000 rp1xxx, rp2xxx and rp4xxx lines (PA-RISC),

Integrity rx1xxx, rx2xxx, rx3xxx and rx4xxx lines (Itanium),

NonStop servers (MIPS) and

ProLiant BL, DL and ML lines (AMD Opteron and Intel Pentium Xeon).
Therefore, there is no hard partitioning for these servers.
HP uses a switch architecture for some entry level and mid-range servers:

AlphaServers DS1x, DS2x, ES40 and ES45 lines (Alpha)
Because these are entry level and mid-range, there is no hard partitioning for these servers.
HP uses a cross-bar architecture for some high-end servers:

AlphaServer GS80, GS160 and GS320 (Alpha)

HP 9000 rp7xxx, rp8xxx and Superdome lines (PA-RISC) and

Integrity rx7xxx, rx8xxx and Superdome lines (Itanium).
The cross-bar architectures between the AlphaServer GS80, GS160 and GS320, and the HP 9000
and Integrity high-end systems are shown in Figure 3.
i/o
cell
cell
crossbar
backplane
superdome cabinet
AlphaServer
crossbar
crossbar
© 2008 Hewlett-Packard Development Company, L.P.
crossbar
i/o
backplane
superdome cabinet
HP 9000 and Integrity
Figure 3 HP Cross-Bar Architectures
Each of the basic building blocks in the AlphaServer line (code named Wildfire) is called “System
Building Blocks” (SBB’s), while the sx1000 and sx2000 chipset defines “cells” in the HP 9000 and
Integrity systems. In both cases the internal switch (called a “cell controller” in the sx1000 and
sx2000) connects the four processor slots, memory slots, ports to external I/O cages for the I/O
cards, and ports to the system cross-bar switch. The AlphaServer has a single global switch that
connects all of the SBB’s, while the HP 9000 and Integrity servers have two levels of switch
between the cells: a cross-bar that connects up to four cells, and direct connections from each
cross-bar to three other cross-bars.
The management console directs the management processor of the system to either allow or not
allow communications between specific components across the global and cross-bar switches,
defining the partition configuration. This defines the hard partitioning, which is therefore
granular to the SBB and cell level. This hard partitioning offers true electrical isolation such that
one partition can be powered up or down without affecting any other partition.
In September 2007, HP added the capability of adding and removing cells to and from the hard
partitions in the cell based Integrity servers. Called “dynamic nPars”, it works similarly to Sun
Dynamic System Domains, in that removing a cell board from a hard partition consists of
preventing any new processes from being scheduled on the processors in the cell to be removed,
moving all data from the memory in the cell to memory in other cells, and (optionally, if there is
an I/O chassis attached to the cell) switching all I/O to the alternate paths on I/O chassis on the
remaining cells. When this has been accomplished, the cell is detached from the hard partition.
Adding a cell with or without an I/O chassis to a hard partition is simply the process in reverse.
One significant difference between the two designs is the way memory is used by the operating
systems. Using OpenVMS and Tru64 UNIX, the AlphaServer optimizes the use of SBB local
memory in order to minimize memory latency, but this comes at the expense of much higher
memory latency when crossing the global switch: the dreaded “NUMA effect”. This has significant
effects on performance, far more than anyone realized at first release. The most recent versions
of the operating systems worked on reducing this behavior, but it is still a factor.
HP-UX on the HP 9000 and Integrity servers balances memory access across all available cells.
This increases the average memory latency, but almost completely eliminates the NUMA effect,
and results in almost UMA performance on this NUMA design. HP offers the option of allowing
“cell local memory” for both the HP 9000 and Integrity servers, but in practice most users disable
this feature. The primary exception to this rule is high performance computing (HPC), where each
workload can be isolated to a particular cell’s processors and memory, and memory bandwidth
and latency are critical.
Note that the communication across the cross-bar switches in the HP 9000 and Integrity servers
in the sx1000 chipset is limited to 250MHz, independent of the clock speed of the processors. In
the sx2000 chipset this was changed to use a serial link at a much higher clock rate. This
eliminated a single point of failure of the clock module.
HP uses the mesh architecture for some high-end servers:

AlphaServer ES47, ES80 and GS1280 (Alpha).
The ES47, ES80 and GS1280 (code named Marvel) systems use the mesh architecture of the Alpha
EV7 processor to connect each processor to up to four other processors (at the N(orth), S(outh),
E(ast) and W(est) connections), to the memory that is directly attached to that processor through
the two memory ports, and to the external I/O cage for the I/O cards. The management console
directs the firmware of the GS1280 as to which processors and I/O cages are connected to each
other, defining the partition configuration. This defines the hard partitioning, which is therefore
granular to the processor and card cage level. This hard partitioning offers true electrical isolation
such that one partition can be powered up or down without affecting any other partition.
© 2008 Hewlett-Packard Development Company, L.P.
4-5* x
RDRAM
S
N
EV7
W
E
4-5 x
RDRAM
I/O
EV7 Processor Ports
EV7 Mesh
Figure 4 HP Alpha EV7 Architecture
The EV7 architecture allows superb memory access times to all locally connected memory, and
the high-speed processor interconnect allows good to excellent memory access times to all of the
local memory on the other processors. The actual access time is dependent on the number of
“hops” (processor to processor communications) that the memory access requires. Figure 5
shows the affect of hops on memory latency in nano-seconds, as the distance (“hops”) from the
center point (of 75ns) increases:
355
319
283
247
211
247
283
319
319
283
247
211
175
211
247
283
283
247
211
175
140
175
211
247
247
211
175
140
75
140
175
211
283
247
211
175
140
175
211
247
319
283
247
211
175
211
247
283
355
319
283
247
211
247
283
319
391
355
319
283
247
283
319
355
Figure 5 EV7 Memory Latency
The mesh based architecture tends to localize the memory accesses to either the processor which
is running the program accessing the memory, or processors very close to that processor. An
example of this is a large in-memory data structure such as an Oracle SGA which will not fit into
the memory of a single processor. And the process scheduler tends to schedule processes on
processors near the memory they will access. The net result of this is that the average hop count
is less than 2, which results in memory access times which are equal or superior to bus-based
systems.
IBM
IBM uses a bus architecture for most servers:

eServer xSeries 326 (AMD Opteron),

eServer BladeCenter HS20 and HS40 (Intel Pentium Xeon),

eServer BladeCenter JS40 (PowerPC),

eServer OpenPower (Power5),

System i and p 510, 520 and 550 (Power5+), and

System x servers (Intel Pentium Xeon).
© 2008 Hewlett-Packard Development Company, L.P.
Therefore, there is no hard partitioning in these servers.
IBM Power4, Power5 and Power5+ are mesh based (in what IBM refers to as an “inter-processor
fabric”) for many servers:

System i servers (Power4 and Power5/5+) and

System p servers (Power5+).
16-way (1 book)
64-way (4 books)
Figure 5 IBM Power4/5/5+ Architecture
Figure 5 shows the connections between each processor. Each blue block is a multi-processor
module (MCM) holding four Power processors (the chartreuse, green, blue and purple blocks).
Each processor can connect to two other processors via the distributed switch on the MCM, and
one other processor on each of two other MCM’s. Each processor has its own memory and I/O
slots. Note that the Power4 series does not have L3 cache, where this was added for the Power5
and Power5+. Two MCM’s with four processors each, with two threads per Power5 or Power5+
processor (which IBM calls a “16-way”, as there are 8 processors with 16 cores) is called a “book”,
and multiple books are connected to form a system.
Not all connections between the distributed switches are always used. In the dual-processor
module (DCM) or quad-processor module (QCM) or in systems with 16 or 32 processors, only two
or three connections between processors are used. Note the extra connection between the
processors in the 64-way configuration that is not present in the 16-way configuration.
IBM achieves their excellent performance because of the high bandwidth and low latency of the
distributed switch. The processors can communicate with the other processors in the same MCM
or QCM on each clock tick, while communicating between modules on every other clock tick.
The “hop” distance and memory latency and bandwidth in the 4 book environment is similar to
the Alpha EV7 environment for 32 processors, but will give better performance overall because
the dual-core nature of the Power5 allows one less hop for the same number of cores.
There is no hard partitioning for these servers. IBM has chosen to implement its mainframe
“logical partition” capability, with either LPARs (logical partitions, prior to AIX 5.2, with fixed
allocations of dedicated processors, memory and I/O slots), DLPARs (dynamic logical partitions,
starting with AIX 5.2, which also has dedicated processors, memory and I/O slots but added the
ability to move those between DLPARs without rebooting) and SPLPARs (shared processor logical
partitions, using the IBM Hypervisor as a host operating system for several guest operating
systems).
IBM discusses “processor hot-swap”, but because of the lack of electrical isolation in the pSeries,
this is actually part of their Capacity Upgrade on Demand (CUoD) package, where there are spare
processors not allocated to a partition which can be activated as needed to either augment the
existing processors or replace a failed processor in a partition. Specifically, IBM does not offer the
© 2008 Hewlett-Packard Development Company, L.P.
capability to physically remove and replace a processor without powering down the system.
Therefore, the pSeries does offer the capability to delay shutting down all of the logical partitions
until it is more convenient to do that, but does require that the entire system be shut down in
order to repair or upgrade processors or memory.
IBM uses the X-Architecture for the high-end System x servers. Each XA-32 chipset (Intel Pentium
Xeon processors) or XA-64 chipset (Intel Itanium processors, which IBM is de-emphasizing) is
called a “node” and has four processor slots, memory slots, and a PCI-X to remote I/O bridge
controller. The enhanced chipset adds an L4 cache and a synchronous memory interface buffer.
Note that instead of a cross-bar architecture, communications between the nodes is done via the
SMP expansion ports. There are three SMP expansion ports per node, meaning that each node
can communicate directly with only three other nodes. This limits scalability to 16 processors, but
allows hard partitioning for these servers. Note that in this case, the SMP expansion ports are the
only communications path that can be enabled or disabled, such that the partitioning granularity
is at the node level. This offers true electrical isolation between nodes.
IBM uses a similar architecture for the System p560 and p570 servers. Each building block
contains a processor drawer with four processors, memory slots and I/O slots, and can be
connected to other building blocks. The key difference between the p570 and the x servers is that
the x servers connections are point-to-point, while up to four building blocks of the p570 can be
connected together using the appropriate “flex cable” (FC) in a ring topology. Each processor
drawer must be fully populated. This does allow for hard partitioning and electrical isolation at
the processor drawer level, but only by removing the existing flex cable and replacing it with the
specific cable for that specific number of building blocks. There are different flex cables for a two
building block, a three building block, and a four building block configuration, and they are not
interchangeable. Further, there must be precisely one “master node” with all of the other building
blocks being “slave nodes”, so you cannot easily split a four building block configuration into (for
example) a pair of two building block configurations.
Both of these designs limit the environment to 16 processors (4 nodes or building blocks) and
neither uses a switch (point to point in the x servers and a ring topology in the p570). These
would seem to be serious limitations, but given the dual core nature of the processors (Intel
Pentium Xeon in the x servers and Power5 in the p570), these servers can effectively scale to 32
cores, which will cover quite a lot of the mid-range market.
Figure 6 IBM X-Architecture and p570 Architecture
NEC
NEC uses a bus architecture for the Intel Xeon based servers:

NEC Express5800/100
Therefore there is no hard partitioning for these servers.
© 2008 Hewlett-Packard Development Company, L.P.
NEC uses a switch architecture for the Intel Itanium based servers:

NEC Express5800/1000
Figure 7 NEC A3 Architecture
The A3 architecture uses a high speed crossbar switch to connect each of the “cells”, each of which
contains four processor sockets and 32 memory slots. The crossbar connects each of the cells to
each other, as well as to the PCI I/O slot chassis.
One of the features of the A3 chipset is Very Large Cache (VLC). VLC provides a separate Cache
Coherency Interface (CCI) path both internally to the cell and externally between cells which
enables processor cache coherency, which is also controlled by the service processor. The cache
coherency scheme is tag based for inquiries and directory based for transfers.
Both the standard communications through the crossbar, and the separate CCI, are managed by
the service processor. This enables hard partitioning at the cell level.
With Windows 2003, the hard partitioning was static, in that the operating system had to be shut
down in order to modify the hard partitions. However, in Windows 2008, Microsoft implemented
the Windows Hardware Error Architecture (WHEA), which provides significant advances in how the
chipset and the operating system can manage hardware more dynamically. Calling it the
Windows Hardware Error Architecture (emphasis added) is somewhat mis-leading, in that the
capabilities of the WHEA’s interaction between the hardware and the operating system go far
beyond simply handling errors. NEC took advantage of these advances, and implemented
Dynamic Hardware Partitioning, which allows hot replacement and hot addition of components
(processors, memory, I/O and full cell boards) to a hard partition.
Sun Microsystems
Sun uses a bus architecture for the low-end and mid-range servers:

Sun Fire V100, V120 (UltraSPARC IIi), V210, V240, V250, V440 (UltraSPARC IIIi), V480, V880
and V1280 (UltraSPARC III)

Sun Fire V490 and V890 (UltraSPARC IV) and

Sun Fire V20z. V40z, X2100, X4100 through X4600 servers (AMD Opteron).
Therefore there is no hard partitioning for these servers.
© 2008 Hewlett-Packard Development Company, L.P.
Figure 8 Sun Fireplane Architecture
Sun uses a cross-bar architecture called the Fireplane for the mid-range and high-end servers:

Sun Fire E2900, E4900 and E6900 (UltraSPARC IV) and

Sun Fire 12K, 15K, E20K and E25K lines (UltraSPARC IV).
Each system board (in the E2900 server) or Uniboard (E4900 and above) has 4 processor slots,
memory slots and direct attached I/O slots. This is equivalent to an HP Alpha System Building
Block or HP Integrity cell board or IBM node. The system boards and Uniboards are switch-based,
where a pair of processors and their associated memory are connected by a switch, and the two
processor/memory pairs are connected to each other and to the I/O slots by a second level
switch. Each pair of system boards and Uniboards are connected to each other by an expander
board (where one system board or Uniboard is “slot 0” and the other is “slot 1”), and expander
boards are connected to each other by the Fireplane interconnect, which consists of another level
of three cross-bar switches:

Address Cross-Bar aka the “data switch”, which passes address requests from one system
board or Uniboard to another system board or Uniboard,

Response Cross-Bar aka the “system data interface”, which passes requests back to the
requesting system board or Uniboard, and

Data Cross-Bar aka “Sun Fireplane interconnect”, which transfers the requested data
between system boards or Uniboards.
Note that all remote memory accesses in a split expander take 13ns longer than the same remote
memory accesses in an expander where both slots are in the same domain.
Sun documentation uses the terms “3x3 switch”, “10x10 switch” and “18x18 switch”. This
terminology indicates the number of ports in each switch: 3, 10 or 18 ports. So, for example, the
18x18 switch connects 18 Uniboards in the E25K, for a total of 72 processors (18 Uniboards each
with 4 processors).
© 2008 Hewlett-Packard Development Company, L.P.
Note that there four levels of switches involved in the high end systems:

“3x3 switch”, which connects a pair of processors and their memory in a system board or
Uniboard to each other,

“3x3 switch”, which connects two pairs of processors and their memory together into a
complete system board or Uniboard,

“10x10 switch”, (also known as an expander board) which connects two system boards
together, and

“18x18 switch”, which connects the expander boards in the E20K and E25K servers.
These switches allow partitioning for these servers, which Sun refers to as Dynamic System
Domains (DSD). DSD partitioning is done at the system board (slot) level. When both system
boards of an expander are connected to different domains, the expander is called a “split
expander”.
Figure 9 Sun Fireplane Partitioning
All components can be dynamically removed from a domain and added to a domain, similar to
the HP Dynamic nPar. Removing a system board or Uniboard from a domain consists of
preventing any new processes from being scheduled on the processors in the system board or
Uniboard, moving all data from the memory in the system board or Uniboard to memory in other
system boards or Uniboards, and switching all I/O to the alternate paths on other system boards
or Uniboards. When this has been accomplished, the system board or Uniboard is detached from
the domain, and repair operations can be performed on it. Adding a system board or Uniboard
to a domain is simply the process in reverse.
Note that a failed system board or Uniboard must have the system go through the hardware
validation process (equivalent to the Power On Self Test (POST) of other systems) in order to be
added to a running domain. This validation process requires that all running partitions be shut
down, as this is a system wide activity, and can take an extended period of time (12 minutes for
© 2008 Hewlett-Packard Development Company, L.P.
one configuration of a Sun Fire 4800). Migrating the running memory from the existing system
board or Uniboard to the added system board or Uniboard can take additional time (minutes).
Sun states that this offers “electrically fault-isolated hard partitions”, but there is a significant
difference in this than what other vendors mean by this term. Sun emphasizes the “on-line add
and replace” (OLAR) features of the servers, and does not refer to any of the other capabilities of
hard partitioning. Therefore, because of this different focus, by the definitions as recognized by
this paper and the other vendors, Dynamic System Domains do not offer hard partitioning.
Unisys
Unisys uses a bus architecture for the blade and mid-range servers

ES3105 (Intel Pentium Xeon),

ES3120, ES3120L, ES3140, ES3140L (Intel Pentium Xeon)
Therefore there is no hard partitioning on these servers.
Unisys uses either a cross-bar and point-to-point architecture for the high end servers

ES7000/one (Intel Itanium and Pentium Xeon)

ES7000/400 (Intel Itanium and Pentium EM64T)
The ES7000/400 servers are built around the Intel E8870 chipset, which has eight processor slots,
up to 64GBytes of memory and eight PCI-X I/O slots. A pair of pods is connected to the Cellular
MultiProcessing V1 (CMP) cross-bar switch, and up to two CMP cross-bar switches are connected
together to form a large SMP server. Unisys also adds up to 256MBytes of shared L4 cache for all
of the processors, and up to 12 additional PCI-X I/O slots. The ES7000/400 is built with two
power domains, such that the maximum configuration in a single partition is half of the entire
system, or four of these chipsets for a maximum of 16 processors, 256GBytes of memory and 32
I/O slots (using the 100MHz PCI-X cards) or 16 I/O slots (using the 133MHz PCI-X cards). The
ES7000/400 can also be partitioned into four partitions, each containing two of the chipsets. The
hard partitioning is at the chipset boundary. Note that the processor slots can either contain
Itanium or Pentium processors, but you cannot mix processor types in a single partition. Pentium
is supported for backwards compatibility, with Itanium being the premier offering.
The ES7000/one (previously known as the /400 and /600) server is built around the “cell”, which is
a 3U module with 4 processor slots, up to 64GBytes of memory and 5 PCI-X I/O slots. A pair of
cells is connected to the Computer Interconnect Module 600 (CIM600), and 4 of the CIM600’s are
connected together with the Cellular MultiProcessing V2 (CMP2) FleXbar point-to-point cabling to
form a large SMP server. Unisys also adds up to 384MBytes of shared L4 cache for all of the
processors, and up to 88 additional PCI-X I/O slots. Note that each cell can only be connected to
two other cells, so in a fully configured ES7000/one, there will be up to 3 hops to access memory
in another cell. The ES7000/one can be hard partitioned down to the cell, so the system can
support from 1 to 8 hard partitions.
SMP cables
8P
8P
Clock
Sync
CIM
CIM
8P
8P
© 2008 Hewlett-Packard Development Company, L.P.
Clock Module
CMP Cross-Bar Switch
CMP2 FleXbar cabling
Figure 10 Unisys Cellular MultiProcessing Architecture
Soft Partitioning
Hard partitions are done exclusively in the server hardware. The operating systems are unaware
of this, and treat all of the hardware that they see as the complete set of server hardware, and
require a reboot in order to change that view of the hardware. But some operating environments
on some servers also have the ability to further divide a hard partition into multiple instances of
the operating system, and then to dynamically move hardware components between these
instances without rebooting either of the running instances. These are called soft, or dynamic,
partitions.
One of the advantages of dynamic partitions is that they protect your environment against peaks
and valleys in your workload. Traditionally, you build a system with the processors and memory
for the worst-case workload for each of the operating system instances, accepting the fact that
this extra hardware will be unused most of the time. In addition, with hard partitioning becoming
more popular through system consolidation, each operating system instance in each hard
partition requires enough processors and memory for the worst-case workload. But in many
cases, you won’t need the extra hardware in all operating system instances at the same time.
Dynamic partitioning lets you move this little bit of extra hardware between partitions of a larger
system easily, without rebooting the operating system instance. For example, you can have the
majority of your processors run in the on-line processing partition during the day when the
majority of your users are on-line, automatically move them to the batch partition at night (that
was basically idle all day but is extremely busy at night), and then move them back to the on-line
processing partition in the morning, using a script that runs at those times of the day.
This same functionality can be accomplished with hard partitions, but it requires that you reboot
the operating system instances in both of the partitions. This is often inconvenient or impossible
in production workloads, while dynamic partitioning does not require rebooting of either
operating system instance.
Soft partitioning functionality is as follows:
Move
processors
Move I/O slots
Move memory
Share memory
between
partitions
DLPARs
AIX
Yes
Yes
Yes
No
Galaxy
OpenVMS
Yes
No
No
Yes
Logical Domains
Solaris T1 and T2
Yes
No
No
No
vPars
HP-UX
Yes
No
Yes
No
Figure 11 Soft Partitioning
IBM DLPARs
IBM’s Dynamic Logical PARtitions (DLPARs) on AIX (Advanced Interactive eXecutive) are an
extension of the Logical PARtitions (LPARs) that have previously been available under the eServer
iSeries and pSeries. LPARs are allocated processors, memory (in 256MByte chunks called
“regions”) and I/O cards, and these allocations are fixed at boot time and can only be modified by
rebooting the LPAR. DLPARs add the “dynamic” part, which is the movement of resources such as
processors, memory and I/O cards between DLPARs. Note that, as shown in the previous section,
© 2008 Hewlett-Packard Development Company, L.P.
the eServer iSeries and pSeries does not offer hard partitions, so IBM offers LPARs and DLPARs
instead. Moving resources between DLPARs involves cooperation between the server firmware
and the operating system. This server firmware is an optional upgrade and is called the “Power
Hypervisor”.
Each AIX instance has a set of “virtual resource connectors” where processors, memory and I/O
slots can be configured. When a resource is added to an instance, a message is sent by the
Hardware Management Console (HMC) through the Hypervisor to the operating system instance
that then connects that resource to one of the virtual resource connectors. The HMC keeps track
of which resources belong to which operating system instance (or no operating system instance, if
the resource is unassigned), and the operating system will receive an error if it attempts to assign
a resource without the cooperation of the HMC. When an operating instance releases a resource,
the Hypervisor initializes it to remove all existing data on the resource.
IBM has grouped all of the virtualization technologies under the “Virtualization Engine” umbrella,
which covers LPARs, DLPARs, micro partitioning, virtual I/O server and monitoring of the entire
environment. I will discuss the other technologies later.
HP Galaxy and vPars
HP’s Galaxy on OpenVMS uses the Adaptive Partitioned MultiProcessing (APMP) model to allow
multiple OpenVMS instances to run in a single server that can be either an entire server or a hard
partition. Each OpenVMS instance has private processors, memory and I/O slots. OpenVMS
Galaxy is unique among the soft partitioning technologies in that it can share memory between
partitions for any program to use (note that the Unisys shared memory is only for LAN
interconnect and is between hard partitions). The partition manager is used to set the initial
configuration of the Galaxy instances by creating and setting the values of console variables.
Processors can then be moved between instances either by the global WorkLoad Manager
(gWLM) software or by the partition manager moving the processor between the instances.
HP’s vPars on HP-UX allows multiple HP-UX 11i instances to run in a single server that can be
either an entire server or a hard partition on HP 9000 PA-RISC or Integrity cell-based servers.
Each HP-UX instance has private processors, memory and I/O slots. The partition manager is used
to set the initial configuration of the vPars. Processors and memory can then be moved between
instances by either gWLM or by the partition manager moving the resource between the
instances. Note that moving an active processor from one vPar to another takes under a second,
but moving active memory from one vPar to another takes longer, due to the need to drain the
memory of the active information in it, and then to cleanse the memory for security purposes: this
can take several minutes.
Whether it is done by the partition manager, by the HP-UX or OpenVMS instance, or by the
gWLM tool, the operation is the same: stopping the processor in one instance, re-assigning it to
another instance, and then starting the processor in the new instance. The function of stopping
and starting the processor has been in the respective operating systems for years: what is new is
the ability to re-assign the processor between instances of the operating systems.
Both HP-UX and OpenVMS offer tools to dynamically balance processors across dynamic
partitions. As previously mentioned, both HP-UX and OpenVMS Work Load Manager (WLM) and
Global Work Load Manager (gWLM) work alongside the Process Resource Manager (PRM) to offer
goal- based performance management of applications. Processors can be moved between vPars
in order to achieve balanced performance across the larger system. OpenVMS also offers the
Galaxy Balancer (GCU$BALANCER) utility, that can load balance processor demand across multiple
Galaxy instances. Both of these work with either clustered or un-clustered instances.
HP has grouped all of the virtualization technologies under the “Virtual Infrastructure” and
“Virtual Server Environment” umbrella, which covers hard partitioning, soft partitioning, clustering,
pay-per-use and instant capacity. I will discuss the variable capacity technologies later.
© 2008 Hewlett-Packard Development Company, L.P.
Sun Microsystems
Observant readers will note that Sun Dynamic System Domain (DSD) is not present in the table,
even though it does allow some movement of components between domains. As I discussed in
the previous section, DSD is more of an “on-line add and replace” (OLAR) capability than a
partitioning technology. Sun documentation does not discuss the dynamic movement of
resources in the same way as both HP and IBM do, so I have chosen to follow their emphasis and
not include DSD as a dynamic partitioning technology.
Sun recently introduced Logical Domains (LDoms). Sun states that they fit between the hard
partition capability of Dynamic System Domains and the resource partitions of Zones. LDoms are
bound to a particular set of processor and memory resources, and do allow a specific domain to
directly access a specific I/O card or a specific port on a multi-port I/O card, which is very similar
to an HP vPar or an IBM LPAR. However, they virtualize the physical hardware in a way more
closely related to micro partitions: LDoms are assigned virtual processors which are threads inside
the cores of a T1 (Niagara) processor, and they share the I/O devices between domains, which is
not the way any of the other soft partitioning products from the other vendors work. Virtual
processors can be moved between LDoms without rebooting, but memory and I/O resources
cannot be moved. To get around this restriction, Sun introduced the concept of “delayed
reconfiguration”, which means that the changes will occur the next time the LDom is rebooted.
LDoms are only available on the T1000 and T2000 servers, and only for Solaris 10. Because they
can run on a single thread of a single core in a processor, there can be up to 32 LDoms in a single
processor T1000.
Sun has characterized LDoms as soft partitions, while it seems to me that they are more similar to
micro partitions. Because they are a mix of technologies, I have placed LDoms in both this section
and the micro partitioning section.
Micro Partitioning
Hard partitioning and soft partitioning both operate on single components of the server
hardware: a processor, a bank of memory, an I/O slot, etc. And while this is quite useful, the
incredible advances in server performance frequently leads to server consolidation cases where an
older application may require far less than an entire component of the server hardware. For
example, an application that performs quite adequately on a 333MHz Pentium II with
10MByte/sec SCSI-2 and 10Mbit Ethernet, simply does not need all of the power of a 3.2GHz
Pentium Xeon quad core processor with a 4Gbit FibreChannel and GigaBit Ethernet card. But if
the original server needs to be retired (perhaps because parts are no longer available, or the
hardware maintenance is becoming increasingly expensive), than that faster hardware might
actually be the low end of the current server offerings.
One popular option has been to stack multiple applications into a single operating system
instance. This works very well for many operations, because it shares the hardware among the
different applications using standard process scheduling and memory utilization algorithms in the
operating system, which work extremely well. And in the cases where they do not, the standard
resource management tools can be very effective. Further, this can be extremely cost effective,
because it amortizes the cost of the hardware and software across the multiple applications,
allowing a few application or database software per-processor licenses to serve many different
applications. I will discuss this “resource partitioning” later in the paper.
However, this is not a universal panacea. The different applications frequently have different
operational requirements. Most often, this comes in the form of operating system or application
version or patch levels, where one application needs a more recent version while another
application needs to remain on an older version. One application might require system downtime
for maintenance at the same time another application is experiencing a mission-critical peak time.
Balancing all of these competing requirements is often difficult, and frequently impossible.
© 2008 Hewlett-Packard Development Company, L.P.
The solution is to divide up the extremely fast server components, and let multiple operating
system instances timeshare each processor, storage card or network card: in effect, to partition
the individual components in the same way we have partitioned the server in the previous cases.
This is called micro partitioning.
It works the same way as hard and soft partitioning, but instead of allocating whole units of server
components to each operating system instance, we timeshare the components between instances
of the operating system instance. The allocations are generally dynamic based on current load, in
the same way that timesharing allocates processor, memory, storage and network bandwidth
dynamically inside a single instance of an operating system in a server. But, again in the same
way, there are tools that can guarantee specific allocations of resources to specific operating
system instances. I will discuss the details of these guarantees later.
The benefits of micro partitioning are many. The operating system instance can remain as it
always was, with the same network name and address, accessing the same volumes, on the same
patch level of the operating system and application, and in general not affecting (or being
affected by) the other operating system instances. Many operating system instances can be
hosted on a very small number of servers, which are treated as a “pool” of resources to be
allocated dynamically among all of the operating system instances.
Micro partitioning separates the physical system from the operating system instances (called the
“guest” system instances) by a virtual intermediary (called the “host” system instance). The “host”
instance interacts directly with the physical server hardware components, controlling and working
with those components in the normal way that operating systems control and work with server
hardware in a non-hosted environment. The difference is that the sole function of this host is to
pretend to be the server hardware to the instances of the guest operating systems.
Each guest instance operates as if it were directly interacting with the server hardware: it runs on
the processor, it uses physical memory, it performs I/O through the network and storage I/O
cards, etc. But in reality it is merely requesting some percentage of the underlying server
hardware components through the host instance which is pretending to be that underlying server
System
Instance
System
Instance
System
Instance
Application
System
Instance
System
Instance
Virtualized Intermediary
Virtualized Intermediary
Physical System
Thin Host
Host OS
Physical System
Fat Host
hardware.
Figure 12 Micro Partitioning Types
The host instance can be a dedicated environment (a “virtual machine monitor” or “hypervisor”)
whose only function is to emulate the underlying hardware, classically known as Type 1 hypervisor
but which I will call the “thin host” model. Or it can be a set of services on a standard operating
© 2008 Hewlett-Packard Development Company, L.P.
system, classically known as a Type 2 hypervisor but which I will call the “fat host” model. This
terminology is a conscious copy of the “thin client”/”fat client” model. Note that hypervisors were
first implemented in 1967 in the IBM System/360-67: like everything else in the virtualization
world, they have a long history but only recently have received the attention they deserve, as the
processors and I/O sub-system have become fast enough, and the memory has become dense
(and cheap) enough, to allow them to be productive and cost-effective outside of the mainframe.
In the fat host model, the operating system can either run its own applications in addition to the
guest instances, or it can be stripped down to remove most of its services in order to enhance the
performance and reduce the overhead of the guest instances. Stripping it down also has the
benefit of reducing the number of targets for virus attacks, making the host instance more secure
and reliable.
One of the features of the “fat host” model is that it can offer services to one or more of the guest
instances, such as sharing of devices such as DVDs or tape arrays. The “thin host” model cannot
generally do this, because it does not offer sufficient locking primitives to the guest instances
which are needed to coordinate access to shared devices or files.
One interesting behavior of Type 1 (“thin host”) hypervisors is that they tend to become “fatter”
over time. The IBM hypervisor running on the Power4 processors was extremely thin, much more
like the Microsoft Hardware Abstraction Layer (HAL) than like a VMware ESX environment. But
the Power5 and 5+ versions have added virtual I/O capabilities, processor dispatch to enable the
movement of physical processors between guest instances, and reliability-availabilityserviceability (RAS) enhancements in response to customer demands. This “feature creep” has
added to the overhead of the host instance, which has been partly offset by the constantly
increasing speed of the underlying server hardware. But some vendors are recognizing the
problems of this, and are offering the option of extremely thin hypervisors, which are actually
embedded into the hardware itself. I will discuss this later in the paper.
In either model, the guest instances are unaware of the existence of the host instance, and run
exactly as if they were running on the “bare iron” server hardware. The guest instances generally
communicate with other instances of the operating systems (whether they are guest instances on
the same physical server, guest instances on another server, or standalone on yet another server)
through the standard methods of networking and storage. Therefore, standard business level
security is the same as if these instances were running on separate systems, and no change to the
application is required.
All vendors provide tools which convert standalone bootable instances of operating systems
(primarily the boot disk) into a format which is suitable to boot the operating system as a guest
instance. In many cases this is a container file which holds the entire guest instance. This process
is known as “P2V”: physical to (2) virtual. The container file is completely portable between virtual
machine monitors of that type, though in general you cannot move a container file between
virtual machine monitors from different vendors. In other cases, this is a standard bootable disk
with a small information file needed by the host instance to boot the guest instance.
The container file and information file concept leads to one of the major advantages of micro
partitioning, in that some virtual machine monitors allow the guest instances to be moved
dynamically between servers. A guest instance can be suspended and then effectively “swapped
out” of one server, its current memory image stored in the container file or disk, and then
“swapped in” and its operation resumed on another server, with all running programs and context
fully intact. Some host instances can also perform the transfer of the in-memory information
using standard networking, which is often faster than saving that information to the container file
and then reading it back in on the target guest instance. Either way, the applications will not be
executing instructions while this is occurring, but the operation is usually far faster and easier than
shutting down the instance and bringing it up on another server, which would be required if the
operating system instance was running standalone on a server.
© 2008 Hewlett-Packard Development Company, L.P.
Note that this is done at the operating system level, not the application level. The virtual machine
monitors on the separate servers take care of transparently transferring all of the network and
storage connections between the servers, such that the guest instance never notices that anything
has occurred. This is similar to failing over application instances between different nodes in a
cluster, but is often more capable of maintaining current user connections and in-flight
transactions.
There are restrictions and operational requirements on this movement, but with proper planning
it does allow an operating system (guest) instance to be active for longer than any of the
underlying server systems have been active, which represents significant advantages in terms of
achieving high availability of applications while still performing the required maintenance on the
servers.
One of the restrictions is that vendors frequently require extremely similar hardware. Intel Flex
Migration allows the processor to fool the hypervisor into thinking that successive generations
are the same as previous generations of processors, to allow vMotion/LiveMigrate across multiple
generations of processors.
It also has another advantage even in normal production when the hardware is operating
properly, which is the case almost all of the time, since today every vendor’s server hardware is
extremely reliable. In the traditional model of load balancing, such as in soft partitions, we would
move the resources to the workload. But with the ability to move guest instances between host
instances, we can move the workload to the resources. So if you have a group of servers which
are running virtual machine monitors, you can move guest instances between them in order to
achieve load balancing among a group of servers, instead of (with soft partitioning) just inside a
single server.
Micro partitioning is not without cost. When comparing a standalone instance of an operating
system and application to that same operating system and application running as a guest
instance in a micro partitioned environment, a performance overhead penalty of 20% to 40% is
common. The reasons for this are many, and include the re-direction needed to execute
privileged instructions, competition for resources with the other guest instances, and the simple
overhead of the host instance. Further, even when there is no resource contention, each
operation (usually I/O operations) will suffer from additional latency compared to the standalone
model.
One way to think about this is to recognize that all partitioning schemes involve some degree of
lying to the operating system instances. And just like when people lie, the greater the size of the
lie, the greater amount of effort needed to support the lie. In the case of hard and soft
partitioning, the lie is not that extensive, so the amount of effort (aka, overhead) is small. But in
the case of micro partitioning, where the guest instance is completely isolated from the
underlying hardware and the host instance (whether a thin host virtual machine monitor or a fat
host operating system with micro partitioning services) is lying about every interaction the guest
instance has with the hardware, the amount of effort (aka, overhead) to support the lie is high.
So micro partitioning is not appropriate when the performance of the guest instance is critical
and would require all of the performance of the underlying server hardware. Further, you must
consider the impact of one guest instance on another guest instance, such that if two guest
instances both require a large percentage of the underlying server hardware at the same time, the
resources might not be available to satisfy both of them.
One factor that often arises is time. Most operating systems keep track of time by counting timer
interrupts or processor clock ticks. For example, Windows sets a 10 milli-second clock timer
which will fire 100 times per second, in order to track timer events, though in some cases it will
reset this to a 1 milli-second timer. And this works well when running on bare iron hardware. But
when these interrupts are themselves being shared among many other guest instances, it is
extremely common to see “clock drift”, where the times reported by the guest instances are
© 2008 Hewlett-Packard Development Company, L.P.
slightly behind each other (and by different amounts), and significantly behind the actual time.
Some operating systems notice that this is occurring, and attempt to adjust for this drift by
making up for the estimated number of clock ticks which have been missed. While this is
sometimes effective, if the estimation is wrong and the operating system applies too much
correction, it can lead to the opposite problem: the operating system getting ahead of real time.
Different virtualization products handle this in different ways. One of the most common is to use
the Network Time Protocol (NTP) in the guest instance on a frequent basis to ensure that the
clock is accurate. There are many white papers and other technical documentation which discuss
how to handle this problem, and I have listed some of them in the appendix. I particularly
recommend the EMC VMware paper, as it gives a solid background on the different hardware
clocks and how they interact with various operating systems, primarily Linux and Windows.
Note that the processors and memory are not always the point of contention. Processors have
become so fast (as Moore’s Law predicted decades ago) and memory has become so large (due
to 64-bit technologies becoming available on every server platform) that in very many cases the
bottleneck has moved to the I/O subsystem. The overhead of the TCP/IP stack and the
requirement to move very large blocks of data from the SCSI and FibreChannel cards to physical
memory, while coordinating the I/O between the many guest instances across a single channel, is
very expensive.
And the expense is not bandwidth: guest instances can communicate with each other over the
host instance’s memory, and can communicate between servers using GigaBit Ethernet (soon to
be 10 GigaBit Ethernet), 4Gbit FibreChannel and even Infiniband. And all of these I/O
technologies can be “trunked” together to provide massive throughput. No, the expense in most
cases is latency, where the guest instances must communicate locking and other information in
messages which are very small, but each message incurs latency of 20,000-50,000 nano-seconds
instead of the latency of local memory which is 200-500 nano-seconds. All processors wait at
exactly the same speed, and with each message taking approximately 100 times as long to reach
its destination, you should carefully consider this I/O contention when evaluating how many guest
instances to place on a single host.
But often more important than the overhead, is the business cost of the failure of a single server.
If only one application was running on that server, the impact of the failure is limited to that
application (though that impact might be severe, it is still only one application). But if 10, 20 or
even 100 different application instances were running on a single server, the impact would be
more wide-spread and could be much more significant. This is another example of the classic
“scale up” vs. “scale out” discussion. In some cases it will make more sense to have a fewer
number of larger and more reliable servers with many guest instances on each server (putting all
your eggs in a few baskets, and then watching those baskets extremely carefully), and in other
cases it will make more sense to have a larger number of smaller and cheaper servers with fewer
guest instances on each server. There is no single right answer, and you need to consider all the
factors carefully when planning for micro partitioning.
Para-Virtualization
One of the difficulties of virtualizing the standard processors from AMD and Intel is that the
operating systems use “ring 0” (the most privileged mode) for the functions such as direct
memory access (DMA) for I/O, mapping page tables to processes, mapping virtual memory to
physical memory, handling device interrupts, and accessing data structures which control the
scheduling and interaction of the different user processes. (Interesting historical note: the Digital
Equipment Corporation VAX was the first processor to use four ring levels, and most succeeding
processors have followed this example, including Itanium, Opteron and Pentium). Ring 1 and ring
2 are used for varying levels of privileged functions, while applications run in ring 3 (the least
privileged mode). But if the virtual machine monitor runs in ring 0, then a guest instance cannot
also run in ring 0, because it would interfere with the virtual machine monitor and represent
security problems, where a guest instance could accidentally map the physical memory which the
© 2008 Hewlett-Packard Development Company, L.P.
host instance had allocated to another guest instance, or respond to a device interrupt from
another guest instance.
This problem was first addressed in software and only more recently in hardware.
There are two methods that the software has for dealing with the problem of both the guest
instance and the host instance each requiring exclusive access to ring 0: either the virtual machine
monitor intercepts all highly privileged operations, or the developers of the guest operating
system modify their source code to notice that they are running as a guest instance and therefore
call the virtual machine monitor instead of executing those instructions directly. The first method
is the one which is used by most early virtualization products, while the second method was
pioneered in the Linux market with Xen but has since moved to the mainstream for almost all
major operating systems.
In addition to this interception, it possible to use “ring compression” to run these operating
systems as a guest instance. Because most operating systems did not use all of the rings
(specifically, they used only ring 3 and ring 0), the hypervisor could “de-privilege” the operating
system, forcing the guest instance to run in rings 3 and 1 instead of rings 3 and 0. Only the
hypervisor runs in ring 0, and all privileged operations are executed by the hypervisor. This is very
slow, especially when those calls occur in a loop, such as changing the page protection on a long
list of pages, or mapping a long list of virtual addresses to physical addresses. If each of these
operations needs to be intercepted individually, sent to the host instance for execution, and the
guest instance resumes only to instantly have the next instruction intercepted as it performs the
next operation in the list, performance will suffer dramatically.
Para-virtualization addresses this by having the operating system be modified to notice at runtime if they are running as a guest instance, and to execute these calls to the hypervisor directly.
This is much more efficient, but requires some assistance from the processor itself in order to fully
isolate the guest instance and the hypervisor, as well as allow extremely fast communication
between them. Without this hardware assistance, the hypervisor is forced to use a combination of
emulation, dynamic patching and trapping in order to have the guest instance execute properly,
all of which has very high overhead. This is where hardware virtualization comes in.
Because of the requirements, demand grew up to address the problem in hardware, specifically in
the processor itself. AMD, IBM, Intel and Sun have all placed virtualization technology into their
physical processors. AMD has the Pacifica project, IBM added virtualization technology to the
Power4 and all subsequent processors, Intel has the Vanderpool project which implements
Virtualization Technology for Itanium (VT-i) and Pentium Xeon (VT-x), and Sun has implemented
the “hyper-privileged level” in the T1 (Niagara) processor.
AMD, VT-x and Sun’s T1 work by creating an extra ring level, below ring 0, which Intel calls VMX
Root, or “ring -1” (ring minus one). This allows the guest instances to actually run in ring 0, and
not require de-privileging the operating systems, either during booting or during development.
The processor itself traps the privileged instruction or I/O operation and gives it to the virtual
machine monitor, instead of executing it directly. This allows much higher performance of the
guest operating systems, by reducing the overhead of transferring control between the guest and
host instances whenever one of the privileged instructions are executed, as well as making it
easier (or even possible) for standard editions of operating systems to run as guest instances.
IBM’s POWER4 processor introduced support for logical partitioning with a new privileged
processor state called hypervisor mode, similar to ring -1. It is accessed using hypervisor call
functions, which are generated by the operating system kernel running in a partition. Hypervisor
mode allows for a secure mode of operation that is required for various system functions where
logical partition integrity and security are required. The hypervisor validates that the partition has
ownership of the resources it is attempting to access, such as processor, memory, and I/O, then
completes the function. This mechanism allows for complete isolation of partition resources. In
© 2008 Hewlett-Packard Development Company, L.P.
the POWER5/5+ and POWER6 processors, further design enhancements are introduced that
enable the sharing of processors by multiple partitions.
Intel introduced virtualization technology to Itanium (VT-i) starting with the “Montecito”
processor slightly differently. A new processor status register (PSR.vm) was created, where the
host instance runs with PSR.vm =0 while guest instances run at PSR.vm =1. All 4 privilege levels
can be used regardless of the value of PSR.vm. When PSR.vm =1, all privileged instructions and
some non privileged cause a new virtualization fault. PSR.vm is cleared to 0 on all interrupts, so
the host instance handles all interrupts, even those belonging to guest instances. The host
instance will handle either the privileged operation or the interrupt, and then set PSR.vm =1 and
use the “rfi” instruction to return control to the guest instance.
These technologies are commonly grouped under the common name “Hardware Virtual Machine”
(HVM), and I will use this term to refer to all of them.
Note that as the percentage of processors with this HVM technology increases in the market (and
this has occurred, such that all general purpose processors now have this ability), and the
percentage of operating system instances running in micro partitions has increased (and this has
also occurred, with virtually every customer implementing some form of server consolidation) the
need for para-virtualization has increased.
As stated above, para-virtualization requires the operating systems to detect whether they are
running as a guest instance, and either to execute the code directly (if they are running on bare
iron) or to call the virtual machine monitor (if they are running as a guest instance). These checks
will slow down the system running as a standard instance very slightly, but dramatically speed up
when running as a guest instance. As the percentage of operating systems instances operating in
micro partitions continues to grow, this trade-off will be well worth the extremely minor cost.
In 2005, VMware proposed a formal interface for the communication between the guest and host
instances, called the Virtual Machine Interface (VMI). Xen has proposed a different formal
interface for this communication, called the “hyper-call interface”, but VMware and Xen are
working together to make the hyper-call interface compatible with VMI by using paravirtualization optimizations (paravirt-ops).
As stated above, Linux led the way with para-virtualization, but all major O/S vendors have
implemented it in their base operating systems. AIX in SPLPARs, HP-UX in HPVM (especially with
the Advanced Virtual I/O system), OpenVMS in HPVM (in a future major release), and Solaris 10 in
either xVM or LDOMs (depending on the processor), all incorporate para-virtualization to some
level, with further enhancements almost certain. Microsoft is using the Xen hyper-call interface in
Windows 2008, and is calling the run-time detection “enlightenment”.
So when you are deciding on a micro-partitioning solution, you need to verify whether your
particular version of your particular guest operating system is running fully virtualized (where
nothing needs to change in your guest environment, but performance will be relatively slow), or
whether it is para-virtualized (where you need a very recent version of the operating system, but
performance will be relatively good), and whether you need to use the latest processors which
contain HVM as part of the mix. If you always buy the latest and greatest for your entire
environment, para-virtualized guests running on processors with HVM will be very easy to find: in
fact, it will be difficult to get away from. But very few customers are able to refresh their entire
infrastructure just to move to the latest and greatest hardware and software, no matter how much
the systems and application vendors might wish them to do so. So you need to carefully match
your environment to the actual capabilities of the solution, and verify the type of virtualization
that is present in your micro-partitioned solution.
© 2008 Hewlett-Packard Development Company, L.P.
Dynamic Capacity
There are two different meanings of dynamic capacity for micro partitioning: modifying the
capacity of the host instance and modifying the capacity of the guest instance.
The host instance can modify the amount of resources available to it by a variety of mechanisms.
A physical resource can have an increasing number of soft errors such that the host instance
operating system will remove it from the resource set in a graceful way without a crash or a
reboot. A physical resource can be added to the environment via hot-add, and the host instance
operating system can add it to the available resources. Some operating systems support variable
capacity programs, such as IBM’s “On/Off Capacity On Demand” or HP’s “Temporary Instant
Capacity”, which allow physical resources can be activated temporarily for peak processing times.
No matter how it occurs, the virtual machine monitor will allocate these new resources to the
guest instances, as follows.
Physical resources can be allocated for the exclusive use of a guest instance, such that they are
not accessible by any other guest instance. This is always done for memory, but is also useful for
network or storage adapters either for security reasons or to ensure a high level of performance.
Because the guest instances usually cannot dynamically modify their virtual processor and
memory resources, the guest instances will not take advantage of the new processors and
memory in an exclusive way. However, new network and storage adapters can be added to the
guest instances in the standard way that host instances recognize new paths.
But in most cases, the guest instances are allocated resources by the virtual machine monitor as
virtual resources. In the case of processors, you specify a number of virtual processors for the
guest instance, which can exceed the number of physical processors on the server, and the virtual
processors are time-shared among the physical processors. In the case of network or storage
adapters, you specify virtual paths which are then mapped to physical network or storage
adapters or to virtual network adapters inside the virtual machine monitor. Access to the virtual
resources is done by either a priority scheme or by a percentage of the physical resources.
Priority schemes work exactly as they do among processes inside a single instance of an
operating system, and do not guarantee that a specific amount of resources will always be
available to a specific guest instance. The percentage mechanism does specify this type of
guarantee, and these percentages can be either “capped” or “uncapped”. A guest instance with
capped percentages can access (for example) 50% of the physical processor resources but never
any more, whether or not the physical resources are being used or idle. A guest instance with
uncapped percentages is guaranteed that (for example) 50% of the physical processor resources
are available to it, but if the physical processors are idle it can use as much as is available. The
above examples are for processors, but are equally applicable to storage and network adapter
bandwidth.
Note that the percentages are in terms of percentage of resources, not in terms of physical
resources. 50% processor time means that the guest instance uses half of the processing time of
the physical processors, such that if there are two processors it is equivalent to dedicating an
entire physical processor to the guest instance. But the guest instance will execute on either of
the physical processors: it is the percentage that is guaranteed, not access to a specific processor.
Therefore, if the host instance modifies the amount of physical resources available to it (such as
by the methods discussed above for the host instance), the percentages and priorities are
computed based on the new amount of physical resources. Such that if a new processor is added
to the physical server, and the guest instances are running uncapped, they will see an increase in
their performance.
The guest instances may be able to modify their virtual resources (which I refer to as “dynamic
resources”) or their percentages dynamically, or they may require a reboot of the guest instance
to modify these values. Further, the computation of percentage of resources may be spread
among all of the physical resources (in which case the percentages may exceed 100%), or a virtual
© 2008 Hewlett-Packard Development Company, L.P.
resource may be specified in terms of physical resources (in which case the percentage cannot
exceed 100%). See the specific discussions below for these details.
I/O Models
There are two different methods for I/O in micro partitioning available today: I/O done by the
virtual machine monitor, and I/O done by a dedicated partition.
System
Instance
System
Instance
System
Instance
Device
Drivers
Device
Drivers
Device
Drivers
Virtual I/O
Virtual CPU
Virtual Mem
Virtualization Intermediary
Native I/O
Native CPU
Native Mem
I/O Mgr
Back End
Drivers
Device
Drivers
System
Instance
System
Instance
Front End
Drivers
Front End
Drivers
Virtual CPU
Virtual Mem
Virtualization Intermediary
Native CPU
Native Mem
Physical System
Physical System
Virtual Machine
Dedicated Partition
Figure 13 Micro Partitioning I/O Types
In the first case, the host instance isolates the server hardware from the guest instances
completely, such that the guest instances are unable to access the server hardware except
through the host instance. All device drivers in the guest instances pass their requests to the
virtual I/O drivers in the host instance, which then controls the hardware directly. This has the
advantage that the guest operating systems can run un-changed, without even being aware that
they are not running on native hardware. This helps in server consolidation of older operating
systems which cannot be modified because they are no longer being actively developed, or are
out of support.
In the second case, a dedicated partition (which IBM calls the “Virtual I/O Server”, which came
from the mainframe technology of the same name, and which Linux calls Domain Zero (Dom0))
handles all of the I/O, without having to pass through the host instance. Each of the guest
instances performs all I/O by passing its request as a message to one of the other guest instances,
which then performs the I/O directly and passes the results back to the original guest instance as
a message. This guest instance requires dedicated physical resources such as memory and
processors, since it is responsible for handling the I/O interrupts for the entire environment. The
advantage of this is that the virtual machine monitor can be extremely small and efficient, since it
does not need any drivers for either storage or networking. All of this is handled by the guest
instance which is handling the I/O and managing the entire environment. The disadvantage of
this is that the guest instance doing the I/O can become overwhelmed with the requests from the
other guest instances, but this can be handled like any other performance issue. Further, it takes
two complete round-trips through the host operating system to perform a single I/O: once from
the guest instance to the Virtual I/O Server, and then once back from the Virtual I/O server to the
guest instance. Finally, the guest instance represents a single point of failure: no I/O can take
place while the guest instance is failing and re-starting. This can be avoided by having two virtual
I/O server guest instances, but this will consume extra physical resources, leaving less for the
application server guest instances
In either case the host instance or the Virtual I/O Server can mask the specific I/O devices from
the guest instances by presenting “standard” hardware to those guest instances. For example, a
guest instance could believe it is communicating over a standard 100Mbit Ethernet card and
© 2008 Hewlett-Packard Development Company, L.P.
standard SCSI-2 card (for which it has the standard drivers), while the virtual machine monitor or
guest instance handling the I/O is actually using very high-speed GigaBit Ethernet and
FibreChannel cards, which the guest instances do not yet support. This allows much older
operating system and application versions to continue to run without modifications or upgrades.
One corollary of this device virtualization is that many of the standard tools which operating
systems offer for higher performance or reliability are not supported or recommended when
running those operating systems as guest instances. For example, HP-UX does not support Auto
Port Aggregation for networking or SecurePath for storage when run as a guest instance. And
this makes sense, because all of these kinds of products believe that they are working directly
with specific hardware channels, which is not true in running as a guest instance.
Most of the micro partitioning products have the ability to create a private network among the
guest instances, allowing very high speed communications with standard networking protocols
between those guest instances through the host instance without passing through the Ethernet
cards. This is known as “virtual switches”.
The following shows the virtualization capabilities of the major micro partitioning products on the
market today (the “p” or “f” in the “System Instance” column indicates whether the guest instance
is capable of running in a para-virtualized environment or a fully virtualized environment):
Hardware
Platform
Host Model
I/O Done By:
System Instance
O/S’s (p/f)
Private
Network
Citrix
XenServer
AMD Opteron,
Intel Pentium
Thin: Linux
Dedicated
partition
Linuxp, Windowsf
Yes
(internal)
EMC VMware
ESX
AMD Opteron,
Intel Pentium
Thin:
dedicated
Virtualization
Intermediary
Linuxp, NetWaref,
EMC VMware
Server (GSX)
AMD Opteron,
Intel Pentium
Fat: Linux or
Windows
Virtualization
intermediary
MS-DOSf, Linuxf,
HP Virtual
Machine
Intel Itanium
Fat: HP-UX
Virtualization
intermediary
HP-UXp, Linuxp,
Yes (host
only
network)
Yes (host
only
network)
Yes
IBM Micropartition
(SPLpars)
IBM Micropartition
(SPLpars) VIOS
Linux (Xen)
Red Hat, SuSE
Power5 (i5, p5,
p6)
Thin:
Hypervisor
Virtualization
intermediary
AIXp, i5/OSp, Linuxp
Yes (Virtual
Ethernet)
Power5 (i5, p5,
p6)
Thin:
Hypervisor
Dedicated
partition
AIXp, i5/OSp, Linuxp
Yes (Virtual
Ethernet)
AMD Opteron,
Intel
Pentium/Itanium
AMD Opteron,
Intel Pentium
Thin: Linux
Dedicated
partition
Linuxp, Windowsf
Yes
Thin:
Hypervisor
Dedicated
partition
Linuxp, Windowsp
Yes
AMD Opteron,
Intel Pentium
Thin: Linux
Dedicated
partition
Linuxp, Windowsf
No
Intel Pentium
Macintosh server
Thick: Mac
OS X
Virtualization
intermediary
Linux, Mac OS X,
Windows
No
Microsoft
Windows
Hyper-V
Oracle VM
(Xen)
Parallels Server
Solarisp, Windowsf
Solarisf, Windowsf
Windowsf
© 2008 Hewlett-Packard Development Company, L.P.
Sun Logical
Domains
UltraSPARC T1,
T2
Thin:
Firmware
Dedicated
partition
Linuxp, Solaris 10p
Yes
Sun xVM
UltraSPARC, Intel
Pentium
Thin:
Hypervisor
Dedicated
partition
BSDp, Linuxp,
No
Solarisp, Windowsf
Figure 14a Micro Partitioning Products
Guests per
host
Citrix
XenServer
EMC VMware
ESX
EMC VMware
Server (aka,
GSX)
HP Virtual
Machine
IBM Micropartition
SPLpars
IBM Micropartition
SPLpars VIOS
Linux
Virtualization
Red Hat, SuSE
Microsoft
Windows
Hyper-V
Oracle VM
(Xen)
Parallels
Server
Sun Logical
Domain
Sun xVM
vCores per
guest
GBytes per
guest
Overcommit
Resources
Dynamic
Resources
32 (Linux)
8 (Win)
4
32
No
No
XenMotion
16
No
vMotion
2
3.6
Processor,
Memory
No
No
No
254
4
36
Processor
No
No
254 (up to
10/core)
64 (up to #
cores)
(total
memory)
Processor
No
No
254 (up to
10/core)
64 (up to #
cores)
(total
memory)
Processor
No
Yes
Unlimited
32
64
Processor
Processor
Yes
1 SE, 4 EE,
unlimited
DC
Unlimited
4 (O/S
dependent)
No
No
No
4
3.6 32b
Unlimited
64b
32
No
No
Yes
50
4
8
No
No
No
64 (up to #
threads)
Unlimited
4
32
No
Processor
No
4
32
Yes
No
No
Unlimited
128 (up to
8/core)
64
Dynamic
Migration
Figure 14b Micro Partitioning Products
EMC VMware
VMware has been doing virtualization longer than any of the other products in the market.
VMware was a stand-alone company but was purchased by EMC at the end of 2003. VMware
offers both thin (VMware ESX, which VMware refers to as the “non-hosted” model) and fat
(VMware Server, which was previously known as GSX, which VMware refers to as the “hosted”
model) virtual machine monitors on AMD Opteron and Intel Pentium Xeon systems. VMware also
offers a version of this software called VMware Workstation, which I will not cover here because it
is focused on the desktop market, and I am focusing on the server market. However, this is a
© 2008 Hewlett-Packard Development Company, L.P.
very solid product that is used by many notebook and laptop PC users to allow the use of Linux
and Windows at the same time without rebooting.
VMware ESX runs as a dedicated virtual machine monitor. The “VMkernel” is a micro-kernel that
is loaded into a Linux kernel as a driver. No applications or any other code can be run in the
VMkernel. VMware ESX supports FreeBSD, many distributions of Linux, Novell NetWare and
Microsoft Windows from NT to Windows 2003 Enterprise Edition in 32-bit, and 64-bit versions of
Windows 2003, Red Hat and SuSE Linux, and Solaris 10 (x86/x64). Linux and Solaris can be paravirtualized, but this requires HVM hardware.
VMware ESX 3i is an integrated hypervisor which gets rid of the Linux O/S, fits into 32MBytes, and
is booted from an internal USB key. This is shipping from multiple vendors (Dell, HP and IBM) as
an “embedded” hypervisor, providing complete integration with the hardware, and looking much
more like IBM’s Advanced Power Virtualization (APV) model, where it is not possible to run an O/S
instance on the bare iron, but only through the hypervisor. The system will power up directly into
ESX 3i, and all operating systems will boot as guest instances. Note that at the same time as ESX
3i, VMware is moving away from SNMP and moving toward the Common Information Model
(CIM) standard. This means that the standard console access will not work, but the vendors are
shipping CIM providers to implement management.
VMware Server (aka, GSX) installs in and runs as a service on a standard instance of Windows
2003 Web, Standard and Enterprise editions, Windows 2000 Advanced Server, and many
distributions of Linux. VMware GSX supports DOS 6.2, FreeBSD, many distributions of Linux,
Novell NetWare, and Windows all the way from Windows 3.11 to Windows 2003 Enterprise
Edition. Note that VMware Server is being offered as a free download by EMC, while support for
that is offered as a fee based service.
EMC offers the standard ability to either install an O/S instance directly from the distribution
media, or to convert a running (physical) instance into a guest (virtual) instance, using P2V
(physical to (2) virtual) tools. EMC previously offered VMware P2V assistant, but with VMware V3
it now offers VMware Converter 3, and which is available in both Starter (a free download) and
Enterprise (licensed with a support contract with VMware Virtual Center) editions. VMware is so
popular that many server vendors also supply these kinds of tools, such as HP’s ProLiant Server
Migration Pack. Note that with VMware Virtual Machine Importer, VMware ESX can read the
container files created by Microsoft Virtual Server (though Microsoft has indicated that this is a
violation of the Microsoft license agreement). Further, ESX 3.0 has implemented VMFS (Virtual
Machine File System), a clustered file system for multiple host instances to share container files.
VMware emulates a generic Intel processor-set and standard I/O cards, and performs all I/O in the
virtual machine monitor, providing support for the widest variety of operating systems and
applications. This is most useful for older operating system guest instances which do not offer
support for 1000BaseT or 4GBit FibreChannel cards. VMware has the ability to move a poweredoff guest instance (migration) or a running guest instance from one server to another (migration
with VMotion). VMware can take a snapshot of a running guest instance, which can then be used
to restore a guest instance to that exact state. Note that VMware recognizes that the external
environment (databases on disks which were not part of the snapshot, file downloads and
network connections to other systems, etc) can confuse both the restored guest instance and the
external systems. VMware can create a virtual network switch to provide a private network
among the guest instances, which VMware calls a “host only network”. VMware also offers a file
system optimized for the guests, called VMFS.
VMware manages the guest instances with Virtual Infrastructure. This offers a dashboard to start,
stop and monitor all of the instances across multiple host systems, VMotion to manually migrate
guest instances between host instances, virtual server provisioning to rapidly deploy new images,
and performance management to show thresholding and utilization. The host instance offers fair
share scheduling among all of the instances, and has tools to dedicate a percentage of resources
(processor, memory, I/O) to a specific guest instance. VMware Virtual Infrastructure also offers
© 2008 Hewlett-Packard Development Company, L.P.
the Distributed Resource Scheduler (DRS) which monitors and balances computing capacity
across the resource pools (multiple host instances) which can accept the VMotion migration of
guest instances which are experiencing performance peaks. Note that bare ESX 3i cannot be
managed by Virtual Infrastructure, but requires extra licenses (VI Standard) to allow it to be
managed by Virtual Infrastructure.
Guest instances can be allocated 1/8th of a processor, and are tuned by specifying “resource
shares” of the processors, which are specified in terms of percentages (proportions) of the
available processor resources. These shares are computed across the total number of physical
processors in the system, and are used to arbitrate between guest instances when processor
contention occurs as a way of preventing specific guest instances from monopolizing processor
resources. Memory is specified in terms of megabytes. Each guest instance is given a minimum
and maximum share, and if there are not enough processor or memory resources, the guest
instance will not boot. The minimum and maximum values for processor resources are dynamic
and can be changed by the Virtual Infrastructure Manager, however changing the memory
allocation requires rebooting the guest instance. Modifying the virtual resources of the guest
(virtual processors, memory and network and storage paths) requires a reboot of the guest
instance.
Note the licensing of the guest operating system instances. With VMware, Windows requires a
license for each guest instance.
HP
HP Virtual Machine (HPVM) is available on all Integrity servers. The host instance is a standard
version of HP-UX 11i V2 or V3 running as a fat host. It is a core part of the Virtual Server
Environment (VSE), and is managed by the HP Virtualization Manager, which is WBEM based and
plugs into the HP Systems Management Home Page and VSE Management Tools. It supports HPUX 11iV2 and V3, Linux and Windows, and will support OpenVMS in a future release (see below).
Note that all of the guest operating systems must be the native HP Integrity versions, and there
are restrictions around the respective versions of HP-UX hosts and guests. All of the operating
systems are para-virtualized, in that HPVM is using ring compression to force them run in ring 1
instead of ring 0.
Guest instances can be allocated 1/20th of a processor, and are tuned by specifying an
“entitlement” of in increments of 1/100th of the processor. The percentages are specified in terms
of single processors, and apply to each virtual processor. For example, if there are four physical
processors, and a guest instance is specified with two virtual processors, an entitlement of 50%
means that each virtual processor can consume up to ½ of a physical processor. Increasing or
decreasing the number of physical processors would not affect the processing time available to
this particular guest instance, but would affect how many entitlements are available to other
guest instances. Memory is specified in terms of megabytes. Each guest instance is given a
minimum and maximum entitlement, and if there are not enough processor or memory resources,
the guest instance will not boot. The minimum, maximum and capped values for processor
entitlements are dynamic and can be changed by the Virtualization Manager. Modifying the
virtual resources of the guest (virtual processors, memory and network and storage paths)
requires a reboot of the guest instance.
The HPVM performs all of the I/O for all of the guest instances. There is no option for specifying
a dedicated partition for I/O. HPVM virtualizes all of the I/O ports, offering an Intel GigaBit
Ethernet, SCSI-2 and serial port. All of the devices supported in HP-UX 11iV2 are supported by
the host instance. SAN and other device management can only be done from the host instance.
HPVM V2.0 offers the ability to directly map a physical I/O device to a guest instance, called
“attached I/O”, to allow access to media such as DVD writers or tape silos.
HPVM has the standard feature of creating a guest instance from a standalone instance, which
can be the same bootable disk as the standalone instance plus a small information file. Tools
such as “p2vassist” are used to select the source system, which applications should be installed on
© 2008 Hewlett-Packard Development Company, L.P.
the target system, and the file systems to be used by the target system. HPVM has the ability to
move a powered-off guest instance (migration) but not a running guest instance from one server
to another. HPVM can take a snapshot of a powered-off guest instance, but not a running guest
instance, except through storage based snapshot tools. HPVM can create a private network
among the guest instances, for what HP calls “inter-VM communication”.
HP offers flexible licensing of the host instances. You may license all of the physical processor
cores, which then explicitly authorizes all of the guest instances. You may also license only the
virtual processor cores which are allocated to the specific guest instance. For example, if one of
your guest instances was running HP-UX 11iV3 High Availability (or 11iV2 Mission Critical)
Operating Environment and another of your guest instances was running HP-UX Foundation
Operating Environment, you could license all of the processors with HAOE/MCOE (to allow any
number of guest instances to run either MCOE/HAOE or FOE), or you could license some virtual
processors with HAOE/MCOE and other virtual processors with FOE. You should analyze the most
cost-efficient method of licensing for your environment.
Note that HPVM V2.0 does not support VT-i, but does ring compression instead, even on
processors which have VT-i technology. Because OpenVMS uses all four ring levels, deprivileging and ring compression is more difficult. A future version of HPVM with a future Itanium
processor will support VT-i, such that ring compression will not be required, at which time it will
be possible to have OpenVMS as a guest instance.
IBM
IBM micro partitioning is available on all Power5 systems, including eServer iSeries and pSeries,
running AIX, i5/OS (previously known as AS/400), Linux, and Microsoft Windows (which requires
specialized IXS/IXA hardware). AIX, i5/OS and Linux are para-virtualized, while Windows is not.
The Power Hypervisor firmware controls the micro partitions in the same way it controls the
DLPARs mentioned earlier, and is controlled by the same Integrated Virtualization Manager (IVM)
console, or by the Hardware Management Console (HMC) referenced earlier. Note that some IBM
documentation refers to the hypervisor, others refer to SPLPARs (Shared Processor Logical
Partitions) and still others refer to SDLPARs (Shared Dynamic Logical Partitions): be aware that all
of these are referring to the same product. I will refer to them as SPLPARs.
Guest instances can be allocated 1/10th of a processor, and are tuned by specifying an
“entitlement” of in increments of 1/100th of the processor. The entitlements are in terms of
percentages of the physical resources, such that if there are four physical processors, you can
specify up to 400% entitlement for a specific guest instance. Each guest instance is given a
minimum and maximum entitlement, and if there are less physical resources than the minimum
entitlement, the guest instance will not boot. If the guest instance’s maximum entitlement is
defined as “capped”, it cannot use more than this even if the resources are available. If the guest
instance’s maximum entitlement is not defined as “capped”, it can use all available resources,
which are called the “excess”. The minimum, maximum and capped values are dynamic and can
be changed by the DLPARs manager for active partitions, without requiring a reboot of the guest
instance. Modifying the virtual resources of the guest (virtual processors, memory and network
and storage paths) requires a reboot of the guest instance.
The hypervisor performs all of the I/O for all LPARs and DLPARs, and can be set to do this for
SPLPARs guest instances. In this case, each guest instance has dedicated I/O cards which cannot
be shared by the other partitions. However, SPLPARs can use the optional “Virtual I/O Server”
(VIOS) software, such that I/O cards can be shared between guest instances, and one of the guest
instances is dedicated to perform the I/O for all of the other instances. The VIOS instance uses
special API’s which communicate directly through the Power5/5+/6 inter-chip fabric, which is
significantly faster than any other communications path, including DMA. Note that IBM strongly
recommends that the VIOS get a dedicated processor and memory, which cannot be used by any
of the other guest instances. If a second VIOS is created to avoid a single point of failure, IBM
recommends that it also get a dedicated processor and memory, which cannot be shared with
© 2008 Hewlett-Packard Development Company, L.P.
either the application guest instances or the original VIOS instance. In versions of AIX prior to 5.3
this was a requirement, but IBM backed off on this requirement and now makes it just a very
strong recommendation, primarily so that 2 processor systems would have processors available to
do real work. Note that you can have multiple VIOS’s which can be dedicated to a specific type of
I/O: LAN, SCSI, FibreChannel, etc, and that these virtual I/O servers can failover to each other. The
VIOS can be either a DLpar or an SPLpar with AIX 5.3 and beyond, but it must be a DLpar with AIX
5.2 or earlier.
Another method of achieving higher availability is to have the hypervisor use MPIO to load
balance across the I/O channels and provide failover.
IBM is unique in that LPARs, DLPARs and SPLPARs can be run side by side on the same server.
The SPLPARs may or may not connect to the virtual I/O server(s), as described above.
IBM micro partitioning has the standard features of creating a guest instance from a standalone
instance, which must be a logical volume. IBM has the ability to move a powered-off guest
instance (migration), and has recently added Live Partition Migration (LPM) on Power6 for AIX 5.3
and above and specific versions of Linux, which can move a running guest instance from one
server to another. Note that LPM requires the use of VIOS for both the source and targets, and
the source and target environments must use the same virtualization interface (either HMC or
IVM). IBM has not yet qualified LPM with HACMP. IBM can take a snapshot of a powered-off
guest instance, but not a running guest instance, except through storage based snapshot tools.
IBM micro partitioning can create a private network among the guest instances, which IBM calls a
“virtual Ethernet”.
IBM offers flexible licensing of the host instances. You may license all of the physical processors,
which then authorize all of the guest instances. You may also license the entitled capacity of
particular guest instances, including fractional processors. Note that IBM licenses all of its
software by Processor Value Unit (PVU), not by core or by processor. But IBM requires a license
for all of the processors which are able to run a particular piece of software, and the
documentation specifically calls out the case where an LPAR with two SPLPARs is sharing 4
processors. One of the SPLPARs is running an application with the other SPLPAR running DB2,
and the example states that because all 4 processors can be used by the second SPLPAR, you
need to license DB2 for all 4 processors, using the PVU factors.
Microsoft
Microsoft Windows Virtual Server 2005 is part of Microsoft’s “Dynamic Systems Initiative”, and is a
set of services which run on top of the standard Windows 2003 operating system. It comes in
either Standard Edition (up to 4 physical processors for the host instance) or Enterprise Edition (32
physical processors for the host instance), but the functionality is otherwise the same. The host
instance can run its own applications, or just act as the virtual machine monitor for the server,
whichever is preferred by the administrator. The host instance can be clustered with other host
instances on other physical servers with MSCS. The guest instances are fully virtualized.
Microsoft is shipping Virtual Server as a free download with Windows Server 2003, and it will be
available with the next release of Windows Server, known as “Longhorn”. Virtual Server supports
Windows guest instances, and with the supplemental Virtual Machine Additions For Linux
software, will also support a wide variety of Linux guest instances. See the Xen section for more
information on support of Linux guest instances.
The System Center Virtual Machine Manager will manage all of the host and guest instances.
Virtual Server has the standard features of creating a guest instance from a standalone instance.
Virtual Server has the ability to move a powered-off guest instance (migration), and the ability to
move a running guest instance from one server to another, through the “Virtual Hard Disks” (i.e.,
container files). Virtual Server can take a snapshot of a powered-off guest instance, but not a
running guest instance, except through storage based snapshot tools. Note that the guest
© 2008 Hewlett-Packard Development Company, L.P.
instances must run under a specific Windows account, but Microsoft supplies a default account if
none is specified.
Guest instances can be allocated processor resources either by “weight” (a number between 1 and
10,000, where the guest instances with higher numbers receive higher preference to resources) or
by capacity, where each guest instance is given a minimum and maximum percentage of
processor resources. In order to guarantee that a guest instance will receive the maximum
processor resources, specify 100 as the “reserve capacity”, which will allocate an entire physical
processor to that guest instance. If there are not enough processor or memory resources, the
guest instance will not boot. The minimum, maximum and reserved values for processor
entitlements are dynamic and can be changed by the Virtual Machine Manager, however
changing the processor entitlements and memory allocation requires rebooting the guest
instance. Modifying the virtual resources of the guest (virtual processors, memory and network
and storage paths) requires a reboot of the guest instance.
Virtual Server performs all I/O in the virtual machine monitor.
Virtual Server can create a private network which allows communication between the guest
instances, from the guest instance to the virtual machine monitor hosting the guest instance, and
to operating system instances running outside of the virtual machine monitor, which Microsoft
calls a “virtual network”.
Microsoft Hyper-V is the next generation of Virtual Server, and implements para-virtualization,
but has not yet been released, so I will not discuss it in this paper. Thin hypervisor running in ring
-1 (requires HMV), AMD/Pentium only, dedication partition for I/O and management, special
“VMbus” for I/O (similar to IBM), para-virtualized Linux and Windows (para-virtualized Linux uses
the VMbus, fully virtualized O/S’s don’t use it), Standard Edition has 1 guest, Enterprise Edition has
4 guests, DataCenter Edition has unlimited guests, Itanium is not supported in Hyper-V but the
Itanium Edition of Windows 2008 licenses unlimited guests with 3rd party hypervisors. Linux
guests require downloading the Integration Services (basically drivers). Supports private network
and Quick Migration which is Windows Failover Clustering: does *NOT* support Live Migration (a
consideration for a future release, planned failover saves and restores state, takes seconds to
minutes). Don’t put multiple container files on the same LUN, as the LUN can only be mounted
on one hypervisor at a time, so if one guest moves, all guests with their containers on that LUN
move. Bundled as part of the price of each of the editions, each edition is available without
Hyper-V with a very slight price reduction ($999/system for SE with Hyper-V, $971/system for SE
without Hyper-V). Hyper-V Manager MMC (snap-in, manages groups of Hyper-V hosts and
guests one server at a time, doesn’t include copy/clone/P2V/failover wizards), System Center
Virtual Machine Manager 2008 (part of System Center Enterprise Suite, preferred for enterprise,
includes all functionality, can do self provisioning, will support VMware VI3 if Virtual Center Server
is running, can initiate a vMotion), ProLiant Essentials Virtual Machine Manager, HP Insight
Dynamics VSE. Up to 4 vCPUs/host depending on the host (Windows 2008 up to 4, Windows
Vista up to 2, Windows 2003 up to 1, don’t know about Linux), can run capped and uncapped.
Parallels Server for Macintosh
Parallels Server for Macintosh is a package running on top of Mac OS X, as a thick hypervisor, on
Apple servers Intel Pentium Xeon with HVM hardware (aka, VT-x). It supports Mac OS X, many
modern Linux distributions, and Windows. Parallels Desktop is a companion product for the
workstation world, which supports the same set of guest operating systems.
Parallels Server has the standard features of creating a guest instance from a standalone instance
and from another virtual guest (P2V and V2V). Parallels Server has the ability to move a poweredoff guest instance (migration), but does not offer live migration. Parallels Server cannot create a
private network among the guest instances.
© 2008 Hewlett-Packard Development Company, L.P.
Management is mostly through the command line, though Parallels does offer an integrated
console for GUI access, and an open API to allow user written extensions. There is a large library
of these extensions available and included in the package. The integrated Parallels Explorer
allows access to the virtual disk files when the guest operating system is not running, for ease of
updating inactive guests. Image Cloning (ie, off-line backup) is also available.
Sun Microsystems
Sun offers multiple products, depending on the underlying processor: Logical Domains on the T1
and T2 processor, and xVM on the UltraSPARC and Intel Pentium Xeon processor. They are
otherwise very similar products.
Sun is explicit that Logical Domains (LDoms) is a soft partitioning product, and not a micro
partitioning product. However, I would disagree with this positioning, as the LDOM hypervisor is
implemented in firmware, and provides complete O/S instance isolation and sharing of processors
and I/O cards among all of the guest O/S instances. LDoms are only available on the Sun Fire
T1000 and T2000 servers running the T1 and T2 processors. LDoms support Solaris 10 as well as
Linux, but note that neither Red Hat nor SUSE support the T1/T2 processor, so the Linux support
is primarily from Ubuntu. All of the supported operating systems are para-virtualized.
The LDom environment consists of, logically enough, “domains”. LDoms are managed through
the Logical Domain Manager, which runs in the “Control Domain”. All I/O is done by the “I/O
Domain”, and there can be up to 2 I/O domains for redundancy. One of the I/O domains must be
the control domain. All guest O/S instances run in “Guest Domains”. Guest domains can
communicate to the I/O domains through a Logical Domain Channel, which provide full access to
all shared I/O devices. Guest domains can also directly map an I/O device, such that the device is
owned that that guest domain and not shareable. Virtual network switches can be implemented
for communication between guest domains, which can act as private networks.
LDom documentation refers to “virtual CPUs”, but they are actually threads running in the T1 and
T2 processors. Since each processor contains 4 cores, and each core has 8 threads, you can
schedule up to 32 virtual CPUs with only a single physical processor.
Guest domains can be created from new installations or by using the P2V tools which are bundled
with LDoms. Virtual CPUs can be moved dynamically between guest domains, but memory and
I/O changes require rebooting the guest domain. LDoms do not support live migration.
Performance management is handled using the standard Solaris tools of pools, PRM and psets.
LDoms is currently a free download, but Sun will charge for support.
Sun also offers xVM on the UltraSPARC and Intel Pentium Xeon processors. It is based on Xen.
The xVM environment consists of “domains” which can run BSD, Linux, Solaris or Windows. All of
the operating systems are para-virtualized. The “Control Domain” must run Solaris. All I/O is
done by the “I/O Domain”, and there can be up to 2 I/O domains for redundancy. One of the I/O
domains must be the control domain. All guest O/S instances run in “Guest Domains”. Guest
domains can communicate to the I/O domains through a Logical Domain Channel, which provide
full access to all shared I/O devices. Guest domains can also directly map an I/O device, such that
the device is owned that that guest domain and not shareable. Virtual network switches can be
implemented for communication between guest domains, which can act as private networks.
Guest domains can be created from new installations or by using the P2V tools which are bundled
with LDoms. Virtual CPUs can be moved dynamically between guest domains, but memory and
I/O changes require rebooting the guest domain. xVM does not support live migration.
Performance management is handled using the standard Solaris tools of pools, PRM and psets.
Virtual Iron
Virtual Iron “Native Virtualization” was originally derived from the open source Xen virtual
machine monitor, but has since diverged from it. Native Virtualization supports AMD Opteron
© 2008 Hewlett-Packard Development Company, L.P.
and Intel Xeon processors with HVM. Native Virtualization supports 32-bit and 64-bit Linux and
Windows as guest operating systems within a resource group called the “Virtual DataCenter”
(VDC). Virtual Iron uses a dedicated instance, called the Virtual Services Partition, to perform all
I/O.Virtual Iron has the ability to move a powered-off guest instance (migration) and the ability to
move a running guest instance from one server to another (LiveMigrate). Virtual Iron can run on
any storage, but the use of LiveMigrate requires that the container file(s) for all of the Virtual
Volumes for the guest instance be on storage which is shared between the two servers, and that
the two servers are located on the same network sub-net. This allows the movement of the guest
instance to occur in a few seconds, while the operations of the guest instance are only suspended
for 100 milli-seconds and all network and storage connections are fully preserved during the
migration. One of the capabilities that Virtual Iron has added is the ability to use LiveMigrate to
proactively load balance and optimize the performance of the guest instances across servers,
called LiveCapacity. This continuously samples the performance of the guest instances and the
host instances across the Virtual Data Center, and migrates guest instances according to the
policies specified by the system administrators. Virtual Server can take a snapshot of a poweredoff guest instance, but not a running guest instance, except through storage based snapshot
tools.
Virtual Iron creates virtual network switches in the host instance for the use of the guest instances.
These can either be directly connected to physical networks or can be used to provide a private
network among the guest instances, which Virtual Iron calls a “private internal virtual switch”.
Virtual Iron manages the guest instances with Virtualization Manager. This offers a dashboard to
start, stop and monitor all of the instances across multiple host systems, LiveMigrate to manually
migrate guest instances between host instances, virtual server provisioning to create virtual
volumes and rapidly deploy new images, and performance management to show thresholding
and utilization and implement auto-recovery (LiveRecovery).
The host instance offers weighted credit scheduling among all of the instances, which takes values
from 1 to 4096 as weights for each domain (higher weighted domains get higher priority to
resources). The Virtual Services Partition gets the highest weight by default, and this cannot be
over-ridden. Note that there is no way to dedicate a percentage of the resources to a guest
instance, except by the relative weights of the instances.
If there are not enough memory resources, the guest instance will not boot. You cannot overallocate memory, but you can over-allocate processors. The weighted credit values for processor
resources are dynamic and can be changed by the Virtual Infrastructure Manager. Modifying the
virtual resources of the guest (virtual processors, memory and network and storage paths)
requires a reboot of the guest instance.
Xen
Xen is a popular virtual machine monitor for Linux on x86 systems, from notebook PCs to large
SMP servers. Xen is an open source project founded and maintained by XenSource, and uses the
para-virtualization model. Prior to Xen 3.0 and the virtualization enhancements available in the
AMD and Intel processors, each of the guest instances (called “domains”) needed be modified
with a set of packages to implement para-virtualization.
Xen 3.0 has led the industry in para-virtualization, by taking advantage of HVM in AMD Opteron,
Intel Itanium and Pentium Xeon, and Sun UltraSPARC T1 processors. Xen uses this technology to
allow “pass-through” execution of guest operating systems which do not support paravirtualization, but only on processors with HVM. So the operating systems which have been
modified to support para-virtualization will continue to make calls to the host operating system,
while operating systems which have not been modified to support para-virtualization (such as
Microsoft Windows and Sun Solaris) will run as a guest operating system but execute in ring 0
normally, and the host operating system will not interfere with the direct connection to the
© 2008 Hewlett-Packard Development Company, L.P.
underlying hardware. In either case the overhead of the host operating system will be
dramatically less than the normal case of guest operating systems, usually on the order of 3-5%.
Note that the other products are rapidly catching up to Xen in terms of para-virtualization
efficiency.
Xen uses the virtual I/O server model, and all I/O is done in “domain 0” or “dom0”, which is a
dedicated guest instance (all standard guest instances are referred to as “domU”). One of the
ways that Xen keeps the host instance small is by performing many of the more complex
functions in dom0, which is running a full version of Linux. Therefore, all of the management of
the Xen virtual machine monitor and the guest operating system instances are also handled in
dom0. Xen has a management GUI-based dashboard.
Xen has the standard feature of creating a guest instance from a standalone instance (P2V) into a
container file. The Citrix implementation has the ability to move a running guest instance from
one server to another on AMD Opteron and Intel Pentium Xeon, while the Red Hat and SUSE
implementations can also do live migration on Intel Itanium.
The host instance offers weighted credit scheduling among all of the instances, which takes values
from 1 to 4096 as weights for each domain (higher weighted domains get higher priority to
resources). The Virtual Services Partition gets the highest weight by default, and this cannot be
over-ridden. Note that there is no way to dedicate a percentage of the resources to a guest
instance, except by the relative weights of the instances.
If there are not enough memory resources, the guest instance will not boot. You cannot overallocate memory, but you can over-allocate processors. The weighted credit values for processor
resources are dynamic and can be changed by the Virtual Infrastructure Manager. Modifying the
virtual resources of the guest (virtual processors, memory and network and storage paths)
requires a reboot of the guest instance.
Citrix has purchased XenSource, and is implementing Xen on AMD Opteron and Intel Pentium
Xeon with Windows and Linux guests. They have not chosen to implement Citrix Xen on any
other platform. Citrix Xen supports LiveMigration in the higher editions. Citrix Xen requires
virtualization support in the processor (AMD Virtual Technology and Intel VT-x), paravirtualization drivers in the host and para-virtualized Windows guests, but does not require these
for Linux guests.
Citrix has commercially packaged this technology into four products:

XenServer Express Edition, which supports up to 4 guest instances per hosts, and up to
two processor sockets and 4GBytes in the host server. Note that this does not support
XenMotion or most of the management tools.

XenServer Standard Edition, which supports an unlimited number of guest instances per
host (subject to memory limitations on the host), and up to two processor sockets and
128GBytes in the host server. This edition adds some management tools, but does not
support XenMotion.

XenServer Enterprise Edition, which supports an unlimited number of guest instances per
host, and unlimited processor sockets and memory in the host server. This edition adds
some additional management tools and XenMotion.

XenServer Platinum Edition, which supports an unlimited number of guest instances per
host, and unlimited processor sockets and memory in the host server. This edition adds
dynamic provisioning of guest instances.
Note that the P2V tools for XenEnterprise are available directly from Citrix, but the P2V tools for
Windows is available from Leostream and Platespin. XenEnterprise does not yet have the ability
to move a running guest instance from one server to another on any processor architecture.
© 2008 Hewlett-Packard Development Company, L.P.
XenEnterprise can take a snapshot of a powered-off guest instance, but not a running guest
instance, except through storage based snapshot tools.
All of the above discussion describes the technology which is available from Citrix. But because
Xen is an open source product, many other companies have incorporated Xen into their products.
Novell SUSE Linux Enterprise Edition 10, Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux
Advanced Server 5 are shipping Xen implementations with both para-virtualization and full
virtualization. Both SUSE and Red Hat support live migration on AMD Opteron, Intel Pentium
Xeon and Intel Itanium. Note that this version of Xen supports Linux with para-virtualization, but
does not yet support Windows with para-virtualization.
Oracle has implemented Oracle VM using Xen. It is only supported on AMD Opteron and Intel
Pentium Xeon.
Clustering of Micro Partitions
•
•
Some vendors support clustering of host instances to provide failover of guest instances
−
VMware ESX 3.0 HA (aka Distributed Availability Services)
−
HP-UX Virtual Machine with Serviceguard
−
IBM Hypervisor does not support clustering of host instances
−
Virtual Iron supports clustering of host instances.
−
Windows Virtual Server 2005 runs as a service with Microsoft Cluster Server (MSCS)
Some vendors support clustering of guest instances
−
AIX supports HACMP in LPARs and DLPARs, but not SPLPars with V5.2+
−
HP-UX and Linux Serviceguard are supported in guest instances with HPVM V2.0
−
Oracle RAC is supported in a guest instance of an IBM SPLPAR and a VMware ESX
guest instance, but is not supported in a guest instance with any other micro
partitioning product from any vendor
−
Windows 2003 supports MSCS of guest instances under VMware ESX. Microsoft
Cluster Server (MSCS) is not supported in a Windows guest of an HPVM, because of
non-implemented SCSI features in the virtual SCSI driver, thus there is no way to use
it to fail an application over.
Operating System Instance Resource Partitioning
All of the above technologies involve running multiple operating system instances at the same
time on the same server hardware. This is most common in server consolidation, where you wish
to maintain the same user experience, with the same system names, network addresses, etc, as on
the individual servers which have been replaced with a single larger server. But licensing and
maintaining individual operating system instances can become expensive. To reduce this
overhead, many businesses are choosing to consolidate their environments to a few standardized
operating systems and versions, but even doing so requires applying the same maintenance to
many instances which can be quite time consuming.
An alternative to this is to “stack” many instances of the applications on a fewer number of
operating system instances. This reduces the amount of maintenance which needs to be done,
while also reducing the number of licenses for both the operating systems and the applications.
This helps the system administrators, but the users still have many of the concerns around high
availability, resource consumption and contention for resources by the other application instances
which were discussed in the micro partitioning section when application stacking is done.
Because all of the application instances are running on a single instance of the operating system,
© 2008 Hewlett-Packard Development Company, L.P.
they are competing for scheduling for execution on the processors, they are competing for the
use of physical memory, and they are competing for bandwidth on the I/O cards. When
applications are stacked onto a single server, users still want guarantees that they will be able to
have their fair share of the resources on the system, and won’t be starved because some other
users begin consuming all of the resources on the system.
In the same way that the dramatically increasing performance and scalability of modern servers
can overcome these concerns in the micro partitioning model, they are equally able to be
overcome in the application stacking model. Many operating systems have added the capability
to group applications and manage them as a single entity, and ensuring that they receive some
pre-determined percentage of the system resources. These are called “resource partitions”. They
are similar to the other partitioning technologies, except that they work entirely within a single
instance of a standard operating system.
There is one additional concern about application stacking which is not present in the other
partitioning models: version control. With hard partitioning, soft partitioning and micro
partitioning, each partition is running a complete version of the operating system, and can
therefore have different operating system versions and patches, and application versions and
patches, than any other of the operating system instances. Because there is actually only a single
instance of the operating system, this is more difficult in resource partitioning.
Vendors have developed two styles of doing resource partitioning: having the operating system
present the same interface to the entire system and simply perform resource management, or
have the operating system present virtualized operating environments to each of the resource
partitions and then manage each of the virtualized operating environments. The first model
isolates the applications for performance, but requires that all of the applications have the same
version and patch level. Some application vendors do allow multiple versions of their application
to run in the same operating system, but that is external to resource partitioning and will not be
discussed here. The second model can actually run different applications at different version and
patch levels, both of the application and of the operating system itself.
But even with all of these restrictions, resource partitioning is increasingly popular, because
modern server systems are far more powerful than ever before, so a single system can handle the
load which overwhelmed multiple older systems, and it reduces costs dramatically, as discussed
above.
In the above discussion, I have used the word “resource” to describe how the operating system
allocates processing capabilities to each application group. These “resources” are primarily
processors, memory and I/O bandwidth and priority. All major operating systems offer tools to
control how the processor scheduling is apportioned between the different application instances.
Some of the operating systems additionally offer tools which control how memory is allocated,
with a few also offering tools which control access to the I/O cards.
But in some cases you need more than this scheduling, you require security isolation between the
different application instances. When, for example, the Human Resources group was on its own
server and was managing confidential employee records containing salary information, there was
no problem because access to that server could be tightly controlled. But when HR is simply one
of the applications on the same physical server as many other users, there might need to be some
additional level of security so a user who has privileges for another of the applications on that
server cannot snoop into the HR application. Therefore, some operating systems additionally
offer role based access controls (RBAC) to allow privileges to apply to one of the resource
partitions but not any of the other resource partitions. Further, some operating systems offer
tools to control the communications between processes between resource partitions, to offer the
same high level of security as if those processes were on two physically separate servers.
Many of the traditional tools, especially in the UNIX space, operate on operating system metrics,
such as percentage of processor time or megabytes of allocated memory. However, users
© 2008 Hewlett-Packard Development Company, L.P.
generally don’t care about such things, and are more focused on metrics which are the basis of
their business function, such as having queries completed in a certain number of seconds,
response time of the application, or job duration. These metrics are often formalized as Service
Level Agreements (SLA’s), and IT departments are often measured on these metrics, frequently to
the level of basing financial rewards for the IT department personnel on meeting or exceeding
them. Some of the tools for managing resource partitions are capable of adjusting the resource
allocation based on these metrics.
HP HP-UX
Secure Resource Partitions
HP OpenVMS
Class Scheduler
IBM AIX
Class Scheduler
Linux VServer
Microsoft Windows
System Resource Manager
Parallels Virtuozzo, OpenVZ for
Linux and Windows
Sun Solaris Container
Sun Solaris Zone
Figure 15 Resource Partitions
Manages
Isolation
Level
Processor,
Memory, I/O
Processor,
Memory, I/O
Processor,
Memory, I/O
Processor,
Memory
Processor,
Memory
Processor,
Memory, I/O
Processor,
Memory, I/O
Application
Tools for SLA’s
Higher
Security
Yes
Application
PRM, WLM,
gWLM
PRM, WLM,
gWLM
WLM
Virtual O/S
PRM
Yes
Application
Policy Based
No
Virtual O/S
Policy Based
Yes
Application
Virtual O/S
SRM, FSS
No
Yes
Application
Not
Needed
No
HP
HP-UX isolates the applications for performance management, and requires a single version and
patch level. It offers a variety of tools to manage application performance. “Processor Sets”
(Psets) allow you to define groups of processors (or, more recently, cores) as a set, which can then
be assigned to one or more applications for their exclusive use. “Fair Share Scheduling” (FSS)
allows you to specify entitlements for each application group, and the processor time of that Pset
(which could be all of the processors in the system, or could be a subset of them) is scheduled
based on those entitlements. Both of these can be integrated with the “Process Resource
Manager” (PRM), which adds the ability to dedicate memory and I/O bandwidth to the application
group. PRM works with operating system metrics, while WorkLoad Manager (WLM) and Global
WorkLoad Manager (gWLM) work with application metrics such as response time. WLM works
within a single server environment, where gWLM can define specific policies which work across
multiple servers and can actually move resources between servers. Secure Resource Partitions
increase the security of the resource partitions by enforcing communications restrictions between
partitions, ensuring that a process in one resource partition cannot communicate with a process
in another resource partition, except through standard networking paths, exactly as if those two
processes were on separate physical servers. HP-UX also offers toolkits for many popular
applications, which can manage the different modules of those applications in different resource
partitions. This allows, for example, different SAP modules to be consolidated onto a single server
while still receiving guaranteed percentages of the entire system resources.
OpenVMS isolates the applications for performance management, and requires a single version
and patch level. It offers a variety of tools to manage application performance. “Processor Sets”
(Psets) allow you to define groups of processors (or, more recently, cores) as a set, which can then
be assigned to one or more applications for their exclusive use. “Class Scheduling” allows you to
specify percentages of processor time available for each application group. Both of these can be
© 2008 Hewlett-Packard Development Company, L.P.
integrated with the “Process Resource Manager” (PRM), which adds the ability to dedicate
memory and I/O bandwidth to the application group. PRM, WorkLoad Manager (WLM) and
Global WorkLoad Manager (gWLM) work with application metrics such as response time. Note
that WLM and gWLM are the same across both HP-UX and OpenVMS. WLM works within a single
server environment, where gWLM can define specific policies which work across multiple servers
and can actually move resources between servers. OpenVMS does not add any specific security
capabilities to the groups, as the existing security policies are already capable of controlling and
enforcing the communications and other security considerations between the resource partitions.
IBM
AIX WorkLoad Manager (WLM) isolates the applications for performance management, and
requires a single version and patch level. It controls the processor, memory and I/O resources
available to groups of applications, which are assigned to a “class”. The resources are enforced at
the class level, with all of the processes inside the class sharing the resources equally. Resources
are allocated either in terms of shares of the total amount of system resources, or hard and soft
limits in terms of percentages of the total system resources. Soft limits can be exceeded if the
resources are available, but hard limits can never be exceeded even if the resources are otherwise
idle. There is no additional security controlling communications between the classes.
Note that prior versions of the Enterprise WorkLoad Manager (EWLM) were simply a load
balancer which allocated incoming requests between multiple servers, and did not manage the
resources inside or between operating system instances. EWLM V2 has been enhanced to allow it
to dynamically move processors between DLPARs, and to modify the resource entitlements of
SPLPARs. Note that EWLM can perform these actions no matter the operating system running in
the partition.
Partition Load Manager (PLM, which requires the presence of an HMC) can move processors,
memory and I/O between DLPARs and SPLPARs. Each of the partitions has a minimum, maximum
and guaranteed resource entitlement, which is allocated to the partitions by their entitlement
divided by the total number of entitlements of that resource for all of the partitions. PLM will
monitor the thresholds every ten seconds, and if the thresholds are exceeded six times in a row, a
“dynamic LPAR event” will be triggered and resources will potentially be moved, assuming the
resources are available.
IBM brings all of the above tools together via the IBM Director, which includes server and storage
deployment tools, server (including micro-partitions) and storage monitoring tools.
Linux
Linux VServer isolates the environments at an operating system level, and requires a single
version and patch level. The operating system kernel code is separated from the user mode code,
which is placed into Virtual Private Servers (VPS), which appear to the application as a physical
server. Each of the VPS’s see a set of processors, memory and I/O cards, as well as an entire
distinct file system and device namespace, distinct from any other VPS. VServer controls
processor and memory resources for each VPS using standard prm tools.
The file system for each VPS is distinguished from all other file systems for all other VPS’s by the
use of the “chroot” (change root) command, which defines a new “/” (root file system) for each
VPS. In this way all of the applications can continue to use their own file system tree, which each
sees as the entire set of file systems, without interfering with any other VPS.
Note that VServer is completely separate from Linux Virtual Server. Linux Virtual Server is a high
performance (technical) computing (HPTC or HPC) environment, which allows many separate
systems to cooperate in solving a difficult or very large mathematical problem.
© 2008 Hewlett-Packard Development Company, L.P.
Microsoft
Windows System Resource Manager isolates the applications for performance management, and
requires a single version and patch level. It controls processor and memory resources for each
application process, which can be any non-system process. An application is defined by the
command line which starts it, so multiple application instances can run in a single Windows
instance, and be differentiated by their originating command line. It defines a set of “policies”,
which can be exported and shared between servers, to manage a group of servers with the same
policies. Resources are allocated in terms of percentage of total processor resources, and
megabytes of physical memory and page file space. These are hard limits, and cannot be
exceeded even if the resources are otherwise idle. There is no additional security controlling
communications between the application instances.
Windows System Resource Manager is a free download for Windows 2003 Enterprise and
DataCenter Edition customers, for all Windows server platforms.
Note that this same technology is built into the SQL Server product, to allow multiple versions
and instances of SQL Server to be run simultaneously. See the section on application resource
partitions for more information.
Sun Microsystems
Sun offers two types of resource partitions: containers and zones.
Solaris Containers define a resource pool which contains one or more physical resource sets,
which contain processors, memory and I/O cards. These resource pools are then allocated to the
containers by means of “shares”. Resources are allocated in terms of percentage of total
processor resources, and megabytes of physical memory and page file space. Containers can run
either “capped” (hard limits, and cannot be exceeded even if the resources are otherwise idle) or
“uncapped (resources can be used if they are otherwise idle). There can also be resources which
are unallocated, and are available on demand to any specific resource pool. Solaris uses Fair
Share Scheduling (FSS) and Solaris Resource Management (SRM) to manage the resources.
Containers do not implement any extra security beyond the standard restrictions on inter-process
communication.
Solaris Zones are a specialized form of container which offers a virtualized operating system
environment which can run Solaris. Each zone has its own interface to the environment, including
network addresses, a host name, and its own root user. There is a “global zone” which is owned
by the base Solaris instance, and then each of the applications runs in a “non-global zone”.
Solaris can share resources between one or more zones, such as when multiple zones are running
the same application, they share a single physical in-memory copy of that image in the global
zone. Further, the different zones can share disk volumes, because all of the volumes are owned
by the global zone. Solaris increases the security of the resource partitions by enforcing
communications restrictions between zones, ensuring that a process in one zone cannot
communicate with a process in another zone, except through standard networking paths, exactly
as if those two processes were on separate physical servers. Further, different non-global zones
can be run in different trust domains.
Solaris Zones have the same “resource pools” as Containers which are allocated to each of the
non-global zones, which controls the amount of resources that they can access. Resources are
allocated in terms of percentage of total processor resources, and megabytes of physical memory
and page file space. These are hard limits, and cannot be exceeded even if the resources are
otherwise idle. Solaris uses Fair Share Scheduling (FSS) and Solaris Resource Management (SRM)
© 2008 Hewlett-Packard Development Company, L.P.
to manage the resources. Different non-global zones can be run in different trust domains, so
there is extra security between zones.
Parallels (SWsoft) Virtuozzo
Parallels Virtuozzo (which was SWsoft but recently changed their company’s name) is a set of
services which run on top of 32-bit and 64-bit Linux (almost all distributions) and Windows (2003
and 2008 Server, but not earlier Windows versions), on AMD Opteron, Intel Pentium Xeon and
Itanium processors. Virtuozzo presents a virtualized operating environment in a container, similar
to Sun’s use of the term, which Linux refers to as “Virtual Private Servers” (VPS). The base
operating system instance can run its own applications, or just act as the virtual machine monitor
for the server, whichever is preferred by the administrator. The host instance can be clustered
with other host instances on other physical servers with the appropriate clustering technology for
the host operating system.
The Virtuozzo Management Console (VZMC, the primary management interface) or the Virtuozzo
Control Center (VZCC, a web based management interface) will manage all of the host and guest
instances. The guest instance itself can be managed through the Virtuozzo Power Panel (VZPP).
Virtuozzo has the standard features of creating a guest instance from a standalone instance
(VZP2V), but can also import existing virtual environments from VMware and Xen and transform
them into containers. Some of the system files (DLL’s in Windows, for example) in the new
containers are replaced with links to the Virtuozzo files during this migration, which is also how
Virtuozzo can maintain different patch levels in the different containers on the same host
operating system. But for all common operating system files, Virtuozzo maintains a single copy in
the host operating environment, which means that mass deployments of a common version of an
operating system (which is the normal and recommended case) can be done with very little disk
space, far less than having unique files for each operating system installation.
Virtuozzo has the ability to move a running guest instance from one server to another as true live
migration, and takes about 15-20 seconds in a Windows environment and zero time in a Linux
environment.
Note that Virtuozzo is the exception to the rule about the O/S and application vendor requiring
error reports running in a non-virtualized environment, and not accepting error reports which
occur in a virtualized environment. Microsoft has agreed to support their products inside a
Virtuozzo virtual container.
The Parallels Infrastructure Manager controls all of the above tools, and provides an overview of
all of the hardware resources and the containers, with the ability to drill-down to each container.
The DataCenter Automation Suite (DAS) tracks the performance of all of the virtual environments,
and can perform chargeback. The containers can be dynamically modified for processor, memory
and I/O. Virtuozzo includes a copy of Acronis TrueImage Backup for dynamic snapshots of the
containers.
Virtuozzo performs all I/O in the host operating system.
Variable Capacity Virtualization
Many environments have business cycles which require significantly more or less resources at
various times. The increased resources required by the end of the month processing for
accounting summaries and the Christmas business peak for retail businesses, or the reduced
requirements for a weekend when no business users are logged in to the system, both mean that
the system resources are wasted during the off-peak times. Traditionally, these resources would
simply go to waste, with very low or even zero utilization for long periods of time. However,
recently many vendors have implemented the ability to enable and disable resources dynamically,
allowing the owners of the system to only pay for the resources that are actually being used at
any given moment.
© 2008 Hewlett-Packard Development Company, L.P.
This may not seem like virtualization. However, remember my definition of virtualization at the
beginning of this paper: the abstraction of resources from their physical function. One of the
capabilities that vendors have implemented is the ability to migrate authorization to use the
resources (aka, software licenses) between servers, such that there are fewer licenses than there
are physical resources but all of the resources of any given system are available for portions of the
time. Because it is using the hardware and software in a way that adapts it to the business
requirement (as opposed to fitting the business requirement into the capabilities of the
computing environment) it seems to me to fit the definition of virtualization, so I am including
this concept in this paper.
There are several different types of variable capacity virtualization:

Some environments can guarantee that they will need to be expanded, but not be able to
accurately predict when that expansion will be required, and that the resources will never
be removed. A business which is growing its customer base, or its employee population,
might require a constantly growing system. This is the “shrink-wrap” model, similar to the
software model where if you break the shrink-wrap plastic around a software kit, you
have purchased the software and cannot return it for a refund. This model allows you to
have the resources physically installed in the system but not running or paid for, until you
decide that they are required, at which time you purchase them and they are enabled for
use.

Some environments have peaks and valleys of utilization, such that resources need to be
enabled and disabled at various times. A business which has seasonal variation in
utilization, or a business which needs to be ready for momentary spikes of demand,
might wish to only pay for the resources which are actually being used at any given time.
Or another business might need to have several servers which are fully populated with
resources (such as processors), but have fewer software licenses than there are physical
processors, and to migrate these software licenses to the server which most needs them
at any given moment. This is the “phone card” model, similar to the pre-paid phone
cards which have a certain amount of time on them, which can be used whenever the
owner of the card needs to make a call. This model allows the owner of the server to buy
a certain amount of time on that resource, and to consume that time whenever the
business requires.

Some environments have the same peaks and valleys of utilization, but the business uses
leasing as a financial tool.
Note that the licensing and support of this equipment is over and above the pricing of the
equipment itself. So, for example, any Oracle or SQL Server or SAP licenses required for the
additional processors, would be in addition to the fees paid to the hardware vendor for the
equipment itself. This should be considered when evaluating variable capacity virtualization, and
the funds for the entire environment of the additional equipment should be forecast, so you
won’t be surprised by an audit stating that you are out of compliance with your mission critical
software packages, or by the inability to activate that software when you desperately need it.
HP Integrity, PA-RISC
HP-UX, OpenVMS
Permanent
Activation
iCAP
Temporary
Activation
TiCAP
30 days by
30 minutes
Utility
Pricing
Pay Per Use
(post-pay)
Purchase
Method
Website,
E-mail
reporting
© 2008 Hewlett-Packard Development Company, L.P.
Granularity
Cell boards,
processors,
memory (in
HP ProLiant
Linux, Windows
IBM iSeries, pSeries
AIX, i5
iCAP
Capacity
Upgrade on
Demand
(pre-pay)
n/a
Trial CoD
30 days,
Reserve CoD
30 days by
24 hours
(pre-pay)
Sun E2900 - E25K
COD 2.0
T-CoD
Solaris
30 days by
30 days
(post-pay)
Figure 16 Variable Capacity Virtualization
n/a
On/Off CoD
(post-pay)
Capacity
Backup
(post-pay)
N/A
(optional)
Order via
SIM auto
tracking
Website,
E-mail
reporting
cell boards)
Server, Blade
Enclosure
Phone call
to Sun
Center
Uniboards,
processors,
memory
Processors,
memory
HP
HP implements Instant Capacity (iCAP) across both the Integrity and PA-RISC cell based servers,
and the ProLiant servers, but in a very different way.
The Integrity and PA-RISC cell based servers (rx76xx, rx86xx, rp7xxx, rp8xxx and Superdome) allow
iCAP for individual processors, as well as cell boards which contain processors and memory. iCAP
is the purchase plan, where a customer would pay some percentage (approximately 25%) of the
price of the equipment and have it installed in the system. When the customer needs the
additional capacity, they would contact HP with a purchase order for the remaining cost of the
equipment (note that this is the price at the time of the activation, not the price at the time of
original purchase), and would get a code word, which can be entered into the system and the new
equipment is activated for that system instance. This works with either nPARs for HP-UX or
OpenVMS (which would instantly see the new equipment in the operating system instance where
you entered the code word), vPARs for HP-UX (which again would see the new equipment in the
operating system instance where you entered the code word, but which could then move the
processors to another operating system instance in the same nPAR), or dynamic nPARs for HP-UX
(which could add the new cell board and its processors and memory to a running instance).
The same servers offer Temporary Instant Capacity (TiCAP) for processors. TiCAP is the “phone
card” model, where you purchase 30 days of processor utilization, and which you can then use in
any HP-UX or OpenVMS operating system instance. The time is checked approximately every 30
minutes, and each TiCAP core which is active at that point counts against the time. So, for
example, if you activated four cores for two hours, you would consume eight hours of time
against your 30 days.
TiCAP is bound to a single server, and customers often want to share licenses between servers.
This is primarily useful in a disaster recovery scenario, where the DR server is idle for months at a
time. HP offers Global Instant Capacity (GiCAP), which is an extension of TiCAP, in that it will
cover a group of servers which may be geographically diverse. So, for example, you could have
two identical rx8640 servers, one for production and one for DR, with all 32 cores active in your
production site and only one core active in your DR site. If you need to fail over to your DR site,
you use the GiCAP Manager to deactivate 31 licenses at your production site (since, by definition,
you aren’t using them anymore) and activate them at your DR site. In this way, both the
© 2008 Hewlett-Packard Development Company, L.P.
production and DR site are fully licensed (though not at the same time), and you have only
purchased 33 licenses for the environment, saving the licensing costs of 31 licenses.
Both TiCAP and GiCAP allow you to consume more than is available on your TiCAP/GiCAP license.
In this case, the monitoring records a negative balance, which is then adjusted when you purchase
the next TiCAP/GiCAP license. So, for example, if you consume five extra days of processing time,
and you purchase a new 30 day license, you will have 25 days remaining on your license when
you install the new license. HP will not disable processors when you exceed the time on the
license.
Note that iCAP processors include 5 days of TiCAP licenses as part of their purchase.
Pay Per Use (PPU) is the leasing model, where all of the equipment is fully installed and
operational at all times, but HP installs a monitor on the system to measure the actual utilization
of the system. This is averaged over 24 hours, and then summed over a month, to produce an
average utilization, which is then applied to your lease payment. So, for example, if your
utilization was 80%, you would pay roughly 80% of your lease payment for that month. If the
next month you experienced a spike to 90% utilization, your lease payment would be 90%. And
so on if in the following month your utilization dropped to 70%. Many customers achieve
significant savings with this model.
TiCAP, GiCAP and PPU require communication back to HP. This can be done through the
OpenView Services Essentials Pack (was ISEE) via e-mail, through the HP website, or via phone.
Note that if your system is isolated for security reasons and cannot communicate to HP, HP will
accept FAX’ed or regular mail reports or other means of communication.
If an active processor fails, an iCAP or TiCAP processor can be activated in the same nPAR to take
the place of the failed processor. This is not a chargeable event, in that the same number of
processors were active before and after the processor failed.
The ProLiant servers, both rack mount and blades, implement iCAP on entire servers and blade
enclosures. HP will work with the customer to establish inventory levels and invoicing
information. Customers are able to simply install servers from their inventory, which are detected
by the Systems Insight Manager (SIM) and updated in the invoicing database. Then, when
benchmarks are reached (for example, down to 5 servers in stock), purchase orders are
automatically executed and product shipped to restore the inventory levels.
Blades are done the same way, except that the blade enclosure costs are amortized across the
number of blades in the enclosure. So, for example, if 16 blades are to be installed in an
enclosure, the price of each blade would include 1/16 th of the price of the enclosure itself.
IBM
The Power5 and Power6 based servers allow Capacity Upgrade on Demand (CUoD) for processors
and memory. CUoD is the purchase plan, where a customer would pay some percentage of the
price of the equipment and have it installed in the system. When the customer needs the
additional capacity, they would contact IBM through their website with a purchase order for the
remaining cost of the equipment, send the Vital Product Data (VPD, effectively the system serial
number) via either FAX or via e-mail from the HMC, and would get a code word, which can be
entered into the system via the HMC and the new equipment is activated, and available for use
with any DLPAR or SPLPAR which are uncapped.
The same servers offer Trial Capacity on Demand (Trial CoD) and Reserve Capacity on Demand
(Reserve CoD) for processors and memory. Reserve CoD is the “phone card” model, where you
© 2008 Hewlett-Packard Development Company, L.P.
purchase 30 days of processor utilization, and which you can then use in any AIX and Linux
operating system instance by simply moving the Reserve CoD processors to the active processor
pool via the HMC. The time is consumed for the entire time each processor is in the active
processor pool, and each processor counts against the time. So, for example, if you activated four
cores for two hours, you would consume eight hours of time against your 30 days. The
processors are available to any DLPAR or SPLPAR which are uncapped.
Trial Capacity on Demand is slightly different, in that it operates for an entire 30 day period (which
is counted only during the time when the server is powered on). It is intended for an evaluation
of servers and software, or other extended period of usage. Otherwise, it works the same way as
Reserve CoD.
Capacity Back Up (CBU) works with On/Off Capacity on Demand (On/Off CoD) to allow the
activation and deactivation of processors and memory in a leased environment. Activation of a
processor or memory unit is for an entire 24 hour period, and usage is summed and paid for at
the end of the quarter. Otherwise, it works the same way as Reserve CoD.
Sun Microsystems
The Uniboard based servers (E2900, E4900, E6900, E20K and E25K) allow Capacity on Demand
(CoD) for individual processors and memory. CoD is the purchase plan, where a customer would
pay some percentage of the price of the equipment and have it installed in the system. When the
customer needs the additional capacity, they would contact Sun via phone with a purchase order
for the remaining cost of the equipment, and would get a Right to Use (RTU) license via e-mail,
which can be entered into the system and the new equipment is activated for that system
instance.
The same servers (except the E2900, which is excluded) offer Temporary Capacity on Demand (TCoD) for processors. T-CoD is the “phone card” model, where you purchase 30 days of processor
utilization. If six months of T-CoD RTU’s are purchased, that is equivalent to purchasing the
equipment.
Sun offers the “headroom” feature, where you can activate the processors and memory
immediately, and then purchase the CoD or T-CoD RTU’s after the fact. You have up to one
month to purchase these RTU’s.
If an active processor fails, a T-CoD processor can be activated to take the place of the failed
processor. This is not a chargeable event, in that the same number of processors were active
before and after the processor failed.
Virtualization Automation
All of the above technologies allow the system administrator to migrate, enable and disable the
resources by hand, either through a command line interface (CLI), a script or a management
console. However, the demands of a 24x7 business environment, and the inevitable peaks and
valleys of business processes, often makes it inconvenient to require the human beings who are
managing the system to closely monitor the performance of all of their systems on a moment by
moment basis. So some vendors have created software automation which can perform this
monitoring, and which is capable of migrating, enabling and disabling the resources according to
business rules.
Citrix
XenServer
Soft
n/a
Micro
n/a
Resource
XenCenter
© 2008 Hewlett-Packard Development Company, L.P.
Utility
n/a
Egenera
n/a
EMC VMware
n/a
Processing Area
Network
n/a
CLI, SIM
n/a
n/a
CLI, SIM, CapAd
CLI, HMC, IVM
n/a
n/a
n/a
Sun
CLI
CLI
SWsoft
n/a
n/a
XenSource
n/a
n/a
HP
IBM
Linux (Red Hat,
SUSE)
Microsoft
Figure 17 Virtualization Management
n/a
n/a
Virtual
Infrastructure
CLI, SIM, CapAd
CLI, HMC, IVM
CLI, Virtual
Machine Mgr
VMM, MOM,
System Center
CLI, xVM Ops
Center
Virtuozzo Mgt
Console
Virtual
Infrastructure
n/a
CLI, PRM, SIM
WLM
Many
SRM
PRM
DataCenter
Automation
n/a
Note that “n/a” in the above table indicates that either this vendor or product does not offer this
type of virtualization so by definition there is no automation capability, or this vendor does not
offer automation of this type of virtualization.
Egenera
EMC VMware
HP
IBM
Linux (Red Hat,
SUSE)
Microsoft
Sun
SWsoft
Soft
n/a
n/a
gWLM
eWLM, PLM
n/a
Micro
PAN
FSS, DRS
n/a
eWLM, PLM
DRS
Resource
n/a
n/a
PRM, (g)WLM
WLM
n/a
Utility
n/a
n/a
gWLM
n/a
n/a
n/a
n/a
n/a
SRM
VxM Ops Center
DataCenter
Automation
n/a
SRM
FSS, SRM
DataCenter
Automation
n/a
n/a
n/a
n/a
XenSource
n/a
Figure 18 Virtualization Automation
n/a
Operating System Emulation
One method of virtualizing systems that may not be obvious as a virtualization technology is
emulation. But if you think about it, emulation is the abstraction of server, storage, and network
resources from their physical configurations, in that the operating system may be running on
hardware which is completely different than the original server hardware, and so I include it as
virtualization.
The following are popular methods of emulation by virtualization:

Aries
o

HP-UX PA-RISC on HP-UX Integrity
Charon-VAX
o
OpenVMS VAX on Windows x86 and OpenVMS Alpha
© 2008 Hewlett-Packard Development Company, L.P.

Intel
o


IA-32 Execution Layer on Itanium
Platform Solutions, Inc with T3 Technologies
o
zOS, OS/390 on HP Integrity (Liberty Server)
o
zOS, OS/390 on IBM xServer (tServer)
Transitive Software (QuickTransit)
o
Solaris SPARC on Solaris x86, Linux x86 and Linux Itanium
o
Linux x86 on IBM pServer (PowerVM Lx86)
Note that at this time (July 2008), IBM has purchased Platform Solutions, with the stated intent to
“maximizing customer value” of this technology. Pardon me if I cynically believe that they will
bury this technology completely, in order not to interfere with their mainframe revenue stream.
© 2008 Hewlett-Packard Development Company, L.P.
Summary
All of the above technologies fit into a continuum of server partitioning, as shown in Figure 15.
Single physical node
Hard partitions
within a server
Single system image per
node within a cluster
Soft partitions and/ or
Micro-partitions within a hard
partition of a server
Hard Partition 1
Hard Partition 1
Soft Partition 1
•
OS image with
HW fault isolation
• Dedicated CPU
RAM & I/ O
Hard Partition 2
•
OS image with
HW fault isolation
• Dedicated CPU
RAM & I/ O
•
•
OS + SW fault isolation
Dedicated CPU, RAM
•
•
OS + SW fault isolation
Dedicated CPU, RAM
Application 1
• Guaranteed compute
resources (shares or
percentages)
Soft Partition 2
Application 2
•
Hard Partition 2
Hard Partition 3
N ode
Resource partitions and/ or
Secure partitions within a
system image
Application 3
Virtual Machine 1
•
•
•
OS + SW fault isolation
Virtual + Shared CPU, I/ O
Virtualized Memory
•
•
•
OS + SW fault isolation
Virtual + Shared CPU, I/ O
Virtualized Memory
Hard Partition n
Virtual Machine 2
•
OS image with
HW fault isolation
• Dedicated CPU
RAM & I/ O
Isolation
Guaranteed compute
resources (shares or
percentages)
Application n
•
Guaranteed compute
resources (shares or
percentages)
Flexibility
Figure 19 Server Virtualization Continuum
There are many choices that can be made when deciding how to virtualize a server, and these
choices can be overwhelming. Not all servers can do all of the types of virtualization, but even
the subset that a particular server can do can often present too many choices.
You need to go back and understand the business goals for the server, in order to make good
choices for the one or more virtualization technologies that you will apply to that server. And you
should consider some of the following criteria when choosing among those technologies:

Internal resistance. Developers and BU representatives may resist new application hosting,
partitioning and stacking strategies.

ISV Support. ISV’s may not support their application running in a virtualized environment.

Platforms that support virtualization at the hardware level have minimum OS versions
that might be incompatible with some older applications.

Security challenges that occur when two separate applications with privileged users are
combined with non-privileged users on the same OS instance.

What applications are available for that server and that operating environment? Does
your application vendor support a virtualized environment for production? Do they have
a set of “best practices” for a virtualized environment? And this applies equally to the
case where you are running “home-grown” applications and acting as your own
application vendor and support.

What is the size of the applications? Are you doing server consolidation where a single
processor with a few tens of megabytes of memory is sufficient for multiple instances, or
are you adding more and more load to the system where you will constantly need to add
processors and memory and I/O cards?

What is the degree of fluctuation in the workload? Is it purely cyclic (Christmas buying
season, end of the month processing, etc), is it random (acquiring a new company which
doubles the user base), or is it constantly increasing? How predictable is the growth?
© 2008 Hewlett-Packard Development Company, L.P.
And can you afford to have one workload slow down the other workloads by consuming
all of the available resources on the server?

What level of security isolation do you require? An application service provider (ASP)
would typically offer and advertise strong isolation between customers, but any business
might require that same level of isolation between departments, due to regulatory issues.

What degree of availability do you require? Can you afford to have many application
instances become unavailable for minutes during the failover if a single system goes
down?

How do you plan for new workloads? Testing and training are the most common
activities which require new operating system instances quickly, and then will remove
them just as quickly, but how quickly do you need a new database instance or a new
application instance?
There are many choices in server virtualization, and many technologies involved in implementing
those choices. But you need to focus on the business reasons you purchased the server systems,
make the correct choices for your business requirements, and then choose the correct server
virtualization technology to implement those choices.
© 2008 Hewlett-Packard Development Company, L.P.
Network Virtualization
Networking virtualization is not as complex as server virtualization, because there are many fewer
layers of hardware and software involved. But we need to ensure many of the same features as in
server virtualization, including efficiency, security, high performance, scalability, high availability,
easy growth and redundancy, as our networks have almost moved beyond mission-critical into
the realm of the very air we breathe. Without the networks providing communication between
our users and our servers, all of the topics that I have discussed up to this point are completely
useless.
I will focus primarily on TCP/IP V4, as that is by far the most common networking protocol in use
today. But where applicable, I will discuss other protocols such as AppleTalk, DECnet, NetBUEI,
SNA, TCP/IP V6, UDP, etc, as well as other transport mediums like Infiniband.
TCP/IP Subnets and Virtual LANs
One of the first topics we need to discuss is subnetting. When Transmission Control Protocol /
Internet Protocol (TCP/IP) was designed in 1973, the designers provided for networks of various
sizes. The first section of the address was to be the network identifier, with the rest of the address
being the specific host in that network. The size of the networks was explicit in the class-ful 32 bit
addressing scheme, where if the first bit of the network address was 0, then the network identifier
was 8 bits long, and it was one of the 126 Class A networks with up to 16 million addresses each,
with the leading number being from 1 through 127. If the second bit of the network address was
0, then the network identifier was 16 bits long, and it was one of the 16,384 Class B networks with
up to 65 thousand addresses each, with the leading number being from 128 through 191. If the
third bit of the network addresses was 0, then the network identifier was 24 bits long, and it was
one of the 2 million Class C networks with up to 254 hosts each, with the leading number being
from 192 through 223. And this made sense at the time, and several of the founding networking
companies got some of the prime networking real estate: Digital Equipment Corporation still
owns the first Class A network (16.x.x.x).
The routers build their tables based on these addresses, and perform all traffic routing based on
the network identifier, such that each network identifier belongs to a single domain (ie, company
or other organization).
But after a while, it became obvious that class-ful addressing would not work out. For example,
with only 126 Class A networks available, even Digital Equipment could not efficiently use 16
million addresses, as this was 100 times as many employees as it had at its peak. And there were
far more than 125 other companies which needed more than the 16 thousand addresses which
were available as Class B addresses. And companies which could not justify getting an entire
Class B address needed more than the 254 addresses available in a Class C, so they needed
multiple Class C domains.
So the standard was extended to Class-less Internet Domain Routing (CIDR), where the break
between the network identifier and the host address was more extensible and flexible. So each
TCP/IP address requires a second piece of information, the “network mask”. We continue to use
the class-ful addressing scheme to determine the network address, but add a new concept of the
“subnet”, so we can determine which pieces of the 32 bit TCP/IP address are the network
identifier, which are the subnet identifier, and which are the host.
So, for example, if the TCP/IP address is 16.1.2.3 and the network mask is 255.255.255.0/24, then
you apply the binary AND function to the two values, yielding 16.1.2.0 as the network identifier.
But because the first bit of the network identifier is 0, we know this is a Class A address, so the
network identifier is 16.0.0.0, the subnet identifier is 0.1.2.0, and 0.0.0.3 is the host.
How does this help? Well, by allowing us to assign more granular groupings of networks to
different organizations. For example, assume a company needs 1,000 addresses. Instead of
© 2008 Hewlett-Packard Development Company, L.P.
assigning an entire Class B address to a single company (and wasting 64 thousand host
addresses), or assigning four Class C addresses to that same company (which complicates the
network design and increases the routing table load on the primary routers), we can assign
something between the two, with just enough addresses to satisfy the requirements. So we could
give this company network identifiers 154.11.204.x and 154.11.205.x with a subnet mask of
255.255.254.0/23, and they would both be routed to this company, while consuming only 1,000
address, even though they are both Class B addresses.
So what does this have to do with networking virtualization? Well, you can apply the same
concept inside your company to further subdivide your network even beyond the high level
subnet that was assigned to your company. If we were still using class-ful addressing, we could
only have two different Class C domains (255.255.255.0/24):
154.11.204.1 through 154.11.204.254 and
154.11.205.1 through 154.11.205.254
But with a subnet mask of 255.255.255.128/25, you could assign network addresses:
154.11.204.1 through 154.11.204.126 to one group,
154.11.129 through 154.11.204.254 to another group,
154.11.205.1 through 154.11.205.126 to a third group and
154.11.205.129 through 154.11.205.254 to a fourth group.
With a different subnet mask, you could go down to any level you wished, including subnets of 2
hosts (which would be silly, but perfectly in accordance with the standards). And you could have
different subnets of different sizes. But let’s keep it simple.
Now that we have the addresses segmented, we only have four domains, so our routing tables
inside our company are very simple.
This very lengthy discussion is simply to introduce the concept of subnetting, which is a
virtualization strategy for network addressing. And in and of itself it is moderately useful
(primarily to allow efficient subdivision of the increasingly scarce TCP/IP V4 network addresses,
given the explosion of the number of devices that want IP addresses), but the real advantage in
your company comes in when we begin to do virtual Local Area Networks (LANs).
A LAN is simply a broadcast domain, where all network addresses are shared with no routing. So
all network ports can see all network packets, as anyone who has experimented with a network
sniffer or put their NIC into promiscuous mode can attest. Some network switches will perform
some level of segmentation, such that the packet is only sent out the port to the specific device,
but this is not universally true. A LAN is usually defined by one or more network switches. A
Virtual LAN (VLAN) is simply a subdivision of that LAN, so that the broadcast domain is smaller
and more focused on only those servers which interact with each other the most. VLANs are
defined by IEEE 802.1q.
VLANs can be defined in a variety of ways, depending on the requirements of the business.
VLANs can be defined either at networking Layer 2 (media layer) or at Layer 3 (networking layer).
Layer 2 VLANs define their members by physical attributes: either Media Access Control (MAC)
addresses (the unique identifier on every network device) or by the port number on the network
switch. This is extremely easy to implement, and corresponds well to the physical environment:
everyone on this floor is on the same VLAN, or everyone in a specific group on this floor is on the
same VLAN. However, it is often difficult to scale, as it requires tracking the MAC addresses of
every network device and potentially assigning them to one of the VLANs. With tens of network
devices, this is easy, but with thousands of network devices (including PC’s on desktops, servers,
wireless devices wandering around the environment, plus the number of devices that are added
or replaced or removed each month, etc) this is increasingly difficult to keep track of.
© 2008 Hewlett-Packard Development Company, L.P.
Layer 3 VLANs define their members by non-physical attributes: TCP/IP addresses, the networking
protocol being used, whether they are part of an IP multicast group, etc. This allows much more
flexibility, and removes the geographic portion of the solution. So, for example, you could assign
a VLAN to everyone who is using the same networking protocol. This is common when different
divisions of a company chose different solutions (DECnet vs TCP/IP, or AppleTalk vs SNA), or even
when companies have grown by acquisition and the different companies had chosen different
networking solutions. Protocol VLANs can achieve this level of isolation among the various
groups, isolating the TCP/IP traffic from the DECnet traffic from the AppleTalk traffic.
Layer 2 VLANs are good for non-TCP/IP environments, as well as for fairly static environments,
while layer 3 VLANs are good for more dynamic environments.
Assuming that your entire company has switched to TCP/IP (and this is increasingly common), you
could assign a group of TCP/IP addresses to the Finance Department, and then assign those
addresses to a specific VLAN. All desktops, notebook PC’s with wireless cards, printers, servers
and other devices which are on the network all get their TCP/IP addresses from this group, and so
they are all automatically part of the Finance VLAN. Even when the MAC address changes (due to
replacing a broken NIC, or upgrading to a new PC), the TCP/IP address remains the same, so
membership in the VLAN remains the same.
IEEE standard 802.1x covers edge access security, where a user is authenticated via some means
(whether at Layer 2 or 3, as discussed above), and then is allowed access to some
The entire purpose of this, as stated before, is to restrict the size of the broadcast domain. All
members of the Finance VLAN will see each other’s packets, but no one else will. This reduces the
amount of traffic on the backbone routers, because if the packet is intended for one of the
members of the VLAN, it never leaves their local switch and so is never routed. It also enhances
security, in that all members of the Finance VLAN are automatically granted rights to things like
printers, but non-members are automatically barred from those devices. In fact, the routers
themselves perform this security function by simply never forwarding the packets to the devices,
unless they are on the correct VLAN.
This works extremely well when all members of the VLAN are in the same geographical location,
and can be (for example) all plugged into the same network switch. This is the optimal case,
where no routing ever occurs. However, it is rarely achieved in the real world, given the way
people move around between offices, plus wireless devices which are intended to move between
areas of the company. And you certainly don’t want a person to lose their rights to access a
database server just because they happen to be visiting a remote office, which is far away from
their home network switch.
This means that in some cases, the network packets need to be routed. The way this is normally
accomplished is via tunneling or label switching.
Tunneling is taking the entire network packet, and treating it as the payload of a larger network
packet. So a network packet that is isolated to a VLAN is incorporated into a network packet
which is fully routable, and is not restricted to a specific VLAN. This larger packet is then sent
across one or more routers to the appropriate network segment, where it is peeled away to reveal
the smaller packet inside, which is then sent to the appropriate network device. We will discuss
this in more detail in the VPN section.
Label switching prefixes the original packet with a label which identifies the source and target
routers for this packet. Multi Protocol Label Switching (MPLS) is a common Layer 3 method of
doing this. MPLS adds a prefix to the packet and forwards it on to the target router, but the value
of MPLS is that the prefix tells the target router how to forward on the packet, such that the
target router does not have to spend a lot of time analyzing the packet and calculating the route.
This makes it extremely efficient.
You need to consider several factors when setting up VLANs. These include:
© 2008 Hewlett-Packard Development Company, L.P.

Access control – Ensure the all legitimate users have access to the services that they
require, but that non-authorized users are excluded from accessing your environment.
Whether this is done at Layer 2 or Layer 3, you need to correctly assign the privileges and
rights to each user and each device, and to protect those systems that offer services.

Path isolation – Make sure that the primary purpose of VLANs is accomplished: reducing
the amount of traffic on the primary network backbone. You need to minimize the
amount of routing that occurs between network segments, keeping the traffic isolated to
each group of users in each VLAN.

Centralized management – Implement centralized policy enforcement, with a common
set of criteria, authentication, tracking and verification across your entire enterprise.
Whether this is due to regulatory requirements, financial and reporting requirements, or
simply best practices, you need to treat your entire company the same, such that people
who move around in the company (such as with wireless devices or travelers who do not
have a fixed office) are treated equally no matter where they are at any moment.

Be careful of DHCP, because moving between subnets generates a new IP address and
the layer 3 VLAN is based on the old IP address.
I spent a great deal of time discussing subnetting, primarily because VLANs and subnets are often
used interchangeably. Subnets isolate the routing to a specific group, and VLANs isolate the
packet traffic to a specific group. It is extremely common to have a VLAN map to a specific
subnet, and you need to consider both when you set up either one.
If you are going to integrate Cisco and HP Procurve hardware on the same network, and you
intend to use VLANs there are only a few things you need to remember:

For end nodes - Cisco uses "mode access", HP uses "untagged" mode.

For VLAN dot1q trunks - Cisco uses "mode trunk", HP uses "tagged" mode.

For no VLAN association - Cisco uses no notation at all, HP uses "no" mode in the
configuration menu, or you have VLAN support turned off.
Virtual Private Networks (VPN)
VLANs isolate the traffic, and that isolation results in somewhat higher security, in that devices not
on the VLAN simply don’t see the traffic at all. However, as we discussed above, sometimes the
traffic does leave the isolated area to travel over more public routes, such as when the VLAN is
distributed across an entire physical WAN with multiple routers. This forces the VLAN traffic to
traverse the public (whether that is private to a company but accessible to all divisions of that
company, or it is a truly public network such as the Internet) network, and therefore we need to
implement a higher level of security.
One technique is to use tunneling, such as Virtual Private LAN Service (VPLS). This tags the VLAN
that the packet came from with an identifier that is unique to that VLAN. This is an extension of
the MPLS technique described above, except that it more clearly identifies the VLAN and not just
the target router. The target edge device that receives the packet then uses this identifier to take
action on the packet, such as routing. This is at Layer 2, and binds multiple LANs and VLANs into
a single larger environment.
The encapsulation is most frequently done by Point To Point Protocol (PPTP, which requires
TCP/IP) or Layer 2 Tunneling Protocol (L2TP, which can work with many protocols and networking
methods such as ATM). It fully encloses the original packet, and routes the contents to the edge
device which then extracts the original packet and routes it appropriately.
© 2008 Hewlett-Packard Development Company, L.P.
But this does not protect the contents of the packet: a packet sniffer could identify the VPLS or
MPLS packet and then peek inside to read the source and target addresses of the original
(embedded) packet, and even the contents of the packet itself.
This may be acceptable if the underlying network is trusted, such as it would be inside a building
or campus network. These are called Trusted VPNs. But this may not be acceptable if there are
security concerns inside the company, such as either regulatory requirements or multiple levels of
security in a single organization. And this is clearly not acceptable if some external network is
used, such as a city-wide WAN service or the Internet itself.
In this case the security responsibility resides in the VPN, which requires some type of
authentication of the users and then encryption of the data.
Normal VLANs establish membership by various means (TCP/IP address, MAC address, switch
port number, etc), but VPN’s demand specific authentication of the user. These strong
authentication measures may include usernames and passwords, using Password Authentication
Protocol (POP), Challenge Handshake Authentication Protocol (CHAP, including several versions
from Microsoft called MS-CHAP and MS-CHAP V2), and the Extension Authentication Protocol
(EAP). But many go further than this by requiring physical security keys implementing Public Key
Infrastructure (PKI), such as USB dongles and smart cards which contain digital certificates from a
trusted Certificate Authority agency, usually your corporate networking or security group.
Then you need to encrypt the data. There are many types of encryption, but the most common
for VPN’s is public key encryption, where a public key is paired with a private key to produce quite
excellent security. The private key is usually stored on the same smart card as the authentication
information.
Note that VPN encryption does not provide end to end security. The data is fully protected when
sent over the untrusted network, but then the data is decrypted by the VPN server at the trusted
network edge, such that the information is sent in clear over the corporate network. In most
cases this is completely adequate, but if necessary, corporations can implement secure VPN’s with
encryption even inside their corporate network.
Network Address Translation and Aliasing
One of the problems with TCP/IP V4 is the extremely limited and disjoint addressing space. I am
sure that the original designers never imagined the scope of what they were inventing, nor could
they project the simply incredible number of networks, hosts and clients that we have today.
Subnetting helps, but it doesn’t do anything to increase the number of available addresses, it
merely lets us use any granularity that we wish in segmenting those addresses.
Another problem is that we need near infinite scalability today in some of our web services, which
far exceeds the capacity of any single system by any vendor. But we don’t need that scalability all
the time, as it is economically convenient to expand and contract the capabilities dynamically, in
response to actual demand.
A solution for both of these problems is Network Address Translation (NAT). NAT uses the
concept of a front end load balancing system with a public network address, to mask the multiple
server systems behind it which actually perform the work. The load balancer simply accepts any
incoming request from the user base, and then distributes it to one of the servers in its assigned
servers. The assigned servers can either use one of the addresses owned by the company
providing the service, or it can take advantage of some of the special addresses implicit in the
TCP/IP standard, such as the 10.*.*.*, 172.16.*.* through 172.31.*.* and 192.168.*.* sequences. For
the purposes of this discussion, I will use the 192.168.*.* numbers. These special addresses are
reserved via RFC 1918 for private networks, and are not routable, and so are usable for operations
like this. The advantage of using one of these addresses is that the same addresses can be used
for multiple sets of load balanced servers, optimizing the use of the addresses available to the
organization.
© 2008 Hewlett-Packard Development Company, L.P.
When a request comes in to one of the load balancing servers, in the example in Figure 20 we will
use hp.com/pages with TCP/IP address 16.1.1.1, it is immediately forwarded to one of the
application servers. For example, it could be forwarded to the server with address 192.168.1.1.
The load balancing server would remember the packet and which server it sent it to, so that the
response from 192.168.1.1 could be returned to the correct requestor.
This “one to many” approach is more properly called “Port Address Translation” (PAT) or “Network
Address Port Translation” (NAPT), because the load balancing server appends a port number to
the public address, in order to differentiate the application servers and keep the conversations
straight. In a “one to one” approach, it is truly “Network Address Translation” (NAT). (Note that
this is common in home networking, where a single network address in a DSL or cable modem
will translate into multiple addresses, one for each home networking device). In either case, the
load balancer re-writes the IP address of the application server in both directions, so that the
users see only the public address hp.com/pages and 16.1.1. (and cannot see the private
addresses), and the servers see only their private addresses (and are unaware of the translation
being done by the load balancers).
Figure 20 Network Address Translation
This solves the multiple problems: there are only a few public addresses required (one per
service), and we can scale the number of application servers behind each service practically
infinitely.
AOL uses this design to serve web pages and their Instant Messaging software, and has a single
environment of over 1,000 application servers. Google has almost half a million servers in this
kind of structure.
The individual application servers are known only to the load balancing servers: their names
and/or IP addresses are not known anywhere else, since their functions are controlled strictly by
the load balancing servers. So if an application server fails, or you need to add a series of new
application servers, you simply place them on your network and add them to the routing tables in
the load balancing server, and they are now in production. The outside world only ever sees the
service offered by the load balancing server.
To advertise the load balancing services, you would use the standard DNS or Wins or Novell
Directory Services or Active Directory (or whatever you care to use) to create service names,
assign them IP addresses, and place these into your namespace database. This would be
© 2008 Hewlett-Packard Development Company, L.P.
HP.com/pages and HP.com/apps, as a completely made up example. The IP addresses for these
would also be in your directory server, such as 16.1.1.1 and 16.1.2.1.
Then, behind the scenes, you create the set of application servers. These get assigned IP
addresses, but no names, and the IP addresses would not be publicized in your directory server.
The only visible address is that of the service offered by the load balancing server itself.
In this way you can add new application servers, without having the change anything other than
the routing table in the load balancing server.
And when a system goes down, whether the outage is planned like system maintenance or
software upgrades, or unplanned, because systems do occasionally fail, the entire environment
keeps going. If you are serving web pages, the user simply clicks the “Refresh” button, and the
new request is sent to one of the other servers. If you are serving applications, an external
transaction monitor can re-start the transaction. In either case, the entire environment stays
operational.
As you can also see, you can have a mix of systems offering services, so you can choose the
application servers based on the best application for the job, and not be restricted in your choices
of environments.
In this diagram the load balancing servers are single points of failure. In real life you could have
multiple load balancing servers acting either in a failover or a load balancing capacity, depending
on which load balancing server you chose.
There are basically three different approaches to load balancers: networking, hardware and
software.
The traditional way of doing this was to let the Domain Name Server spread the load among the
different application servers. This is simple to implement, since the DNS server simply rotated
among a hard-coded list of IP addresses. It was also cheap, since you needed a DNS server
anyway. But there were some serious deficiencies to this scheme: all of the other network and
server components along the way tended to cache IP addresses, which defeated the round robin
scheme. Further, the DNS server had no way of knowing whether any given server was ready to
accept the next request, or whether it was overloaded, or whether it was even up. So requests
could get lost, and servers could get overloaded, and no one would know.
To get around these problems, you can use a dedicated network box to act as a router to the
application servers. This has some serious advantages, including greater scalability, no CPU load
on the application servers for routing, and very simple management. Your network people will
find the management of the load balancing server to be very similar to the management of the
network routing tables they take care of today, and will be pleased over how simple it is to
manage. The one dis-advantage to this approach is cost: these systems cost between $5,000 and
$50,000, depending on capacity. And you generally need to buy them in pairs, so you don’t have
any single points of failure.
There are two basic kinds of hardware load balancers: network switches and appliance balancers.
Network switches, such as the Alteon, are basically standard network switches which have
additional software running in the ASICs to handle the load balancing function. Appliance
balancers such as the Cisco or the F5, on the other hand, run the load balancing software as their
primary function, but don’t need the level of tuning or maintenance that servers need, since they
are network appliances. Either type can easily handle the bandwidth of major Internet sites: you
will find that your Internet connection will be more of a bottleneck than the load balancer.
The other choice is to use a software load balancer. This is basically a service to be run, and it
consumes CPU cycles, which means each of the application servers has fewer cycles to do the real
work that you bought them for. However, it does not require a separate load balancing server or
servers, so for small sites this might be cheaper.
© 2008 Hewlett-Packard Development Company, L.P.
All of the application servers must be on the same network segment and in the same IP subnet.
The network cards are set in promiscuous mode, whereby all NICs accept all packets on that
network segment. The load balancing service on each of the application servers then computes
which application server should handle this particular task, and the other servers ignore the
packet.
The Windows Network Load Balancer (NLB) is part of the Windows 2003 Advanced Server and
above. Microsoft purchased this code from Convoy. Beowulf does this same job with Linux.
One of the things that you must be aware of is the different amount of information that each of
the load balancers has about the state of the application servers.
Network load balancers are basically network switches with some added software. As such, they
have only minimal information about the application servers. They do the best they can, using
keepalive timers, sending “ping” and measuring the response time, and choosing the optimal
application server via content routing. But these methodologies tend to measure network
performance as much or more than they measure server load. They are also subject to tuning of
the timeout values, where a busy server might be judged inoperable if the timeout value is too
low, and an inoperable server might continue to get requests sent to it if the timeout value is too
high. There are attempts to make this even more efficient by using Dynamic Feedback Protocol
(DFP) and other tools to become more aware of the actual load on each of the servers. But they
are still only network switches, and cannot understand the details of the servers at any given
moment. This is not to say that overall they cannot do a reasonable job, but just to make you
aware of some of the difficulties that they face in doing so.
Software load balancers tend to do a better job at measuring the actual load on each of the
servers, primarily because they are running full operating systems which are identical to the ones
on the application servers themselves. In most cases, they are simply one of the application
servers which is taking on the additional task of being the load balancing server.
There is no perfect solution: network load balancers scale far higher than software load balancers,
but tend to be less effective at measuring actual load. Software load balancers cost less, but tend
to not be as effective at delivering services because they are also doing the load balancing work.
But either solution can fully satisfy your business requirements, you simply have to keep all of the
factors in mind when choosing between them.
Another case of network aliases is the cluster network alias. In this case we have a service which is
being offered by a software cluster, with very close cooperation between all of the members of
the cluster. The service may be something like file or print services, a database service, or an
application service like SAP. The service may be offered on multiple members of the cluster (in a
shared everything cluster environment) or on only one member of the cluster (in a shared nothing
cluster). The network alias is bound to one of the NIC’s on one of the members of the cluster, and
accepts all incoming requests for that service. At that point the methods differ:
•
The Dynamic Name System (DNS) method offers a single network name, which is the alias
for the cluster or the service. A list of “A” records specifies the network addresses of the
actual systems in the cluster. The DNS server can “round robin” the requests among the
list of network addresses, offering simple load balancing. One or more of the systems in
the cluster dynamically update the list of “A” records in the DNS server to reflect system
availability and load. This offers load balancing and higher availability, as requests will
only be sent to systems that are running and able to do the work. TCP/IP Services for
OpenVMS offers the Load Broker which calculates the workload of all of the cluster
members, and the Metric Server to update the list of records in the DNS server.
•
If the address is not shared, it is bound to one of the network interface cards (NIC) on one
of the systems in the cluster. All connections go to that NIC until that system exits the
cluster, at which time the network address is bound to another NIC on another system in
the cluster. This offers failover only, with no load balancing, but it does allow a client to
© 2008 Hewlett-Packard Development Company, L.P.
connect to the cluster without needing to know which NIC on which system is available at
that moment. AIX HACMP, Linux LifeKeeper, MySQL Cluster, Oracle 9i/10g Real
Applications Cluster, HP-UX Serviceguard, SunCluster and Windows Microsoft Cluster
Service do not share the cluster alias.
•
If the address can be shared, the network address is still bound to one of the NICs on one
of the systems in the cluster. Again, all connections go to that NIC until that system exits
the cluster, at which time the network address is bound to another NIC on another
system in the cluster. However, in this case, when a request is received, the redirection
software on that system chooses the best system in the cluster to handle the request by
taking into account the services offered by each system and the load on the systems at
that moment, and then sends the request to the chosen system. This offers high
availability and dynamic load balancing, and is more efficient than constantly updating
the DNS server. NonStop Kernel and Tru64 UNIX TruCluster offer a shared cluster alias.
Channel Bonding
Even with the amazing advances in network performance, many applications exceed the capacity
of a single NIC or switch port. The solution is the same one that was used for decades when
applications ran out of processor performance: Symmetric Multi Processing (SMP), where we put
many processors together to act as a single high bandwidth processor.
This goes by many names: NIC teaming, port teaming, Ethernet trunking, Network Fault
Tolerance, network Redundant Array of Independent Networks (netRAIN), Auto Port Aggregation
(APA), etc. But they all mean the same thing: to put many NIC’s together into a single virtual NIC,
which increases the bandwidth available by the number of physical NIC’s are bonded together.
This is officially known as the Link Aggregation Protocol, covered by IEEE standard 802.3ad.
The different operating systems handle this in one of two ways: active/active and active/passive.
Active/active means that all channels are used at all times, with different requests being sent out
and back on all channels, to achieve almost linear growth in network bandwidth. Failover is
extremely fast, in that the failed packet is simply re-sent out on one of the other active channels.
AIX offers EtherChannel, HP-UX offers Auto Port Aggregation (APA), Linux offers Channel
Bonding, and Solaris offers Nemo (liblaadm), all active/active. In addition, several networking
companies offer solutions, such as Broadcom Advanced Network Server Program for Linux and
Windows, Dell PowerConnect Switches for Windows, HP’s ProLiant Essentials Intelligent
Networking Pack for Linux and Windows, Intel Advanced Network Services with Multi-Vendor
Teaming for Linux and Windows, and Paired Network Interface from MultiNet (and TCPware) for
OpenVMS, all of which are also active/active.
Active/passive means that only one channel is used for all traffic, with all of the other channels
being available for failover but not in active use.
FailSAFE IP on OpenVMS and Tru64 UNIX, LAN Failover on OpenVMS are examples of
active/passive channel bonding.
MAC Spoofing
The Media Access Controller (MAC) is the fundamental base on which all other networking rests.
Every network device in the world has a unique MAC address. In a sense, all of the above
discussion, from DECnet to TCP/IP, from cluster alias to NAT, is a virtualization of the MAC
address, in that they are all methods use to determine the route to and from the specific NIC
across all other network devices. But I will focus this discussion on MAC spoofing, because
otherwise this paper would end up being much longer than it is now…
In most cases, you want to leave the MAC address as it is burned into your NIC, so that Address
Resolution Protocol (ARP) can successfully translate your TCP/IP address into the route needed to
© 2008 Hewlett-Packard Development Company, L.P.
get the packets to your NIC. But in some cases, you want to be able to change this address to
something else. Hackers would want to do this to pretend to be a valid network device on your
network. Some home networking setups require this because the MAC address of the primary PC
is registered with the ISP, and the cable or DSL modem needs to pretend to be that PC. But the
legitimate reason for doing so in your server environment is that it enables re-configuration of a
network very quickly and easily, by using a MAC address which is standing in for the actual MAC
address on the NIC. The physical NIC can then be replaced (which results in a new MAC address,
and which would ordinarily have forced a network change) without any network changes.
Several products do this today:

Cisco VFrame Data Center will use MAC spoofing to bind multiple channels into a single
channel, increasing throughput, and will use MAC spoofing to allow network devices to
be replaced without any changes to the network administration. Note that the IBM
Virtual Fabric Architecture is simply an OEM version of Cisco VFrame Server Fabric
Virtualization Software, using the Infiniband to LAN/SAN switches in the IBM blade
enclosure.

HP Virtual Connect has the c-Class blade enclosure present MAC addresses to the
network that are not the addresses on the NIC’s in the blades themselves. This way, when
blades get added or replaced, nothing has changed from the network point of view.
Note that this equally applies to storage virtualization in managing SAN World Wide ID’s
(WWID).

IBM Open Fabric has the IBM BladeCenter present MAC addresses to the network that are
managed by the Open Fabric Manager. Again, when blades get added or replaced or the
workload moves between blade enclosures, nothing changes from the network point of
view. Note that this equally applies to storage virtualization in managing SAN WWID’s.

Every hypervisor which has internal networking capability uses MAC spoofing to present
the virtual NIC as a physical NIC.
Figure 21 MAC Spoofing
Figure 21 shows how the network connections (MAC addresses 11 through 14) are spoofing the
MAC addresses 01 through 04, by re-writing the MAC addresses on the physical NIC’s. (I use a
short-hand form of MAC address, as the real ones are 12 digit hexadecimal numbers, and that
wouldn’t fit on the diagram). Originally, a server NIC with MAC ID 03 was in place, and had its
MAC address replaced by MAC ID 14. When server NIC MAC ID 03 needs to be replaced, a new
server NIC with MAC ID 04 is swapped in, but the connector will spoof MAC address 04 with MAC
address 14, and the rest of the network will not notice any change. If this scheme were not in
place, a new MAC address would need to be registered, associated with a TCP/IP address, reassigned to any VLAN’s that were in place, etc. This can result in work orders that take days or
weeks, just because a network card went bad or a server needed to be upgraded.
© 2008 Hewlett-Packard Development Company, L.P.
But the diagram also shows one of the downsides of that very migration. If server NIC with MAC
ID 04 was relocated to the connector with MAC ID 13 (which is in a different subnet than MAC ID
14), things become complicated. For example, the subnet in blue will likely have a different set of
DHCP addresses than the network area outside that subnet. So the server may be assigned a
TCP/IP address that is outside of the security zone that it needs to be in, preventing it from
accessing the servers it needs to access, and preventing legitimate users from accessing it.
This is of moderate concern for blade enclosures and most implementations of VFrame, but is a
large concern when doing failover between guest operating system instances across physical
machines on different subnets. And it is a very large concern when doing DR failover to a
different geography, with very different subnets. It is extremely easy to setup this type of failover
between virtual machines, but they need to be thoroughly tested to catch this type of error.
Storage Virtualization
Storage virtualization has many of the same attributes as network virtualization: it is primarily
concerned with communication to an outside agency instead of dividing up or aggregating the
hardware and software inside the server, a lot of the virtualization happens outside of the server,
and there are many more standards involved than in server virtualization, where the actual
implementation tends to be very specific to each vendor. But it has some major differences as
well: communication from the server tends to be to a very specific set of storage hardware (with
specific lists of what host based adapter (HBA) is supported with what storage array), and there
are fewer standards involved and they tend to be much looser than in the networking world
(compare some of the Redundant Array of Independent (or Inexpensive, depending if you are a
storage buyer or vendor) Disks (RAID) definitions with something like 802.1q).
Storage virtualization, like all of the other virtualization technologies that we have been
discussing, has been around for literally decades. We may have started off in the 1960’s with disk
managers as people who kept maps of washing machine sized hard disks on a wall, and who
handed out storage based on the physical tracks and sectors of those disks (which the user was
responsible for keeping straight, with very little checking to see if they accidentally wandered into
the next users section of disk), but file systems such as the UNIX File System (UFS, which was
actually invented in 1969 by Ken Thompson in order to let him write Space Travel on a PDP-7)
came into prominence in the early 1970’s, with volume managers following about 10 years later.
This was the earliest form of storage virtualization, but it was soon followed by Digital Equipment
Corporation with OpenVMS Volume Shadowing and IBM mirroring. At the same time, advanced
RAID algorithms were invented at U.C. Berkeley, which started an industry around RAID storage
systems. These were highly in vogue through the 1990s both in operating systems and storage
controllers.
During the ‘90s, the advances continued. The development of FibreChannel enabled fabric-based
virtualization, and the continuing advances in microprocessors enabled intelligent switches that
could incorporate virtualization became reality.
Importantly, the way virtualization has been talked about publicly has changed radically over time.
We were using virtualization techniques, but (as it was equally in the server world) the word
wasn’t mentioned at all. And so, rather than coin a lot of new terms, we stuck with the names of
technologies. Hence, mirroring, or RAID-5.
Eventually the excitement about “virtualization” spread to the storage world. A lot of industry
messaging became focused on virtualization as being the desired end goal. DEC and HP led this,
and others soon hopped on board. Suddenly storage virtualization was the buzz of the industry.
Note that the following discussion focuses on hard disk storage. Tape and optical storage
virtualization is a topic that I may add in a future major release of this paper.
The first thing we have to deal with is where is the virtualization taking place? If can exist either in
the server software (whether that is the operating system or add-on products), the storage
© 2008 Hewlett-Packard Development Company, L.P.
controller (whether that is an HBA in a PCI slot in the server or embedded in a standalone storage
array) or as part of a device offering management of the entities of the Storage Area Network
(SAN), such as switches, controllers and servers offering file services to other devices on the
network. And it can exist at multiple levels simultaneously, with an operating system
implementing mirroring between storage systems which themselves have implemented
redundancy and which are chaining multiple tiers of storage together transparently using multiply
redundant paths.
We can evaluate these different levels on four criteria:

Management: How is the storage managed? What software tools are used to do the
management? What is the overhead involved in performing this management? What
complexities does storage virtualization add to the base management?

Pooling: How can we combine the various levels of storage? How can the various vendor
products be combined? What is the overhead in doing this aggregation?

Data movement: How easily is data migrated between storage elements, including
elements from different vendors? What type of tiering is available?

Asset management: What is the span of control of those tools? How can performance
and capacity be distributed among the storage elements, including elements from
different vendors?
The first level is at the server itself, including servers which offer file services such as NAS devices.

Management: All of the major operating systems and their clustering products have some
storage virtualization included (primarily RAID software), but there are many 3 rd party
tools from storage vendors available to augment these native capabilities. And frequently
these are very easy to manage, primarily because the system administrators are already
trained and comfortable with the operating systems way of doing things. But all of the
work involved in doing this virtualization (such as issuing the 2 or more I/O requests for
each write to a RAID-1 volume, or performing the data copy involved in a clone) takes
place in the host, detracting from the amount of processing power and memory available
to do the real work for which you purchased the server.

Pooling: In a perfect world, the servers can bring together many different subsystems
from many different vendors. In this world it is somewhat less perfect than that, but it is
often possible to pool together multiple types of storage from multiple vendors. This
gives the servers the largest possible span of control and ability to pool storage. But it is
often complex to do this, and sometimes requires additional software to manage the
disparate storage subsystems. Further, just as in management, this aggregation will take
processing power and memory away from doing the work that only the server can do.

Data movement: Just as in pooling, servers can more easily migrate data between storage
systems than any other level. But again, just as in pooling, this takes server resources.

Asset management: Servers can see every piece of storage which they have mounted, and
as pointed out above, this can be a very large group. But because most operating
systems do not have the ability to mount the same storage system on multiple servers
simultaneously (especially between operating systems), the servers can only manage
those storage devices that they have mounted, which is a very small fraction of the total
storage environment. Further, servers often do not have the ability to see behind the
Logical Unit Number (LUN) which is presented by the storage array, such that they cannot
manage the individual items in the storage array (disk drives, tape drives, storage
controllers) or the infrastructure itself (switches, chained storage, etc).
The next level is the storage array.
© 2008 Hewlett-Packard Development Company, L.P.

Management: Each storage array has a specific set of tools which are tailored to that
particular vendor’s product family (where a family might be an HP EVA or an EMC
CLARiiON). As such, each does a good job of understanding the details of that family.
However, these tools do not extend outside the family into even the other storage arrays
of that particular vendor, never mind the storage arrays from another vendor.

Pooling: All storage arrays have the ability to group multiple disks into RAID arrays. Most
storage arrays even have the ability to group multiple RAID arrays into large groups, to
achieve superb throughput and bandwidth. Further, many storage arrays have the ability
to “chain”, where a storage array can connect to a second storage array, and present its
LUNs transparently to the servers as if it were a single large storage array. And this is
supported even across storage arrays from multiple vendors. So pooling is a strength of
storage array virtualization. But in the same way that servers can only manage those
volumes that are mounted on the server itself, storage arrays can only pool those arrays
that are physically connected to it. As such, the span of control is fairly small when
compared to the entire storage environment.

Data movement: Many storage arrays have the ability to perform snapshots or business
copies, that is, to instantly make a copy of an entire group of storage for other uses such
as backup or software testing. Some storage arrays have the ability to replicate data
between storage arrays, using software such as EMC Symmetrix Data Replication Facility
(SRDF) or HP Continuous Access (CA). But even these capabilities are extremely limited in
their ability to migrate data between disparate storage arrays, and I have not found any
storage array virtualization software which will replicate data between storage arrays from
different vendors.

Asset management: Storage array virtualization does a good job in managing the array
itself, but this is extremely limited. Most of the drawbacks to storage-based virtualization
are simply the result of “span”, ie storage-based is typically limited in scope to a single
storage subsystem, whereas network- and server-based approaches can encompass many
storage subsystems, even systems from different vendors.
The final level is the SAN itself.

Management: Managing the overall SAN is often easier than managing individual
components, as there tend to be fewer of them (a few SAN switches rather than tens or
hundreds of LUNs), and you get to take a higher level of the environment rather than
focusing on the details of specific storage arrays. Further, you get to look at overall traffic
flow rather than focusing on specific disk devices on specific servers. On the other hand,
a heterogeneous SAN means multiple management tools, which often conflict with each
other, leading to a poor overall view of the environment.

Pooling: The SAN is the ultimate span of control, as it covers every server, every switch
and every storage array. Pooling is easy at the SAN level among storage arrays and
switches of the same type, but is often extremely difficult to accomplish between
equipment of different types and especially between equipment from different vendors.

Data movement: Again, this is very easy to do among equipment of the same type, but
very difficult to do between equipment from different vendors.

Asset management: The equipment can be tracked and viewed as a whole, but it is often
difficult to track problems because of the lack of view inside the storage array.
© 2008 Hewlett-Packard Development Company, L.P.
RAID
Beyond the virtualization of the file system and the volume manager (which are such a implicit
part of the way we work that we don’t even consider them virtualization, even though they clearly
fit even the loosest definition), RAID is the fundamental storage virtualization technology.
At its core, RAID is a way of optimizing the cost per usable GByte, the performance, or the
availability of the storage. The definitions are from the Storage Networking Industry Association
(SNIA), but to me the simplest way to think about it is the old project manager’s mantra of
“cheap, fast, good: pick any two”. I will give the SNIA definition first, and then add my comments.
RAID 0: SNIA’s definition is “A disk array data mapping technique in which fixed-length sequences
of virtual disk data addresses are mapped to sequences of member disk addresses in a regular
rotating pattern. Disk striping is commonly called RAID Level 0 or RAID 0 because of its similarity
to common RAID data mapping techniques. It includes no data protection, however, so strictly
speaking, the appellation RAID is a misnomer.” This optimizes cost per usable GByte, because all
of the space on the physical disks is usable: if you have a stripeset of three 73GByte disks, you get
73+73+73=219 usable GBytes of storage. It also optimizes performance, because you can
execute parallel reads and writes on all of the members of the stripeset, so both total bandwidth
and latency are improved. But it offers no data protection at all, in that if any disk fails, the entire
stripeset needs to be rebuilt. So it is cheap and fast, but not good.
RAID 1: SNIA’s definition is “A form of storage array in which two or more identical copies of data
are maintained on separate media. Also known as RAID Level 1, disk shadowing, real-time copy,
and t1 copy.” I would add “mirroring” to this definition. This optimizes read performance,
because you can execute parallel reads on all of the members of the mirrorset, so both bandwidth
and latency are improved. It does not help write performance, because all disks in the mirrorset
must be written at the same time. It also optimizes availability, because on an ‘n’ member
mirrorset, n – 1 disks can fail and you still have full access to all of the data. But it is the most
expensive of the RAID choices, in that no matter how many disks you have, you have only the
usable storage of the smallest disk in the mirrorset: if you have a mirrorset of three 73GByte disks,
you get 73 usable GBytes of storage. So it is fast and good, but not cheap.
RAID 2: SNIA’s definition is “A form of RAID in which a Hamming code computed on stripes of
data on some of a RAID array's disks is stored on the remaining disks and serves as check data.”
This is the first of the RAID choices which employs parity to be able to re-compute the data if it is
lost, rather than keeping an extra copy of the data, as RAID 1 does. This parity data takes up
somewhat less space than the original data, yielding approximately 60% of the usable space of
the RAIDset (rather than the 50% of a two member mirrorset): if you have a RAID 2 set of three
73GByte disks, you get approximately (73*3*0.60)=132 usable GBytes of storage. The data is
striped across the disks at the bit level, which is different than the other RAID definitions. The
parity is striped across all of the available disks. RAID 2 does not help read or write performance,
because every read must be distributed among all of the disks, and every write must write both
the data on one disk and the parity information on another disk. But it does offer decent
availability, because one disk can fail and you still have full access to all of the data (though some
of it will be re-computed from the parity instead of being read directly). So it is not very cheap,
not very fast and good. For these reasons, RAID 2 is very seldom used.
RAID 3: SNIA’s definition is “A form of parity RAID in which all disks are assumed to be
rotationally synchronized, and in which the data stripe size is no larger than the exported block
size.” RAID 3 also uses parity, but stripes the data across the disks in bytes (rather than bits as in
RAID 2) and isolates the parity to one of the disks in the RAIDset (rather than distributing it across
all of the disks as in RAID 2). This is slightly better in space than RAID 2, and gets better as the
number of disks in the RAIDset go up, as you lose only one entire disk to the parity information: if
you have a RAID 3 set of ‘n’ 73GByte disks, you get n – 1 usable GBytes, for 73*(3-1)=146 usable
GBytes (67%). If you have a RAID 3 set of 4 73GByte disks, you again get n – 1 usable GBytes, for
© 2008 Hewlett-Packard Development Company, L.P.
73*(4-1) = 219 usable GBytes (75%). Performance is the same as a single disk drive, since the data
is spread out among all of the disks, both reads and writes must access all of the disks for
operations of any significant size, and the disks are synchronized so random reads of different
areas of the disk are not allowed. Further, all writes must additionally update the parity disk. But
it does offer decent availability, because any one disk (either data or parity) can fail and you still
have full access to all of the data (though some of it will be re-computed from the parity instead
of being read directly, if a data disk fails). So it is cheap and good.
RAID 4: SNIA’s definition is “A form of parity RAID in which the disks operate independently, the
data stripe size is no smaller than the exported block size, and all parity check data is stored on
one disk.” This continues the growth we have seen from RAID 2 to RAID 3, in that the size of the
data on each of the data disks is growing, with whole blocks being distributed among the data
disks. Just like in RAID 3, there is a dedicated parity disk. Just like RAID 3, you get better space
utilization as you increase the number of disks in the RAID 4 set. Read performance is improved,
as there is now enough data on each disk to be useful, such that multiple disks can be read
simultaneously with different data areas on each disk. But write performance is still limited to a
single disks performance, as every write must update the parity disk. RAID 4 offers decent
availability, because any one disk (either data or parity) can fail and you still have full access to all
of the data (though some of it will be re-computed from the parity instead of being read directly,
if a data disk fails). So it is cheap and good.
RAID 5: SNIA’s definition is “A form of parity RAID in which the disks operate independently, the
data stripe size is no smaller than the exported block size, and parity check data is distributed
across the RAID array’s disks.” This is one of the more common RAID levels, as it takes the best
features of RAID 2 and RAID 4 (distributed parity and large block sizes on each disk), to optimize
across all three factors. Just like in RAID 4, you get better space utilization as you increase the
number of disks in a RAID 5 set. Read performance is the sum of the disks performance, and
reading different areas from different disks to achieve parallelization is normal. Write
performance is also improved as the parity is distributed among multiple disks. RAID 5 offers
decent availability, because any one disk (either data or parity) can fail and you still have full
access to all of the data (though some of it will be re-computed from the parity instead of being
read directly, if a data disk fails). So it is fairly cheap, pretty fast and reasonably good.
RAID 6: SNIA’s definition is “Any form of RAID that can continue to execute read and write
requests to all of the RAID array’s virtual disks in the presence of any two concurrent disk failures.
Several methods, including dual check data computations (parity and Reed Solomon), orthogonal
dual parity check data and diagonal parity, have been used to implement RAID Level 6.” Just like
in RAID 5, you get better space utilization as you add disks, but you lose the equivalent of two
disk drives worth of space due to the extra copy of the parity, so it is not as space efficient as
RAID 5. Read and write performance is similar to RAID 5. One of the difficulties of RAID 5 is that
it cannot survive the loss of two disks. RAID 6 is just like RAID 5 except that it adds a second copy
of the parity data to be able to survive a second disk failure. So RAID 6 is fast and good.
Storage requirements have grown, and there are moderate limits to the number of disks that can
be combined into a single RAIDset. People wanted to have a smaller number of much large
volumes, but not give up the high availability of some of the above RAID levels. To solve this
problem, people started using multi-level RAID: one of the above RAID levels is used to create
multiple RAID volumes, but then those multiple volumes are striped together into a single larger
volume which is easier to manage. The nomenclature is (RAID level) + (RAID level, almost always
0), such as 1+0, but it is often written without the “+” such as 10.
RAID 10: This uses RAID 0 (striping) on multiple RAID 1 (mirroring) volumes. It is otherwise
identical to RAID 1, in that it is fast and good, but not cheap.
RAID 50: This uses RAID 0 (striping) on multiple RAID 5 volumes. It is otherwise identical to RAID
5, in that it is fairly cheap, pretty fast and reasonably good.
© 2008 Hewlett-Packard Development Company, L.P.
RAID 60: This uses RAID 0 (striping) on multiple RAID 6 volumes. It is otherwise identical to RAID
6, in that it is fast and good but not cheap.
One thing is critical in this technique: use RAID 1 or 5 or 6 first, then RAID 0. Remember that
RAID 0 offers no availability whatsoever, so if one of the disks in a RAID 0 set fail, the entire RAID
0 set is unavailable. This will frequently put your entire RAIDset at risk.
For example, assume that you have four disks (A, B, C and D) and you want to mirror and stripe.
But you choose to stripe first and then mirror (effectively RAID 0+1). So you stripe A and B, then
stripe C and D, and then mirror the resulting two volumes. Now A goes bad. This takes the entire
stripeset A+B offline, and the data on C+D is now running completely unprotected. B is fine, but
completely unusable. Any failure of either C or D makes your data completely unavailable.
Contrast this with the case where you have the same four disks, and you mirror then stripe. So
you mirror A and B, then mirror C and D, and stripe the resulting two volumes. Now A goes bad.
B still has a complete copy of the data, so you are still fully operational. Now C goes bad. But D
still has a complete copy of the data, so you are still fully operational. This shows that you can
lose up to half of the volumes in your RAID 10 set simultaneously, and you are still operational.
RAID 50 and 60 work similarly.
Most vendors offer multiple host based adapters (HBA’s) which support some level of RAID for
their directly connected storage (usually SCSI, now moving to Serial Attached SCSI (SAS)). I
originally attempted to classify them, but there were just too many, and they were too specific to
a server and an operating system, so I didn’t think it was worth it. Please check the vendors
website or your local Sales Representative to find the details of these cards.
Figure 22 describes the RAID support in the major storage arrays available today. The subscript
on each RAID level is the number of disks that can be in that RAIDset.
Vendor
RAID Levels
EqualLogic PS5000x: 58, 108
Dell
EMC
ClARiiON CX3: 016, 12, 39, 516, 616, 1016
Symmetrix DMX3. DMX4: 08, 12, 58, 616, 104
MSA: 12, 514, 1056, ADG (6)56
EVA (VRAID): 0168, 1168, 5168, 10168
HP
XP: 18, 532, 68
Simple Modular Storage: 612
Hitachi
Workgroup Modular Storage: 03, 12, 516, 68, 106
Adaptive Modular Storage: 03, 12, 516, 68, 106
Universal Storage Platform: 18, 58, 68
DS3xxx, DS4xxx: 012, 12, 312, 512, 1012
DS6xxx, DS8xxx: 58, 108
IBM
Enterprise Storage Server F10, F20: 5256, 10256
StorageTek 2xxx, 6xxx: 030, 12, 330, 530, 1030
StorageTek 99xxV: 12, 530, 630, 1030
Sun
Figure 22 Storage Array RAID Support
StorEdge: 08, 12, 38, 58, 108
IBM enhances the normal RAID levels by adding an “E” to the ordinary RAID levels: 1E, 1E0, 5E and
5EE. Each “E” represents an extra disk of the set offering additional redundancy, so 1E means a
mirrorset with a spare disk, while 5EE means a RAID 5 set with two spare disks. These will
© 2008 Hewlett-Packard Development Company, L.P.
automatically be activated in the event of a disk failure, with the data being reconstructed
(whether copied in the case of mirroring or re-computed in the case of RAID 5) onto the newly
activated disk.
The IBM DS4xxx and the Sun StorageTek 6xxx are the same unit, both OEM’ed from Engenio.
The IBM DS3/4/6/8 are driven by Power5+ systems, which abstract the underlying disks into
logical volumes which are presented as LUNs. Each has two independent Power5+ systems, each
with a pair of FC-AL and FC switches, cross-connected to each Power5+ system.
FAStT is the old name for the low end DS4xxx series.
Sun SANtricity on the StorageTek line allows Dynamic RAID Migration, ie, conversion from one
RAID level to another with no re-formatting, though it does require lots of free space.
One of the other topics that you need to understand is chunk size. Note that chunk size is
different than stripe width, which is the number of drives in the stripeset. Chunk size is the
smallest block of data which can be read or written at a time on each of the disks in a RAID 0
(when it is known as stripe size), 2, 3, 4, 5 and 6 RAIDset. For RAID 2 (bit striping) the chunk size is
1 bit, and for RAID 3 (byte striping), the chunk size is 1 byte. RAID 0, 4, 5, and 6 let you specify
various stripe sizes, in order to optimize performance for your particular workload.
And that is the key, because workloads vary. A large chunk size means that there is a large
amount of any given file on each disk, where a small chunk size means that there is a small
amount of any given file on each disk such that it may take many chunks to store the entire file.
And a large chunk size means that you will always read a large amount of data from each disk
(whether or not you will use all of that data), while a smaller chunk size means more parallel
operations as you fetch data from all of the disks in your RAIDset.
As you can see, the most important thing is the size of each I/O. Large chunk sizes means that
the I/O subsystem can break a huge I/O into a fairly small number of I/O’s (in a perfect world, this
would be the number of drives in the array) and execute them in parallel. Small chunk sizes
means that the I/O subsystem can execute multiple smaller independent I/O’s in parallel, with
each disk potentially being the source of multiple of these I/O’s.
HP-UX chunk sizes are up to 262Kbyte, while Windows NTFS default is 4Kbytes. Oracle
recommends a 1MByte chunk size for ASM for the database (to match the MAXIO database buffer
size), but offers 1, 2, 4, 8, 16, 32 and 64MByte chunk sizes to help out sequential reads/writes for
the non database files. Note that Sun calls a partition a “slice”, and stores special information in
slice 0.
The only thing you can do for chunk sizes is measure your actual I/O size over time, and optimize
for both bandwidth and latency. Too large a chunk size means you may not be using all of your
disk drives in parallel, and may be throwing away data that you read because your data doesn’t fill
up a chunk. But too small a chunk size means that you are reading multiple chunks from the
same disks, and therefore are serializing your I/O’s. The best thing is to accept the
recommendations of your storage and database vendor for the initial implementation, and then
measure your disk queue depth, to see if you are waiting on many I/O’s to complete on each disk.
If so, then increase the chunk size and re-measure. But modifying your chunk size is often a
major operation, involving re-initializing your volumes and restoring the data, so it is not to be
done casually. But chunk size can have a measurable effect on performance.
ASM/EVA Virtualization
Almost every storage array in the industry today works in a similar way: bind individual disk drives
into a RAIDset, potentially bind those RAIDsets together into a multi-level RAIDset (RAID 10, RAID
50, etc), and the present the result as a Logical Unit Number (LUN), ie, as a device to be mounted
by a server. And this works very well, with quite good throughput, because the LUNs are assisted
© 2008 Hewlett-Packard Development Company, L.P.
with a large amount of cache and very sophisticated controllers. But this is not the only way to do
things.
HP’s Enterprise Virtual Array (EVA) storage array binds large groups of disks into a single volume
group, and then partitions the single large storage pool into individual volumes, which can have
their own RAID level. So, for example, 160 73GByte volumes (with a total usable space of
11.7TBytes, leaving 8 disks for sparing) might be partitioned into a 1TByte RAID 1 (consuming
2TBytes of actual space) volume and a 1TByte RAID 5 (consuming 1.2TBytes of actual space)
volume, leaving 8.5TBytes of free space. But the difference between the EVA and every other
storage array on the market is that the actual space is striped across every one of the 160 active
disks, with automatic load balancing and re-distribution of the actual data blocks across all of
those physical disks. If a disk fails, all of the information on that disk is reconstituted from the
redundant information for the RAIDset (not in the case of RAID 0, of course), exactly as if they
were physical disks. If new disks are added to the volume group, the data is normalized across all
of the disks in the resulting volume group, and the same is true if disks are removed from the
volume group.
Oracle’s Automatic Storage Manager (ASM) does the same thing to the volumes that it manages,
though in this case these are LUN’s presented by another storage array. ASM provides logical
mirroring at the file level, not physical mirroring at the block level. Disks can be added to the
“disk group” and are automatically included in the environment, with files and I/O quickly load
balanced to the new space.
Some of the storage arrays, such as the IBM DS3 and DS4 series, as well as some NAS devices,
such as the IBM N series and the Sun NAS 5300 series, do this using standard file systems and
volume management, but they do not have the same level of dynamic data migration and load
balancing, nor do they do it to the level of the number of drives in the HP EVA.
Disk Partitioning, Volume Management and File Systems
Once all of LUN’s are presented to an operating system, whether they are individual disk volumes
or RAIDsets, they can be managed by the operating system and 3rd party tools.
OpenVMS deals strictly with the individual LUN’s as if they were disk volumes, and does not
group them together except with HP RAID Software for OpenVMS for partitioning.
Windows also deals only with individual LUN’s, but then partitions them using the Disk
Administrator plug-in, and offers a “Volume Set” which is strictly a RAID 0 concatenation of the
LUN’s into a single larger volume.
For UNIX systems (AIX, HP-UX, Linux, Solaris and Tru64 UNIX), as well as other software such as
Symantec Veritas and Oracle, this is done with a volume manager. Volume managers take the
individual LUN’s and bind them together into larger entities called volume groups (VG’s). The VG
is then treated as a standard LUN, which can be mounted and partitioned in any way needed.
One advantage of VG’s over physical disks is that they can be expanded and contracted as
needed, which physical disks cannot do. However, they do not do the same kind of automated
block-based and file-based load balancing that is done in the EVA and ASM, respectively.
As the USA and England are often described as two countries separated by a common language,
so are the various tools which manage volume groups: they all use some common terms and
some unique terms to describe the same concept. Figure 23 describes the terminology of the
various volume managers.
Tool
LUN
Partition of
AIX
LVM
Physical
Partition
Logical
HP-UX
LVM
Physical
Volume
Logical
Linux
DRBD
Physical
Volume
Logical
Solaris
SVM
Drive
Tru64
LSM
Volume
Veritas
VxVM
Subdisk
Slice
Subdisk
Plex
© 2008 Hewlett-Packard Development Company, L.P.
a disk
Partition
Extent
Extent
Logical
Volume
Logical
Logical
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Group
Group
Group
Group
Volume
Volume
Mirror
Disk Mirror
Mirror
Mirror
Copy
Figure 23 Volume Group Management Terminology
Volume
Volume
Disk Media
Disk Group
Disk Group
Disk Group
Disk Mirror
Mirror
Partition
Copy
All of the volume managers do some level of RAID. Figure 24 describes the operating system
support for the various RAID levels, and the number of members which are supported for
mirrorsets.
RAID Levels
0
1
5
Host Environment
AIX
HP-UX
Software
Logical Volume Manager
3


Logical Volume Manager
3
MirrorDisk/UX
Linux
2
Distributed Replicated Block Device
NonStop Kernel
2
RAID-1, Process Pairs
OpenVMS


RAID Software for OpenVMS
3
Host Based Volume Shadowing
Automatic Storage Management
Oracle 10g

3
Solaris

3

Solaris Volume Manager
Tru64 UNIX

32

Logical Storage Manager
Veritas

3

Veritas Volume Manager (VxVM)
Windows 2003

2

Disk Administrator Snap-in
Figure 24 Operating System RAID Support
The AIX LVM does not work with physical disks, but must convert them to logical volumes before
RAID can be applied.
The Oracle ASM automatically distributes 1MByte chunks for the database and 128KByte chunks
for the control files and redo logs across all available volumes. Oracle ASM 10g only reads from
the primary mirror, not the mirror copy unless a failure occurs in the primary. 11g adds the ability
to specify which copy is read from, but it is manual and not load-balanced: it is for extended RAC
environments where one volume is closer to the server than the other volume.
Solaris Volume Manager offers both RAID-0 striped and concatenation but calls them both RAID0.
Most vendors are moving to follow the industry standards Fabric Device Management Interface
(FDMI), Storage Management Initiative Specification (SMI-S) and Storage Networking Industry
Association (SNIA) HBA API.
© 2008 Hewlett-Packard Development Company, L.P.
Multi Pathing
RAID only protects the disk device itself, but it is a long way from the server to the disk, and we
need to protect the entire path. We do this with redundant components and multi-pathing. Note
that the following discussion focuses mostly on FibreChannel and iSCSI, as it is more difficult to
implement (and often less critical) for direct attached storage with SCSI and SAS.
In a perfect world, we would have redundant HBA’s to redundant network switches to redundant
storage controllers. In this way there would be no single point of failure (SPoF) in the storage
environment (though the server would be an SPoF, that is outside the scope of this section).
But when you have the above setup, you end up with many potential paths to a specific LUN, and
the LUN will often show up with different names depending on the path chosen. You need to
carefully trace out the path from the server to the LUN to ensure that you are not setting up
alternate paths which share an SPoF. Most of the tools listed below are aware of this problem,
and will assist you in disambiguating the paths and the devices along that path, and you should
use these tools carefully to not be surprised by a failure during a critical production run.
Another factor to consider is whether all of the paths are in use at the same time (active/active) or
if there is a primary active path and all of the other paths are only there for transparent failover
(active/passive). If they are active/active, what kind of load balancing is used among the paths?
Some tools offer dynamic measurement of the latency and load of each path to optimize the
performance of all of the paths, while others offer only simple “round-robin” scheduling
regardless of the load on each path.
Then you need to evaluate whether the tool automatically reconfigures the paths in the event of a
failure. As I said in the above discussion around the multiple paths to each device, and how some
of the paths have a common component, you need to consider whether the tool is aware that a
single device failure (such as an HBA or a switch) will invalidate a whole series of paths, such that
it will only re-attempt the I/O operation on one of the healthy paths.
One subtle difference between the tools is whether they treat each path as an individual object
and then load balance or failover between all of the paths, or whether they bundle the paths
together into groups, with active/active load balancing inside a group but active/passive failover
between groups.
Finally, a note on micro-partitioning and multi-pathing. Most multi-pathing tools are not
supported in a micro-partitioned environment (ie, in the guest operating system instance). This is
because the I/O paths are virtualized, and many of the virtual HBA’s may be using a single
physical HBA, with multiple guest instances using a single physical HBA, in order to optimize the
server consolidation cost savings in cabling and switch ports. Since the guest O/S instance is
unaware of these consolidations, implementing multi-pathing in the guest instance may be
(probably is) completely useless. You need to implement the multi-pathing in the hypervisor in
order to protect all of the guest instances equally, even those instances running operating
systems which do not support sophisticated multi-pathing tools.
Figure 25 and 26 list the tools available on each operating system for multi-pathing. Note that
many of these tools come from the operating system (LVM on HP-UX and device mapper on
Linux, for example) and support many different storage arrays from multiple vendors. But more
of the tools come from the storage vendors, in which case it is common for the tool to run on
multiple operating systems but only on that vendor’s specific set of storage arrays.
© 2008 Hewlett-Packard Development Company, L.P.
Tool Name
Subsystem Device Driver
pvlinks, LVM, SLVM
Device Mapper
O/S Support
AIX
HP-UX
Linux
Features
active/active, load based scheduling, auto failback
active/active, load based scheduling, auto failback
active/active in a group, active/passive between
groups, round robin with priority in a group, auto
failback
OpenVMS I/O
OpenVMS
active/passive, manual path balancing, auto failback
MPxIO
Solaris
active/active, round robin, auto failback
Multi-Path I/O
Windows
Technology API, implemented by vendors such as HP
MPIO for EVA and SecurePath
Figure 25 Operating System Multi-Pathing Tools
IBM SDD supports 16 paths, and is included in the DS8xxx series at no additional charge. It has
quite sophisticated load based scheduling across LPARs and DLPARs, with load based scheduling
and active measurement of the state of each of the paths, with fully automated failover and
failback. Note that only active/passive multi-pathing is offered on SPLPARs.
HP-UX supports pvlinks, as well as multi-pathing support in the Logical Volume Manager and the
Shared Logical Volume Manager. All offer active/active with sophisticated load based scheduling.
Linux device mapper bundles multiple paths into groups, and all paths in the first group are used
round-robin. If all paths in a group fail, the next group is used. RHEL and SUSE Linux on ProLiant
offer active/passive with static load balancing (preferred path).
Multi-Path I/O (MPxIO) is integrated in Solaris 8 and beyond. It is active/active, but uses only
round robin scheduling among the paths.
Windows does not offer a specific multi-path I/O module, but offers a technology API, which has
been implemented by many storage vendors, including HP and QLogic, as a Device Specific
Module (DSM) for each of their specific storage arrays.
Tool Name
EMC PowerPath
Emulex MultiPulse
Hitachi Dynamic Link Mgr
HP SecurePath, Auto-Path
O/S Support
AIX, HP-UX, Linux, Solaris,
Tru64, Windows
ProLiant Linux
AIX, HP-UX, Linux, Solaris,
Windows
AIX, HP-UX, Linux, Solaris,
Windows
IBM Subsystem Device Driver
AIX, HP-UX, Linux, Solaris,
Windows
QLogic Multi-Path
Linux, Windows
Sun StorEdge Traffic Mgr
Veritas Dynamic Multi-Path
AIX, HP-UX, Windows
AIX, HP-UX, Linux, Solaris,
Windows
Figure 26 3rd Party Multi-Pathing Tools
Features
active/active, load based
scheduling, auto failback
8 paths, active/passive, auto
failback
active/active, round robin
scheduling, auto failback
32 paths, active/active, load
based scheduling, auto
failback
16 paths, active/active, load
based scheduling, auto
failback
32 paths, active/passive, auto
failback
active/passive, auto failback
active/active, load based
scheduling, auto failback
HP XP AutoPath supports dynamic load balancing including shortest queue length and shortest
service time per path.
© 2008 Hewlett-Packard Development Company, L.P.
Symantec Veritas offers Dynamic Multi-Path (DMP). DMP does sophisticated path analysis to
group paths with common components, to avoid failing over to a path with a failed component.
However, note that V4.1 and later versions of Veritas Volume Manager (VxVM) and Dynamic
Multipathing are supported on HP-UX 11i v3, but do not provide multi-pathing and load
balancing; DMP acts as a pass-through driver, allowing multi-pathing and load balancing to be
controlled by the HP-UX I/O subsystem instead.
QLogic Multi-Path is a Device Specific Module (DSM) of MPIO for Windows.
Hitachi Dynamic Link Manager supports round-robin load balancing, and automated path failover
(active/active). It is managed by Global Link Availability Manager.
Storage Chaining
One of the difficulties of implementing storage over time is that the vendors are continually leapfrogging each other in terms of performance and functionality, and you need to purchase a
storage array from another vendor in order to satisfy a real business requirement. And then there
is the case where the storage vendor really wants your business, and is prepared to offer you a
special deal to introduce their storage into your environment. The net result is that you often end
up with a heterogeneous storage environment, with storage arrays from a variety of vendors.
And this can cause problems, especially when you try to manage all of these disparate systems.
There are very few tools which even offer to manage the entire environment, and none of them
have the component level of understanding and management that the tool from that particular
vendor on that particular storage offers, and that you frequently require.
One solution for this is storage chaining, that is, to hide multiple storage arrays with another
device, and have the front-end device pretend that all of the storage is physically in that device.
There are two ways that this can be accomplished: in-band and out-of-band.
The in-band method is to connect one storage array directly to second storage array, and have
the second storage array simply appear as LUNs in the management tools of the first storage
array. So, in effect, the second storage array disappears and the first storage array simply appears
to have a great deal more space than it had before. In this case, the host sees a LUN through a
FiberChannel port. The LUN is mapped to a Logical Device called an LDEV, and an external
storage array is connected through other FibreChannel ports, which are simply mapped to the
original LUN and presented to the server. In this way, whether the LUN is physically present in the
storage array, or mapped through FibreChannel ports, the server cannot tell the difference: a LUN
is a LUN, which can simply be used.
The Hitachi Universal Storage Platform (USP) and HP XP arrays offer this type of chaining.
It can also be done with a server in cooperation with the storage switches. The IBM SAN Volume
Controller and HP (nee Compaq nee Digital)’s CASA (Continuous Access Storage Array) are
standard servers which offer in-band storage chaining.
The other choice is out-of-band, and to do it with clustered servers in cooperation with the
storage switches. The difference between this is and the above tools, is that the storage
appliance is only used for management and path control, and that all of the I/O is going through
the storage switches normally. EMC Invista implements Data Path Controllers (DPC) controlled by
the Cluster Path Controller (CPC), and the NetApp V series, run on standard servers and offer outof-band storage chaining.
Network Attached Storage (NAS)
As much as storage vendors have optimized their offerings, and reduced their prices in the face of
the fierce competition with each other, HBA’s and network switches remain very expensive. They
are far too expensive to supply to each desktop which needs to access the storage in the
© 2008 Hewlett-Packard Development Company, L.P.
organization (never mind the problem of wireless access to the SAN), and they may even be too
expensive to supply to each small server which needs to access a great deal of storage.
But network ports are cheap, and each desktop, notebook PC and small server already has one or
more high speed network ports built in, since they need network access no matter what. So why
can’t we supply storage over the standard network? Well, we can, and we call it NAS.
NAS can either offer the block oriented storage of SCSI or FibreChannel through iSCSI, or it offers
file system oriented storage such as NFS and CIFS (aka, Samba).
iSCSI is simply the Small Computer Systems Interface (SCSI) protocol encapsulated into a
networking protocol such as TCP/IP. It enables disk sharing over routable networks, without the
distance limitations and storage isolation of standard SCSI cabling. It can be used with standard
networking cards (NIC) and a software stack that emulates a SCSI driver, or it can be similar to a
storage HBA with a NIC assisted with a TCP Offload Engine (TOE) to take care of the crushing
processor overhead of TCP/IP networking protocols.
The Network File System (NFS) was developed by Sun in 1984 (V2 in 1989, V3 in 1995 and V4 in
2003). It originally used Unreliable Data Protocol (UDP) for performance, but recently switched to
TCP for routability and reliability. V1-3 were stateless (where each operation was discrete, with no
knowledge of previous operations, which is excellent for failover but limits the efficiency of the
entire session), with V4 being stateful (where the connection has more knowledge of the previous
operations and can optimize caching and locking).
The Server Message Block (SMB) remote file system protocol, also known as Common Internet
File System (CIFS), has been the basis of Windows file serving since file serving functionality was
introduced into Windows. Over the last several years, SMB's design limitations have restricted
Windows file serving performance and the ability to take advantage of new local file system
features. For example, the maximum buffer size that can be transmitted in a single message is
about 60KB, and SMB 1.0 was not aware of NTFS client-side symbolic links that were added in
Windows Vista and Windows Server 2008. Further, it used the NetBIOS protocols.
Windows Vista (in 2006) and Windows Server 2008 (in 2008) introduce SMB 2.0, which is a new
remote file serving protocol that Windows uses when both the client and server support it.
Besides correctly processing client-side symbolic links and other NTFS enhancements, as well as
using TCP, SMB 2.0 uses batching to minimize the number of messages exchanged between a
client and a server. Batching can improve throughput on high-latency networks like wide-area
networks (WANs) because it allows more data to be in flight at the same time.
Whereas SMB 1.0 issued I/Os for a single file sequentially, SMB 2.0 implements I/O pipelining,
allowing it to issue multiple concurrent I/Os for the same file. It measures the amount of server
memory used by a client for outstanding I/Os to determine how deeply to pipeline.
Because of the changes in the Windows I/O memory manager and I/O system, TCP/IP receive
window auto-tuning, and improvements to the file copy engine, SMB 2.0 enables significant
throughput improvements and reduction in file copy times for large transfers.
SMB (both 1.0 and 2.0) are now available through open source implementations, under the
Samba project, on all major operating systems.
The storage vendors have recognized the prevalence of both NFS and CIFS (SMB or Samba), and
have implemented those file system protocols over their standard storage connections, such as
FibreChannel. In this way, a server can access storage using a single file system protocol, whether
the storage is accessed via the network or FibreChannel
Figure 27 describes the RAID support in the major NAS devices available today. As before, the
subscript on each RAID level is the number of disks that can be in that RAIDset.
Vendor
RAID Levels
Protocols
© 2008 Hewlett-Packard Development Company, L.P.
File System
EMC
Celerra NS/NSX:
12, 39, 516, 616
CLARiiON AX4, CX3:
016, 12, 39, 516, 616 1016
AiO: 012, 12, 512, 612
EFS: 0168, 1168, 5168, 10168
HP
MSA 2000i: 048, 12, 348, 548, 648,
CIFS, iSCSI, NFS
FibreChannel, Network
CIFS, iSCSI, NFS
Network
CIFS, NFS
Network
iSCSI
1048, 5048
FibreChannel, Network
CIFS, NFS
Network
CIFS, iSCSI, NFS
FibreChannel, Network
IBM
High Perf: 58, 50256
N: RAID-DP (6)28, 1656
CIFS, iSCSI, NFS
FibreChannel, Network
NetApp
FASxxxx: 414, DP (6)28
CIFS, iSCSI, NFS
DAS, FibreChannel, Network
Hitachi
Essential: 68
Line
Figure 27 NAS RAID Support
Tiering
Not all data is created equal. Some of the data is absolutely mission-critical, with requirements
for extremely high performance, zero downtime, and extremely frequent backups such that no
data loss is acceptable. Some other data is still important to the organization, but performance is
not as critical, some downtime is acceptable, and people could reasonably reconstruct the data if
it was lost. Other data is archived, where it will be used only very occasionally, performance is not
an issue, and the data never changes such that the backup you took 2 years ago is still valid. But
you need to ensure that all of this data is available when it is needed, whether for financial
reporting, regulatory requirements, protection against litigation, etc.
Not all storage arrays are created equal. Some storage arrays offer incredibly fast performance
with very high speed FibreChannel disks, each of which is fairly small in order to maximize the
number of spindles for performance, multiple redundancy inside the array with no SPoF, and RAID
10 and multiple sparing. Other storage arrays offer lower speed FATA or SATA disks, each of
which is huge in order to minimize the footprint, simplified implementation with little to no
redundancy, and RAID 5 or less and very few spares. The first can be quite expensive, while the
second can be fairly inexpensive.
Putting these two concepts together, we can implement storage tiering.
In the same way you evaluate the business value of your data in order to choose a RAID level and
a backup schedule (including cloning and long distance replication), you can choose a storage
array in order to match your business requirement to the amount you are willing to pay to store
that particular set of data.
You should avoid the tendency to express tiers as specific technologies. For example, you might
say that all Tier 1 storage is FibreChannel based and that all Tier 2 storage is SAS or SATA based.
Instead, present tiers as "service levels" that express real business needs, and let those business
needs drive you to a solution.
Five tiers of storage are standard:
•
Tier 1: The data is mission critical, requiring high performance with zero downtime and
zero data loss acceptable
© 2008 Hewlett-Packard Development Company, L.P.
•
Tier 2: The data is important, with good performance requirements, and a 2-4 hour
recovery time and a 2-4 data loss being acceptable
•
Tier 3: Archived data, with minimal performance requirements, and 4-8 hour recovery
time. Since the data is archived, data loss requirements are minimal, though you do want
to verify the backups occasionally.
•
Tier 4: The data is now near-line, on automated tape or optical libraries, with a 1-2 hour
access time. The data is automatically recovered from the backup media.
•
Tier 5: The data is off-line in a tape library, with up to a 72 hour access time. The data is
manually recovered from the backup media.
Data will often move between one tier and another. For example, this quarter’s order data may
exist in Tier 1 storage, as it represents active contracts and orders which have yet to be fulfilled.
But once the quarter is over and all the contracts are completed, this data may move to Tier 2.
And once the financial year is over, the entire year’s data may move to Tier 3. This can be
accomplished with Hierarchical Storage Management (HSM), Enterprise Content Management
(ECM) and Information Lifecycle Management (ILM) tools.
HSM is policy based management of backup and data migration, and is technology based. It
establishes policies for data, and migrates the files transparently between the different tiers. EMC
DiskXtender works with NAS, UNIX/Linux and Windows based storage. HP XP Auto-LUN will
migrate data between the multiple tiers in an XP array, including the chained storage that it is
management. IBM Grid Access Manager, IBM Tivoli Storage Manager and Veritas Volume
Manager Dynamic Storage Tiering also implement HSM at the data level.
It is also possible to do this with files. Arkivio Auto-Stor, F5 Acopia Intelligent File Virtualization
and NuView File Lifecycle Manager will implement file based HSM.
Database and application vendors recognize the need to do this inside their applications. EMC
Documentum, Microsoft CommVault for Exchange, Mimosa NearPoint e-mail File System
Archiving, Oracle, Microsoft SQL Server and DB2 all implement data based HSM by storing all of
the data in their own set of databases, and categorizing the content, and migrating the
documents appropriately according to the business policies of the organization.
Note that when HSM moves data from on-line (whether Tier 1, 2 or 3) to near-line or off-line (Tier
4 or 5), it will leave behind a “stub” on the on-line storage array. This stub contains all of the file
information (owner, size, name, etc), but is extremely small, in that it the only “data” is a pointer to
the near-line or off-line storage where the actual data exists. When the file is accessed, the data is
transparently (except for some latency) fetched from the near-line or off-line storage, exactly as if
the data had existed on the on-line storage array. The advantage of this is that user access
methods don’t change no matter where the data actually exists, and the backup of the stub is
many times smaller than the actual data would be.
ILM goes further than that, and implements policy based management of data from creation to
destruction. It tracks the data itself, and moves beyond archiving into actually removing the data
from the environment.
Planning is the key to effective storage tiering. You need to establish a team and standards for
the data. I know of customers who discovered that they were backing up hundreds of GBytes of
.TMP files which were decades old, which had been stored on their Tier 1 storage, which were of
no value whatsoever. So ensuring that the right data is on the right tier is critical.
Implement a Proof of Concept, implement across the enterprise, and then continually review new
data and assign it to the proper tier.
Client Virtualization
© 2008 Hewlett-Packard Development Company, L.P.
Up to this point we have been talking about virtualizing the back end systems: servers, storage
and networking. But there is another area where virtualization can yield the same kinds of
benefits: the client devices.
Personal computers (PC’s) today are incredible devices. They are far more capable (more
processing power, more memory, more disk space, and far better graphics) than mainframe
systems and high-end workstations of even a decade ago. Even notebook and tablet PC’s have
power that exceeds the capacity of entire data centers that existed when we started our careers.
And Moore’s Law has extended beyond the processor, into memory, disk and graphics, dropping
the price of these devices to where it is reasonable to place this power in the hands of many or all
of the people in your organization.
But in some cases this power has exceeded the requirements of the users. Customer Service
Representatives on the phone or at the counter of your business who run bounded applications
with limited graphics, office workers who only run simple word processing and spreadsheet
applications, manufacturing people who run data entry and reporting applications in an
environmentally hostile environment, don’t need all of this power. They would be quite satisfied
with a lower power (though still quite powerful) device.
And even though prices have come down, the percentage of people in your organization that
require at least some level of computing power has exploded. Twenty years ago, only engineers
required workstations, and most offices had a few centralized word processing devices. Today,
virtually every person in your office uses word processing, spreadsheets, web browsing and
specialized office applications. Many people on factory floor and warehouse production lines
require computer access in order to receive information about the orders to process and to report
on the fulfillment of those orders. And it is the rare person indeed who does not require e-mail
for their daily job. So the total cost (lower cost of each device times a much larger number of
devices) is much higher than it used to be.
Further, the maintenance of the PC, especially in the vast numbers that are required today, can
overwhelm an organization. And there are security concerns in having hundreds of client devices
containing highly sensitive data which are highly mobile (in the case of notebook and tablet PC’s).
Note that in almost every case the server device doing the virtualization is based on an AMD
Opteron or Intel Pentium Xeon, running either Linux or Windows. The reason for this is simple:
these are the processors and operating systems that the clients are running today on their current
client devices (ie, desktops). So unless I specify otherwise, the processors and operating systems
are x86/x64 and Linux/Windows.
I titled this section “Client Virtualization” because I wanted to incorporate the virtualization of the
PC in all of its forms: desktop, notebook, tablet, thin client and other terminal devices.
Client virtualization is extremely simple in concept: a common (usually very small) set of standard
operating system “gold” images are deployed to a set of standard server devices in the
datacenter, which run the standard desktop applications. The graphical user interface is delivered
via simplified delivery device.
© 2008 Hewlett-Packard Development Company, L.P.
Figure 28 Client Virtualization Architecture
Figure 28 shows the common components of a client virtualization solution:

The storage needed to hold the data which is produced and consumed by the
applications run by the virtual clients.

The devices which are running the client software. The diagram specifies blade systems,
but any kind of server would work equally effectively. These can be large servers, small
rack-mount servers, or more recently blade servers. These are becoming an increasingly
popular choice, due to the economic and environmental advantages of blade servers.

The virtual clients (VC’s in the diagram) themselves. These can be either a single client
operating system per server device, multiple client operating systems (usually virtualized,
as discussed in the micro-partitioning section above) per server device, or a single
operating system handling multiple clients in a timesharing environment.

A connection manager handles the graphics, keyboard and pointing device connections
between the virtual clients and the client devices.

A remote protocol is required to interface over the network between the servers in the
data center and the client devices. There are several popular protocols to handle the
graphics communication, including Independent Computing Architecture (ICA) from
Citrix, Remote Desktop Protocol (RDP) from Microsoft, Remote Frame Buffer (RFB) from
the X11 consortium, Remote Graphics Software (RGS) from HP, and Simple Protocol for
Independent Computing Environment (SPICE) from Qumranet, as well as several protocols
which are proprietary to a specific client virtualization solution.

Finally, the client device can be either a standard PC (desktop, notebook, tablet, etc) or a
thin client device with only a monitor, keyboard, pointing device and just enough
capability to display the graphics and return the keyboard and mouse events.
Because each user still requires some type of client device, you might be wondering how this is
better than simply giving each user a standard PC or notebook. I covered a little bit of this earlier,
but let’s make it explicit:

The thin client device is usually far cheaper than a standard desktop or notebook PC. For
example, it often has no disk, and is not licensed for any software.

It is extremely unusual to have 100% of the people using their client device at the same
time. People are on vacation, traveling, in meetings, out sick, etc, leaving their client
devices idle. In most organizations, 20% or more of the client devices are idle at any
given moment. By reducing the number of virtualized clients by this amount, both in
terms of servers in the datacenter and access devices, significant cost savings can be
achieved: think of how much your organization could save by reducing your PC purchases
by 20%.

People often move around. By utilizing roaming profiles, a person can have their exact
configuration (storage, networking, applications, and any other customizations allowed by
their profile) available at any client device, simply by logging in. The client virtualization
software will load their environment on the specific virtualized environment (an entire
server, a guest operating system instance on a server, or a time-sharing client on a server)
and deliver their profile to them, no matter where they are.

A broken desktop often means that the person is idle. But a broken server in the
datacenter simply means that all of the users on that device will shift over to another
server and continue working with either no interruption (using guest operating systems
instances with live migration) or a small interruption (being forced to log in again).
Roaming profiles allow the movement of users and optimal use of the servers. And a
broken client device is the same way: the user will simply log into any other similar client
© 2008 Hewlett-Packard Development Company, L.P.
device, and continue working without a major interruption. So client virtualization
solutions mean higher availability for the clients.

Security is also enhanced. With standard desktop and especially notebook PC’s, the files
are physically resident on the disk on the PC. This makes them somewhat difficult to
backup, very difficult to track, and extremely difficult to enforce version control. Further,
physical control of the information is lost, either due to loss of the device by the user or
by the deliberate theft of the information by corporate or other espionage. By having all
of the disks for the virtual clients being resident on the storage in the datacenter, most of
these problems are either eliminated or severely reduced.
A very important component of a client virtualization solution is the graphics protocol. This is
almost always the most important factor in whether the solution will meet the needs of the
clients, and is also the one most often overlooked. It is fairly easy to analyze the performance
requirements of each of the clients, and then to size the servers (processors and memory) and the
back end (networking and storage) to satisfy those requirements, as the tools and expertise for
this are similar to what the system administrator has been doing for servers for many years. But
the graphics is often a new concept.
In many cases the graphics protocol is going through the corporate backbone: this is the
common case in corporate call centers or corporate offices, where the clients are in the same
building or campus as the servers and the storage.
But if the clients are remote, and the graphics protocol is going out through a WAN or even the
Internet itself (which is often the case of home workers, remote offices, or multiple corporate
offices being served by a single datacenter), then the networking infrastructure and therefore
performance is something over which the system administrator has no control.
And the problem is usually not bandwidth. The Internet makes amazing amounts of bandwidth
available to any business and most homes, and wireless technology is available pretty much
anywhere at speeds which used to only be available in dedicated circuits. Download speeds of
512Kbits per second are the low end of home networking today (whether through cable modem
or telephone company DSL), are common at wireless hot-spots, and are well within the range of
cellular wireless cards in major metropolitan centers. This is perfectly adequate for everything
except large video images or extremely detailed graphics (real-time CAD drawings, etc).
The problem in most cases is latency, and more precisely, variable latency, where the delivery
delay between network packets is inconsistent. Human beings begin to detect interruptions at
around 75 milli-seconds, and find it becomes bothersome at between 100 and 150 milli-seconds
(over 1/10th of a second). When the screen is being refreshed at a certain rate (let’s say every 25
milli-seconds), and a refresh is delayed by more then 100 milli-seconds, humans perceive this as
an interruption and a degradation of the quality of service. This is especially true in streaming
media, such as audio and video. Audio especially is not bandwidth intensive, but hesitations
caused by variable latency are quickly perceived as breaks in the audio. You need to be aware of
this situation when deciding when and how to deploy client virtualization outside your campus.
There are four primary user segments for client virtualization:

At the very low end, the usage is primarily for task oriented uses such as a kiosk where
only a limited set of tasks need to be done on a very infrequent basis. Low cost per user is
a big driving factor for this solution. A shared Server Hosted environment, where a single
operating system hosts many users in a time-sharing manner, is excellent for this
segment.

In the middle are solutions that target basic productivity and knowledge workers. These
solutions are ideal for users who want a full desktop user experience and want to run
some multimedia applications, whether they are doing it in an office location or a remote
location. At the low end, each of the virtual client sessions can be a virtualized operating
© 2008 Hewlett-Packard Development Company, L.P.
system instance, which yields a very low cost per client, but may not scale if too many
virtual instances are deployed. With Virtual Desktops, the server is now in the data
center as well, but the server runs multiple instances of virtual machines each with its own
OS instance. There are multiple users connecting to a single server in the backend, but
each to its own OS instance in the VM.

A more scalable solution is to use a dedicated server (low end to mid range) per virtual
client. This delivers a guaranteed performance level, but the cost per user will be higher.
The application is run on a Blade or Physical PC, and is delivered remotely to the client.
The client device used by the user is typically a thin client that supports the protocol
required to access the application remotely

On the extreme high end is the blade workstation where users require a high end graphic
experience that requires dedicated hardware. This kind of solution is typically good for
users in the engineering community using 3D graphics or finance industry users that use
multiple monitors to run and view several applications at the same time. This is the same
as the previous segment in that each user has their own Blade or Physical PC, but in this
case the system is much more powerful than in the previous segment, reflecting the
greater demand for performance of the clients of this segment. There is a 1:1 mapping of
user to the PC. Cost for this is higher, but still lower than placing a powerful CAD
workstation on the desk of each user.
There is a great deal of overlap between these solutions: one organizations definition of “extreme
high end” might be another organizations definition of “middle tier”. But the distinction is mostly
whether the server is shared (whether with different guest operating system instances or different
users time-sharing a single bare iron operating system instance) or dedicated to a single user.
User Type
Server Hosted
Citrix XenDesktop
EMC VMware ACE
EMC VMware VDI
GoToMyPC
HP Consolidated
Client Infrastructure
HP Virtual Desktop
Interface
Microsoft Terminal
Services
Qumranet Solid ICE
Remote Desktop
Virtual Desktop
Blade/Physical PC
Blade/Physical PC
Graphics
Independent
Computing Architecture
Local
RDP w/extensions
Proprietary
RDP, Remote Graphics
Software (RGS)
Virtual Desktop
Server Hosted
Virtual Desktop
Symantec pcAnywhere Blade/Physical PC
Virtual Network
Blade/Physical PC
Computing
X11
Blade/Physical PC
Figure 29 Client Virtualization Products
Auto Refresh
Yes (pooled)
No
Yes
No
No
Yes (pooled)
Remote Desktop
Protocol (RDP)
RDP, Simple Protocol
for ICE (SPICE)
Proprietary
Remote Frame Buffer
No
X11
No
Yes
No
No
GoToMyPC and Symantec pcAnywhere are one-to-one desktop solutions which are focused on
allowing individuals to access their home or work desktops remotely. In most cases the graphics
bandwidth is minimal, with the primary requirement for bandwidth being file transfer. I include
them for completeness, but will not focus on them.
Virtual Network Computing (VNC) is similar, in that it focuses on the client device (running the
“VNC Viewer”) controlling a remote system (the “VNC Server”). Unlike most of the other products
© 2008 Hewlett-Packard Development Company, L.P.
covered in this topic, it has been implemented on a wide variety of platforms, and works very well
even between platforms. A very common situation is a Windows or Macintosh system accessing a
UNIX or OpenVMS server. There is even a Java VNC viewer running inside standard web
browsers. VNC uses the Remote Frame Buffer (RFB) protocol, which transfers the bitmap of the
screen to the graphics frame buffer, instead of transferring the actual graphics commands needed
to re-draw the client device screen. VNC and RFB are now open source products.
The “X Windowing System” (commonly called X11 or just X) was invented in 1984 at the
Massachusetts Institute of Technology by a consortium including MIT, Digital Equipment
Corporation and IBM. It is a general purpose graphics protocol which has been implemented on
many platforms and client devices. The major difference between it and other graphics protocol
was that it was explicitly designed to be used remotely over a network, instead of to a locally
attached monitor. While it remains popular in the UNIX world, it has fairly high overhead and
bandwidth requirements, and is more suited for a campus environment than a truly remote
delivery mechanism. Therefore, in the dramatically larger desktop world (ie, on Windows), it has
been superseded by other graphics protocols such as RDP.
Citrix XenDesktop
EMC VMware ACE and VDI
HP CCI and VDI
Microsoft Terminal Services
Qumranet Solid ICE
Software Virtualization (Software As A Service)
In the server section we have been mostly talking about hardware virtualization, whether that was
implemented in hardware (hard partitioning) or software (soft partitioning, resource partitioning,
clustering, etc). This allows us to deliver the hardware resources only to the specific area (user,
application, etc) that requires it. But there is a further distinction that is becoming very popular,
and that is the ability to deliver only the software itself to the specific area (user, server, desktop,
etc) that requires it. This is primarily for the purposes of reducing licensing costs and controlling
the specific versions and patch levels of software that are available throughout your organization.
So, for example, you might want to equip all of your employees with the ability to run Microsoft
Word. One way would be to issue each of them a CD/DVD with a full copy of Microsoft Office
2003 Professional, and trust each of them to install this on their work system. And this will work,
but it has several problems:

Not all people are capable of installing the software in the optimal way, and many people
have their own ideas of how to install software (which, while perfectly valid, do not match
the ideas of other people in your organization nor the ideas of your IT department). You
will end up with a variety of installation levels, which will be difficult to trouble-shoot.

This is extremely expensive, because Microsoft will require you to pay for every one of
these licensed copies, whether the people actually need them or not.

When you transition to Microsoft Office XP Professional, you need to get back each of the
original CD/DVD media (if the people still have it), and issue new DVD media. This is a
very time consuming and lengthy task.
It would be better if you could maintain a single copy of the full Microsoft Office suite, and allow
each of your users to run it when and if they need it. All instances would be installed and deinstalled in exactly the same way (as specified by the IT department after careful study and
following industry best practices), costs would be reduced because you would only license the
© 2008 Hewlett-Packard Development Company, L.P.
number of copies of the software that are running simultaneously at the peak times, and
upgrades would be easy as you would only upgrade the single copy.
This is the idea of Software As A Service (SAAS): delivering software packages when and where
they are needed, running either on virtual desktops (as described in the previous section) or
physical desktops connected through your corporate LAN.
−
Altiris SVS
−
Citrix XenApp (was Presentation Services)
−
Microsoft SoftGrid
−
VMware Thinstall
© 2008 Hewlett-Packard Development Company, L.P.
For more information
General information about virtualization

http://www.virtualization.info, a blog about virtualization with many good links

http://en.wikipedia.org/wiki/Hypervisor, a description of hypervisors and their history

http://it20.info/blogs/main/archive/2007/06/17/25.aspx, IT 2.0 Next Generation IT
Infrastructure discussion comparing VMware, Viridian (aka, Hyper-V) and Xen.

http://www.tcpipguide.com/free/t_IPAddressing.htm, on TCP/IP addressing and
subnetting

http://www.tech-faq.com/vlan.shtml, for a good discussion of VLAN technology
On the history of computing

http://www.thocp.net, The History Of Computing Project, for the history of hardware and
software and some of the people involved in this

http://www.computer50.org, for information on the Mark-1, Colossus, ENIAC and EDVAC

http://www.itjungle.com/tfh/tfh013006-story04.html for a history of computing focused
on virtualization, including some excellent pointers to other work

http://cm.bell-labs.com/who/dmr/hist.html, for a discussion of the invention of UNIX and
the UNIX File System (UFS)
On 3Com products
•
http://www.3com.com/other/pdfs/solutions/en_US/20037401.pdf, for details on 3Com’s
approach to VLANs
On AMD products

http://enterprise.amd.com/downloadables/Pacifica.ppt, for a description of the “Pacifica”
project on processor virtualization
On Cisco products

http://www.cisco.com/en/US/prod/collateral/ps6418/ps6423/ps6429/product_data_sheet
0900aecd8029fc58.pdf, for a description of VFrame Server Fabric Virtualization
On Citrix products

http://www.xensource.com/PRODUCTS/Pages/XenEnterprise.aspx, details on the open
source Xen project

http://www.citrix.com/English/ps2/products/product.asp?contentID=683148&ntref=hp_n
av_US, details of the Citrix Xen product
On Egenera products

http://www.egenera.com/products-panmanager-ex.htm, for a description of the
Processing Area Network (PAN) hypervisor management tool
© 2008 Hewlett-Packard Development Company, L.P.
On EMC (VMware) products

http://www.vmware.com/products/server/esx_features.html and
http://www.vmware.com/products/server/gsx_features.html for details on the VMware
virtual servers

http://www.vmware.com/pdf/vi3_systems_guide.pdf for the list of supported server
hardware and Chapter 13 for snapshots, and
http://www.vmware.com/pdf/GuestOS_guide.pdf for the list of supported guest operating
systems

http://vmware-land.com/vmware_links.html for a lengthy list of VMware pointers,
whitepapers and other documents

http://www.vmware.com/products/esxi/, for details on ESXi

http://www.vmware.com/pdf/vmware_timekeeping.pdf, for a discussion on adjusting the
time in guest instances

http://www.emc.com/products/family/powerpath-family.htm, EMC PowerPath

http://www.emc.com/collateral/hardware/comparison/emc-celerra-systems.htm, the EMC
Celerra storage home page

http://www.emc.com/products/family/diskxtender-family.htm, HSM software

http://www.emc.com/products/family/documentum-family.htm, EMC ILM products
On Emulex products

http://www.emulex.com/products/hba/multipulse/ds.jsp, RHEL and SUSE Linux on
ProLiants

http://www.vmware.com/products/desktop_virtualization.html, ACE and VDI desktop
virtualization products
On Hitachi Data Systems products
•
http://www.hds.com/products/storage-systems/index.html, TagmaStore SMS (primarily
SMB), WMS, AMS and USP
•
http://www.hds.com/assets/pdf/intelligent-virtual-storage-controllers-hitachitagmastore-universal-storage-platform-and-network-storage-controller-architectureguide.pdf for a general guide to HDS storage products
•
http://www.hds.com/products/storage-software/hitachi-dynamic-link-manageradvanced-software.html, Hitachi Dynamic Link Manager
•
http://www.hds.com/products/storage-systems/index.html, home page for HDS storage
products
On HP products

http://www.hp.com/products1/unix/operating/manageability/partitions, details on the
HP-UX partitioning continuum, and
http://www.hp.com/products1/unix/operating/manageability/partitions/virtual_partitions.
html for the specifics on vPars
© 2008 Hewlett-Packard Development Company, L.P.

http://h71000.www7.hp.com/doc/72final/6512/6512pro_contents.html, details on
OpenVMS Galaxy

http://h71028.www7.hp.com/enterprise/cache/8886-0-0-225-121.aspx, Adaptive
Enterprise Virtualization

http://h71028.www7.hp.com/enterprise/downloads/040816_HP_VSE_WLM.PDF, D.H.
Brown report on HP Virtual Server Environment for HP-UX

http://h71028.www7.hp.com/enterprise/downloads/Intro_VM_WP_12_Sept%2005.pdf,
white paper on HP Integrity Virtual Machine technology and
http://h71028.www7.hp.com/ERC/downloads/c00677597.pdf, “Best Practices for Integrity
Virtual Machines”

http://docs.hp.com/en/hpux11iv2.html and http://docs.hp.com/en/hpux11iv3.html
contains the documentation for Psets, PRM, WLM and gWLM

http://h71000.www7.hp.com/doc/82FINAL/5841/aa-rnshd-te.PDF, Chapter 4.5 for a
description of OpenVMS Class Scheduling

http://www.hpl.hp.com/techreports/2007/HPL-2007-59.pdf, compares the performance of
Xen and OpenVZ in similar workloads.

http://www.hp.com/rnd/pdf_html/guest_vlan_paper.htm, discusses VLAN security

http://h18004.www1.hp.com/products/servers/networking/whitepapers.html discusses HP
networking products

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00814156/c00814156.pdf
, discusses HP Virtual Connect

http://docs.hp.com/en/5992-3385/ch01s01.html discusses HP-UX RAID technology

http://docs.hp.com/en/native-multi-pathing/native_multipathing_wp_AR0803.pdf for HPUX 11iV3 dynamic multi-pathing

http://h71000.www7.hp.com/doc/83FINAL/9996/9996pro_206.html#set_pref_path,
OpenVMS preferred pathing

http://h18006.www1.hp.com/storage/aiostorage.html, All in One (AiO) iSCSI storage
arrays

http://h18006.www1.hp.com/products/storageworks/efs/index.html, Enterprise File
Services Clustered Gateway

http://h18006.www1.hp.com/storage/disk_storage/msa_diskarrays/san_arrays/msa2000i/s
pecs.html, Modular Storage Array specifications

http://h18006.www1.hp.com/products/storage/software/autolunxp/index.html, HP
AutoLun software
On IBM products

http://www-1.ibm.com/servers/eserver/bladecenter/,
http://www-1.ibm.com/servers/eserver/xseries/,
http://www-1.ibm.com/servers/eserver/iseries/, and
http://www-1.ibm.com/servers/eserver/pseries/hardware/pseries_family.html, eServer
brochures and full descriptions

http://www-1.ibm.com/servers/eserver/xseries/xarchitecture/enterprise/index.html. for
general information on X-Architecture technology in the eServer, but specifically
© 2008 Hewlett-Packard Development Company, L.P.
ftp://ftp.software.ibm.com/pc/pccbbs/pc_servers_pdf/exawhitepaper.pdf for a detailed
white paper on the X-Architecture

http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf, eServer p570 Technical
Overview

http://www.redbooks.ibm.com/abstracts/SG246251.html, logical partitioning on the IBM
eServer iSeries

http://www.redbooks.ibm.com/abstracts/sg247039.html and
http://www.redbooks.ibm.com/redbooks/pdfs/sg247940.pdf, logical partitioning on the
IBM eServer pSeries

http://publib-b.boulder.ibm.com/abstracts/tips0119.html, LPAR planning for the IBM
eServer pSeries, describes the overhead required for LPARs

http://www.redbooks.ibm.com/abstracts/tips0426.html, a comparison of partitioning and
workload management on AIX

http://www-1.ibm.com/servers/eserver/about/virtualization/systems/, describes the
Virtualization Engine, including micro partitioning and the Virtual I/O System.
http://techsupport.services.ibm.com/server/vios is the documentation for VIOS.

http://www.redbooks.ibm.com/redbooks/pdfs/sg246389.pdf, Appendix A describes the
management of the entitlement of guest instances.

http://www.redbooks.ibm.com/redbooks/pdfs/sg246785.pdf, describes Enterprise
WorkLoad Manager (eWLM) V2.1

http://www.redbooks.ibm.com/abstracts/tips0426.html describes the attributes of
WorkLoad Manager and Partition Load Manager

http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/cuod.htm,
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/reservecoduse.
htm, and
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/ipha2/tcod.htm,
http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/ipha2/cbu.
htm, describes all of the Capacity On Demand permutations and billing for p5 and p6
servers

http://www.ibm.com/software/network/dispatcher, IBM SecureWay Network Dispatcher
is a front end load balancer

http://www-03.ibm.com/systems/storage/resource/pguide/prodguidedisk.pdf, a general
product guide for IBM storage with pointers to specific products

http://www-1.ibm.com/support/docview.wss?uid=ssg1S7000303&aid=3, IBM Subsystem
Device Driver

http://www-03.ibm.com/systems/storage/network/index.html, home page for IBM
network (NAS) storage

http://www-03.ibm.com/systems/storage/software/gam/index.html, IBM Grid Access
Manager
On Intel products

http://www.intel.com/technology/computing/vptech/, for a description of the
“Vanderpool” VT-x project on processor virtualization
© 2008 Hewlett-Packard Development Company, L.P.

ftp://download.intel.com/technology/computing/vptech/30594201.pdf, for a description
of the VT-i technology, specifically Chapter 3
On Microsoft products

http://www.microsoft.com/windowsserversystem/virtualserver/default.mspx, the Virtual
Server home page,
http://www.microsoft.com/technet/prodtechnol/virtualserver/2005/proddocs/vs_tr_comp
onents_vmm.mspx, the Virtual Server 2005 Administrator’s Guide,
http://www.microsoft.com/windowsserversystem/virtualserver/evaluation/virtualizationfa
q.mspx, the Virtual Server 2005 FAQ, and
http://www.microsoft.com/windowsserversystem/virtualserver/overview/vs2005tech.mspx
for the Virtual Server technical overview

http://www.microsoft.com/windowsserver2008/en/us/white-papers.aspx, for details on
Windows Server 2008

http://www.microsoft.com/windowsserversystem/virtualserver/evaluation/linuxguestsupp
ort/default.mspx, for the description and download for Virtual Machine Additions For
Linux

http://www.microsoft.com/technet/archive/winntas/maintain/optimize/08wntpca.mspx?m
fr=true discusses Windows RAID technology

http://www.microsoft.com/windowsserver2003/technologies/storage/mpio/default.mspx,
Windows multipathing
On NEC products

http://www.necam.com/Servers/Enterprise/collateral.cfm, the NEC Express5800/1000
main page, but especially
http://www.necam.com/servers/enterprise/docs/NEC_Express58001000Series_2008_R1.pdf for details on the high end servers

http://www.necam.com/DP/ for a description of Dynamic Hardware Partitioning
On NetApp products
•
http://www.netapp.com/us/products/storage-systems/fas6000/, home page for NetApp
FAS products
On Novell products

http://www.novell.com/solutions/datacenter/docs/dca_whitepaper.pdf, describes
DataCenter Automation

http://www.novell.com/products/zenworks/orchestrator/, describes ZenWorks
Orchestrator
On Oracle products

http://www.oracle.com/corporate/pricing/partitioning.pdf, describes how Oracle “per
processor” licensing works with partitioned servers
© 2008 Hewlett-Packard Development Company, L.P.

http://www.oracle.com/technologies/virtualization/index.html, covers Oracle VM and
other Oracle virtualization technologies

http://www.oracle.com/technology/products/database/asm/pdf/take%20the%20guesswo
rk%20out%20of%20db%20tuning%2001-06.pdf and
http://www.oracle.com/technology/products/database/asm/index.html discuss ASM and
Oracle storage mechanisms
On Qumranet Solid ICE products

http://www.qumranet.com/products-and-solutions/solid-ice-white-papers, for the top
level web page for Solid ICE desktop virtualization
On Parallels (SWsoft) products

http://www.swsoft.com/en/products/virtuozzo/, base product page for Virtuozzo

http://www.parallels.com/en/products/server/mac/, base product page for Parallels Server
for Macintosh
On Red Hat products

http://www.redhat.com/rhel/server/compare/, compares the virtualization features of Red
Hat Enterprise Linux and Red Hat Enterprise Linux Advanced Platform

http://linux-vserver.org/Welcome_to_Linux-VServer.org, and http://linuxvserver.org/Paper, Linux VServer which implements virtual operating system instances
inside an O/S instance.

http://sources.redhat.com/dm/ is the source, but http://lwn.net/Articles/124703/ is an
excellent article on Linux Device Mapper and multi-path
On Sun products

http://www.sun.com/servers/entry/, http://www.sun.com/servers/midrange/, and
http://www.sun.com/servers/highend/ for specific server information

http://www.sun.com/servers/highend/whitepapers/Sun_Fire_Enterprise_Servers_Performa
nce.pdf for a description of the Sun Fireplane System Interconnect, with more details at
http://www.sun.com/servers/wp/docs/thirdparty.pdf, Section 4. See also
http://docs.sun.com/source/817-4136-12, Chapters 2 and 4, for a description of the
overhead and latency involved in the Fireplane

http://www.sun.com/products-n-solutions/hardware/docs/pdf/819-1501-10.pdf, Sun Fire
Systems Dynamic Reconfiguration User Guide and

http://www.sun.com/servers/highend/sunfire_e25k/features.xml, for a description of
Dynamic System Domain in the Sun Fire E25K, which states that Dynamic System
Domains are “electrically fault-isolated hard partitions”, focusing on the OLAR features
and not the partitioning features recognized by the other vendors

http://www.sun.com/bigadmin/content/zones/, for a description of Solaris containers and
zones and http://www.sun.com/blueprints/1006/820-0001.pdf for a Sun whitepaper on all
virtualization technologies
© 2008 Hewlett-Packard Development Company, L.P.

http://www.sun.com/datacenter/consolidation/docs/IDC_SystemVirtualization_Oct2006.p
df for an IDC whitepaper reviewing containers and zones and
http://www.genunix.org/wiki/index.php/Solaris10_Tech_FAQ for a FAQ on Sun
virtualization

http://www.sun.com/datacenter/cod/, Sun’s Capacity On Demand program and products

http://www.sun.com/blueprints/0207/820-0832.pdf and
http://www.sun.com/servers/coolthreads/ldoms/datasheet.pdf, Sun Logical Domains

http://docs.sun.com/app/docs/doc/819-2450 and
http://www.sun.com/software/products/xvm/index.jsp, covers xVM and Zones

http://docs.sun.com/source/820-2319-10/index.html, covers LDoms

http://www.sun.com/storagetek/disk.jsp, SANtricity on the StorageTek line
On Symantec products

http://eval.symantec.com/mktginfo/enterprise/white_papers/entwhitepaper_vsf_5.0_dynamic_multi-pathing_05-2007.en-us.pdf, Dynamic Multi Pathing

http://h20219.www2.hp.com/ERC/downloads/4AA1-2792ENW.pdf, discusses Symantec
Veritas Volume Manager Dynamic Storage Tiering
On Unisys products

http://www.unisys.com/products/es7000__servers/features/crossbar__communication__pa
ths.htm, describes the cross-bar communications and
http://www.unisys.com/products/es7000__servers/features/partition__servers__within__a__
server.htm describes the hard partitioning capabilities

http://www.serverworldmagazine.com/webpapers/2000/09_unisys.shtml, describes
Cellular MultiProcessing (CMP)

http://docs.sun.com/source/819-0139/index.html, the MPxIO Admin Guide
On Virtual Iron products

http://www.virtualiron.com/products/index.cfm describes “native virtualization”

http://www.virtualiron.com/products/System_Requirements.cfm describes the systems
requirements (hardware platforms, guest O/S’s, etc) of Virtual Iron

http://virtualiron.com/solutions/virtual_infrastructure_management.cfm describes
LiveMigrate and LiveCapacity
On Virtual Network Computing (VNC) products

http://www.realvnc.com/products/enterprise/index.html, describing the Enterprise Edition
of VNC

http://www.tightvnc.com/, describing one of the many freeware versions of VNC
On Para-virtualization products
© 2008 Hewlett-Packard Development Company, L.P.

http://www.trango-systems.com/english/frameset_en.html, describes the Trango realtime embedded hypervisor

http://www.linuxdevices.com/news/NS3442674062.html for a description of OpenKernel
Labs (OKL)’s offering of a Linux micro-kernel used in cell phones, which offers a « Secure
Hypercell » to encapsulate an entire Linux O/S (Symbian, primarily), an application or a
device driver. VirtualLogix and Trango are competitors.

Xen and the Art of Virtualization, describes the issues around virtualizing the x86
architecture and the Xen approach to solving those problems. Also see the October 15
2004 Linux Magazine article Now and Xen by Andrew Warfield and Keir Fraser

http://www.cs.rochester.edu/sosp2003/papers/p133-barham.pdf, a technical white paper
by the authors of Xen, detailing the design and implementation

http://www.xensource.com/files/xensource_wp2.pdf, describes the para-virtualization
model and how it can cooperate with operating systems which do not support paravirtualization, and http://xensource.com/products/xen_server for a comparison of the
different editions of Xen

http://www.egenera.com/reg/pdf/virtual_datacenter.pdf, describes the Egenera
Processing Area Network (PAN) management tool which offers datacenter virtualization,
or management of many host and guest operating system instances across a datacenter
On emulation products
•
http://www.softresint.com/charon-vax/index.htm
•
http://www.transitive.com/news/news_20060627_2.htm
•
http://www.platform-solutions.com/products.php and http://www.t3t.com
On storage in general

http://www.wwpi.com/index.php?option=com_content&task=view&id=301&Itemid=44,
“Achieving Simplicity with Clustered, Virtual Storage Architectures” by Rob Peglar

http://www.adic.com/storenextfs, http://www.sun.com/storage/software/data_mgmt/qfs/
and http://www-1.ibm.com/servers/storage/software/virtualization/sfs/index.html for
SAN-based file systems across non-clustered servers

http://www.unixguide.net/unixguide.shtml, for a good comparison of the commands and
directories for all the UNIX systems

http://www.snia.org/education/dictionary/r for the official definition of RAID
© 2008 Hewlett-Packard Development Company, L.P.
Acknowledgements
I would like to extend my heartfelt thanks to Matt Connon, Jeff Foster, Eric Martorell, Brad
McCusker, Mohan Parthasarathy, Ram Rao, Alex Vasilevsky, John Vigeant and Scott Wiest for their
invaluable input and review. I would also like to thank Nalini Bouri, K.C. Choi and Eamon
Jenningsfor their enthusiastic support for this and my other whitepapers.
Revision History
V0.1, Mar 2005 – Preliminary version, with some material derived from “A Survey of Cluster
Technologies” V2.4.
V1.0, Jun 2006 – First released version, to HP internal audience
V1.1, Sep 2006 – Presented at HP Technology Forum, Houston TX
V1.2, Jan 2007 – Published in TYE
V1.4, Jun 2007 – Presented at HP Technology Forum, Las Vegas NV
V2.0, Jun 2008 – Presented as a pre-conference seminar at HP Technology Forum, Las Vegas NV
© 2008 Hewlett-Packard Development Company, L.P.
Download