EVA Cache: Why So Efficient?

advertisement
EVA Cache – Why so Efficient?
– October 5, 2004
Introduction ...................................................................................................................................................... 1
Storage Performance Council Benchmark ....................................................................................................... 3
Cache functions ............................................................................................................................................... 3
Random reads ................................................................................................................................................. 3
Random writes ................................................................................................................................................. 5
Sequential reads .............................................................................................................................................. 5
Sequential writes ............................................................................................................................................. 7
Summary ......................................................................................................................................................... 7
“It’s not the size of the dog in the fight; it’s the size of the fight in the dog.”
Archie Griffin (5’9” two-time Heisman winner)
Introduction
Modern arrays are a “system” of many dependent pieces and parts. One of those
dependent pieces is cache memory. Today almost all disk arrays employ cache
memory to increase system performance by hiding or at least minimizing the
relatively slow mechanical performance of disk drives. Some array designs require
more cache than others and some arrays have a faster backend than others. But in
any case, this one principle needs to be understood – that it’s not the size of the
cache that matters or the speed of the backend, but it is the overall system
performance that counts. Suffice it to say that overall array performance is based
HP Restricted – for HP and HP Channel Partners Only
EVA cache
on the multiple pieces and parts working together and not on the size or speed of
one component. Today, the amount of cache offered in mid-range arrays varies
widely. For example, the HP StorageWorks EVA and the IBM FAStT900 (now the
DS4500) both offer 2GB of cache while the EMC CLARiiON CX700 offers 8GB.
Whether or not an array has the proper amount of cache is best determined by
running an industry standard performance benchmark and seeing how the system
functions as a whole. An out of balance system is easily exposed.
The operating system environment also plays a part in determining the optimum
amount of cache. Mainframe IO applications are typically storage cache-hit
intensive and will generally benefit from larger caches in the storage subsystem. In
mainframe environments storage cache hit rates of 90% are not unusual, and this
makes cache in these situations a good investment. On the other hand, open
system applications typically cache the IO requests in the server memory, resulting
in storage cache hit rates of 20-40% - less than half that of mainframes. As a proof
point, the industry recognized Storage Performance Council open systems
benchmark is based on a cache hit rate of 20%. As such, informed data systems
managers tend to view backend disk performance, not cache size or cache
performance, as the best overall predictor for mid-range open system application
performance.
Another point to consider is the array itself and its assortment of features and
capabilities. For example, the new XP12000 from HP holds 1152 disk drives, has
128GB of cache, services thousands of servers, heavily utilizes the cache when
doing remote mirroring, and also has a Cache-LUN software product that allows
entire LUNs to be held in cache. Obviously, with this range of features, a large
cache is important even in open systems.
Despite all these conditions, it is still a frequent question as to how the EVA can
perform so well with only 2 GB of cache memory. The answer is that the EVA’s
HSV110 controller was designed exclusively by HP, a systems vendor that has an
intimate knowledge of how servers, storage, applications, and operating systems
interact together. No company knows more about open systems than HP. In the
EVA, the cache size and advanced caching algorithms have been optimized for
open systems environments. In fact, according to the Storage Performance
Council (SPC) benchmark, the EVA with its 2GB of cache compares very favorably
in random performance with a Shark 800 turbo configured with 16GB of cache, 6
heavy duty processors, and the compute power of dual RS-6000 servers. This is
the same Shark IBM uses in its most performance intensive mainframe and open
systems environments.
Bottom line: cache is necessary in an array. However, because of the high cost of
array cache and the relatively low cache hit rates in open systems environments,
customers should look at more than just the size of the cache when trying to
estimate array performance – well designed cache algorithms can have a dramatic
impact on cache efficiency and cause the cache to perform out of all proportion to
2
HP Restricted – For HP and HP Channel Partners Only
EVA cache
its size.
The remainder of this paper describes how the EVA cache algorithms maximize
the effectiveness of the EVA cache.
Storage Performance Council Benchmark
The purpose of this paper is to explain why the EVA cache is so efficient. It is not
intended to prove that the EVA is one of the fastest arrays in the world. We believe
that argument has already been settled by the EVA’s Storage Performance Council
Benchmark result. Below are some results from the SPC website. Notice how
close the EVA performs to an IBM Turbo 800 and exceeds the performance of a
FAStT900 (now called the DS4500). http://www.storageperformance.org/results
COMPANY
IBM
PRODUCT
Shark 800 Turbo
(16GB cache)
EVA5000
FAStT900 (DS4500)
StorEdge 6920 (8 tray
CLARiiON CX
HP
IBM
Sun
EMC
SPC IOPS
22,999
20,096
18,447
19,496
Did not participate
Cache functions
The primary purpose of controller cache is to mask the relatively long service times
associated with the mechanical nature of the actuator head movement across the
face of the disk platter. Today’s disk drives have an average access time measured
in milliseconds, while cache access times for high performance controllers are
typically less than 200 microseconds. Since a disk access can take 30 to 40 times
longer than a cache access, efficient cache algorithms can have a dramatic affect
on overall storage performance.
Controller cache algorithms can be divided into four categories based on I/O
workloads: random reads, random writes, sequential reads, and sequential writes.
The amount of cache necessary for good performance in any array depends
partially on the efficiency of the algorithms specific to each of these areas. The
rest of this paper will be spent delving into the nature of those individual I/O
workloads and into each workload’s specific EVA algorithm.
Random reads
Read cache is used to cut down on disk accesses that repeatedly read the same
data. When a data block is accessed, the block is placed in cache on the
assumption (or hope) that it will be accessed again. If the host does access that
3
HP Restricted – For HP and HP Channel Partners Only
EVA cache
data again, then that block can be returned directly from cache, thereby avoiding a
much slower disk access. Although this seems like a fault-proof approach, there
are a few problems when applying this theory to the real world:
1. In open systems random access environments there is a low probability of a
cache hit. For example, if the application randomly accessed all the data
held in 1TB and had 8GB of cache (4GB of read cache and 4GB of write),
the probability of a read cache hit would be 4/1000 or less than ½ %. In a
10TB solution the cache hit rate would be 1/20 %. In other words, given the
high cost of storage cache memory, the cache would return almost no “bang
for the buck”. Fortunately, most open systems environments are not totally
random and the cache hit ratios tend to be in the 20% range. Nevertheless,
the principle still holds true – open systems performance is mostly
dependent on backend disk performance and on the caching algorithms and
not on cache size. Note: most interactive database applications are
random access.
2. Modern databases have their own internal caches (the SGA in Oracle, for
example). Since the database knows best what data should be cached, any
data that is accessed repeatedly is kept within the host memory (the SGA),
and the storage will never see the I/O.
In spite of these issues, there are times when the storage cache is definitely
beneficial. The most common case is when there is a “hot spot” on the disk. This
may be due to a specific file or portion thereof that is repeatedly accessed, but is
not cached by the application or operating system. A good example would be the
index area of a database. Although this access pattern does benefit from a random
access read cache, in most real world environments there is an added
complication, i.e., in addition to the I/O that repeatedly hits a “hot spot” on a disk,
there are usually other ongoing I/O operations on other disks that are completely
random. If these other accesses occur at a high rate and are read into cache,
there is a likely possibility that they will eventually displace the data from the hot
spot area. The result will be cache misses for the hot spot data.
One common way to overcome this, which is both costly and inefficient, is to simply
increase the amount of cache memory in the hope that having a sufficiently large
amount of cache will allow the hot spot data to be retained in cache. The end
result of this thinking can be tens of GB of cache memory required just to ensure
cache hits on tens of MB of hot data. A second approach is to disable read cache
on the LUNs that have no cache hits, thereby preventing their data from polluting
cache. Although this will certainly help the hot spot data to remain in cache, it
suffers from the disadvantage that it requires manual monitoring and intervention.
Not only that, if the workload to a LUN that has read-caching disabled suddenly
develops hot spots in the data, the potential for improved performance via caching
is lost, since cache has been disabled on that LUN.
4
HP Restricted – For HP and HP Channel Partners Only
EVA cache
The EVA answer:
Rather than having large amounts of cache or requiring continuous manual
monitoring, the EVA has an advanced algorithm that can handle situations as
described in the previous paragraphs with a minimal amount of cache memory.
Within the EVA, there is code that monitors cache hits, and if the hit rate for a LUN
drops below a certain level, it will automatically disable read cache for that LUN. As
a result, the random access stream will not be read into cache, and will not
displace other data. The EVA will continue to monitor accesses for that LUN, and if
the access pattern is such that cache would help, the EVA will re-enable read
cache for that LUN. This dynamic enabling and disabling of read cache based on
changing I/O workloads is one of the many reasons the EVA doesn't need large
amounts of read cache
Random writes
Write cache is used primarily as a speed-matching buffer. For random writes,
completion is signaled to the host as soon as the data is in cache, resulting in very
low latency. As the cache buffers start to fill, the EVA will initiate a background
flush operation, writing that data to disk.
Interestingly enough, the size of write cache has very little impact on random
workloads beyond a few hundred MB. As long as the incoming host-write-rate is
lower than the rate at which cache flushes to disk, the cache will never fill, and the
application will always see microsecond write response times. If, however, the host
write rate is greater than the rate at which the data can be flushed to disk, then
cache will eventually fill – no matter how large the cache is – and the host write
rate will be reduced to that of the arrays’ backend disk performance.
The EVA answer:
Knowing all this, it makes sense that if the cache destaging algorithms are well
designed then the amount of write cache only needs be large enough to
accommodate transient bursts of write I/O activity. The secret to making due with a
relatively small amount of write cache is to have a very fast backend and to have
caching algorithms that quickly move random write data out of the cache to disk.
Fortunately, the EVA has both.
Sequential reads
Sequential reads are quite common with many applications and, from an array
perspective, are characterized by two issues. First: sequential reads are usually
mixed with the I/O streams from other applications. Second: many applications
issue sequential I/O requests with very small transfer sizes. These two issues can
lead to two problems. The first problem is that when IOs are mixed with those of
other applications an array can find it difficult to identify a sequential read pattern.
The second problem is that small transfer sizes oftentimes lead to an extremely
5
HP Restricted – For HP and HP Channel Partners Only
EVA cache
inefficient use of array bandwidth. A simplistic way of handling both these issues is
to simply fetch large amounts of extra data with every read request. The thinking
behind this is that by reading in more data than is originally requested, there is a
chance that the next requested sequential read data will already be in cache.
The fallacy with this thinking is twofold. First, only a very small percentage of
requests are sequential in nature, so the additional time taken to read in the “extra”
data significantly reduces performance on all accesses, sequential or not. Second,
this unneeded data is stored in cache; once again causing other data to be purged
from cache.
The EVA answer:
A much more advanced (and patented) algorithm is used by the EVA for prefetching data. The EVA continuously monitors the I/O stream, searching for
sequential access patterns. A sequential I/O stream can be detected even if there
is intervening, non-sequential I/O. As an example, if a request stream requested
logical blocks 100, 2367, 17621, 48, 101, 17, 2, 15, and 102, the EVA would
recognize that within the seemingly random requests there is a single sequential
request stream present (blocks 100, 101, and 102). Since the EVA was designed
to handle large numbers of LUNs, the pre-fetch algorithm has been designed to
recognize up to 512 simultaneous sequential streams for different areas of different
LUNs. Because of this, the EVA can not only handle simultaneous I/Os from
disparate applications in a heterogeneous operating system environment, but can
also initiate parallel pre-fetch operations on many different LUNs as required.
Once a sequential stream has been detected and the originally requested data
returned to the host, the EVA will request additional data from the LUN. By waiting
until the original data has been sent to the host before pre-fetching more data, no
additional latencies are imposed on the host I/O. If the EVA pre-fetches the data
before the host requests it, the algorithm concludes that the pre-fetch is working
fast enough to keep up with the host, and the process continues (another pre-fetch
after the host asks for the data that is already in cache). If, on the other hand, the
host requests the data before the EVA has pre-fetched it, it is an indication that
either the EVA is too slow, or the host is too fast. To handle this, the EVA will
increase the size of the pre-fetch and continue. As the sequential read from the
host proceeds, the EVA will dynamically increase the size of the pre-fetch as
needed to stay ahead of the host requests.
After returning the data to the host, the EVA will purge the data from cache, since
sequentially accessed data is rarely, if ever, requested again. Because this data is
immediately purged from cache, the pre-fetched data essentially takes up very little
room even if large amounts of data are pre-fetched. As a result, a host can
sequentially read an entire LUN, get 100% cache hits from the EVA, and use only a
few hundred KB of cache. This result of this algorithm is that there is no penalty for
randomly accessed data (no pre-fetch is triggered); pre-fetch will only occur when
needed and only at the size that is needed (the size is dynamically adjusted), and
6
HP Restricted – For HP and HP Channel Partners Only
EVA cache
minimal cache will be used (pre-fetch data is flushed after being returned to the
host). Thus, the EVA’s relatively small amount of read cache is more than sufficient
to sustain high performance simultaneous random and sequential read workloads.
Sequential writes
Sequential write streams are typically associated with log files, and are usually one
of the more critical I/O streams from a performance standpoint. These writes
usually involve small transfer sizes, such as 8 or 16 KB, and as such, are very
inefficient from the standpoint of the array’s internal bandwidth. Furthermore, they
are quite often associated with database “checkpoint” or “commit” operations, so
other application I/Os are often held up waiting for these writes to complete.
The EVA answer:
In addition to detecting sequential read streams, the EVA will also detect sequential
write streams. Multiple sequential write requests will be aggregated in cache and,
when time comes to flush the data, will be sent to the disk drives as a single, large
request. A clear advantage of this approach is that a much higher data rate can be
obtained at the disk level, increasing the efficiency of the transfer. In a nutshell,
when the size of the write data reaches what is known as a “strip”, the EVA will
flush this data to the disks. A “strip” is 512 KB in size.
As with reads, the EVA’s pre-allocated write cache is more than sufficient to
sustain high IO performance simultaneously for both random and sequential write
workloads.
Summary
Intelligent caching algorithms can take a relatively small amount of cache and
make it perform like a much larger cache. This is the secret of the EVA cache
design. There is no question that the EVA is a very fast array. The SPC numbers
prove it. What isn’t as widely known is how the EVA can achieve these numbers
with only 2GB of cache. The secret is in the caching algorithms.
Storage Competitive Team
© 2004 Hewlett-Packard Development Company, L.P. The
information contained herein is subject to change without
notice. The only warranties for HP products and services are
set forth in the express warranty statements accompanying
such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall
not be liable for technical or editorial errors or omissions
contained herein.
5982-6623EN, Rev. 1 09/2004
7
HP Restricted – For HP and HP Channel Partners Only
Download