EVA Cache – Why so Efficient? – October 5, 2004 Introduction ...................................................................................................................................................... 1 Storage Performance Council Benchmark ....................................................................................................... 3 Cache functions ............................................................................................................................................... 3 Random reads ................................................................................................................................................. 3 Random writes ................................................................................................................................................. 5 Sequential reads .............................................................................................................................................. 5 Sequential writes ............................................................................................................................................. 7 Summary ......................................................................................................................................................... 7 “It’s not the size of the dog in the fight; it’s the size of the fight in the dog.” Archie Griffin (5’9” two-time Heisman winner) Introduction Modern arrays are a “system” of many dependent pieces and parts. One of those dependent pieces is cache memory. Today almost all disk arrays employ cache memory to increase system performance by hiding or at least minimizing the relatively slow mechanical performance of disk drives. Some array designs require more cache than others and some arrays have a faster backend than others. But in any case, this one principle needs to be understood – that it’s not the size of the cache that matters or the speed of the backend, but it is the overall system performance that counts. Suffice it to say that overall array performance is based HP Restricted – for HP and HP Channel Partners Only EVA cache on the multiple pieces and parts working together and not on the size or speed of one component. Today, the amount of cache offered in mid-range arrays varies widely. For example, the HP StorageWorks EVA and the IBM FAStT900 (now the DS4500) both offer 2GB of cache while the EMC CLARiiON CX700 offers 8GB. Whether or not an array has the proper amount of cache is best determined by running an industry standard performance benchmark and seeing how the system functions as a whole. An out of balance system is easily exposed. The operating system environment also plays a part in determining the optimum amount of cache. Mainframe IO applications are typically storage cache-hit intensive and will generally benefit from larger caches in the storage subsystem. In mainframe environments storage cache hit rates of 90% are not unusual, and this makes cache in these situations a good investment. On the other hand, open system applications typically cache the IO requests in the server memory, resulting in storage cache hit rates of 20-40% - less than half that of mainframes. As a proof point, the industry recognized Storage Performance Council open systems benchmark is based on a cache hit rate of 20%. As such, informed data systems managers tend to view backend disk performance, not cache size or cache performance, as the best overall predictor for mid-range open system application performance. Another point to consider is the array itself and its assortment of features and capabilities. For example, the new XP12000 from HP holds 1152 disk drives, has 128GB of cache, services thousands of servers, heavily utilizes the cache when doing remote mirroring, and also has a Cache-LUN software product that allows entire LUNs to be held in cache. Obviously, with this range of features, a large cache is important even in open systems. Despite all these conditions, it is still a frequent question as to how the EVA can perform so well with only 2 GB of cache memory. The answer is that the EVA’s HSV110 controller was designed exclusively by HP, a systems vendor that has an intimate knowledge of how servers, storage, applications, and operating systems interact together. No company knows more about open systems than HP. In the EVA, the cache size and advanced caching algorithms have been optimized for open systems environments. In fact, according to the Storage Performance Council (SPC) benchmark, the EVA with its 2GB of cache compares very favorably in random performance with a Shark 800 turbo configured with 16GB of cache, 6 heavy duty processors, and the compute power of dual RS-6000 servers. This is the same Shark IBM uses in its most performance intensive mainframe and open systems environments. Bottom line: cache is necessary in an array. However, because of the high cost of array cache and the relatively low cache hit rates in open systems environments, customers should look at more than just the size of the cache when trying to estimate array performance – well designed cache algorithms can have a dramatic impact on cache efficiency and cause the cache to perform out of all proportion to 2 HP Restricted – For HP and HP Channel Partners Only EVA cache its size. The remainder of this paper describes how the EVA cache algorithms maximize the effectiveness of the EVA cache. Storage Performance Council Benchmark The purpose of this paper is to explain why the EVA cache is so efficient. It is not intended to prove that the EVA is one of the fastest arrays in the world. We believe that argument has already been settled by the EVA’s Storage Performance Council Benchmark result. Below are some results from the SPC website. Notice how close the EVA performs to an IBM Turbo 800 and exceeds the performance of a FAStT900 (now called the DS4500). http://www.storageperformance.org/results COMPANY IBM PRODUCT Shark 800 Turbo (16GB cache) EVA5000 FAStT900 (DS4500) StorEdge 6920 (8 tray CLARiiON CX HP IBM Sun EMC SPC IOPS 22,999 20,096 18,447 19,496 Did not participate Cache functions The primary purpose of controller cache is to mask the relatively long service times associated with the mechanical nature of the actuator head movement across the face of the disk platter. Today’s disk drives have an average access time measured in milliseconds, while cache access times for high performance controllers are typically less than 200 microseconds. Since a disk access can take 30 to 40 times longer than a cache access, efficient cache algorithms can have a dramatic affect on overall storage performance. Controller cache algorithms can be divided into four categories based on I/O workloads: random reads, random writes, sequential reads, and sequential writes. The amount of cache necessary for good performance in any array depends partially on the efficiency of the algorithms specific to each of these areas. The rest of this paper will be spent delving into the nature of those individual I/O workloads and into each workload’s specific EVA algorithm. Random reads Read cache is used to cut down on disk accesses that repeatedly read the same data. When a data block is accessed, the block is placed in cache on the assumption (or hope) that it will be accessed again. If the host does access that 3 HP Restricted – For HP and HP Channel Partners Only EVA cache data again, then that block can be returned directly from cache, thereby avoiding a much slower disk access. Although this seems like a fault-proof approach, there are a few problems when applying this theory to the real world: 1. In open systems random access environments there is a low probability of a cache hit. For example, if the application randomly accessed all the data held in 1TB and had 8GB of cache (4GB of read cache and 4GB of write), the probability of a read cache hit would be 4/1000 or less than ½ %. In a 10TB solution the cache hit rate would be 1/20 %. In other words, given the high cost of storage cache memory, the cache would return almost no “bang for the buck”. Fortunately, most open systems environments are not totally random and the cache hit ratios tend to be in the 20% range. Nevertheless, the principle still holds true – open systems performance is mostly dependent on backend disk performance and on the caching algorithms and not on cache size. Note: most interactive database applications are random access. 2. Modern databases have their own internal caches (the SGA in Oracle, for example). Since the database knows best what data should be cached, any data that is accessed repeatedly is kept within the host memory (the SGA), and the storage will never see the I/O. In spite of these issues, there are times when the storage cache is definitely beneficial. The most common case is when there is a “hot spot” on the disk. This may be due to a specific file or portion thereof that is repeatedly accessed, but is not cached by the application or operating system. A good example would be the index area of a database. Although this access pattern does benefit from a random access read cache, in most real world environments there is an added complication, i.e., in addition to the I/O that repeatedly hits a “hot spot” on a disk, there are usually other ongoing I/O operations on other disks that are completely random. If these other accesses occur at a high rate and are read into cache, there is a likely possibility that they will eventually displace the data from the hot spot area. The result will be cache misses for the hot spot data. One common way to overcome this, which is both costly and inefficient, is to simply increase the amount of cache memory in the hope that having a sufficiently large amount of cache will allow the hot spot data to be retained in cache. The end result of this thinking can be tens of GB of cache memory required just to ensure cache hits on tens of MB of hot data. A second approach is to disable read cache on the LUNs that have no cache hits, thereby preventing their data from polluting cache. Although this will certainly help the hot spot data to remain in cache, it suffers from the disadvantage that it requires manual monitoring and intervention. Not only that, if the workload to a LUN that has read-caching disabled suddenly develops hot spots in the data, the potential for improved performance via caching is lost, since cache has been disabled on that LUN. 4 HP Restricted – For HP and HP Channel Partners Only EVA cache The EVA answer: Rather than having large amounts of cache or requiring continuous manual monitoring, the EVA has an advanced algorithm that can handle situations as described in the previous paragraphs with a minimal amount of cache memory. Within the EVA, there is code that monitors cache hits, and if the hit rate for a LUN drops below a certain level, it will automatically disable read cache for that LUN. As a result, the random access stream will not be read into cache, and will not displace other data. The EVA will continue to monitor accesses for that LUN, and if the access pattern is such that cache would help, the EVA will re-enable read cache for that LUN. This dynamic enabling and disabling of read cache based on changing I/O workloads is one of the many reasons the EVA doesn't need large amounts of read cache Random writes Write cache is used primarily as a speed-matching buffer. For random writes, completion is signaled to the host as soon as the data is in cache, resulting in very low latency. As the cache buffers start to fill, the EVA will initiate a background flush operation, writing that data to disk. Interestingly enough, the size of write cache has very little impact on random workloads beyond a few hundred MB. As long as the incoming host-write-rate is lower than the rate at which cache flushes to disk, the cache will never fill, and the application will always see microsecond write response times. If, however, the host write rate is greater than the rate at which the data can be flushed to disk, then cache will eventually fill – no matter how large the cache is – and the host write rate will be reduced to that of the arrays’ backend disk performance. The EVA answer: Knowing all this, it makes sense that if the cache destaging algorithms are well designed then the amount of write cache only needs be large enough to accommodate transient bursts of write I/O activity. The secret to making due with a relatively small amount of write cache is to have a very fast backend and to have caching algorithms that quickly move random write data out of the cache to disk. Fortunately, the EVA has both. Sequential reads Sequential reads are quite common with many applications and, from an array perspective, are characterized by two issues. First: sequential reads are usually mixed with the I/O streams from other applications. Second: many applications issue sequential I/O requests with very small transfer sizes. These two issues can lead to two problems. The first problem is that when IOs are mixed with those of other applications an array can find it difficult to identify a sequential read pattern. The second problem is that small transfer sizes oftentimes lead to an extremely 5 HP Restricted – For HP and HP Channel Partners Only EVA cache inefficient use of array bandwidth. A simplistic way of handling both these issues is to simply fetch large amounts of extra data with every read request. The thinking behind this is that by reading in more data than is originally requested, there is a chance that the next requested sequential read data will already be in cache. The fallacy with this thinking is twofold. First, only a very small percentage of requests are sequential in nature, so the additional time taken to read in the “extra” data significantly reduces performance on all accesses, sequential or not. Second, this unneeded data is stored in cache; once again causing other data to be purged from cache. The EVA answer: A much more advanced (and patented) algorithm is used by the EVA for prefetching data. The EVA continuously monitors the I/O stream, searching for sequential access patterns. A sequential I/O stream can be detected even if there is intervening, non-sequential I/O. As an example, if a request stream requested logical blocks 100, 2367, 17621, 48, 101, 17, 2, 15, and 102, the EVA would recognize that within the seemingly random requests there is a single sequential request stream present (blocks 100, 101, and 102). Since the EVA was designed to handle large numbers of LUNs, the pre-fetch algorithm has been designed to recognize up to 512 simultaneous sequential streams for different areas of different LUNs. Because of this, the EVA can not only handle simultaneous I/Os from disparate applications in a heterogeneous operating system environment, but can also initiate parallel pre-fetch operations on many different LUNs as required. Once a sequential stream has been detected and the originally requested data returned to the host, the EVA will request additional data from the LUN. By waiting until the original data has been sent to the host before pre-fetching more data, no additional latencies are imposed on the host I/O. If the EVA pre-fetches the data before the host requests it, the algorithm concludes that the pre-fetch is working fast enough to keep up with the host, and the process continues (another pre-fetch after the host asks for the data that is already in cache). If, on the other hand, the host requests the data before the EVA has pre-fetched it, it is an indication that either the EVA is too slow, or the host is too fast. To handle this, the EVA will increase the size of the pre-fetch and continue. As the sequential read from the host proceeds, the EVA will dynamically increase the size of the pre-fetch as needed to stay ahead of the host requests. After returning the data to the host, the EVA will purge the data from cache, since sequentially accessed data is rarely, if ever, requested again. Because this data is immediately purged from cache, the pre-fetched data essentially takes up very little room even if large amounts of data are pre-fetched. As a result, a host can sequentially read an entire LUN, get 100% cache hits from the EVA, and use only a few hundred KB of cache. This result of this algorithm is that there is no penalty for randomly accessed data (no pre-fetch is triggered); pre-fetch will only occur when needed and only at the size that is needed (the size is dynamically adjusted), and 6 HP Restricted – For HP and HP Channel Partners Only EVA cache minimal cache will be used (pre-fetch data is flushed after being returned to the host). Thus, the EVA’s relatively small amount of read cache is more than sufficient to sustain high performance simultaneous random and sequential read workloads. Sequential writes Sequential write streams are typically associated with log files, and are usually one of the more critical I/O streams from a performance standpoint. These writes usually involve small transfer sizes, such as 8 or 16 KB, and as such, are very inefficient from the standpoint of the array’s internal bandwidth. Furthermore, they are quite often associated with database “checkpoint” or “commit” operations, so other application I/Os are often held up waiting for these writes to complete. The EVA answer: In addition to detecting sequential read streams, the EVA will also detect sequential write streams. Multiple sequential write requests will be aggregated in cache and, when time comes to flush the data, will be sent to the disk drives as a single, large request. A clear advantage of this approach is that a much higher data rate can be obtained at the disk level, increasing the efficiency of the transfer. In a nutshell, when the size of the write data reaches what is known as a “strip”, the EVA will flush this data to the disks. A “strip” is 512 KB in size. As with reads, the EVA’s pre-allocated write cache is more than sufficient to sustain high IO performance simultaneously for both random and sequential write workloads. Summary Intelligent caching algorithms can take a relatively small amount of cache and make it perform like a much larger cache. This is the secret of the EVA cache design. There is no question that the EVA is a very fast array. The SPC numbers prove it. What isn’t as widely known is how the EVA can achieve these numbers with only 2GB of cache. The secret is in the caching algorithms. Storage Competitive Team © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. 5982-6623EN, Rev. 1 09/2004 7 HP Restricted – For HP and HP Channel Partners Only