Hybrid Cache - A Synonym Free Cache for Parallel TLB Accessing Tung-Chi Wu 國立中興大學資訊科學與工程學系 phd9704@csmail.nchu.edu.tw Yen-Jen Chang 國立中興大學資訊科學與工程學系 ychang@cs.nchu.edu.tw Abstract Virtual cache is a trend in the future, because of the faster access time. But the virtual cache suffers from the synonym problem, this is the reason why it is still not in widespread use. In this paper, we propose a microarchitecture, called hybrid cache. We can relieve the hybrid cache from the synonym problem by using the frame table information of the operating system to reflect the mapping state between virtual page and physical page. In our proposed microarchitecture, a minimum hardware (cache controller) addition guarantees each memory reference against the synonym problem, so the TLB access would be removed from the critical path thoroughly. We also develop a performance evaluation model to estimate our hybrid cache. The improvement depends on the occurrence frequency of the synonyms, and the results show that the improving rate of hybrid cache is 1.8/(1+P), where P is the occurrence frequency of the synonyms MMU CPU virtual address virtual page page number offset TLB C physical address Memory System C : concatenation Figure 1. The architecture of MMU. There are two alternative time points for processing the addresses translation in most of architectures. One is before accessing the cache, and the other is before accessing the main memory. Consequently, the cache has two indexed modes, Keywords: Virtual Cache, Synonym Problem, Hybrid Cache, virtually indexed and physically indexed, which depends on Frame Table, TLB, and Critical Path. whether the addresses translation is implemented before accessing the cache or after. If we use the virtual addresses to I. Introduction access the cache directly, the cache is called virtual cache. By using a virtual cache, the cache access and the addresses We know that the virtual memory is essential for current translation are performed in parallel. In contrast, if we use the computers. It is not only used to share the limited physical physical addresses produced by the MMU to access the cache, memory but relieve programmer from managing the Prepare the cache is a physical cache [1]. The addresses translation and your document in Microsoft Word using this document as a the cache access are performed in serial. It is very clear that the template. All manuscript pages, equations, citations, figures, an addresses translation is on the critical path when we choose the memory hierarchy. Since the cache can also effectively reduce physical cache. As the processor gets faster, the time spent in the speed gap between processor and main memory, we usually processing addresses translation would become a performance use it to boost the system performance. As a result, how to bottleneck. Especially an inefficient TLB would incur a serious integrate the cache with the virtual memory is a very important degradation in system performance. From the aspect of fast issue to achieve a high performance system. Because of the cache access, the virtual cache would be attractive, since it can virtual memory, the addresses produced by processor, i.e., virtual be accessed directly without any translation procedure. The addresses, must be translated into physical addresses before virtual cache seems superior to the physical cache, but there are accessing the main memory. In most architectures, the addresses a number of reasons why physical cache is used widely now. translation is implemented by the MMU (Memory Management The straightforward hardware is a major advantage of the Unit) that maps virtual addresses to physical addresses by physical cache, and it is simple to be implemented. Conversely, looking up the TLB (Translation Look-aside Buffer) and the most serious drawback of the virtual cache is the synonym executes the protection mechanism, as shown in Figure 1 problem, i.e., two different virtual addresses mapped to the same physical address [1]. The mechanisms of the synonym handling would complicate the cache access and the hardware 792 implementation. If we spoil the synonym handling, the hidden cost of the virtual cache would be higher than its benefits of no addresses translation. Whatever kind of cache is used, a large number of techniques have been developed to improve the performance of the virtual memory integrated with cache system. These improvements would be described in the following sections. On the basis that the TLB access would become a performance bottleneck. We believe that the trend is toward the parallel execution of cache access and TLB access, if the synonym problem can be solved with reasonable cost. In this paper, we propose a microarchitecture, called hybrid cache, to avoid the synonym problem. By integration of software and hardware, the hybrid cache reduces the access time without waiting for the generation of physical addresses and simplifies the handling of synonym problem. The rest of this paper is organized as follows. Section 2 introduces the related work about the synonym problem. In Section 3, we develop the hybrid cache architecture, in which the operating system and cache architecture would be described elaborately. In Section 4, we estimate the performance of the hybrid cache and compare it to a traditional physical cache. Section 5 offers some brief concluding comments. II. The synonym problem related work A cache, which caches the most recently used data or instructions, can speed up the memory reference in processors. In the process of cache access, the reference address would be used to index the cache, and then compare with the tags to determine whether the data is available or not. According to the mode of index and tag, we can categorize the caches into four classes, as shown in Table 1. Most machines use a complete physical cache, because a P/P cache is simple to be implemented, while a virtual cache (V/P or V/V) is not in widespread use, because of the synonym. For example, HP Precision used a V/P cache, and Berkeley SPUR used a complete virtual cache. The P/V cache, especially, is only used in MIPS R6000 machine, in which an elaborate TLB slice technique is used [2]. Table 1. The classification of a cache. physically indexed virtually indexed P/P cache physically tagged V/P cache complete physical cache V/V cache virtually tagged P/V cache complete virtual cache Once a virtual cache is used, the cache access and TLB access are performed in parallel. Thus the synonym handling would be the critical issue to improve the global performance. In general, the solutions to the synonym problem can be classified into two categories [3]. One is software-based technique, and the other is hardware-based technique. These two methods have their pros and cons, and we would briefly describe them in the rest of this section. 2.1 Software-based solutions to the synonym problem. Without any hardware support, the basic concept of software-based schemes is to prevent or avoid the synonym. From the aspect of prevention, the prohibition of synonyms in the operating system [4] is the simplest scheme, but this method would complicate the implementation of operating system. The second way for preventing the synonyms is to enforce each physical page has only one virtual number by sharing at the segment level [5]. The third is to make use of single virtual-address-space operating system [6], i.e., no PID. Since all processes share a global virtual address space, no synonyms arise. The disadvantage of this method is that a very large page table must be accessed by hashing the virtual address. From the aspect of avoiding the synonym, the simplest solution is to flush the entire virtual cache on context switching [7]. But this way is only suitable for using a small cache with lower context switch frequency. Another solution to the synonym is the alignment scheme [7]. In this method, operating system must align all the synonyms, that is, all synonyms are mapped to the same cache block. In the cache, since a synonym would displace the previous aligned synonym in the same block, the synonym problem would impossibly arise. The main disadvantage of this method is that the cost is too high to align all the synonyms. 2.2 Hardware-based solutions to the synonym problem The concept of hardware-based schemes is to tolerate the synonyms, but these methods need the hardware support to solve the synonyms. These hardware solutions disallow two copies of the same block present in the cache at any time. They can effectively simplify the implementation of operating system. The basic hardware- based approach is to remove synonyms from the cache on a miss without any special hardware. When a cache miss occurs, the cache controller must search for all blocks to find a synonym. If a synonym exists, this miss is a false miss and then the cache controller must retag this block or simply remove it. It is clear that this search process is very efficient in a V/P cache, but not in a V/V cache. As a result, many hardware techniques are proposed to find the synonym as fast as possible. The process of finding the synonym is called the reverse translation. One solution is R-tag cache [8] that can speed up the reverse translation by a copy of the main cache tag. A second organization for the reverse translation is the use of a small cache that only stores the unaligned synonyms [9]. Besides an on-chip cache, for a system with two-level cache, i.e., the first-level is a virtual cache and the second-level is a physical cache, the reverse 793 translation can be implemented in the tag of the second-level cache [10]. S PID Virtual Page Number Physical Page Number V D Misc. 0 PID Virtual Page Number Physical Page Number V D Misc. III. How the hybrid cache works To avoid the synonym, we must recognize the situation in which the synonym possibly arises. In the page level, the synonym problem means that two or more virtual page numbers map to the same physical page number. In other words, the mapping between virtual page and physical page is multiple-to-one while the synonym arises. By contrast, the mapping state is one-to-one without the synonym problem. When the virtual memory is implemented in paging mode, the operating system (OS) allocates dynamically the physical page frame to the process. In order to make the best use of the limited physical memory, OS must maintain an available frame pool and the information of the frames that have been allocated to the processes. The whole information about page allocation is stored in the frame table [11]. In our proposed scheme, if the demanded physical frame has been used, OS must set the shared bit of this page table entry to indicate this mapping is multiple-to-one. If necessary, OS would update the shared bit of the first virtual page that maps to the same physical frame, as shown in Figure 4(b). It is easy to reflect the mapping state in process of page allocation. Because the frame table maps the physical page to the virtual page, and we just look it up to get the mapping information. This mapping state, i.e., one-to-one or multiple-to-one, is such a critical future that our scheme has the ability to avoid the synonym in cache access. (a) One-to-one mapping. (No synonym case) 1 PID Physical Page Number V D V D Misc. Figure 3. Variation of a conventional page table entry The shared bit 1 indicates this page is shared with other process, and there would be synonym problem. On the other side, if the shared bit is 0, there is a one-to-one mapping between virtual page number and physical page number, and the synonym problem would not arise. The reflection of the current mapping is illustrated in Figure 4. Figure 2 shows a conventional page table entry that contains four components. Basically, the TLB entry has the same format as the page table entry, since the TLB is a slice of the page table. Besides the virtual page number and the corresponding physical page number, the process identification (PID) indicates that every process has separate virtual address space. Virtual Page Number Physical Page Number (b) Multiple-to-one mapping. (Synonym case) 3.1 TLB architecture PID Virtual Page Number Misc. Figure 2. Format of a conventional page table entry Typically, the TLB entry contains several status bits to control the access to this page frame. The V (Valid) bit indicates this page frame is in the main memory or not, and the D (Dirty) bit reveals whether this page frame has ever been modified. The rest of control bits, i.e., misc. field, contain other important information about this page such as reference bit, access mode bit and so on. To reflect the mapping state, we must append S (Shared) bit to the page table entry, as shown in Figure 3. 794 3.3 Cache access flow physical page VPN 1 VA CPU VA PPN p PA TLB 0 VPN 1 Hybrid Cache PPN p S (a) physical page control logic Figure 5. The block diagram of our architecture. (PA: physical address, VA: virtual address) VPN 1 PPN p Figure 5 shows our proposed cache model. In the block diagram, note that the TLB outputs the shared bit (S) to indicate whether the current memory reference has synonym or not. If the shared bit is 1, the PA from the TLB would be used in the following physical access. Because of the hybrid cache architecture, the cache access and the TLB lookup are performed in parallel. There are two different cache access stages in our architecture, as shown in Figure 6 VPN 2 1 VPN 1 PPN p 1 VPN 2 PPN p (b) physical page S= 0 access cache with VA from CPU (virtual access) PPN p hit : possible hit miss : possible miss S=1 S=1 cache hit hit : cache hit access cache with PA from TLB (physical access) miss : cache miss S=0 cache miss 1 VPN 1 PPN p 1 VPN n PPN p 1 VPN x PPN p Figure 6. Two access stages in our cache architecture. The shadow part is the physical access. (c) Figure 4. Mapping state and the alteration of corresponding pag e table entry. (a) One-to-one mapping, (b)(c) Multiple-toone mapping. (VPN: virtual page number, PPN: physical page number) 3.2 Cache architecture Since the cache can be accessed by virtual addresses or physical addresses, we call it hybrid cache, i.e., hybrid-indexed, hybrid-tagged cache. To avoid unnecessary comparison, we must append one bit to the tag field to indicate this tag is a virtual tag or physical tag. In virtual access, we use the virtual address to index the cache, and then compare with the virtual tags to determine whether the data is available or not. In physical access, we just use the physical address to access the cache instead of virtual address. Depending on whether the current reference is in a shared physical page, the physical access could be bypassed or not. In the virtual access, we use the virtual address produced by the CPU to access the cache. Whatever the result is, i.e., hit or miss, we must check the shared bit of the TLB output. Shared bit 0 indicates the current reference is not in a shared physical page, so the synonym would not arise and this result is a true access result. Shared bit 1 indicates the current reference is in a shared physical page, and then we must proceed the physical access with the physical address that is produced in the previous virtual access (PA from TLB). The flow of a cache access is illustrated completely in Figure 7. 795 synonym problem or not. Unlike the use of conventional reverse cache, our scheme does not need extra storage to store the synonym information. VA from CPU access cache y look up TLB n hit ? hit ? n IF TLB miss trap to the OS ID EX VA look up PA access TLB cache WB y possible miss possible hit n n S= 1 ? S= 1 ? y cache hit MEM ( PA , S ) Figure 8. Pipeline of the physical cache. y cache miss The pipeline of a traditional physical cache access is shown in Figure 8. As it is well known that the cache access suffers from the delay of addresses translation in the TLB. The hybrid cache can effectively reduce the cache access time by overlapping the addresses translation, as shown in Figure 9. Figure 9(a) shows the pipeline of a case in which the current reference is not in a shared physical page. physical access physical access (a) PA from TLB access cache hit ? y VA look up TLB VA access cache cache hit n IF ID EX WB cache miss (b) MEM Figure 7. Access flow in hybrid cache architecture. (a)virtual access (b)physical access. (a) virtual access IV. Comparison and performance evaluation In Section 2, we described a lot of solutions to solve the synonyms. No matter what techniques are used, the cost is too high to be practical. The software-based approaches would complicate the implementation of operating system. The hardware-based approaches, in contrast, must compensate higher hardware cost for the synonyms without any software intervention. Compare the hybrid cache architecture, a combination of software and hardware, to the previous synonym solutions, we can find its superiority in both software and hardware. In software portion, we just enable the OS to fill the mapping state information into the page table in process of page allocation. If the demanded physical page is shared, OS not only sets the shared bit of this mapping but also looks up the frame table to update the page table entry of the previous virtual page that maps to the same physical. This effect is shown in Figure 4(b). Once the page table entry is altered, OS must update the corresponding TLB entry at the same time. Since the frame table exists in any operating system, the mapping state is very easy to be reflected. In hardware portion, the cache controller must detect the shared bit of the TLB output to recognize whether the current reference has a VA IF ID Lookup TLB EX physical access PA VA access access cache WB cache MEM (b) Figure 9. Pipeline of the hybrid cache architecture. (a) No synonym case. (b) Synonym case. In Figure 9(b), the current reference is in a shard physical page. Since the shared bit of the TLB output is 1, we must use the PA produced in virtual access to access the cache again. As a result, if a synonym occurs, the total access time would be 2 times that of a single access. In this article, we develop an estimation model for comparing our proposal to a traditional physical cache in access time, shown as follows: 796 Timephysical = TimeTLB+ Timecache ----- (1) Timehybrid = (1-P)*Timecache+P*2*Timecache= (1+P)*Timecache ----- (2) The time to access a traditional physical cache is Timephysical, TimeTLB and Timecache are the time to access the TLB and cache respectively. In Equation (2), Timehybrid represents the average access time of hybrid cache. Note that, for each reference, the parameter P is the occurrence frequency of the synonyms. Consequently, the improvement in average access time is : Timephysical TimeTLB + Timecache = Timehybrid (1+P) Timecache In this paper, we used the practical TLB and cache as our basic model. Based on the measurements reported by Wilton and Jouppi [12], the Timecache of a 32K, 32-byte block size, 2way cache are 6.9 ns and 3.0 ns for the 0.8 m and 0.35 m process technology, respectively. The TimeTLB of a 32-entry, fully associativity TLB are 5.5 ns and 2.4 ns for the 0.8 m and 0.35 m process technology, respectively. Thus Timephysical / Timehybrid is equal to 1.8 / (1+P). Finally, the improvement depends on the occurrence frequency of the synonyms, i.e., parameter P. The synonym would not be captured unless we can modify the operating system to monitor the page allocation and seize the physical addresses generated by the TLB. Such is very difficult to capture the synonym, so we are unable to measure the parameter P in our simulation environment correctly. Nevertheless, we believe that the frequency of synonyms should not be considerable. As a result, our proposed synonym free cache would improve the access time by 80% ideally while no synonyms arise. V. Conclusions In a physical cache, it is clear that the TLB access is on the critical path, since even in the simplest cache the address translation must be performed for each memory reference. As the processor gets faster, the time spent in processing addresses translation would become a performance bottleneck. Especially an inefficient TLB would incur a serious degradation in system performance. We believe that the trend is toward the parallel execution of cache access and TLB access, although the synonym problem is hard to be eliminated. This is the motivation why we develop the hybrid cache architecture to solve the synonym problem in this paper. The hybrid cache is a synonym free cache in which the cache access and TLB access are performed in parallel. No more the synonyms aligned software and reverse translation hardware are needed. In the process of page allocation, we enable the OS to update the mapping state information in the page table entry and TLB entry while the synonyms arise. In hardware portion, a minimum hardware (cache controller) addition guarantees each memory reference against the synonyms. In this paper, we also develop a performance evaluation model to estimate the hybrid cache. The final results show that, depending on the occurrence frequency of the synonyms, the hybrid cache architecture would improve the access time by 80% ideally. VI. References [1] A. J. Smith, “Cache Memories,” Computing Surveys, Vol. 14, No. 3, September 1982, pp. 473-530. [2] G. Taylor, P. Davies and M. Farmwald, "The TLB Slice--A Low-Cost High-Speed Address Translation Mechanism," In Proceedings of the 17th Annual International Symposium on Computer Architecture, 1990, pp. 355-363. [3] M. Cekleov and M. Dubois, "Virtual-Address Caches, Part 1: Problems and Solutions in Uniprocessors," IEEE Micro, Sep. 1997, pp. 64-71. [4] M. D. Hill and et al., "Design decisions in SPUR," Computer, Vol. 19, No. 11, Nov. 1986, pp. 8-22. [5] K. Diefendorff, R. Oehler, and R. Hochsprung, "Evolution of the PowerPC Architecture," IEEE Micro, Apr. 1994, pp. 34-49. [6] J. S. Chase et al., "Sharing and Protection in a SingleAddress-Space Operating System," ACM Trans. Computer Systems, Nov. 1994, pp. 271-307. [7] R. Cheng, "Virtual Address Cache in Unix," In Proceedings of the 1987 Summer USENIX conference, 1987, pp. 217-224. [8] J. R. Goodman, "Coherency for Multimprocessor Virtual Address Caches," In Proc. of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, 1987, pp. 72-81. [9] J. Kim, S. L. Min, S. Jeon, and B. Ahn, "U-Cache: A CostEffective Solution to the Synonym Problem," Proc. First IEEE Conf. High-Performance Computing, IEEE CS Press, Jan. 1995, pp. 243-252. [10] W. H. Wang, J. L. Baer, and H. M. Levy, "Organization and Performance of a Two-Level Virtual Cache Hierarchy," In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989, pp. 140-148. [11] A. Silberschatz and P. B. Galvin, “Operating System Concepts,” 5th Ed., Addison Wesley Publishers, Inc., 1997. [12] S. E. Wilton and N. Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” DEC WRL, Research Report 93/5, 1994. 797