Hybrid Cache - A Synonym Free Cache for Parallel TLB Accessing

advertisement
Hybrid Cache - A Synonym Free Cache for Parallel TLB Accessing
Tung-Chi Wu
國立中興大學資訊科學與工程學系
phd9704@csmail.nchu.edu.tw
Yen-Jen Chang
國立中興大學資訊科學與工程學系
ychang@cs.nchu.edu.tw
Abstract
Virtual cache is a trend in the future, because of the faster
access time. But the virtual cache suffers from the synonym
problem, this is the reason why it is still not in widespread use.
In this paper, we propose a microarchitecture, called hybrid
cache. We can relieve the hybrid cache from the synonym
problem by using the frame table information of the operating
system to reflect the mapping state between virtual page and
physical page. In our proposed microarchitecture, a minimum
hardware (cache controller) addition guarantees each memory
reference against the synonym problem, so the TLB access
would be removed from the critical path thoroughly. We also
develop a performance evaluation model to estimate our hybrid
cache. The improvement depends on the occurrence frequency
of the synonyms, and the results show that the improving rate
of hybrid cache is 1.8/(1+P), where P is the occurrence
frequency of the synonyms
MMU
CPU
virtual
address
virtual page page
number
offset
TLB
C
physical
address
Memory
System
C
: concatenation
Figure 1. The architecture of MMU.
There are two alternative time points for processing the
addresses translation in most of architectures. One is before
accessing the cache, and the other is before accessing the main
memory. Consequently, the cache has two indexed modes,
Keywords: Virtual Cache, Synonym Problem, Hybrid Cache,
virtually indexed and physically indexed, which depends on
Frame Table, TLB, and Critical Path.
whether the addresses translation is implemented before
accessing the cache or after. If we use the virtual addresses to
I. Introduction
access the cache directly, the cache is called virtual cache. By
using a virtual cache, the cache access and the addresses
We know that the virtual memory is essential for current translation are performed in parallel. In contrast, if we use the
computers. It is not only used to share the limited physical physical addresses produced by the MMU to access the cache,
memory but relieve programmer from managing the Prepare the cache is a physical cache [1]. The addresses translation and
your document in Microsoft Word using this document as a the cache access are performed in serial. It is very clear that the
template. All manuscript pages, equations, citations, figures, an addresses translation is on the critical path when we choose the
memory hierarchy. Since the cache can also effectively reduce physical cache. As the processor gets faster, the time spent in
the speed gap between processor and main memory, we usually processing addresses translation would become a performance
use it to boost the system performance. As a result, how to bottleneck. Especially an inefficient TLB would incur a serious
integrate the cache with the virtual memory is a very important degradation in system performance. From the aspect of fast
issue to achieve a high performance system. Because of the cache access, the virtual cache would be attractive, since it can
virtual memory, the addresses produced by processor, i.e., virtual be accessed directly without any translation procedure. The
addresses, must be translated into physical addresses before virtual cache seems superior to the physical cache, but there are
accessing the main memory. In most architectures, the addresses a number of reasons why physical cache is used widely now.
translation is implemented by the MMU (Memory Management The straightforward hardware is a major advantage of the
Unit) that maps virtual addresses to physical addresses by physical cache, and it is simple to be implemented. Conversely,
looking up the TLB (Translation Look-aside Buffer) and the most serious drawback of the virtual cache is the synonym
executes the protection mechanism, as shown in Figure 1
problem, i.e., two different virtual addresses mapped to the
same physical address [1]. The mechanisms of the synonym
handling would complicate the cache access and the hardware
792
implementation. If we spoil the synonym handling, the hidden
cost of the virtual cache would be higher than its benefits of no
addresses translation. Whatever kind of cache is used, a large
number of techniques have been developed to improve the
performance of the virtual memory integrated with cache
system. These improvements would be described in the
following sections.
On the basis that the TLB access would become a
performance bottleneck. We believe that the trend is toward the
parallel execution of cache access and TLB access, if the
synonym problem can be solved with reasonable cost. In this
paper, we propose a microarchitecture, called hybrid cache, to
avoid the synonym problem. By integration of software and
hardware, the hybrid cache reduces the access time without
waiting for the generation of physical addresses and simplifies
the handling of synonym problem. The rest of this paper is
organized as follows. Section 2 introduces the related work
about the synonym problem. In Section 3, we develop the
hybrid cache architecture, in which the operating system and
cache architecture would be described elaborately. In Section 4,
we estimate the performance of the hybrid cache and compare
it to a traditional physical cache. Section 5 offers some brief
concluding comments.
II. The synonym problem related work
A cache, which caches the most recently used data or
instructions, can speed up the memory reference in processors.
In the process of cache access, the reference address would be
used to index the cache, and then compare with the tags to
determine whether the data is available or not. According to the
mode of index and tag, we can categorize the caches into four
classes, as shown in Table 1. Most machines use a complete
physical cache, because a P/P cache is simple to be
implemented, while a virtual cache (V/P or V/V) is not in
widespread use, because of the synonym. For example, HP
Precision used a V/P cache, and Berkeley SPUR used a
complete virtual cache. The P/V cache, especially, is only used
in MIPS R6000 machine, in which an elaborate TLB slice
technique is used [2].
Table 1. The classification of a cache.
physically indexed
virtually indexed
P/P cache
physically tagged
V/P cache
complete physical cache
V/V cache
virtually tagged
P/V cache
complete virtual cache
Once a virtual cache is used, the cache access and TLB
access are performed in parallel. Thus the synonym handling
would be the critical issue to improve the global performance.
In general, the solutions to the synonym problem can be
classified into two categories [3]. One is software-based
technique, and the other is hardware-based technique. These
two methods have their pros and cons, and we would briefly
describe them in the rest of this section.
2.1 Software-based solutions to the synonym problem.
Without any hardware support, the basic concept of
software-based schemes is to prevent or avoid the synonym.
From the aspect of prevention, the prohibition of synonyms in
the operating system [4] is the simplest scheme, but this
method would complicate the implementation of operating
system. The second way for preventing the synonyms is to
enforce each physical page has only one virtual number by
sharing at the segment level [5]. The third is to make use of
single virtual-address-space operating system [6], i.e., no PID.
Since all processes share a global virtual address space, no
synonyms arise. The disadvantage of this method is that a very
large page table must be accessed by hashing the virtual
address.
From the aspect of avoiding the synonym, the simplest
solution is to flush the entire virtual cache on context switching
[7]. But this way is only suitable for using a small cache with
lower context switch frequency. Another solution to the
synonym is the alignment scheme [7]. In this method,
operating system must align all the synonyms, that is, all
synonyms are mapped to the same cache block. In the cache,
since a synonym would displace the previous aligned synonym
in the same block, the synonym problem would impossibly
arise. The main disadvantage of this method is that the cost is
too high to align all the synonyms.
2.2 Hardware-based solutions to the synonym problem
The concept of hardware-based schemes is to tolerate the
synonyms, but these methods need the hardware support to
solve the synonyms. These hardware solutions disallow two
copies of the same block present in the cache at any time. They
can effectively simplify the implementation of operating
system. The basic hardware- based approach is to remove
synonyms from the cache on a miss without any special
hardware. When a cache miss occurs, the cache controller must
search for all blocks to find a synonym. If a synonym exists,
this miss is a false miss and then the cache controller must retag
this block or simply remove it. It is clear that this search
process is very efficient in a V/P cache, but not in a V/V cache.
As a result, many hardware techniques are proposed to find the
synonym as fast as possible. The process of finding the
synonym is called the reverse translation. One solution is R-tag
cache [8] that can speed up the reverse translation by a copy of
the main cache tag. A second organization for the reverse
translation is the use of a small cache that only stores the
unaligned synonyms [9]. Besides an on-chip cache, for a
system with two-level cache, i.e., the first-level is a virtual
cache and the second-level is a physical cache, the reverse
793
translation can be implemented in the tag of the second-level
cache [10].
S PID
Virtual Page
Number
Physical Page
Number
V
D
Misc.
0 PID
Virtual Page
Number
Physical Page
Number
V
D
Misc.
III. How the hybrid cache works
To avoid the synonym, we must recognize the situation in
which the synonym possibly arises. In the page level, the
synonym problem means that two or more virtual page
numbers map to the same physical page number. In other
words, the mapping between virtual page and physical page is
multiple-to-one while the synonym arises. By contrast, the
mapping state is one-to-one without the synonym problem.
When the virtual memory is implemented in paging mode, the
operating system (OS) allocates dynamically the physical page
frame to the process. In order to make the best use of the
limited physical memory, OS must maintain an available frame
pool and the information of the frames that have been allocated
to the processes. The whole information about page allocation
is stored in the frame table [11]. In our proposed scheme, if the
demanded physical frame has been used, OS must set the
shared bit of this page table entry to indicate this mapping is
multiple-to-one. If necessary, OS would update the shared bit
of the first virtual page that maps to the same physical frame, as
shown in Figure 4(b). It is easy to reflect the mapping state in
process of page allocation. Because the frame table maps the
physical page to the virtual page, and we just look it up to get
the mapping information. This mapping state, i.e., one-to-one
or multiple-to-one, is such a critical future that our scheme has
the ability to avoid the synonym in cache access.
(a) One-to-one mapping. (No synonym case)
1 PID
Physical Page
Number
V D
V
D
Misc.
Figure 3. Variation of a conventional page table entry
The shared bit 1 indicates this page is shared with other
process, and there would be synonym problem. On the other
side, if the shared bit is 0, there is a one-to-one mapping
between virtual page number and physical page number, and
the synonym problem would not arise. The reflection of the
current mapping is illustrated in Figure 4.
Figure 2 shows a conventional page table entry that contains
four components. Basically, the TLB entry has the same format
as the page table entry, since the TLB is a slice of the page table.
Besides the virtual page number and the corresponding
physical page number, the process identification (PID)
indicates that every process has separate virtual address space.
Virtual Page
Number
Physical Page
Number
(b) Multiple-to-one mapping. (Synonym case)
3.1 TLB architecture
PID
Virtual Page
Number
Misc.
Figure 2. Format of a conventional page table entry
Typically, the TLB entry contains several status bits to
control the access to this page frame. The V (Valid) bit
indicates this page frame is in the main memory or not, and the
D (Dirty) bit reveals whether this page frame has ever been
modified. The rest of control bits, i.e., misc. field, contain other
important information about this page such as reference bit,
access mode bit and so on. To reflect the mapping state, we
must append S (Shared) bit to the page table entry, as shown in
Figure 3.
794
3.3 Cache access flow
physical
page
VPN 1
VA
CPU
VA
PPN p
PA
TLB
0 VPN 1
Hybrid
Cache
PPN p
S
(a)
physical
page
control
logic
Figure 5. The block diagram of our architecture.
(PA: physical address, VA: virtual address)
VPN 1
PPN p
Figure 5 shows our proposed cache model. In the block
diagram, note that the TLB outputs the shared bit (S) to
indicate whether the current memory reference has synonym or
not. If the shared bit is 1, the PA from the TLB would be used
in the following physical access. Because of the hybrid cache
architecture, the cache access and the TLB lookup are
performed in parallel. There are two different cache access
stages in our architecture, as shown in Figure 6
VPN 2
1 VPN 1
PPN p
1 VPN 2
PPN p
(b)
physical
page
S= 0
access cache with
VA from CPU
(virtual access)
PPN p
hit :
possible hit
miss :
possible miss
S=1
S=1
cache hit
hit :
cache hit
access cache with
PA from TLB
(physical access)
miss :
cache miss
S=0
cache miss
1 VPN 1
PPN p
1 VPN n
PPN p
1 VPN x
PPN p
Figure 6. Two access stages in our cache architecture. The shadow
part is the physical access.
(c)
Figure 4. Mapping state and the alteration of corresponding pag
e table entry. (a) One-to-one mapping, (b)(c) Multiple-toone mapping.
(VPN: virtual page number, PPN: physical page number)
3.2 Cache architecture
Since the cache can be accessed by virtual addresses or
physical addresses, we call it hybrid cache, i.e., hybrid-indexed,
hybrid-tagged cache. To avoid unnecessary comparison, we
must append one bit to the tag field to indicate this tag is a
virtual tag or physical tag. In virtual access, we use the virtual
address to index the cache, and then compare with the virtual
tags to determine whether the data is available or not. In
physical access, we just use the physical address to access the
cache instead of virtual address.
Depending on whether the current reference is in a shared
physical page, the physical access could be bypassed or not. In
the virtual access, we use the virtual address produced by the
CPU to access the cache. Whatever the result is, i.e., hit or miss,
we must check the shared bit of the TLB output. Shared bit 0
indicates the current reference is not in a shared physical page,
so the synonym would not arise and this result is a true access
result. Shared bit 1 indicates the current reference is in a shared
physical page, and then we must proceed the physical access
with the physical address that is produced in the previous
virtual access (PA from TLB). The flow of a cache access is
illustrated completely in Figure 7.
795
synonym problem or not. Unlike the use of conventional
reverse cache, our scheme does not need extra storage to store
the synonym information.
VA from CPU
access
cache
y
look up
TLB
n
hit ?
hit ?
n
IF
TLB miss
trap to the OS
ID
EX
VA look up PA access
TLB
cache
WB
y
possible
miss
possible
hit
n
n
S= 1 ?
S= 1 ?
y
cache hit
MEM
( PA , S )
Figure 8. Pipeline of the physical cache.
y
cache miss
The pipeline of a traditional physical cache access is shown
in Figure 8. As it is well known that the cache access suffers
from the delay of addresses translation in the TLB. The hybrid
cache can effectively reduce the cache access time by
overlapping the addresses translation, as shown in Figure 9.
Figure 9(a) shows the pipeline of a case in which the current
reference is not in a shared physical page.
physical
access
physical
access
(a)
PA from TLB
access
cache
hit ?
y
VA
look up
TLB
VA
access
cache
cache hit
n
IF
ID
EX
WB
cache miss
(b)
MEM
Figure 7. Access flow in hybrid cache architecture.
(a)virtual access (b)physical access.
(a)
virtual
access
IV. Comparison and performance evaluation
In Section 2, we described a lot of solutions to solve the
synonyms. No matter what techniques are used, the cost is too
high to be practical. The software-based approaches would
complicate the implementation of operating system. The
hardware-based approaches, in contrast, must compensate
higher hardware cost for the synonyms without any software
intervention. Compare the hybrid cache architecture, a
combination of software and hardware, to the previous
synonym solutions, we can find its superiority in both software
and hardware.
In software portion, we just enable the OS to fill the mapping
state information into the page table in process of page
allocation. If the demanded physical page is shared, OS not
only sets the shared bit of this mapping but also looks up the
frame table to update the page table entry of the previous
virtual page that maps to the same physical. This effect is
shown in Figure 4(b). Once the page table entry is altered, OS
must update the corresponding TLB entry at the same time.
Since the frame table exists in any operating system, the
mapping state is very easy to be reflected. In hardware portion,
the cache controller must detect the shared bit of the TLB
output to recognize whether the current reference has a
VA
IF
ID
Lookup
TLB
EX
physical
access
PA
VA access
access
cache
WB
cache
MEM
(b)
Figure 9. Pipeline of the hybrid cache architecture. (a) No
synonym case. (b) Synonym case.
In Figure 9(b), the current reference is in a shard physical
page. Since the shared bit of the TLB output is 1, we must use
the PA produced in virtual access to access the cache again. As
a result, if a synonym occurs, the total access time would be 2
times that of a single access. In this article, we develop an
estimation model for comparing our proposal to a traditional
physical cache in access time, shown as follows:
796
Timephysical = TimeTLB+ Timecache ----- (1)
Timehybrid = (1-P)*Timecache+P*2*Timecache= (1+P)*Timecache ----- (2)
The time to access a traditional physical cache is Timephysical,
TimeTLB and Timecache are the time to access the TLB and cache
respectively. In Equation (2), Timehybrid represents the average
access time of hybrid cache. Note that, for each reference, the
parameter P is the occurrence frequency of the synonyms.
Consequently, the improvement in average access time is :
Timephysical
TimeTLB
+
Timecache
=
Timehybrid
(1+P)
Timecache
In this paper, we used the practical TLB and cache as our
basic model. Based on the measurements reported by Wilton
and Jouppi [12], the Timecache of a 32K, 32-byte block size, 2way cache are 6.9 ns and 3.0 ns for the 0.8 m and 0.35 m
process technology, respectively. The TimeTLB of a 32-entry,
fully associativity TLB are 5.5 ns and 2.4 ns for the 0.8 m and
0.35 m process technology, respectively. Thus Timephysical /
Timehybrid is equal to 1.8 / (1+P). Finally, the improvement
depends on the occurrence frequency of the synonyms, i.e.,
parameter P.
The synonym would not be captured unless we can modify
the operating system to monitor the page allocation and seize
the physical addresses generated by the TLB. Such is very
difficult to capture the synonym, so we are unable to measure
the parameter P in our simulation environment correctly.
Nevertheless, we believe that the frequency of synonyms
should not be considerable. As a result, our proposed synonym
free cache would improve the access time by 80% ideally
while no synonyms arise.
V. Conclusions
In a physical cache, it is clear that the TLB access is on the
critical path, since even in the simplest cache the address
translation must be performed for each memory reference. As
the processor gets faster, the time spent in processing addresses
translation would become a performance bottleneck.
Especially an inefficient TLB would incur a serious
degradation in system performance.
We believe that the trend is toward the parallel execution of
cache access and TLB access, although the synonym problem
is hard to be eliminated. This is the motivation why we develop
the hybrid cache architecture to solve the synonym problem in
this paper. The hybrid cache is a synonym free cache in which
the cache access and TLB access are performed in parallel. No
more the synonyms aligned software and reverse translation
hardware are needed. In the process of page allocation, we
enable the OS to update the mapping state information in the
page table entry and TLB entry while the synonyms arise. In
hardware portion, a minimum hardware (cache controller)
addition guarantees each memory reference against the
synonyms. In this paper, we also develop a performance
evaluation model to estimate the hybrid cache. The final results
show that, depending on the occurrence frequency of the
synonyms, the hybrid cache architecture would improve the
access time by 80% ideally.
VI. References
[1] A. J. Smith, “Cache Memories,” Computing Surveys, Vol.
14, No. 3, September 1982, pp. 473-530.
[2] G. Taylor, P. Davies and M. Farmwald, "The TLB Slice--A
Low-Cost High-Speed Address Translation Mechanism,"
In Proceedings of the 17th Annual International
Symposium on Computer Architecture, 1990, pp. 355-363.
[3] M. Cekleov and M. Dubois, "Virtual-Address Caches,
Part 1: Problems and Solutions in Uniprocessors," IEEE
Micro, Sep. 1997, pp. 64-71.
[4] M. D. Hill and et al., "Design decisions in SPUR,"
Computer, Vol. 19, No. 11, Nov. 1986, pp. 8-22.
[5] K. Diefendorff, R. Oehler, and R. Hochsprung, "Evolution
of the PowerPC Architecture," IEEE Micro, Apr. 1994, pp.
34-49.
[6] J. S. Chase et al., "Sharing and Protection in a SingleAddress-Space Operating System," ACM Trans.
Computer Systems, Nov. 1994, pp. 271-307.
[7] R. Cheng, "Virtual Address Cache in Unix," In
Proceedings of the 1987 Summer USENIX conference,
1987, pp. 217-224.
[8] J. R. Goodman, "Coherency for Multimprocessor Virtual
Address Caches," In Proc. of the Second International
Conference on Architectural Support for Programming
Languages and Operating Systems, 1987, pp. 72-81.
[9] J. Kim, S. L. Min, S. Jeon, and B. Ahn, "U-Cache: A CostEffective Solution to the Synonym Problem," Proc. First
IEEE Conf. High-Performance Computing, IEEE CS
Press, Jan. 1995, pp. 243-252.
[10] W. H. Wang, J. L. Baer, and H. M. Levy, "Organization
and Performance of a Two-Level Virtual Cache
Hierarchy," In Proceedings of the 16th Annual
International Symposium on Computer Architecture,
May 1989, pp. 140-148.
[11] A. Silberschatz and P. B. Galvin, “Operating System
Concepts,” 5th Ed., Addison Wesley Publishers, Inc.,
1997.
[12] S. E. Wilton and N. Jouppi, “An Enhanced Access and
Cycle Time Model for On-Chip Caches,” DEC WRL,
Research Report 93/5, 1994.
797
Download