A Burst Scheduling Access Reordering Mechanism

advertisement
A Burst Scheduling Access Reordering Mechanism
Jun Shao and Brian T. Davis
Department of Electrical and Computer Engineering
Michigan Technological University
{jshao, btdavis}@mtu.edu
Abstract
Utilizing the nonuniform latencies of SDRAM devices,
access reordering mechanisms alter the sequence of main
memory access streams to reduce the observed access latency. Using a revised M5 simulator with an accurate
SDRAM module, the burst scheduling access reordering
mechanism is proposed and compared to conventional in order memory scheduling as well as existing academic and industrial access reordering mechanisms. With burst scheduling, memory accesses to the same rows of the same banks
are clustered into bursts to maximize bus utilization of the
SDRAM device. Subject to a static threshold, memory reads
are allowed to preempt ongoing writes for reduced read latency, while qualified writes are piggybacked at the end of
bursts to exploit row locality in writes and prevent write
queue saturation. Performance improvements contributed
by read preemption and write piggybacking are identified. Simulation results show that burst scheduling reduces
the average execution time of selected SPEC CPU2000
benchmarks by 21% over conventional bank in order memory scheduling. Burst scheduling also outperforms Intel’s
patented out of order memory scheduling and the row hit
access reordering mechanism by 11% and 6% respectively.
1. Introduction
Memory performance can be measured in two ways:
bandwidth and latency. Memory bandwidth can be largely
addressed by increasing resources, using increased frequency, wider busses, dual or qual channels. However, reducing memory latency often requires reducing the device
size. Although caches can hide the long main memory latency, cache misses may require hundreds of CPU cycles
and cause pipeline stalls. Therefore main memory access
latency remains a factor limiting system perform [21].
Due to the 3-D (bank, row, column) structure, modern
SDRAM devices have nonuniform access latencies [3, 4].
The access latency depends upon the location of requested
data and the state of SDRAM device. Two adjacent memory accesses directed to the same row of the same bank can
be completed faster than two accesses directed to different
rows because the accessed row data can be maintained in an
active row for faster access to following same row accesses.
In addition, accesses to unique banks can be pipelined, thus
two accesses directed to unique banks may have shorter latency than two accesses directed to the same bank.
With aggressive out of order execution processors and
non-blocking caches, multiple main memory accesses can
be issued and outstanding while the main memory is serving the previous access. Compared with conventional in
order access scheduling, memory access reordering mechanisms execute these outstanding memory accesses in an
order which attempts to reduce execution time. Through exploiting SDRAM row locality and bank parallelism, access
reordering mechanisms significantly reduce observed main
memory access latency and improve the effective memory
bandwidth. Access reordering mechanisms do not require
a large amount of chip area and only need modifications to
the memory controller.
This paper makes the following contributions:
• Studies and identifies performance contributions made
by access reordering mechanisms.
• Introduces burst scheduling which creates bursts by
clustering accesses directed to the same rows of the
same banks to achieve a high data bus utilization.
Reads are allowed to preempt ongoing writes while
qualified writes are piggybacked at the end of bursts.
• Compares burst scheduling with conventional bank in
order memory scheduling as well as existing access
reordering mechanisms, including published academic
and industrial out of order scheduling.
• Explores the design space of burst scheduling by using
a static threshold to control read preemption and write
piggybacking. Determines the threshold that yields the
shortest execution time by experiments.
The rest of the paper is organized as follows. Section 2 discusses the background of modern SDRAM devices, memory access scheduling and related work. Section 3 introduces burst scheduling. Section 4 and Section 5
present experimental environment and simulation results.
Conclusions are explored in Section 6 based on the results.
Section 7 briefly discusses future work.
2. Background
Modern SDRAM devices store data in arrays (banks)
which can be indexed by row address and column address [3, 4]. An access to the SDRAM device may require
three transactions besides the data transfer: bank precharge,
row activate and column access [10]. A bank precharge
charges and prepares the bank. A row activate copies an
entire row data from the array to the sense amplifiers which
function like a cache. Then one or more column accesses
can access the row data.
Depending on the state of the SDRAM device, a memory
access could be a row hit, row conflict or row empty and
experiences different latencies [14]. A row hit occurs when
the bank is open and an access is directed to the same row
as the last access to the same bank. A row conflict occurs
when an access goes to a different row than the last access
to the same bank. If the bank is closed (precharged) then a
row empty occurs. Row hits only require column accesses
while all three transactions are required for row conflicts
therefore row hits have shorter latencies than row conflicts.
After completing a memory access, the bank can be left
open or closed by a bank precharge, subject to the memory controller policy. One of two static controller policies,
Open Page (OP) and Close Page Autoprecharge (CPA), typically makes this decision [20]. Table 1 summarizes possible SDRAM access latencies given SDRAM busses are idle,
where tRP , tRCD and tCL are the timing constraints associated with bank precharge, row activate and column access
respectively [10].
Table 1. Possible SDRAM access latencies
Controller policy
Row hit
Row empty
Row conflict
OP
tCL
tRCD +tCL
tRP +tRCD +tCL
CPA
N/A
tRCD +tCL
N/A
Multiple internal banks allow accesses to unique banks
to be executed in parallel. A set of SDRAM devices concatenated to populate the system memory bus is known as
a rank, which shares address bus and data bus with other
ranks. For systems that support dual channels, different
channels have unique busses. Therefore parallelism exists
in the memory subsystem between banks and/or channels.
Main memory access streams are comprised of cache
misses from the lowest level cache and have been shown
to have significant spatial and temporal locality even after
being filtered by cache(s) [16]. These characteristics of the
main memory create a design space where parallelism and
locality can be exploited by access reordering mechanisms
to reduce the main memory access latency.
Throughout the rest of this paper, the term access refers
to a memory read or write issued by the lowest level cache.
An access may require several transactions depending upon
the state of the SDRAM devices.
2.1. Memory Access Scheduling
The memory controller, usually located on the north
bridge or on the CPU die, generates the required transactions for each access and schedules them on the SDRAM
busses. SDRAM busses are split transaction busses therefore transactions belonging to different accesses can be interleaved.
Access0 to
bank0 row0
R
C
Access1 to
bank1 row0
R
C
D0
Access2 to
bank0 row1
P
R
C
D1
Access3 to
bank0 row0
P
R
C
D2
D3
28 cycles
(a) In order scheduling without interleaving
R
C R C C
D0
P
D3
R
D1
C
D2
16 cycles
(b) Out of order scheduling with interleaving
P Bank precharge
R
Row activate
C Column access
Dx
Data
Figure 1. Memory access scheduling
How the memory controller schedules transactions of accesses impacts the performance, as illustrated in Figure 1.
In this example four accesses are to be scheduled. Access0
and access1 are row empties; access2 and access3 are row
conflicts. The SDRAM device has timing constraints of 22-2 (tCL -tRCD -tRP ) and a burst length of 4 (2 cycles with
double data rate). In figure 1(a), the controller performs
these accesses strictly in order and does not interleave any
transactions. It takes 28 memory cycles to complete four
accesses. While in order scheduling is easy to implement,
it is inefficient because of the low bus utilization.
In Figure 1(b), the same four accesses are scheduled out
of order. Access3 is scheduled prior to access1, which
turns access3 from a row conflict into a row hit. The transactions of different accesses are also interleaved to maximize SDRAM bus utilization. As a result, only 16 memory
cycles are needed to complete the same four accesses. It is
possible that some accesses result in increased latency due
to access reordering, however the average access latency
should be reduced.
2.2. Related Work
SDRAM transactions to devices
Other access reordering mechanisms exist. They are usually designed for special applications or hardware. Scott
Rixner et al. exploited the features of Virtual Channel
SDRAM devices [11] and proposed various access reordering policies to achieve a high memory bandwidth and low
memory latency for modern web servers [12]. Proposed
by Ibrahim Hur et al., the adaptive history-based memory
scheduler tracks the access pattern of recently scheduled
accesses and selects memory accesses matching the program’s mixture of reads and writes [7]. Zhichun Zhu et
al. proposed fine-grain priority scheduling, which splits and
maps memory accesses into different channels and returns
critical data first, to fully utilize the available bandwidth
and concurrency provided by Direct Rambus DRAM systems [24]. Sally McKee et al. described a Stream Memory
Controller system for streaming computations, which combines compile-time detection of streams with executiontime access reordering [9]. SDRAM access reordering and
prefetching were also proposed by Jahangir Hasan et al. to
increase row locality and reduce row conflict penalty for
network processors [6].
Other SDRAM related techniques such as SDRAM address mapping change the distribution of memory blocks in
the SDRAM address space to exploit the parallelism [19,
5, 23, 16]. A dynamic SDRAM controller policy predictor proposed by Ying Xu reduces main memory access latency by using a history based predictor similar to branch
predictors to make the decision whether or not to leave the
accessed row open for each access [22].
3. Burst Scheduling
In a packet switching network, large packets can improve
network throughput because the fraction of bandwidth used
to transfer packet heads (overhead) is reduced. Consider
these three transactions as the overhead and the data transfer as the payload, the same idea could be used in access
scheduling to improve bus utilization.
P0
R0
C0
C1
C2
C3
C4
Access0 Access1 Access2 Access3 Access4
Overhead
Burst (payload)
Figure 2. Burst scheduling
As illustrated in Figure 2, the proposed burst scheduling
clusters outstanding accesses directed to the same rows of
the same banks into bursts. Accesses within a burst, except
SDRAM Transaction Scheduler
Bank0
Arbiter
Bank1
Arbiter
BankN
Arbiter
Shared write
data pool
Shared
access
pool
Memory accesses
Read
Write
Write data
Figure 3. Structure of burst scheduling
for the first one, are row hits and only require column access transactions. Data transfers of these accesses can be
performed back to back on the data bus, resulting in a large
payload therefore improving data bus utilization. Increasing the row hit rate and maximizing the memory data bus
utilization are the major design goals of burst scheduling.
Figure 3 shows the structure of burst scheduling. Outstanding accesses are stored in unique read queues and write
queues based on their target banks. The read and write
queues share a global access pool. A write data pool is used
to store the data associated with writes. The queues can be
implemented as linked lists. Depending upon the row index
of the access address, new reads will join existing bursts, or
new bursts containing single access will be created and appended at the end of the read queues. Bank arbiters select an
ongoing access from the read or write queue for each bank.
At each memory cycle a global transaction scheduler selects
an ongoing access from all bank arbiters and schedules the
next transactions of the access.
Newly arrived accesses can join existing bursts when
bursts are being scheduled. Bursts within a bank are sorted
based on the arrival time of the first access to prevent starvation to single access bursts or small bursts in the same bank.
However a large or an increasing burst can still delay small
bursts from other banks. Therefore burst scheduling interleaves bursts from different banks to give relatively equal
opportunity to all bursts. High data bus utilization can be
maintained during burst interleaving. However consideration must be taken to avoid bubble cycles due to certain
SDRAM timing constrains, i.e. rank to rank turnaround cycles introduced by DDR2 devices [8].
The following sections present the three subroutines of
the algorithm used in burst scheduling, which can be transformed into finite state machine for incorporation into the
SDRAM controller.
3.1. Access Enter Queue Subroutine
Figure 4 shows the access enter queue subroutine, which
is called when new accesses enter the queues. Because the
write queue serves as a write buffer, all reads must search
the write queue for possible hits, although write queue hits
happen infrequently due to the small queue size. When a
write queue hit occurs, a read is requesting the data at the
same location as a preceding write. The data from the latest
write (if there are multiple) will be forwarded to the read
such that the read can complete immediately. Missed reads
enter the read queue. If a read is directed to the same row
as an existing burst, the read will be appended to that burst.
Otherwise, a new burst composed of the read will be created
and appended to the read queue. All writes enter the write
queue in order and are completed from the view of the CPU.
subroutine AccessEnterQueue(access)
1: if access is a read then
2:
if hit in the write queue then
3:
forward the latest write data to access
4:
send response to access
5:
else if found an existing burst in read queue then
6:
append access to that burst
else
7:
create a new burst
8:
append the new burst to read queue
end if
else
9:
append access to the write queue
10:
send response to access
end if
Figure 4. Access enter queue subroutine
3.2. Bank Arbiter Subroutine
Each bank has one ongoing access, which is the access for which transactions are currently being scheduled,
but have not yet been completed. The bank arbiter selects
the ongoing access from either the read queue or the write
queue, generally prioritizing reads over writes. Writes are
selected only when there are no outstanding reads in the
read queue, when the write queue is full or when doing write
piggybacking. The algorithm is given in Figure 5.
Two options, read preemption and write piggybacking,
are available to the bank arbiter. Read preemption allows a
newly arrived read to interrupt an ongoing write. The read
becomes the ongoing access and starts immediately, therefore reducing the latency to the read. Read preemption will
not affect the correctness of execution; the preempted write
will restart later.
subroutine BankArbiter(ongoing access)
1: if ongoing access == NULL then
2:
if write queue is full then
3:
ongoing access = oldest write in write queue
4:
else if write queue length > threshold and
last access was an end of burst and
any row hit in write queue then
5:
ongoing access = oldest row hit write
6:
else if write queue is not empty and
read queue is empty then
7:
ongoing access = oldest write in write queue
else
8:
ongoing access = first read in next burst
end if
9: else if ongoing access is a write and
read queue is not empty and
write queues length < threshold then
10:
reset ongoing access
11:
ongoing access = first read in next burst
end if
Figure 5. Bank arbiter subroutine
The major functionality of the write queue, besides hiding the write latency and reducing write traffic [17], is to
allow reads to bypass writes. When the write queue reaches
its capacity, the main memory can not accept any new access, causing a possible CPU pipeline stall. Write piggybacking is designed to speed up write process by appending
qualified writes at the end of bursts. The writes being appended must be directed to the same row as the burst, so
that they will not disturb the continuous row hits created by
the burst. If there are no qualified writes available, the next
burst will start. Write piggybacking reduces the probability
of write queue saturation and exploits the locality of row
hits from writes as well.
Read preemption and write piggybacking may conflict
with each other, i.e. a piggybacked write may be preempted
by a new read. A threshold is introduced to allow the bank
arbiter to switch dynamically between read preemption and
write piggybacking. When the write queue occupancy is
less than the threshold, read preemption is enabled. Otherwise, write piggybacking is enabled. Section 5.4 will have
a detailed study of the threshold.
3.3. Transaction Scheduler Subroutine
A transaction is considered as unblocked when all required timing constraints are met. At each memory cycle
the transaction scheduler selects from all banks one ongoing access for which the next transaction is unblocked and
schedules that transaction.
A static priority, as shown in Table 2, is used to select the
ongoing access containing the next unblocked transaction to
Table 2. Transactions priority table (1: the
highest, 8: the lowest)
Read
Write
Bank precharge
Row activate
Column access
Bank precharge
Row activate
Column access
Same
bank
5
5
1
6
6
3
Same
rank
5
5
2
6
6
4
Other
ranks
5
5
7
6
6
8
be scheduled. Among all unblocked transactions, column
accesses within the same rank as the last access scheduled
have the highest priorities. Column accesses from different banks but within the same ranks are interleaved, so that
bursts from different banks are equally served. The high
data bus utilization is maintained because interleaved accesses are still row hits. Bank precharge and row activate
have the next highest priorities as they do not require data
bus resources, and therefore can be overlapped with column
access transactions. The scheduler has priorities set to finish all bursts within a rank before switching to another rank
to avoid rank-to-rank turnaround cycles required by DDR2
device [8]. Column accesses from different ranks thus have
the lowest priority. Within each category, read transactions
always have higher priorities than write transactions. An
oldest first policy is used to break ties.
Based on the scheduling priority in Table 2, the subroutine of the transaction scheduler is shown in Figure 6. When
there are no unblocked transactions from any accesses, the
scheduler will switch to the bank which has the oldest accesses and initiates an access from that bank in the next
memory cycle.
The priority table and transaction scheduler are the core
of burst scheduling, which maintain the structure of bursts
created by bank arbiters, meanwhile maximizing the data
bus utilization by aggressively interleaving transactions between accesses.
3.4. Validation
Burst scheduling will not affect program correctness.
Reads are checked in the write queues for hits before entering the read queues. If a read hits in the write queue, the
latest data will be forwarded from the write to the read, so
read after write (RAW) hazards are avoided. Within bursts,
writes are always piggybacked after reads which have previously checked the write queue for hits, avoiding write after
read (WAR) hazards. When performing write piggybacking, the oldest qualified write will be selected first. Writes
are scheduled in program order within the same rows, therefore write after write (WAW) hazards are also avoided.
subroutine TransactionScheduler(last bank, last rank)
1: if last bank has unblocked col access then
2:
schedule the unblocked col access
3: else if any unblocked col access in last rank then
4:
schedule the oldest unblocked col access
5: else if any unblocked precharge or row activate then
6:
schedule the oldest precharge or row activate
7: else if any unblocked col access in other ranks then
8:
schedule the oldest unblocked col access
end if
9: if access scheduled then
10:
if scheduled access has completed then
11:
send response to that access
end if
12:
last bank = scheduled access’s target bank
13:
last rank = scheduled access’s target rank
else
14:
last bank = the bank having the oldest access
15:
last rank = the rank having the oldest access
end if
Figure 6. Transaction scheduler subroutine
4. Experimental Environment
A revised M5 simulator [1] and SPEC CPU2000 benchmark suite [18] are used in the studies of access reordering mechanisms. The M5 simulator is selected mainly
because it supports a nondeterministic memory access latency. The revisions made to the M5 simulator include a
detailed DDR2 SDRAM module, a parameterized memory
controller as well as the addition of the access reordering
mechanisms described in this paper.
4.1. Benchmarks and Baseline Machine
Due to the page limitation, results from 16 of 26 SPEC
CPU2000 benchmarks are shown in Section 5 using the criterion that the benchmark selected exhibits more than 2%
performance difference between in order scheduling and
any out of order access reordering mechanisms studied in
this paper. While excluding non-memory intensive benchmarks provides a better illustration of impacts contributed
by access reordering mechanisms, results from the complete
suite can be found in [15]. Simulations are run through the
first 2 billion instructions with reference input sets and precompiled little-endian Alpha ISA SPEC2000 binaries [2].
Table 3 lists the configuration of the baseline machine,
which represents a typical desktop workstation in the near
future using a bank in order memory access scheduling
(BkInOrder). With BkInOrder, accesses within the same
banks are scheduled in the same order as they were issued,
while accesses from different banks are selected in a round
robin fashion.
Table 3. Baseline machine configuration
CPU
L1 I-cache
L1 D-cache
L2 cache
FSB
Main memory
Channel/Rank/Bank
SDRAM row policy
Address mapping
Access reordering
Memory access pool
4GHz, 8-way, 32 LSQ, 196 ROB
128KB, 2-way, 64B cache line
128KB, 2-way, 64B cache line
2MB, 16-way, 64B cache line
64bit, 800MHz (DDR)
4GB DDR2 PC2-6400 (5-5-5),
64-bit, burst length 8
2/4/4 (a total of 32 banks)
Open Page
Page Interleaving
Bank in order (BkInOrder)
256 (maximal 64 writes)
transaction priority table given in Table 2, which encompasses timing constraints and is extensible, burst scheduling
guarantees that row hits within a burst are scheduled back to
back, maximizing the bus utilization. Additionally, different information is employed to make scheduling decisions
at the memory access level and the transaction level, making
burst scheduling a two-level scheduler.
Burst scheduling with various optimizations, as discussed in Section 3, is also evaluated. Burst RP allows
reads to preempt writes. Burst WP piggybacks writes at
the end of bursts. Burst TH uses an experimentally selected
static threshold to control read preemption and write piggybacking. A threshold of 52 obtains the best performance
crossing simulated benchmarks, as will be shown in Section 5.4.
4.2. Simulated Access Reordering Mechanisms
Besides BkInOrder scheduling, three existing access reordering mechanisms, RowHit, Intel and Intel RP, are simulated and compared with burst scheduling. Table 4 summarizes all simulated access reordering mechanisms.
Table 4. Simulated access reordering mechanisms
BkInOrder
RowHit
Intel
Intel RP
Burst
Burst RP
Burst WP
Burst TH
In order intra banks, round robin inter banks
Row hit first intra bank, round robin inter
banks [13]
Intel’s memory scheduling [14]
Intel’s scheduling with read preemption
Burst scheduling
Burst scheduling with read preemption
Burst scheduling with write piggybacking
Burst scheduling with threshold (52)
Proposed by Scott Rixner et al., RowHit scheduling uses
a unified access queue for each bank. A row hit first policy selects the oldest access directed to the same row as the
last access to that bank. Accesses from different banks are
performed in a round robin fashion [13].
Intel’s patented out of order memory scheduling features
unique read queues per bank and a single write queue for
all banks. Reads are prioritized over writes to minimize
read latency. Once an access is started, it will receive the
highest priority so that it can finish as quickly as possible
to reduce the degree of reordering [14]. Not proposed in
the patent, Intel scheduling with read preemption (Intel RP)
allows reads to interrupt ongoing writes in a similar way to
read preemption as described in Section 3.2.
RowHit and Intel’s scheduling both attempt to prioritize
row hits, however, they employ a “best effort” mechanism
in grouping row hit accesses. Without considerations of
SDRAM timing constraints, bubble cycles could easily be
introduced, leading to performance degradation. With the
5. Simulation Results and Analysis
Increasing row hits and reducing access latency are the
major design goals of burst scheduling. Studies of access
latency and SDRAM row hit rate illustrate the impacts of
access reordering mechanisms and inspire further improvements to burst scheduling.
5.1. Access Latency
When a memory read access is issued to the main memory, all in-flight instructions dependent upon this read request are blocked until the requested data is returned. Write
accesses, however, can complete immediately because no
data needs to be returned. Therefore, one of the design
goals of access reordering mechanisms is to reduce the read
latency by postponing writes.
Figure 7 shows the averaged read latency and write latency obtained by all access reordering mechanisms simulated. All out of order access reordering mechanisms reduce the read latency by a range of 26% to 47% compared
to BkInOrder; while all write latencies except for RowHit
are increased.
RowHit treats reads and writes equally thus it reduces
both read and write latency and achieves the lowest write
latency among all access reordering mechanisms. Burst RP
has the lowest read latency because reads are not only prioritized over writes but also allowed to interrupt ongoing
writes. Read preemption helps Intel’s scheduling to reduce
read latency as well. Intel and Burst postpone writes thus
they both have long write latencies. Read preemption makes
the write latency of Intel and Burst even longer. On the
other hand, write latency is greatly reduced by write piggybacking because more row hits from writes are exploited.
To better understand the relationship between read and
write latency, the distribution of outstanding memory accesses for the benchmark swim, which is defined as the
Percentage of Time
(a) Outstanding Reads
BkInOrder
RowHit
Intel
Burst_RP
Burst_WP
Burst_TH
0.20
0.15
0.5
0.4
0.3
0.10
0.2
0.05
0.1
0.00
0
5
10 15 20 25 30 35
Number of Accesses
0.0
0
0.20
0.15
0.10
Burst_TH
Burst_WP
Burst
0.00
Burst_RP
Address bus
Data bus
0.05
Intel
0.0
0.25
Intel_RP
Row empty
Row conflict
Row hit
0.2
Percentage of Time
0.3
0.30
RowHit
Burst_TH
Burst_WP
Burst
Burst_RP
Intel
Intel_RP
RowHit
Figure 7. Access latency in memory cycles
0.25
0.4
0.1
BkInOrder
0
Burst_TH
Burst_WP
Burst
Burst_RP
Intel
Intel_RP
RowHit
0
BkInOrder
100
0.5
Burst_TH
200
0.6
Burst_WP
300
0.35
0.7
Burst
400
0.40
0.8
Intel
20
500
(b) Bus Utilization
0.45
0.9
Intel_RP
40
600
RowHit
SDRAM Cycles
SDRAM Cycles
60
Percentage of Accesses
700
80
(a) Row Hit/Conflict/Empty
BkInOrder
1.0
Burst_RP
(b) Write Latency
800
BkInOrder
(a) Read Latency
100
Figure 9. Average row hit, row conflict and
row empty rate and SDRAM bus utilization
(b) Outstanding Writes
BkInOrder
RowHit
Intel
Burst_RP
Burst_WP
Burst_TH
10 20 30 40 50 60 70
Number of Accesses
Figure 8. Distribution of outstanding memory
accesses for benchmark swim
percentage of time that a given number of accesses are outstanding in the main memory, is shown in Figure 8.
RowHit slightly increases the number of outstanding
accesses compared to BkInOrder to allow row hits to be
served first. Intel and Burst have large number of outstanding writes in the write queue due to postponed writes. Burst
is more aggressive in prioritizing reads over writes than Intel. As a result, Intel and Burst cause write queue saturation
24% and 46% of time respectively for the swim benchmark.
Read preemption reduces the number of outstanding reads
but causes the write queue saturating more frequently, i.e.
Burst RP causes write queue saturation 70% of time.
Prioritizing reads over writes can improve system performance. However, write queue saturation may result
in CPU pipeline stalls. Write piggybacking is employed
to empty writes from the write queue without causing an
undo increase in read latency. Burst WP only causes write
queue saturation 2% of time. Burst TH with a threshold of
52 achieves a tradeoff between reducing read latency and
preventing write queue saturation, resulting in a 9% write
queue saturation rate. Burst TH yields the best performance
as following sections will show.
5.2. Row Hit Rate and Bus Utilization
Row hits require fewer transactions and have shorter latencies than row conflicts, as discussed in Section 2. Access
reordering mechanisms usually select row hits first and turn
potential row conflicts into row hits. Also more row hits
may become available as new accesses arrive. Figure 9(a)
shows row hit, row conflict and row empty rate averaged
crossing all simulated benchmarks.
Out of order access reordering mechanisms increase row
hit rate. Among them, RowHit, Burst WP and Burst TH
have the highest row hit rates. Intel and Burst without write
piggybacking have lower row hit rates although they are still
better than BkInOrder. The reason is that in contrast to Intel
and Burst, which only search the read queues for row hits,
RowHit, Burst WP and Burst TH seek row hits in both the
read queues and the write queues.
With static open page policy, most row empties happen
after SDRAM auto refreshes as banks are precharged. With
read preemption, an ongoing write interrupted by a read
may have precharged the bank while having not yet initiated the row activate. This causes the preempting read to be
a row empty. Therefore Intel RP, Burst RP and Burst TH
have increased row empty rates.
The SDRAM bus utilization, which is the percentage of
time that the bus is occupied, is shown in Figure 9(b). There
is only a 3% difference of address bus utilization among all
access reordering mechanisms, while the data bus utilization varies in a range from 31% to 42%. This confirms that
the data bus is more critical than the address bus. Given the
simulated DDR2 PC2-6400 SDRAM, Burst TH achieves
the highest data bus utilization of 42%. The effective memory bandwidth is increased from 2.0GB/s (BkInOrder) to
2.7GB/s (Burst TH), resulting in a 35% improvement.
0.25
(a) Outstanding Reads
WP
TH8
TH16
TH24
TH32
TH40
TH48
TH52
TH56
RP
0.8
0.7
0.6
0.5
0.2
art
facerec
mesa
applu
swim
mgrid
bzip2
wupwise
gap
parser
perlbmk
gcc
mcf
0.0
gzip
0.1
apsi
0.3
average
RowHit
Intel
Intel_RP
Burst
Burst_RP
Burst_WP
Burst_TH
0.4
Figure 10. Execution time of access reordering mechanisms, normalized to BkInOrder
Percentage of Time
0.9
lucas
Normalized Execution Time
1.0
0.20
0.15
0.10
0.05
0.00
0
0.5
0.4
0.3
0.2
(b) Outstanding Writes
WP
TH8
TH16
TH24
TH32
TH40
TH48
TH52
TH56
RP
0.1
5
10 15 20 25 30 35
Number of Accesses
0.0
0
10 20 30 40 50 60 70
Number of Accesses
Figure 11. Outstanding accesses for benchmark swim under various thresholds
mined.
5.4. Burst Threshold
5.3. Execution Time
Previous sections showed access latency, row hit rate as
well as but utilization of various access reordering mechanisms. Results in this section examine the execution time
of each individual benchmark with these access reordering
mechanisms. Execution times are normalized to BkInOrder
and shown in Figure 10.
RowHit achieves an average 17% reduction of execution
time compared to BkInOrder. Intel and Burst reduce the execution time by 12% and 14% respectively. Read preemption alone contributes another 3% improvement on top of
Intel and Burst. Write piggybacking alone also contributes
5% improvement on top of Burst, resulting in a total of 19%
reduction of execution time by Burst WP. Burst TH which
combines read preemption and write piggybacking using a
static threshold of 52 yields the best performance among all
access reordering mechanisms, achieving a 21% reduction
in execution time crossing all simulated benchmarks, which
results in a 6% improvement over RowHit, 11% and 7%
improvement over Intel and Intel RP respectively.
Read preemption and write piggybacking have a varied impact dependent upon benchmark characteristics. For
mcf, parser, perlbmk and facerec, read preemption
contributes much greater performance improvement compared to write piggybacking. For the remainder of the
benchmarks, write piggybacking generally results in more
improvement than read preemption. Especially for gcc and
lucas, Burst WP achieves 14% and 18% reduction in execution time respectively. It is desirable to take advantage
of both read preemption and write piggybacking to achieve
a maximal performance improvement. A static threshold is
employed to control read preemption and write piggybacking. The next section will show how this threshold affects
the performance and how the optimized threshold is deter-
Read preemption and write piggybacking have been
shown to perform well on some benchmarks but not on all
benchmarks. Which technique has greater impact on performance is largely dependent on the memory access patterns of benchmarks. For example, allowing a critical read
having many dependent instructions to preempt an ongoing
write may improve the performance. However, completing
the ongoing write may prevent CPU pipeline stalls due to a
saturated write queue.
When the write queue has low occupancy, read preemption is desired to reduce read latency by allowing reads to
bypass writes. When the write queue approaches capacity,
write piggybacking can keep the write queue from saturation. Read preemption and write piggyback can be switched
dynamically based on the write queue occupancy. When the
write queue occupancy is less than a certain threshold, read
preemption is enabled; otherwise, write piggybacking is enabled.
Using the same example swim benchmark as in Section 5.1, the distribution of outstanding accesses with various thresholds are shown in Figure 11. Note Burst RP and
Burst WP are equivalents to Burst TH64 and Burst TH0 respectively given that the write queue size is 64.
From Figure 11, Burst RP has fewer outstanding reads
than other thresholds, however, read latency of Burst RP
is slightly higher than others. This is because when there
are fewer reads in the read queue, there are less opportunities for row hits to occur. In order for burst scheduling to
create larger bursts and increase row hits, the read queue
should contain a number of outstanding reads, and these
reads should be served at a rate that will not deplete the
read queue too quickly.
As the threshold increases from 0 to 64, the peak value
of outstanding writes increases as well. The write buffer
(b) Read Latency
(c) Write Latency
(a) Execution Time
saturation rate is below 7% when the threshold is less than
48. The saturation rate increases to 14% at threshold 56
then jumps to 70% at threshold 64 (Burst RP). The earlier
write piggybacking is enabled, the less frequently the write
queue will be saturated.
To determine the threshold that yields the best performance, simulations with various thresholds are performed
and the results are shown in Figure 12. The execution
times are averaged crossing all benchmarks and normalized to Burst. As the threshold increases, read latency first
decreases because there are more reads which experience
shorter latencies by preempting writes. From threshold 40
read latency starts increasing mainly due to CPU pipeline
stalls caused by increased occurrences of write queue saturation. Write latency increases as expected as the threshold increases. Execution time is determined by both read
latency and write latency. According the results, the threshold 52 yields the lowest execution time based upon the 16
benchmarks of SPEC CPU2000 simulated.
benchmarks over in order scheduling. Burst scheduling also
outperforms the row hit scheduling and Intel’s out of order
scheduling by 6% and 11% respectively.
As SDRAM devices evolve, timing parameters (tCL tRCD -tRP ) do not improve as rapidly as bus frequency.
For instance, DDR PC-2100 devices (133MHz) have typical timings of 2-2-2 (15ns-15ns-15ns), while DDR2 PC26400 devices (400MHz) have typical 5-5-5 (12.5ns-12.5ns12.5ns) timing. From DDR PC-2100 to DDR2 PC2-6400,
the bus frequency as well as bandwidth improve by 200%
however the timing parameters only reduce by 17%. Access latency, in terms of memory cycles, increases, i.e. row
conflict latency increases from 6 cycles to 15 cycles. Increased main memory access latency leads to more performance improvement opportunities for memory optimization
techniques. As the number of cycles for timing parameters
increases in the future, the performance improvement provided by access reordering mechanisms will be even more
significant than the simulation results presented in this paper.
Access reordering mechanisms will play a more important role with chip level multiple processors, as the memory controller will have larger number of outstanding main
memory accesses from which to select. Access reordering
mechanisms may also benefit from integrating the memory
controller into the CPU die. Due to a tighter connection between the integrated memory controller and the CPU, more
instruction level information, such as the number of dependent instructions, is available to the controller. This information may be utilized to make intelligent scheduling decisions. An integrated memory controller can also run at the
same speed as the CPU, making complex scheduling algorithms feasible.
6. Conclusions
7. Future Work
Memory scheduling techniques improve system performance by changing the sequence of memory accesses to increase row hit rate and reduce memory access latency. Inspired by existing academic and industrial access reordering
mechanisms, the burst scheduling access reordering mechanism is hereby being proposed with the goal of improving
performance of existing access reordering mechanisms and
addressing their shortcomings.
Using the M5 simulator with a detailed SDRAM module, burst scheduling is examined and compared to the row
hit scheduling and Intel’s out of order memory scheduling. The performance contributions of read preemption
and write piggybacking are studied and identified. The
threshold that yields the best performance is determined by
experiments. With selected SPEC CPU2000 benchmarks,
burst scheduling with a threshold of 52 achieves an average
of 21% reduction in execution time crossing 16 simulated
Burst scheduling with static threshold works well on average, however, benchmarks have unique access patterns,
and therefore require different thresholds. A dynamical
threshold, which is calculated on the fly based on some critical parameters such as read write ratios, will match access
patterns of different benchmarks for further performance
improvement.
Currently reads inside bursts are scheduled in the same
order as they were issued. Changing the order of accesses
within a burst will not affect the effective memory bandwidth and the total time to complete the burst. However,
critical accesses (i.e. having many dependent instructions)
may benefit from being scheduled first inside the burst. Integrating the SDRAM controller into the CPU die makes
more instruction level information, which can be used to
perform intra burst scheduling, obtainable to the scheduler.
Similarly the sequence of bursts within banks can also be
800
40
30
20
Normalized to Burst
SDRAM Cycles
SDRAM Cycles
50
600
500
400
300
200
10
0.9
0.8
0.7
0.6
0
Burst
WP
TH8
TH16
TH24
TH32
TH40
TH48
TH52
TH56
TH60
RP
100
Burst
WP
TH8
TH16
TH24
TH32
TH40
TH48
TH52
TH56
TH60
RP
0
1.0
700
0.5
Burst
WP
TH8
TH16
TH24
TH32
TH40
TH48
TH52
TH56
TH60
RP
60
Figure 12. Access latency and execution time
under various thresholds
changed to reduce latency of critical data. Bursts could be
sorted other than the arrival time of the first access of each
burst as being proposed, i.e. sorted by the size of bursts.
However, considerations are required to prevent starvation
when performing inter burst scheduling.
Other SDRAM techniques such as address mapping [19,
5, 23, 16] change the distribution of memory accesses to
increase row hit rate. Access reordering mechanisms will
benefit from the increased row hit rate. Studies of access reordering mechanisms working in conjunction with SDRAM
address mapping are ongoing [15].
8. Acknowledgments
The work of Jun Shao and Brian Davis was supported
in part by NSF CAREER Award CCR 0133777. We would
also like to thank the reviewers for their comments and suggestions.
References
[1] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. NetworkOriented Full-System Simulation using M5. In Proceedings
of the Sixth Workshop on Computer Architecture Evaluation
using Commercial Workloads (CAECW), 2003.
[2] Chris Weaver. Pre-compiled Little-endian Alpha ISA SPEC
CPU2000 Binaries.
[3] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. HighPerformance DRAMs in Workstation Environments. IEEE
Trans. Comput., 50(11):1133–1153, 2001.
[4] B. T. Davis. Modern DRAM Architectures. PhD thesis, Department of Computer Science and Engineering, the University of Michigan, 2001.
[5] W. fen Lin. Reducing DRAM Latencies with an Integrated
Memory Hierarchy Design. In HPCA ’01: Proceedings
of the 7th International Symposium on High-Performance
Computer Architecture, page 301, Washington, DC, USA,
2001. IEEE Computer Society.
[6] J. Hasan, S. Chandra, and T. N. Vijaykumar. Efficient
Use of Memory Bandwidth to Improve Network Processor
Throughput. In ISCA ’03: Proceedings of the 30th Annual
International Symposium on Computer Architecture, pages
300–313, New York, NY, USA, 2003. ACM Press.
[7] I. Hur and C. Lin. Adaptive History-Based Memory Schedulers. In MICRO 37: Proceedings of the 37th annual International Symposium on Microarchitecture, pages 343–354,
Washington, DC, USA, 2004. IEEE Computer Society.
[8] J. Janzen. DDR2 Offers New Features and Functionality.
DesignLine, 12(2), Micron Technology, Inc., 2003.
[9] S. A. McKee, W. A. Wulf, J. H. Aylor, M. H. Salinas, R. H.
Klenke, S. I. Hong, and D. A. B. Weikle. Dynamic Access
Ordering for Streamed Computations. IEEE Trans. Comput.,
49(11):1255–1271, 2000.
[10] Micron Technology, Inc. Micron 512Mb: x4, x8, x16 DDR2
SDRAM Datasheet, 2006.
[11] NEC. 64M-bit Virtual Channel SDRAM, October 1998.
[12] S. Rixner. Memory Controller Optimizations for Web
Servers. In MICRO 37: Proceedings of the 37th annual
International Symposium on Microarchitecture, pages 355–
366, Washington, DC, USA, 2004. IEEE Computer Society.
[13] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.
Owens. Memory Access Scheduling. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 128–138, New York, NY, USA,
2000. ACM Press.
[14] H. G. Rotithor, R. B. Osborne, and N. Aboulenein. Method
and Apparatus for Out of Order Memory Scheduling. United
States Patent 7127574, Intel Corporation, October 2006.
[15] J. Shao. Reducing Main Memory Access Latency through
SDRAM Address Mapping Techniques and Access Reordering Mechanisms. PhD thesis, Department of Electrical and
Computer Engineering, Michigan Technological University,
2006.
[16] J. Shao and B. T. Davis. The Bit-reversal SDRAM Address
Mapping. In SCOPES ’05: Proceedings of the 9th International Workshop on Software and Compilers for Embedded
Systems, pages 62–71, September 2005.
[17] K. Skadron and D. W. Clark. Design Issues and Tradeoffs
for Write Buffers. In HPCA ’97: Proceedings of the 3rd
IEEE Symposium on High-Performance Computer Architecture, page 144, Washington, DC, USA, 1997. IEEE Computer Society.
[18] Standard Performance Evaluation Corporation.
SPEC
CPU2000 V1.2, December 2001.
[19] R. Tomas. Indexing Memory Banks to Maximize Page
Mode Hit Percentage and Minimize Memory Latency. Technical Report HPL-96-95, Hewlett-Packard Laboratories,
June 1996.
[20] A. Wong. Breaking Through the BIOS Barrier: The Definitive BIOS Optimization Guide for PCs. Prentice Hall, 2004.
[21] W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious. SIGARCH Comput. Archit. News,
23(1):20–24, 1995.
[22] Y. Xu. Dynamic SDRAM Controller Policy Predictor. Master’s thesis, Department of Electrical and Computer Engineering, Michigan Technological University, April 2006.
[23] Z. Zhang, Z. Zhu, and X. Zhang. A Permutation-based
Page Interleaving Scheme to Reduce Row-buffer Conflicts
and Exploit Data Locality. In MICRO 33: Proceedings of
the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 32–41, New York, NY, USA, 2000.
ACM Press.
[24] Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority
Scheduling on Multi-channel Memory Systems. In HPCA
’02: Proceedings of the 8th International Symposium on
High-Performance Computer Architecture, page 107, Washington, DC, USA, 2002. IEEE Computer Society.
Download