FlashCache: A NAND Flash Memory File Cache for Low Taeho Kgil

advertisement
FlashCache: A NAND Flash Memory File Cache for Low
Power Web Servers
Taeho Kgil
Trevor Mudge
Advanced Computer Architecture Laboratory
The University of Michigan
Ann Arbor, USA
Advanced Computer Architecture Laboratory
The University of Michigan
Ann Arbor, USA
tkgil@eecs.umich.edu
tnm@eecs.umich.edu
ABSTRACT
We propose an architecture that uses NAND flash memory to reduce main memory power in web server platforms.
Our architecture uses a two level file buffer cache composed
of a relatively small DRAM, which includes a primary file
buffer cache, and a flash memory secondary file buffer cache.
Compared to a conventional DRAM-only architecture, our
architecture consumes orders of magnitude less idle power
while remaining cost effective. This is a result of using
flash memory, which consumes orders of magnitude less idle
power than DRAM and is twice as dense. The client request behavior in web servers, allows us to show that the
primary drawbacks of flash memory—endurance and long
write latencies—can easily be overcome. In fact the wearlevel aware management techniques that we propose are not
heavily used.
Categories and Subject Descriptors
B.3 [Semiconductor Memories]; C.0 [System architectures]
General Terms
Design, Experimentation, Performance
Keywords
Low power, web server, application-specific architectures,
Flash memory, server platforms, embedded system, full-system
simulation
1.
INTRODUCTION
With the growing importance of web servers found in
internet service providers like Google and AOL, there is
a trend towards using simple low power systems as blade
servers in power hungry server farms. These suit the modest
computation power and high memory throughput required
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CASES’06, October 23–25, 2006, Seoul, Korea.
Copyright 2006 ACM 1-59593-543-6/06/0010 ...$5.00.
103
in typical server workloads. As shown in Figure 1, web
servers connect directly to the client and are only in charge
of delivering the web content page to the client. Since web
servers require just a modest amount of computation power,
a large amount of their performance depends heavily on
memory, I/O bandwidth and access latency. To mitigate
I/O latency and bandwidth, especially latency in hard disk
drives, server farms typically use large main memories that
try to map the entire fileset onto memory, caching the whole
fileset onto DRAM. Unfortunately, DRAM consumes a large
portion of overall system power. Today’s typical servers
come with large quantities of main memory—4∼32GB and
have been reported to consume as much as 45W in DRAM
idle power[8]. If we consider that the Niagara core inside
the Sun T2000, a typical server, consumes about 63W, we
can clearly see that DRAM idle power contributes to a large
portion of overall system power.
We examined the behavior of DRAM main memory in Figure 2. In particular, Figure 2(a) shows the hit rate of a file
buffer cache for web server applications. We see marginal
improvement in page cache hit rate after a certain point.
However, because the access latency to a hard disk drive is
in milliseconds, a large amount of DRAM size is still needed
to make up for the costly hard disk drive access latency.
Otherwise, the server will not be fully utilized and remain
idle waiting to get information from the hard disk drive. Surprisingly though, from our experiments done on web server
workloads shown in Figure 3, we observe an access latency
of tens to hundreds of microseconds can be tolerated when
accessing a large part of a file buffer cache without affecting
throughput. This is due to the multi-threaded nature of web
server applications that allows modest access latency of microseconds to be hidden. The resulting non-uniform memory
hierarchy consumes less power while performing equally well
as a flat DRAM-only main memory. Furthermore, reads are
more frequent than writes in web server applications. These
characteristics make a strong case for using flash memories
as a secondary file buffer cache. Flash memories have established themselves as storage devices for embedded and mobile platforms. Their continued rapid improvement is supported by their growing usage in a wide variety of high volume commercial products. Flash memories consume orders
of magnitude less idle power and are cheaper than DRAM,
especially NAND-based flash memory, making them an excellent component used for energy-efficient computing.
In this paper, we propose an architecture called FlashCache that uses a NAND-based flash memory as a secondary
TIER 1
TIER 2
HTTP Response
TIER 3
Return Results
Response
Execute
Query
HTTP Request
Internet
Invoke Component
Invoke Query
IP Services
Frontend
Application
Services
Web Server
Application
Server
Data
Backup
Services
Database Server
Figure 1: A Typical 3 Tier Server Architecture. Tier 1—Web Server, Tier 2—Application Server, Tier 3—
DataBase Server. PicoServer is targeted at Tier 1. An example of an internet transaction is shown. When
a client request comes in for a Java Servlet Page, it is first received by the front end server—Tier 1. Tier 1
recognizes a Java Servlet Page that must be handled and initiates a request to Tier 2 typically using Remote
Message Interfaces (RMI). Tier 2 initiates a database query on the Tier 3 servers, which in turn generate
the results and send the relevant information up the chain all the way to Tier 1. Finally, Tier 1 sends the
generated content to the client.
In the process of describing our architecture, we interchangeably use page cache and file buffer cache. This is
because after Linux kernel version 2.4 the file buffer cache
has been merged into a page cache.
The paper is organized as follows. In section 2, we provide
background on flash memory, web server workloads, and related work. Section 3 and 4 provide a detailed description
of our FlashCache architecture and explains how it works.
Section 5 and 6 present our experimental setup and results.
And finally we present our conclusions and future work in
Section 7.
file buffer cache to reduce overall power consumption in the
main memory without impacting network bandwidth. Although flash memory has limitations in terms of endurance
and bandwidth, we will make the case in this paper that
the benefits found in the reduction of overall main memory
power outweigh the drawbacks in adopting flash memory.
Aside from the cost-effectiveness and low idle power of
flash memory, our FlashCache architecture can also take
advantage of flash memory’s persistence. Integrating flash
memory into the conventional system-level memory hierarchy enables quick warm up of file buffer caches, which allows
quicker deployment of web servers. We can quickly image
the cached file content onto the flash memory and reduce
the warm-up time of the file buffer cache.
Our FlashCache architecture provides:
• Overall reduction in system level power. The
physical characteristics of flash memory reduces by a
significant amount the idle power in main memory.
The overhead resulting from the increased latency in
accessing items in the file buffer cache can be minimized because of the access behavior of the files.
2.
BACKGROUND
2.1
Flash memory
There are two types of flash memory—NAND and NOR—
that are commonly used today. Each type of memory has
been developed for a particular purpose and has its own pros
and cons. Table 1 and Table 2 summarizes the properties
of NAND and NOR along with DRAM. The biggest difference, compared to DRAM, is when writing to a flash memory. Flash memories require a preceding erase operation
to perform a write operation. The ITRS roadmap projects
NAND flash memory cell sizes to be 2∼4× smaller than
DRAM cell sizes and NOR flash memory cell sizes to be similar or slightly bigger than DRAM cell sizes. A NAND flash
memory uses a NAND structure memory array to store content. The small cell size for a NAND flash memory results in
higher storage density. With the potential to support multilevel cells, storage density is expected to improve even more
as shown in the ITRS roadmap. However, the drawback is
that the random read access time for NAND flash memory is
lengthy due to the nature of the NAND structure. Sequen-
• Overall cost reduction in main memory. A costeffective solution with multi-chip main memory compared to a DRAM-only based solution. The density of
flash memory exceeds DRAM by more than 2×, which
is reflected in the reduced cost per bit of flash memory. Therefore, the total cost of a main memory is
much less costly than a DRAM-only solution.
• Quicker startup time compared to conventional
DRAM-only platforms. The nonvolatile property
in flash memory means file buffer cache warm up is not
required after boot up.
104
page cache hit rate
cumulated percentage of
requests
100%
80%
60%
40%
20%
0%
0%
20%
40%
60%
80%
MP4
100%
95%
90%
85%
80%
75%
70%
65%
60%
32MB
100%
MP8
64MB
MP12
128MB
256MB
DRAM size
percentage of fileset
(a)
(b)
MP4
2,100
MP8
MP12
Network Bandwidth - Mbps
Network Bandwidth - Mbps
Figure 2: (a) File buffer cache access behavior on the server side for client requests. We measured for 4, 8,
12 multicore configurations and varied the DRAM size. (b) A typical cumulative distribution function of a
client request behavior. 90% of requests are for 20% of the web content files.
1,800
1,500
1,200
900
600
300
0
12μs
25μs
50μs
100μs
400μs 1600μs
MP4
1,200
MP12
1,000
800
600
400
200
0
12μs
access latency
MP8
25μs
50μs
100μs
400μs 1600μs
access latency
(a) SURGE
(b) SPECWeb99
Figure 3: Measured network bandwidth for full system simulation while varying access latency to a secondary
page cache. We assumed a 128MB DRAM with a slower memory of 1GB. We measured bandwidth for 4, 8, 12
multicore configurations. The secondary page cache can tolerate access latencies of hundreds of microseconds
while providing equal network bandwidth.
2.2
tial read accesses are not as slow as random reads, making
it a good candidate for applications that have streaming locality. As a consequence, NAND flash memory has become
popular for storing digital content—audio and video files. In
addition to lengthy random access time, NAND flash memory also has issues with reliability.
NAND flash memory is likely to be manufactured with
faulty cells. Furthermore, wear-out may cause good cells to
become bad cells. NAND flash memory comes with builtin error correction code support to maintain reliability and
extend the lifespan of flash memory. There have been implementations of file systems that extend the endurance of
NAND flash memory to more than 5,000,000 cycles using
BCH (Bose,Chaudhuri and Hocquenghem) block codes[9].
A NOR flash memory, in contrast to NAND flash, has a
random read access time that is faster. Since NOR flash
memory performs relatively well for random read accesses,
it is commonly used for applications requiring fast access to
code and data. Accordingly, NOR flash memory has typically been used to store code and data on handheld devices,
PDAs, laptops, cell phones, etc.
Web Server workload
Web Server workloads are known to have high levels of
thread level parallelism (TLP), because connection level parallelism can be translated into thread level parallelism. In
addition to that, they cover a working set size determined
by the client web page request behavior. Client web page
requests follow a Zipf-like distribution as shown in Figure
2(b). A large portion of requests are centered around the
same group of files. These file accesses translate into memory and I/O accesses. Due to the modest computation,
memory and I/O latency are critical to high performance.
Therefore, file caching in the main memory plays a critical part in providing sufficient throughput. Without a page
cache, the performance degradation due to the hard disk
drive latency would be unacceptable. To work well on this
particular workload, the preferred architecture is a simple
multicore architecture with enough page cache to reduce the
amount of access to the hard disk drive[18].
105
DDR2 DRAM
NOR
NAND
∗
†
Density–
Gb/cm2
0.7
0.57
1.42
$/Gb
48
96
21
Active
Power∗
878mW
86mW
27mW
Idle
Power∗
80mW†
16μW
6μW
Read
Latency
55ns
200ns
25μs
Write
Latency
55ns
200μs
200μs
Erase
Latency
N/A
1.2s
1.5ms
Built-in ECC
support
No
No
Yes
Power consumed for 1Gbit of memory
DRAM Idle power in active mode. Idle power in powerdown mode is 18mW
Table 1: Cost and power consumption for conventional DRAM, NOR, NAND-based flash memory. NAND
flash memory is the most cost-effective while consuming the least amount of power.[22][20]
Flash NAND Cell size
–SLC/MLC∗ (μm2 )
Flash NOR Cell size(μm2 )†
DRAM Cell size(μm2 )
Flash erase/write cycles
Flash data retention
∗
†
2005
2007
2009
2011
2013
0.0231/0.0116
0.0130/0.0065
0.0081/0.0041
0.0052/0.0013
0.0031/0.0008
0.0520
0.0514
1E+05
10-20
0.0293
0.0324
1E+05
10-20
0.0204
0.0153
1E+05
10-20
0.0117
0.0096
1E+06
10-20
0.0078
0.0061
1E+06
20
2016
0.0016/0.0004
0.0040
0.0030
1E+07
20
SLC - Single level Cell, MLC - Multi Level Cell
We assume a single level cell with smallest area size of 9F2 stated in the ITRS roadmap
Table 2: ITRS 2005 roadmap for flash memory technology. NAND flash memory is projected to be upto
7∼8× as dense as DRAM. Flash memory endurance improves by an order of magnitude approximately every
5∼6 years. Data retention is over 10 years which is a long time for server platforms.[10]
2.3
Related Work
3.
There has been considerable work on using flash memory
as part of the memory hierarchy. In [22], it was shown that
flash memory could be used directly for high performance
code execution by adding an SRAM. In [2], the authors proposed to integrate flash memory into hard disk drives to
improve their access latencies in mobile laptops. The prototype used the flash memory as a boot buffer to speedup boot
time and as a write buffer to hide write latencies to the hard
disk drive. The issue of wear-level awareness has also been
studied. For example, wear-level aware file systems have
been outlined in [3] to extend flash memory endurance. [15]
has shown error correction codes extend the lifespan of flash
memory with marginal overhead.
In the case of non-uniform main memory, [14] showed the
potential of having a non-uniform main memory architecture. They examined using a slow secondary main memory
as a swap cache and measured the performance overhead.
The overhead was shown to be negligible. [17][19] showed
that a considerable amount of main memory power can be
reduced with power aware page allocation. They reduced
DRAM idle power by putting non-recently accessed pages
to sleep. Our work extends the work from [14][2] and integrates flash memory as a secondary file buffer cache for
web server applications. We will show that this architecture
reduces operating costs due to an order of magnitude reduction in main memory power consumption, while giving a
significant cost advantage compared to a DRAM-only architecture. By adapting and simplifying the methods in [3] and
incorporating them onto a cache replacement algorithm, we
mitigate any wear-out problems.
FLASHCACHE ARCHITECTURE
Figure 4 shows an overview of our proposed architecture
using a FlashCache. Compared to a conventional DRAMonly architecture, our proposed FlashCache assumes a two
level page cache. It requires a non-uniform main memory architecture with a small DRAM that holds the primary page
cache and flash memory that functions as the secondary page
cache. A flash memory controller is also required. Additional data structures required to manage the FlashCache
are placed in DRAM.
Our FlashCache architecture uses a NAND-based flash
memory rather than a NOR-based flash memory due to the
compactness and faster write latency found in NAND flash
memory. The random read access latency for a commercial
NAND is 25μs and the write latency is 200μs with an erase
latency of 1.5ms per block [6]. We employ flash memory as
a page cache rather than a generic main memory or a storage device. Flash is unsuitable as a generic main memory,
because of the frequent writes and associated wear-out. It is
unsuitable as a storage device, because of the large memory
overhead required in a file system where wear-out is a concern. A page cache only requires cache tags for each block,
implying less memory overhead than a file system requiring
a tree structure filled with file location nodes. Data structures used in FlashCache management are read from the
hard disk drive or the flash memory and loaded to DRAM
to reduce access latency and mitigate wear-out.
In conventional DRAM-only architectures that use a single level DRAM for file caching, a fully associative page
cache table is managed in software and probed to check if
a certain file is in DRAM. The search time is speed up by
using tree structures. A considerable amount of search time
can still elapse, but is sustainable because DRAM access latency for transferring file content is in nanoseconds. This is
not the case for a flash memory-based page cache, where the
read access latency for transferring file content is 10∼100 mi-
106
a logical block to be evicted using an LRU policy. However,
if this block is identified to be a hot block by observing the
difference between the eviction count of the logical block selected for eviction chosen from the LRU policy and the minimum eviction count from the other logical blocks belonging
to the same set exceeds a certain threshold, then the logical block corresponding to the minimum eviction count is
evicted to balance the wear level.
At the set level, we determine if the currently accessed set
is a hot set by comparing it to the maximum eviction count
for the logical blocks belonging to the currently accessed
set and whether this number exceeds the eviction count of
other sets by a certain threshold. A hot set is swapped with
a cold set to balance the wear-level. The temporary staging
area for swaps is located in flash memory where a temporary
buffer is used in the swap process.
croseconds. Although, we found search time to be 300∼400
nanoseconds, for conservative reasons, we employ a hash table to reduce search time.
3.1
FlashCache Hash Table for tag lookup
The FlashCache Hash Table (FCHT) is a memory structure that holds tags associated with the flash memory blocks.
This table exists in DRAM. A tag is made up of a page address field and a flash memory address field. The page address field points to the location in the hard disk drive and
is used for determining whether the flash memory holds this
particular location in the hard disk drive. The corresponding flash memory address field is used to access flash memory. If a hit occurs in the FCHT, the corresponding flash
memory address field is used to access flash memory. More
than 99% of the time a hit occurs for our FlashCache system and the flash memory location containing the requested
file is sent to the primary page cache existing in DRAM. If
a miss occurs, the FlashCache management scheme determines which block to evict based on a wear-level aware replacement algorithm and schedules that block to be evicted.
The FCHT is partitioned into a set associative table. In
this work, we assume 16 way set associativity. Our studies
suggested that there was a marginal improvement in the hit
rate for higher associativity. Wear-level awareness is managed both at the block level—fine grain—and at the set
level—coarse grain. Each set is made up of 16 logical blocks
that are in turn made up of 4 of the write/erase blocks in
the flash memory. A wear-level status table is required to
profile the number of writes and erases performed on a particular block. In the following subsections, we describe each
component in an architecture using the FlashCache.
3.2
3.3
The flash memory controller handles the unique interface
required in accessing flash memory. The flash memory controller supports DMA to handle DMA transfers from DRAM
to flash memory and vice versa. The procedure required
in transferring flash memory data is simple in principle—
similar to accessing DRAM. However, there are two potential problems in performing reads and writes in flash memory. The first potential problem is bandwidth. Usually, a
flash memory can read and write only a single byte or word
per cycle. In addition, today’s NAND flash memories have
a slow transfer clock— 50MHz. Therefore, a large amount
of time is spent in reading and writing data. This becomes a
problem when we access the flash memory frequently. The
limited bandwidth in flash memory potentially becomes a
bottleneck in providing high performance. The other potential problem is blocking writes. Today’s NAND flash memory suffer from blocking writes and do not support Read
While Write (RWW). A NAND flash memory is busy during
writes, blocking all other accesses that can occur. Without
RWW, a blocking flash memory write could also become a
potential bottleneck in providing high performance.
Fortunately, these problems can currently be dealt with.
From our studies shown in Figure 3, we know that we
can tolerate an access latency of hundreds of microseconds.
This relieves the limited bandwidth problem. The blocking
property of flash memory writes can be solved by distributing writes. Because flash memory writes do not occur frequently, we can schedule flash memory reads to have priority
over writes, by using a lazy writeback scheme. By managing a writeback buffer in DRAM that stores blocks that
need to be updated in the flash memory, we can prevent a
flash memory write from occurring when flash memory reads
are requested. The lazy writeback scheme allows writebacks
to occur mostly when the flash memory is not accessed for
reads.
In the long term we expect to see more direct solutions—
improve bandwidth and non-blocking features. Adopting
emerging technologies like 3D stacking technology[16] to improve flash memory bandwidth and implementing multibanked flash memory that supports RWW are possible solutions.
Wear-level aware cache replacement
The drawback of using flash memory is wear-out. Compared to DRAM, flash memory can only be written a fixed
amount of times. The endurance for flash memory is expected to improve in the future [10]. Table 2 shows the
ITRS projection for flash memory endurance. A 10× improvement in endurance is expected every 5∼6 years. Endurance improvement is attributed to the use of better material. Although endurance is expected to improve to the
point where it is no longer a concern, we have chosen to
be conservative and make our FlashCache replacement algorithm wear-level aware. The wear-level status table exists
to assist in wear-level management.
3.2.1
Wear-level status table
The wear-level status table located in DRAM maintains
the number of erases and writes performed on a logical block.
The number of erases and writes is equal to the number of
evictions. A wear level status counter exists for each logical
block in the flash memory. Every time a logical block is
selected for eviction its wear-level status table entry located
in DRAM is incremented. The wear level status table is
used for both coarse grain and fine grain level wear-level
management. The wear-level status table determines a hot
set or a hot block. A set or block is considered hot if its
status table entry exceeds a certain threshold.
3.2.2
Flash memory controller with DMA
support
Management at the block level and set level
Wear-level management for the FlashCache is performed
on FlashCache misses. At the block level, we initially select
107
Processor
Processor
128MB DRAM
PPC
1GB DRAM
Main Memory
Flash ctrl
DMA
WST
FCHT
1GB Flash
Generic main
memory +
Primary File
Buffer Cache
Secondary File
Buffer Cache
Main Memory
HDD ctrl
HDD ctrl
Replaces
Hard Disk Drive
Hard Disk Drive
PPC : Primary Page Cache
FCHT : FlashCache Hash Table
WST : Wear level Status Table
Figure 4: General overview of our FlashCache architecture. We show an example of a 1GB DRAM replaced
with a smaller 128MB DRAM and 1GB NAND-based flash memory. Additional components are added
to control the flash memory. The total die area required in our multichip memory is 60% the size of a
conventional DRAM-only architecture.
4.
HOW IT WORKS
memory for that logical block. Concurrently, a hard disk
drive access is scheduled using the device driver interface.
The hard disk drive content is copied to the primary page
cache in DRAM. The corresponding tag in the FCHT is also
updated. Finally, we schedule a writeback of the hard disk
drive content to flash memory through our lazy writeback
scheme using DMA.
In this section we discuss how FlashCache hits and misses
are handled.
When a file I/O is performed at the application level, the
OS searches for the file in the primary page cache located in
DRAM. On a primary page cache hit in DRAM, the FlashCache is not accessed at all, and the file content is accessed
directly from the primary page cache. On a primary page
cache miss in DRAM, the OS searches the FCHT to determine whether the requested file currently exists in the
secondary page cache. If the requested file is found, then
a flash memory read is performed and a DMA transaction
is scheduled to transfer flash memory content to DRAM.
The requested address to flash memory is obtained from the
FlashCache Hash Table.
If a miss occurs in the FlashCache Hash Table search process, a logical block is selected for eviction. The selection
process first considers wear-level at the fine grain block level.
If the selected logical block has been evicted more times than
a certain threshold—a hot block, we evict the least evicted
logical block belonging to the same set instead of applying
an LRU policy (noted above). Furthermore, if the set which
the selected logical block belongs to is evicted frequently—a
hot set, a set swap is performed, where a hot set is swapped
with a cold set. Fortunately this does not occur frequently
as a result of the behavior of the workload. After a logical
block is selected for eviction and whether or not we decide to
do a set swap, an erase operation is performed on the flash
5.
EXPERIMENTAL SETUP
The architectural aspects of our studies are obtained from
a microarchitectural simulator called M5 [12] that is able to
run Linux and evaluate full system level performance. A web
server connected to multiple clients is modeled. The client
requests are generated from user level network application
programs. To measure and model behavior found in DRAM
and flash memory, we extracted information from [6][5] to
add timing and power models to M5. A more detailed description of our methodology is described in the following
subsections.
5.1
5.1.1
Simulation Studies
Full system architectural simulator
M5 is a configurable architecture that runs an SMP version of Linux 2.6. The clients and server are all modeled with
distinct communicating instances of M5 connected through
a point-to-point ethernet link. The server side executes
Apache—a web server. The client side executes benchmarks
108
Processor type
Number of cores
Clock frequency
L1 cache size
L2 cache size
DRAM
NAND Flash Memory
IDE disk
Ethernet Device (NIC)
Number of NICs
Server configuration parameters
single issue in-order
4, 8, 12 core
1GHz
4 way 16KB
8 way 2MB
64MB∼1GB
tRC latency 50ns
bandwidth 6.4GB/s
1GB
16 way 128KB logical block size
random read latency 25μs
write latency 200μs
erase latency 1.5ms
bandwidth 50MB/s
average access latency 3.3ms
bandwidth 300MB/s
1Gbps Ethernet NIC
2, 4, 6
Table 3: Server configurations in our studies.
relied on industry and academia publications on die size
area, power and performance. We used published datasheets
found in [6][5] to estimate timing and power consumption.
For die area estimation, we used published data found in
[20][21]. We discuss this further in the next subsections.
that generate representative requests for dynamic and static
page content. For comparison, a chip multiprocessor-like
system is created from [18]. Configurations of 4, 8, 12 processors are used in this study. We assumed a simple in-order
5 stage pipeline core which is similar to, but simpler than the
Niagara configuration. A detailed breakdown of our server
architecture is shown in Table 3. The clients are modeled
just functionally. We use total network bandwidth in the
ethernet devices as our performance metric. It is the sum of
transmitted and received bandwidth.
5.1.2
5.2.1
Server Benchmarks
We use two web server benchmarks that directly interact
with client requests, SURGE[11] and Specweb99[7] to measure client web page requests. Both benchmarks request
filesets of more than a 1GB. The fileset size for SURGE is
1.5GB and for Specweb99 is 1.8GB.
SURGE The SURGE benchmark represents client requests for static web content[11]. SURGE is a multi-threaded,
multi-process workload. We modified the SURGE fileset
and used a Zipf distribution to generate reasonable client
requests. Based on the Zipf distribution a static web page
which is approximately 16KB in file size is requested 50% of
the time in our client requests. We configured the SURGE
client to have 24 outstanding client requests. It has been
shown in [13] that the number of requests handled per second saturates after 10 concurrent connections implying 24
concurrent connections is more than enough to fully utilize
the server.
SPECWeb99 To evaluate a mixture of static web content and simple dynamic web content, we used a modified
version of SURGE to request SPECWeb99 filesets. We used
the default configuration for Specweb99 to generate client requests. 70% of client requests are for static web content and
30% are for dynamic web contents. We also fixed the number of connections of Specweb99 to be 24 as in the SURGE
case.
5.2
DRAM
The timing model for DRAM is generated from the Micron datasheets in [4]. Our timing model also considers the
DRAM command interfaces including address multiplexing,
DRAM precharge, etc. This timing model is integrated onto
the M5 simulator. The Micron DRAM spreadsheet calculator generates DRAM power based on inputs of reads, writes,
and page hit rates [5]. From the platform simulator, we profile the number of cycles spent on DRAM reads and writes,
and page hit rates to obtain average power. Our power estimates correlate well with numbers from [8]. For die area
estimation, we used numbers generated from industry products found in [21].
5.2.2
Flash memory
To understand the timing and power model for NAND
flash memory, we used several of the publications found in
[6]. We assumed a single bit cell in this study and expect
density and bandwidth to continue to improve due to the
high demand of flash memory in many commercial sectors.
Our die area estimates that are used to compare with DRAM
are from [20][21] and we performed comparisons on similar
process technologies. Although flash memory has begun to
outpace DRAM in process technology, we made conservative
estimates. To estimate the power consumption of our FlashCache architecture, we used measurements from datasheets.
Published numbers in datasheets represent maximum power
consumption. Therefore, our estimates are expected to be
conservative compared to real implementations. The idle
power of flash memory is typically 3 orders of magnitude
less than that of DRAM.
Modeling DRAM and Flash memory
Timing, power, and die area estimation at the architectural level is difficult to estimate with great accuracy. To
make a reasonable estimation and show general trends, we
109
MP8
MP12
MP4
Network Bandwidth - Mbps
Network Bandwidth - Mbps
1,600
1,200
800
400
0
MP8
MP12
1,200
1,200
1,000
1,000
total die area (mm 2)
MP4
2,000
800
600
400
200
DRAM
DRAM
DRAM
DRAM
64MB + 128MB 256MB 512MB
FLASH +FLASH +FLASH +FLASH
1GB
1GB
1GB
1GB
DRAM
1GB
600
400
200
0
0
DRAM
32MB +
FLASH
1GB
800
DRAM DRAM
DRAM DRAM DRAM
32MB + 64MB + 128MB 256MB 512MB
FLASH FLASH +FLASH +FLASH +FLASH
1GB
1GB
1GB
1GB
1GB
(a) SURGE
DRAM
32MB +
FLASH
1GB
DRAM
1GB
(b) SPECWeb99
DRAM
DRAM
DRAM
DRAM
64MB + 128MB 256MB 512MB
FLASH +FLASH +FLASH +FLASH
1GB
1GB
1GB
1GB
DRAM
1GB
(c) total die area
Figure 5: (a),(b) show a network bandwidth comparison for various main memory configurations using the
FlashCache. The rightmost points are for a DRAM-only system. (c) depicts the total die area.
read power
read power
idle power
3
2.7W
Overall Power - W
Overall Power - W
3
write power
2.5
1.8W
2
1.5
0.7W
1
0.5
0
write power
idle power
2.5W
2.5
1.6W
2
1.5
1
0.6W
0.5
0
DDR2 1GB
active
DDR2 1GB
powerdown
DDR2 1GB
active
DDR2 128MB +
Flash 1GB
(a) SURGE
DDR2 1GB
powerdown
DDR2 128MB +
Flash 1GB
(b) SPECWeb99
Figure 6: Overall memory power consumption breakdown. Our FlashCache architecture reduces idle power
by several orders of magnitude. For powerdown mode in DRAM, we assume an oracle powerdown algorithm.
6.
RESULTS
6.2
The feasibility of this architecture can be measured by
performance, power reduction, and wear-out. In the following subsections, we present performance in terms of network
bandwidth, an important metric in server platforms. The
power reduction results mostly from the reduction in idle
power. We also show that our FlashCache architecture does
not require frequent wear-level rebalancing.
6.1
Overall main memory power
With respect to overall power consumed in main memory,
our primary power savings come from the reduction in idle
power from using flash memory. It is several orders of magnitude. We compare our multichip configuration with DRAM
configurations that have a) no power management—active
mode and b) ideal power management—powerdown mode.
Intelligent power management reduces DRAM idle power.
Our ideal power management scheme assumes the DRAM
is set to a powerdown mode whenever possible. These modes
were derived from [17][19]. Figure 6 shows our results. As
shown in Figure 6, the FlashCache architecture reduces
overall main memory power by more than a factor of 2.5,
even compared to an oracle power management policy implemented in software for DRAM. The memory power for an
architecture using a FlashCache includes the flash memory
controller power consumption along with the extra DRAM
accesses to manage the flash memory. Since flash memory
is accessed only thousands of times per second, the overall
average contribution from additional DRAM accesses and
the flash memory controller is negligible.
Network Performance
Figure 5 depicts the overall network performance as DRAM
size is varied and flash memory is fixed at 1GB. Our baseline comparison is a configuration with no flash and 1GB of
DRAM. Our optimal multichip configuration requires less
die area than our baseline configuration. From our early
observations that large portions of the page cache can tolerate access latencies of tens to hundreds of microseconds, we
find our optimal configuration to be 128MB of DRAM and
1GB of flash memory. In terms of area, this configuration
requires 40% less die than our baseline of no flash and 1GB
of DRAM as shown in Figure 5 (c).
110
SURGE
SPECWeb99
worst case
SURGE
1000.00
SPECWeb99
100.00
Standard Deviation
Lifetime - years
10.00
10.00
1.00
0.10
1.00
0.10
0.01
0.01
128MB
256MB
512MB
1GB
128MB
2GB
256MB
512MB
1GB
2GB
Flash memory size
Flash memory size
(a)
(b)
Figure 7: Flash memory endurance and spatial variation for a 8 core 128MB DRAM configuration while
varying the flash memory size in the FlashCache architecture (a) temporal endurance in years, assuming
flash memory endurance of 1,000,000 cycles (b) standard deviation for the number of evicted flash memory
blocks spatial variation
6.3
Wear level aware behavior
8.
Figure 7 shows the predicted lifetime for varying flash
memory sizes assuming the lifecycle of a flash memory to
be a million cycles. These simulations assumed a 128MB
DRAM with 8 multicores and FlashCache sizes from 128MB
to 2GB. From our simulation results, we found our FlashCache is accessed less than 2000 times a second. This implies
a flash memory size of 1GB has a lifetime close to 100 years
when assuming a 1,000,000 cell lifecycle. Worst case analysis
assuming the hard disk drive is accessed all the time yielded
a lifetime of 2 years for a 2GB flash memory. With ECC
support for multiple error correction, worst case lifetime can
be extended even more. The ECC overhead in latency has
been found to be in nanoseconds[15][1]. Applying a 100,000
cell of lifecycle, which is the current endurance figure, we
expect 1GB flash memory to have an lifetime of 10 years.
We also found that the spatial variation is small implying
our architecture naturally levels out wear. This is largely
due to the set associativity of the cache. In our simulations,
wear-level management routines were seldom invoked—only
twice for the whole duration of our simulation.
7.
ACKNOWLEDGEMENTS
This project is supported by the National Science Foundation under grants NSF-ITR CCR-0325898. This work was
also supported by gifts from Intel.
9.
REFERENCES
[1] Error Correction Code in Single Level Cell NAND
Flash Memories. http://www.st.com/stonline/
products/literature/an/10123.pdf.
[2] Hybrid Hard Drives with Non-Volatile Flash and
Longhorn.
http://www.samsung.com/Products/HardDiskDrive/
news/HardDiskDrive_20050425_0000117556.htm.
[3] JFFS: The Journalling Flash File System.
http://sources.redhat.com/jffs2/jffs2.pdf.
[4] Micron DDR2 DRAM.
http://www.micron.com/products/dram/ddr2/.
[5] The Micron system-power calculator. http:
//www.micron.com/products/dram/syscalc.html.
[6] Samsung NAND Flash memory datasheet.
http://www.samsung.com/products/semiconductor/
NANDFlash/SLC_LargeBlock/8Gbit/K9K8G08U0A/
K9K8G08U0A.htm.
[7] SPECweb99 benchmark.
http://www.spec.org/osg/web99/.
[8] Sun Fire T2000 Server Power Calculator.
http://www.sun.com/servers/coolthreads/t2000/
calc/index.jsp.
[9] TrueFFS.
http://www.m-systems.com/site/en-US/Support/
DeveloperZone/Software/LifespanCalc.htm.
[10] ITRS roadmap. Technical report, 2005.
[11] P. Barford and M. Crovella. Generating representative
web workloads for network and server performance
evaluation. In Measurement and Modeling of
Computer Systems, pages 151–160, 1998.
[12] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim,
A. G. Saidi, and S. K. Reinhardt. The M5 simulator:
CONCLUSIONS AND FUTURE WORK
Our FlashCache architecture reduces idle power by several orders of magnitude while maintaining cost effectiveness
over conventional DRAM-only architectures both in terms
of operating cost and energy efficiency. From our observations, a typical web server architecture can sustain a file
access latency in the tens to hundreds of microseconds without noticeable loss in throughput. We also observe that the
organization of a cache, especially set associativity, inherently displays wear-level aware properties. This strengthens
the case for using a flash based file buffer cache instead of
a conventional DRAM. Our simulations show more than a
2.5× reduction in overall main memory power with negligible network performance degradation. Our future work will
investigate the bandwidth requirements and endurance requirements for adopting flash memory for other types workloads found in other application domains.
111
[13]
[14]
[15]
[16]
[17]
[18] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara:
A 32-way multithreaded Sparc processor. IEEE Micro,
25(2):21–29, Mar. 2005.
[19] A. R. Lebeck, X. Fan, H. Zeng, and C. S. Ellis. Power
aware page allocation. In Proc. Int’l Conf. on Arch.
Support for Programming Languages and Operating
Systems, pages 105–116, 2000.
[20] J. Lee, S.-S. Lee, O.-S. Kwon, K.-H. Lee, D.-S. Byeon,
I.-Y. Kim, K.-H. Lee, Y.-H. Lim, B.-S. Choi, J.-S. Lee,
W.-C. Shin, J.-H. Choi, and K.-D. Suh. A 90-nm
CMOS 1.8-V 2-Gb NAND Flash Memory for Mass
Storage Applications. 38(11), Nov 2003.
[21] G. MacGillivray. Process vs. density in DRAMs.
http://www.eetasia.com/ARTICLES/2005SEP/B/
2005SEP01_STOR_TA.pdf.
[22] C. Park, J. Seo, S. Bae, H. Kim, S. Kim, and B. Kim.
A Low-cost Memory Architecture With NAND XIP
for Mobile Embedded Systems. In Proc. Int’l Conf. on
HW-SW Codesign and System
Synthesis(CODES+ISSS), Oct 2003.
Modeling networked systems. IEEE Micro,
26(4):52–60, Jul/Aug 2006.
E. L. Congduc. Packet classification in the NIC for
improved SMP-based internet servers. In Proc. Int’l
Conf. on Networking, Feb. 2004.
M. Ekman and P. Stenstr. A cost-effective main
memory organization for future servers. In Proc. of the
Int’l Parallel and Distributed Processing Symp., Apr
2005.
S. Gregori, A. Cabrini, O. Khouri, and G. Torelli.
On-chip error correcting techniques for new-generation
flash memories. 91(4), Apr 2003.
S. Gupta, M. Hilbert, S. Hong, and R. Patti.
Techniques for producing 3D ICs with high-density
interconnect. www.tezzaron.com/about/papers/ieee_
vmic_2004_finalsecure.pdf.
H. Huang, P. Pillai, and K. G. Shin. Design and
Implementation of Power-Aware Virtual Memory. In
USENIX Annual Technical Conference, pages 57–70,
2003.
112
Download