Hardware VM with DRAM and NAND Flash Memory Seunghak Lee Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 seunghak@cs.cmu.edu Jin Kyu Kim Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jinkyuk@andrew.cmu.edu Abstract As the cost of NAND flash memory decreases very rapidly, SSD (Solid State Disk) is considered as an alternative to HDD in high end server and mobile application. In comparison to other devices, SSD shows higher I/O performance while consumes lower power. Also, system researcher community has considered SSD as an alternative to HDD swap space for demand paging. According to our literature survey, major application of flash devices is limited to an alternative to HDD as storage. The approach of replacing HDD forces flash based swap space to use block I/O interface. We believe the block I/O interface would be main hurdle for better performance and longer life span of flash devices. There has been a little effort to use flash devices as a part of memory hierarchy as shown in Intel Robson technology. But its usage is very limited to a fast code cache. We believe that we could attain much more benefits from NAND flash if we would place the NAND flash more close to memory hierarchy. In this study, we will try to integrate flash devices into memory hierarchy and make the flash device a third layer below the main memory. The goal of this research is to transfer page replacement algorithm to HW based memory controller. By placing page replacement function at memory controller, we expect to get more information such as R/W operation count from memory controller. By using these information we could lower main memory miss, lower power consumption and attain longer life span of flash device. Particularly, to maximize the life span of NAND flash, we will design extra memory translation which allows to minimize the write operation on NAND flash and helps to lower main memory miss. Via this extra memory translation, we will try finer granularity of physical memory frame management. Finally, we provide large physical address space equal to the sum of main memory and swap space in existing systems. The OS VM will not face any page faults except initial page faults. We will rethink current virtual memory system in terms of page frame size and page replacement policy, and the current HW supports for virtual memory. And we will try to find spaces to apply NAND flash device, new requirements of HW support and new HW/SW interface to enable our approach. 1 Definition of the Problem 1.1 Placement of data migration functions of virtual memory system In traditional virtual memory system, major tasks for virtual memory management are performed by OS. Only the address translation is performed by HW. Data migration between HDD and main memory is determined by page replacement algorithm in operating system. Due to the limited 1 HW supports and SW overhead, LRU approximate algorithm is commonly used relying on the HW support such as access bit. However, obviously there has been little consideration about NV memory component such as NAND flash. Unlike DRAM memory, NAND flash has different power consumptions and different read/write speed. Thus, hybrid memory approach should take these different characteristic of each memory component into account. However, the information needed for efficient management is not available at SW side due to the heavy communication overhead between HW and SW. Thus, we believe that at least data migration function of VM should be transferred to HW side. The question about what kinds of information and HW supports are necessary remains open to this study. 1.2 Size of physical frame In current virtual memory system, the size of physical memory chunk (frame) has tended to be large. Huge speed gab between HDD and memory and page table size are attributed to this trend [1]. However, there are tradeoffs between large page size and small page size. Smaller size of page would be more likely to keep the locality of working set. But the long latency of HDD was the main hurdle for using smaller page size. Let us assume that we use 512 Byte frame instead of 4KB frame. In order to transfer 4KB, we have to transmit data 8 times by 512B, and it would take almost 8 times longer than transferring 8KB at a time. But, we now get much faster device. The latency of flash devices is faster than HDD by 50 ∼ 100 times. Furthermore, we could exploit parallelism between NAND chips given appropriate buffers in front of NAND flash chips. We will try much smaller page size while keeping it transparent to SW side. To attain it, we will design extra address translation layer. The big page table for extra address translation layer is another problem for smaller page size. But, in hybrid memory hierarchy, NAND flash already needs big table for FTL’s remapping purpose. We will integrate mapping table for extra address translation with FTL’s remapping table. Via this approach we will maintain reasonable size of page table while using small size of frame. Via small size of physical frame, we expect to attain lower ratio of main memory miss. 1.3 Life Span of NAND flash Unfortunately, NAND flash has relatively short life span due to their physical limitation. NAND flash is known to have 100,000 program/erase cycles. NAND flash does program operation on page unit. The size of page is fixed to 2KB, 4KB and 8KB depending on the each device. Thus, the big frame would have negative impacts on the life span of NAND flash. For example, assume that we use 4KB frame. Even though the CPU overwrite small portion of one frame, whole frame should be written on NAND flash. If we minimize the page write operation via coalescing small updates, we could increase the life span of NAND flash dramatically. Fine granularity of FTL mapping allows to merge several small update into one page write. Additionally, we try to integrate wear-leveling into a part of FTL mapping. We expect wear-leveling will not incur any overhead in our approach. 1.4 Efficiency of Garbage Collection In addition to short lifespan, NAND flash has a ”erase before program” constraints and erase operation unit is multiple of program units. To overcome these constraints, FTL does remapping and garbage collection. For efficient garbage collection, it would be very helpful for host system to notify FTL of deleted space event in general. FTL can save copy efforts for the deleted space. These days, SSD vendors are trying to standardize such notification command called TRIM. But, it’s no popular yet because it need to get supports from both of HW and OS sides. However, in our approach, we can replace TRIM command with notification of a process termination from the host. 2 Scope of the project 2.1 Design scope of the project Although the virtual memory concept was originated from the needs to overcome small physical memory size, additional functions such as memory protection, dynamic relocation, memory sharing, and other extras become essential part of virtual memory. In this project, we will limit our scope to 2 first goal of VM to overcome small physical memory size. We will design a mechanism to attain this goal at HW level. But other additional VM functions are left at SW side. 2.2 Experiment and evaluation scope To verify the feasibility of our approach, we plan to implement two static simulators: memory manager simulator and FTL simulator. We will implement memory manager simulator to measure the expected performance gain in terms of main memory miss rate, count of R/W operation of each component. Static FTL simulator will be used to measure the life span of NAND flash. 3 How to attack the problem 3.1 Two level of address translation To abstract large physical address space with DRAM and NAND flash, we will use two level of address translation. First translation is done by VM module of OS. The size of address translation unit is not changed. Through first translation, virtual address of 4KB will be translated into semi-physical frame number of 4KB and byte offset in semi-physical page. (This first level translation would keep our approach transparent to VM of OS, which enables VM of OS to provide other functions of VM such as inter process protection, access permission protection and so on.) Second translation is done by memory manager at memory controller. In second translation, semi-physical frame number and offset will be translated into physical sub-frame number and byte offset in physical sub-frame. Our approach requires one more address access for the case of TLB miss. But, we expect gains from fine granularity of physical frame management would offset this disadvantage. 3.2 Memory manager at memory controller At first, we will explore the feasibility of new HW supports in addition to ”access bit”. We believe the count of write and read operation on a certain sub-frame would be good information for our memory manager to make decision concerning data migration between memory components. For this purpose, we plan to apply machine learning technology to new memory manager. And we will also try to enhance LRU approximate with more information from memory controller. 3.3 FTL implementation In order to manage NAND flash memory, we need to implement FTL software. FTL is responsible for remapping the logical address into physical address and hiding the physical constraints of NAND flash such as ”erase before write” and different granularity of RW operation and erase operation. Among various FTL schemes, we will use full sector mapping. We can choose sector size depending on sub-frame size. Generally, full sector mapping scheme is known to show lower WAF [2] than log block mapping scheme. 3.4 Big address mapping table Our approach needs big table for mapping semi physical frame number into physical sub-frame number. Indirect mapping table could be solution. But, indirect mapping table incur additional memory reference for each memory access. Therefore, we will not take this approach. Instead, we will integrate this table into FTL mapping table. 4 Review of Literature [3, 4] proposed a new way to use SSD as alternative to HDD swap space. To eliminate overhead through storage SW stack, they used PCIe interface SSD and modify VM module to operate directly on the PCIe SSD. Their research is oriented for OS. They pointed out several design issues that are highly optimized to HDD suggested alternatives. [5] proposed variant of LRU. To save the power consumption from flash operation, their approach prefer clean page to dirty page. [6] explored a new way of using NAND flash as a disk cache for fast launching of application and booting of OS. 3 [7] suggested flash memory as disk cache to reduce the usage of DRAM. In previous approaches, the usage of flash memory was very limited. Within our knowledge, we believe our trial to fully integrate flash device into memory hierarchy is the first trial in the context of virtual memory. 5 Action Plan 1. ∼ Milestone 1 Report Review literature 2. ∼ Milestone 2 Report Collect trace of memory access, page faults, swap-in, swap-out Define the memory manager (a) Design new frame replacement algorithm between main memory and NAND flash (b) Goal of new algorithm: longer life span of NAND flash, low main memory miss, low power consumption. 3. ∼ Milestone 3 Report Implementation of simulators (a) Implement static simulator of sector mapping FTL (b) Implement static simulator of memory manager 4. ∼ Final Report Evaluation - Evaluation metrics (a) Estimation of main memory miss rate (b) R/W/E operating on each device for calculating power consumption. (c) Estimation of NAND flash life span References [1] Xiao-Yu Hu, Evangelos Eleftheriou, Robert Haas, Ilias Iliadis, and Roman Pletka. Write amplification analysis in flash-based solid state drives. In SYSTOR ’09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, pages 1–9, New York, NY, USA, 2009. ACM. [2] Abraham Silberschatz and Peter Galvin. Operating System Concepts, 5th Edition. John Wiley & Sons, 1998. [3] Mohit Saxena and Michael M. Swift. Flashvm: virtual memory management on flash. In USENIXATC’10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 14–14, Berkeley, CA, USA, 2010. USENIX Association. [4] Mohit Saxena and Michael M. Swift. Flashvm: revisiting the virtual memory hierarchy. In HotOS’09: Proceedings of the 12th conference on Hot topics in operating systems, pages 13– 13, Berkeley, CA, USA, 2009. USENIX Association. [5] Seon-yeong Park, Dawoon Jung, Jeong-uk Kang, Jin-soo Kim, and Joonwon Lee. Cflru: a replacement algorithm for flash memory. In CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 234–241, New York, NY, USA, 2006. ACM. [6] Jeanna Matthews, Sanjeev Trika, Debra Hensgen, Rick Coulson, and Knut Grimsrud. Intel turbo memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer systems. Trans. Storage, 4(2):1–24, 2008. [7] T. Kgil, D. Roberts, and T. Mudge. Improving nand flash based disk caches. pages 327 –338, jun. 2008. [8] Bruce Jacob, Spencer Ng, and David Wang. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007. [9] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. 4