Linux vs. Windows NT Memory Management Contents 1. Introduction 2. Linux Memory Management 2.1 Address Generation in x86 2.2 How Linux Does This 2.3 Page Allocation 2.4 Page Replacement Algorithm 2.5 Kernel Memory Allocation 2.5.1 Slab Layout 3. Windows NT Memory Management 3.1 Reserved vs. Committed Memory 3.2 Page Frame Database 3.3 Page Replacement References Linux vs. Windows NT Memory Management 1. Introduction : Linux and WNT have the common concept in memory management , virtual memory with paging .The purpose of this paper is to explain the memory management in Linux and to provide a brief introduction to WNT memory management for comparison purposes . The research in memory management had a great impact on design of the hardware . So explanation of memory management is impossible without the discussion of what support microprocessor or hardware provides to the operating system . Although memory management in Linux is platform independent but most of these platforms share a common architecture of page tables with varying paging levels. Linux memory management was designed by taking 64-bit Alpha processor into consideration. But it easily accommodates other platforms by slight modifications. So this discussion is little bit hardware same. The hardware which I dependent , although the basic idea is have chosen is of course Intel Pentium/x86. This information is valid for any Intel general purpose processor from 80386 to latest Pentium. 2. Linux Memory Management : As in SVR4 and Solaris ,Linux also uses two separate memory management schemes ; virtual memory management for user processes and kernel memory management for the use of kernel .Linux divides the memory in two parts . Memory from 0 to 3GB(0xBFFFFFFF) is used for user processes and from 3GB to 4GB(0xFFFFFFFF ) is for kernel . This arrangement is shown in figure 2.1. 4GB Kernel Space 3GB User Space 0 Figure 2.1 Linux Address Space In user space the demand paged virtual memory scheme is used. Let us consider the address generation mechanism w.r.t x86 to completely understand this concept. 2.1 Address Generation in x86 : Intel x86 provides the support for both segmentation and paging . The maximum segment size is 4GB which is the complete linear address space of the processor . Smaller size segments are created by specifying limit field in the descriptor of the segment. Logical Address Selector Offset Linear Address Space Global Desc. Table Linear Address segment Physical Addr. Dir Table Segment Descriptor offset Space Page Lin Addr Page Dir. Table Page Phy Addr. Entry Entry Figure 2.2 Segmentation And Paging In x86 To locate a byte in a particular segment , a logical address must be provided . A logical address consists of a segment selector and an offset . A selector is a unique identifier for a segment . Among other things it provides an offset into a descriptor table to a data structure called a segment descriptor . A segment descriptor provides the base address of the segment , along with the access rights and limit of the segment. This base address is added with the offset from the logical address to generate a linear address. Now if paging is not used , the linear address space of the processor is mapped directly into the physical address space of processor. But if the paging is used then the 32-bit linear address is treated as follows 31 21 Page Directory 11 Page Table 0 Offset Figure 2.3 Linear Address Where the right most 10 bits select a second level page table from the first level page table called page directory . The next 10 bits select a page from the second level page table and the last 12 bits are the address of the byte in the 4k size page. 2.2 How Linux does this ? : As I said that segments can be any size from 0 to 4GB . Linux uses two sizes. All the segments in user space for all the processors are of 3GB , and the segments in kernel space are of 1GB starting from 3GB. It means Linux uses a kind of flat memory model in which all the segments in user space share the same address space. Then how does the memory is protected in this multitasking environment , the protection at page level is used for this purpose. In a sense Linux uses pure paging mechanism for virtual memory management . Now let us consider the platform independent paging scheme of Linux . Linux makes use of a three-level page table structure consisting of the following types of tables : Page Directory : This is top-level node , known as PAGE GLOBAL DIRECTORY or “pgd” . Page Middle Directory : A middle level node is called PAGE MIDDLE DIRECTORY or “pmd” . Page Table : A bottom level node which holds the actual PTE(page table entry) describing pages. Since x86 provides support for only two level paging the code that traverses the “middle level “ of page tables does nothing on the x86 architecture --- it gets preprocessed and compiled down to essentially nothing via platform specific #ifdefs . This allows other code to be written as though all machines had three – level page tables. 2.3 Page Allocation : The part of memory management which handles the allocation of pages or which manages physical memory is called Zone Allocator . Different ranges of physical pages may have different properties for the kernel purposes . For example DMA , may only work for physical address less than 16MB . The zone allocator handles such differences by dividing memory into a number of zones and treating each zone as a unit for allocation purposes .Within each zone the buddy system is used to manage physical pages . Pages are always allocated in blocks of 2n pages aligned on 2n –page boundary. 2.4 Page Replacement Algorithm : The major component of the page replacement mechanism is a clock algorithm . The clock algorithm is used because it provides an approximation of LRU replacement and is cheaper to implement . Plus all common general purpose CPU’s have hardware support for clock algorithm in the form of the reference bit maintained by PTE cache. The simple clock scheme which uses only one bit is known as “second chance” algorithm , because it gives a page a second chance to stay in memory one more sweep cycle. Linux uses a simple second chance (one-bit clock ) algorithm , but with several elaborations and complications. 2.5 Kernel Memory Allocation : The above discussed Buddy System based zone allocator is a simple and relatively fast allocator ; but it is a poor allocator in many respects . The fact that it can only manage block sizes in powers of two means that using it straightforwardly requires rounding the requested block sizes up to power of two , which can incur a large cost in internal fragmentation . Linux therefore uses one more memory allocator for kernel ‘s use called slab allocator . The basic behind slab allocator is the concept of “object caching” , which is a technique for dealing with objects that are frequently allocated and freed. In kernel the small sized objects , like mutex for synchronization ,are very frequently created and destroyed. However in many cases the cost of initializing and destroying the objects exceeds the cost of allocating and freeing memory for it . So the idea is to preserve the invariant portion of an object‘s initial state-its constructed state-between uses, so it does not have to be destroyed and recreated every time the object is used. This is achieved by caching the objects in small buffers. The slab allocator uses the zone allocator to get the largish hunks of memory and carves them into smaller pieces as needed . A slab consists of one or more pages of virtually contiguous memory carved up into equal size chunks , with a reference count of how many of those chunks have been allocated. 2.5.1 Slab Layout: The contents of each slab are managed by a kmem_slab structure that maintains the slab’s linkage in the cache , its reference count , and its list of free buffers. In turn , each buffer in the slab is managed by a kmem_bufctl structure that holds the freelist linkage , buffer addresses , and a back pointer to the controlling slab. This arrangement is shown in figure. Kmem slab Kmem bufctl Kmem bufctl Kmem bufctl Buf Buf buf unused Figure 2.4 Slab Layout 3. Windows NT Memory Management : Windows NT provides a page-based memory management scheme that allows applications to realize a 32 –bit linear address space for 4GB of memory . Like Linux , WNT also divides the memory in two equal parts of 2GB each . This is shown in figure 4.1 . Like Linux the upper half of the address space is reserved for system and lower half is for user processes. Similar to Linux, WNT also didn’t choose the segmented memory architecture but it implemented the pure demand paged virtual memory system . Same discussion of how the addresses are generated on x86 architecture Reserved For Use by System Available for use by application 4 GB 2GB 0 Figure 3.1 Windows NT ‘s Address Space can also be applied to WNT . As told the address space integrity of the process is preserved at page levels. This is achieved in two ways . First each process has its own page-directory , so that it can not access the address space of any other process . Second the access rights bits of the PTE can be used to protect the individual pages from being accidentally corrupted by the process itself. 3.1 Reserved vs. Committed Memory : In Windows NT, a distinction exists between memory and address space. Although each process has a 4-GB address space, rarely if ever will it realize anywhere near that amount of physical memory. Consequently, the virtual-memory manager must keep track of the used and unused addresses of a process, independent of the pages of memory it is actually using. In actuality this amounts to having a structure for representing all of the physical memory in the system and a structure for representing each process's address space. As part of the process object (the overhead associated with every process in Windows NT), the VMM stores a structure called the virtual address descriptor (VAD) tree to represent the address space of a process. As address space gets used for a process, the VMM updates the VAD tree to reflect which addresses are used and which are not. 3.2 The Page-Frame Database: The virtual-memory manager uses a private data structure for maintaining the status of every physical page of memory in the system. The structure is called the pageframe database. The database contains an entry for every page in the system, as well as a status for each page. The status of each page falls into one of the following categories: Valid : A page in use by an active process in the system. Its PTE is marked as valid. Modified: A page that has been written to, but not written to disk. Its PTE is marked as invalid and in transition. Free : A page with no corresponding PTE and available for use. It must first be zeroed before being used unless it is used as a read-only page. Zeroed : A free page that has already been zeroed and is immediately available for use by any process. Bad : A page that has generated a hardware error and cannot be used by any process in the system. Most of the status types are common to most paged operating systems, but the two transitional page status types are unique to Windows NT. If a process addresses a location in one of these pages, a page fault is still generated, but very little work is required of the VMM. Transitional pages are marked as invalid, but they are still resident in memory, and their location is still valid in the PTE. The VMM merely has to change the status on this page to reflect that it is valid in both the PTE and the page-frame database, and let the process continue. Process Page Table Page Frame Database Valid PTE Free Modifed Standby Valid PTE PTE Free Figure 3.2 3.3 Page Replacement : In Windows NT, the component responsible for making page replacement decisions is called the working-set manager. When a process starts, the VMM assigns it a default working set that indicates the minimum number of pages necessary for the process to operate efficiently. The working-set manager periodically tests this quota by stealing Valid pages of memory from a process. If the process continues to execute without generating a page fault for this page, the working set is reduced by one, and the page is made available to the system. The act of stealing a page from a process actually occurs in two stages. First, the working-set manager changes the PTE for the page to indicate an invalid page in transition. Second, the working-set manager also updates the page-frame database entry for the physical page, marking it as either Modified or Standby, depending on whether the page is dirty or not. References: UNIX System for Modern Architectures ; Curt Schimmel , Addison-Wesley Linux Memory Management Documentation ; http://www.linuxmm.org/docs.shtml THE GNU/LINUX 2.2 VIRTUAL MEMORY SYSTEM, PART I ; Paul Wilson Operating Systems , Fourth Edition ; William Stallings ,Prentice Hall Linux MM : Design of a Zone based memory allocator ; Rik Van Riel , July 1998 MSDN Library , Microsoft , Memory Management In Microsoft Windows.